0% found this document useful (0 votes)
594 views397 pages

Artificial Intelligence Applications in Chemistry

Uploaded by

hussain Al-Shaer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
594 views397 pages

Artificial Intelligence Applications in Chemistry

Uploaded by

hussain Al-Shaer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 397

Artificial Intelligence

Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.fw001

Applications in Chemistry

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.fw001

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
ACS SYMPOSIUM SERIES 306

Artificial Intelligence
Applications in Chemistry
Thomas H . Pierce, EDITOR
Rohm and Haas Company
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.fw001

Bruce A. Hohne, EDITOR


Rohm and Haas Company

Developed from a symposium sponsored by


the Division of Computers in Chemistry
at the 190th Meeting
of the American Chemical Society,
Chicago, Illinois, September 8-13, 1985

American Chemical Society, Washington, DC 1986

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Library of Congress Cataloging-in-Publication Data
Artificial intelligence applications in chemistry.
(ACS symposium series, ISSN 0097-6156; 306)
"Developed from a symposium sponsored by the
Division of Computers in Chemistry at the 190th
meeting of the American Chemical Society, Chicago,
Illinois, September 8-13, 1985."
Includes bibliographies and indexes.
1. Chemistry—Data processing—Congresses.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.fw001

2. Artificial intelligence—Congresses.
I. Pierce, Thomas H., 1952- . II. Hohne, Bruce
Α., 1954- .III. American Chemical Society.
Division of Computers in Chemistry. IV. American
Chemical Society. Meeting (190th: 1985: Chicago, 111.)
V. Series.
QD39.3.E46A78 1986 542'.8 86-3315
ISBN 0-8412-0966-9

Copyright © 1986
American Chemical Society
All Rights Reserved. The appearance of the code at the bottom of the first page of each
chapter in this volume indicates the copyright owner's consent that reprographic copies of the
chapter may be made for personal or internal use or for the personal or internal use of specific
clients. This consent is given on the condition, however, that the copier pay the stated per
copy fee through the Copyright Clearance Center, Inc., 27 Congress Street, Salem, MA 01970,
for copying beyond that permitted by Sections 107 or 108 of the U.S. Copyright Law. This
consent does not extend to copying or transmission by any means—graphic or electronic—for
any other purpose, such as for general distribution, for advertising or promotional purposes,
for creating a new collective work, for resale, or for information storage and retrieval systems.
The copying fee for each chapter is indicated in the code at the bottom of the first page of the
chapter.
The citation of trade names and/or names of manufacturers in this publication is not to be
construed as an endorsement or as approval by ACS of the commercial products or services
referenced herein; nor should the mere reference herein to any drawing, specification, chemical
process, or other data be regarded as a license or as a conveyance of any right or permission,
to the holder, reader, or any other person or corporation, to manufacture, reproduce, use, or
sell any patented invention or copyrighted work that may in any way be related thereto.
Registered names, trademarks, etc., used in this publication, even without specific indication
thereof, are not to be considered unprotected by law.
PRINTED IN THE UNITED STATES OF AMERICA

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
ACS Symposium Series
M . Joan Comstock, Series Editor

Advisory Board
Harvey W. Blanch Donald E. Moreland
University of California—Berkeley USDA, Agricultural Research Service
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.fw001

Alan Elzerman W. H. Norton


Clemson University J . T. Baker Chemical Company

John W. Finley James C. Randall


Nabisco Brands, Inc. Exxon Chemical Company

Marye Anne Fox W. D. Shults


The University of Texas—Austin Oak Ridge National Laboratory

Martin L. Gorbaty Geoffrey K. Smith


Exxon Research and Engineering C o . Rohm & Haas C o .

Roland F. Hirsch Charles S.Tuesday


U.S. Department of Energy General Motors Research Laboratory

Rudolph J. Marcus Douglas B. Walters


Consultant, Computers & National Institute of
Chemistry Research Environmental Health

Vincent D. McGinniss C. Grant Willson


Battelle Columbus Laboratories I B M Research Department

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
FOREWORD
The ACS S Y M P O S I U M S E R I E S was founded in 1974 to provide a
medium for publishing symposia quickly in book form. The
format of the Series parallels that of the continuing A D V A N C E S
IN C H E M I S T R Y S E R I E S except that, in order to save time, the
papers are not typeset but are reproduced as they are submitted
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.fw001

by the authors in camera-ready form. Papers are reviewed under


the supervision of the Editors with the assistance of the Series
Advisory Board and are selected to maintain the integrity of the
symposia; however, verbatim reproductions of previously pub-
lished papers are not accepted. Both reviews and reports of
research are acceptable, because symposia may embrace both
types of presentation.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
PREFACE

.ARTIFICIAL I N T E L L I G E N C E (AI) is not a new field, as AI dates back to the


beginnings of computer science. It is not even new to the field of chemistry,
as the D E N D R A L project dates back to the early 1960s. AI is, however, just
beginning to emerge from the ivory towers of academia. To many people it is
still just a buzz word associated with no real applications. Because AI work
involves people from multiple disciplines, the work is difficult to locate and
the application is sometimes difficult to understand.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.pr001

We decided that now would be a good time for an AI book for several
reasons: (1) enough applications can now be presented to expose newcomers
to many of the possibilities that AI has to offer, (2) showing what everyone
else is doing with AI should generate new interest in the field, and (3) we felt
an overview was needed to collect the different areas of AI applications to
help people who are starting to apply AI techniques to their disciplines. The
final and possibly most important reason is our personal interest in the field.
Chemistry is an ideal field for applications in AI. Chemists have been
using computers for years in their day-to-day work and are quite willing to
accept the aid of a computer. In addition, the D E N D R A L project,
throughout its long history, has graduated many chemists already trained in
AI. It is not surprising that chemistry is one of the leading areas for AI
applications. Scientists have been developing the theories of chemistry for
centuries, but the standard approach taken by a chemist to solve a problem
is heuristic; past experience and rules of thumb are used. AI offers a method
to combine theory with these rules. These systems will not replace chemists,
as is commonly thought; but rather, these programs will assist chemists in
performing their daily work.
Computer applications developed from theoretical chemistry tend to be
algorithmic and numerical by nature. AI applications tend to be heuristic
and symbolic by nature. Multilevel expert systems combine these techniques
to use the heuristic power of expert systems to direct numerical calculations.
They can also use the results of numerical calculations in their symbolic
processing. The problems faced by chemists today are so complex that most
require the added power of the multilevel approach to solve them.
Defining exactly which applications constitute AI is difficult in any
field. The problem in chemistry is even worse because chemical applications
that use AI methods often use numerical calculations. Some applications
that are strictly numerical accomplish tasks similar to AI programs. The key
feature used to limit the scope of this book was symbolic processing. The
work presented includes expert systems, natural language applications, and
manipulation of chemical structures.
ix

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
The book is divided into five sections. The first chapter is outside this
structure and is an overview of the technology of expert systems.
The book's first section, on expert systems, is a collection of expert-
system applications. Expert systems can simplistically be thought of as
computerized clones of an expert in a particular specialty. Various schemes
are used to capture the expert's knowledge of the specialty in a manner that
the computer can use to solve problems in that field. Expert-systems
technology is the most heavily commercialized area in AI as shown by the
wide variety of applications that use this technology. These applications help
show the breadth of problems to which AI has been applied. Much of the
work from other sections of the book also uses expert-system techniques in
some manner.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.pr001

The second section, on computer algebra, details chemical applications


whose emphasis is on the mathematical nature of chemistry. As chemical
theories become increasingly complex, the mathematical equations have
become more difficult to apply. Symbolic processing simplifies the construc­
tion of mathematical descriptions of chemical phenomena and helps chemists
apply numerical techniques to simulate chemical systems. Not only does
computer algebra help with complex equations, but the techniques can also
help students learn how to manipulate mathematical structures.
The third section, on handling molecular structures, presents the
interface between algebra and chemical reactions. The storage of molecular
representations in a computer gives the chemist the ability to manipulate
abstract molecular structures, functional groups, and substructures. The
rules that govern the changes in the molecular representations vary with each
approach. Molecules can be described as connected graphs, and the
theorems of graph theory can be used to define their similarity. Another
approach uses heuristic rules for chemical substructures to define and display
molecules.
The fourth section, on organic synthesis, discusses methods to construct
complex organic syntheses using simple one-step reactions. Many groups
have used the computer to search for synthetic pathways for chemical
synthesis in the past. Each approach must deal with the problem of multiple
possible pathways for each step in the reaction. The chapters in this section
apply AI techniques to select "good" paths in the synthesis.
The final section, on analytical chemistry, is a combination of structure-
elucidation techniques and instrumental optimizations. Instrumental analysis
can be broken into several steps: method development, instrumental
optimization, data collection, and data analysis. The trend today in
analytical instrumentation is computerization. Data collection and analysis
are the main reasons for this. The chapters in this section cover all aspects of
the process except data collection. Organic structure elucidation is really an
extension of data analysis. These packages use spectroscopic data to
determine what structural fragments are present and then try to determine
χ

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
how those fragments are connected. Different people have used both
individual spectroscopic techniques and combinations of techniques to solve
this very difficult problem. This area holds great promise for future work in
AI.
We gratefully acknowledge the efforts of all the authors who contrib-
uted their time and ideas to the symposium from which this book was
developed. We also thank the staff of the ACS Books Department for their
helpful advice. Finally, we acknowledge the encouragement and support we
received from our management at Rohm and Haas Company.

THOMAS H. PIERCE
Rohm and Haas Company
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.pr001

Spring House, PA 19477

BRUCE A. HOHNE
Rohm and Haas Company
Spring House, PA 19477

January 13, 1986

XI

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
1
Artificial Intelligence: The Technology of Expert Systems

Dennis H . Smith

Biotechnology Research and Development, IntelliGenetics, Inc., Mountain View,


CA 94040
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch001

Expert systems represent a branch of artificial intelligence that has


received enormous publicity in the last two to three years. Many
companies have been formed to produce computer software for what is
predicted to be a substantial market. This paper describes what is meant
by the term expert system and the kinds of problems that currently appear
amenable to solution by such systems. The physical sciences and
engineering disciplines are areas for application that are receiving
considerable attention. The reasons for this and several examples of recent
applications are discussed. The synergism of scientists and engineers with
machines supporting expert systems has important implications for the
conduct of chemical research in the future; some of these implications are
described.

Expert systems represent a sub-discipline of artificial intelligence (AI). Before beginning a


detailed discussion of such systems, I want to outline my paper so that the focus and
objectives are clear. The structure of the paper is simple. I will:

• Describe the technology of expert systems

• Discuss some areas of application related to chemistry

• Illustrate these areas with some examples


Although the structure of the paper is simple, my goal is more complex. It is simply
stated, but harder to realize: I want to demystify the technology of applied artificial
intelligence and expert systems.
The word mystify means "to involve in mystery, to make difficult to understand, to
puzzle, to bewilder." Therefore, I will try to remove some of the mystery, to make things
easier to understand, to clarify what the technology is and what it can (and cannot) do.
I am going to discuss a special kind of computer software, but software nonetheless.

0097-6156/ 86/0306-0001 $06.00/ 0


© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
2 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Everything I will describe could be built from the ground up using assembly language,
BASIC or any other computer language. In the future, some expert systems will certainly
be built using languages such as Fortran, C or P A S C A L as opposed to LISP and P R O L O G
which are currently in vogue. So there is no mystery here. What is different, but is still
not mysterious, is the approach taken by A I techniques toward solving symbolic, as opposed
to numeric, problems. I discuss this difference in more detail, below. Most readers of this
collection of papers will be scientists and engineers, engaged in research, business or both.
They expect new technologies to have some substantial practical value to them in their
work, or they will not buy and use them. So I will stress the practicality of the technology.
Where is the technology currently? Several descriptions of the marketplace have
appeared over the last year. Annual growth rates for companies involved in marketing
products based on A I exceed 300%, far outstripping other new computer-based applications,
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch001

such as control and management of information networks, private telephone networks,


automation of the home and factory. Of course, those are growth rates, not market sizes or
dollar volumes. The technology will ultimately be successful only to the extent that it does
useful work, by some measure. In this paper I illustrate some areas where useful work can
be, and is being, done. There are many expert systems under development at major
corporations, in the areas of chemistry, chemical engineering, molecular biology and so
forth. Because many of these systems are still proprietary, the examples I will discuss are
drawn from work that is in the public domain. However, the casual reader will easily be
able to generalize from my examples to his or her own potential applications.

The Technology of Expert Systems

I am going to begin my discussion of the technology of expert systems with two provocative
statements. The first is:
Knowledge engineering is the technology base of the "Second Computer Age"
It is possible to use knowledge, for example, objects, facts, data, rules, to manipulate
knowledge, and to cast it in a form in which it can be used easily in computer programs,
thereby creating systems that solve important problems.
The second statement is:
What's on the horizon is not just the Second Computer Age, it's the
important one!
We are facing a second computer revolution while still in the midst of the first one!
And it's probably the important revolution.

Characteristics and Values of Expert Systems. What leads me to make such bold and risky
statements? The answer can be summarized as follows. First, knowledge is power. You
can't solve problems using any technology unless you have some detailed knowledge about
the problem and how to solve it. This fact seems so obvious that it is unnecessary to state
it. Many systems will fail, however, because the builders will attempt to build such systems
to s o l v e i l l - d e f i n e d problems.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
1. SMITH The Technology of Expert Systems 3

Second, processing of this knowledge will become a major, perhaps dominant part of
the computer industry. Why? Simply because most of the world's problem solving
activities involve symbolic reasoning, not calculation and data processing. We have
constructed enormously powerful computers for performing calculations, our number
crunchers. We devote huge machines with dozens of disk drives to database management
systems. Our need for such methods of computing will not disappear in the future.
However, when we have to fix our car, or determine why a processing plant has shut down,
or plan an organic synthesis, we don't normally solve sets of differential equations or pose
queries to a large database. We might use such numerical solutions or the results of such
queries to help solve the problem, but we are mainly reasoning, not calculating.
How do we construct programs that aid us in reasoning as opposed to calculating? A I
is the underlying science. It has several sub-disciplines, including, for example, robotics,
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch001

machine vision, natural language understanding and expert systems, each of which will
make a contribution to the second computer age. M y focus is on expert systems.
Knowledge engineering is the technology behind construction of expert systems, or
knowledge systems, or expert support systems. Such systems are designed to advise, inform
and solve problems. They can perform at the level of experts, and in some cases exceed
expert performance. They do so not because they are "smarter" but because they represent
the collective expertise of the builders of the systems. They are more systematic and
thorough. And they can be replicated and used throughout a laboratory, company or
industry at low cost.
There are three major components to an expert system:

• the knowledge base of facts and heuristics

• the problem-solving and inference engine

• an appropriate human-machine interface


The contents of a knowledge base, the facts and rules, or heuristics, about a problem
will be discussed shortly. The problem-solving and inference engine is the component of
the system that allows rules and logic to be applied to facts in the knowledge base. For
example, in rule-based expert systems, "IF-THEN" rules (production rules) in a knowledge
base may be analyzed in two ways:
• in the forward, or data-driven direction, to solve problems by asserting new
facts, or conditions, and examining the consequences, or conclusions

• in the backward, or goal-driven direction, to solve problems by hypothesizing


conclusions and examining the conditions to determine if they are true.
For the purposes of this paper, I will not describe the inference procedures further. I
will also say very little about the human-machine interface. However, since expert systems
are designed to be built by experts and used by experts and novices alike, the interface is of
crucial importance. The examples discussed later illustrate how powerful interfaces are
implemented through use of high resolution bit-mapped graphics, menu and "button"

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
4 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

driven operations, a "mouse" as a pointing device, familiar icons to represent objects such
as schematics, valves, tanks, and so forth.

The Knowledge Base. The knowledge base holds symbolic knowledge. To be sure, the
knowledge base can also contain tables of numbers, ranges of numerical values, and some
numerical procedures where appropriate. But the major content consists of facts and
heuristics.
The facts in a knowledge base include descriptions of objects, their attributes and
corresponding data values, in the area to which the expert system is to be applied. In a
process control application, for example, the factual knowledge might include a description
of a physical plant or a portion thereof, characteristics of individual components, values
from sensor data, composition of feedstocks and so forth.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch001

The heuristics, or rules, consist of the judgemental knowledge used to reason about
the facts in order to solve a particular problem. Such knowledge is often based on
experience, is used effectively by experts in solving problems and is often privately held.
Knowledge engineering has been characterized as the process by which this knowledge is
"mined and refined" by builders of expert systems. Again, using the motif of process
control, such knowledge might include rules on how to decide when to schedule a plant or
subsystem for routine maintenance, rules on how to adjust feedstocks based on current
pricing, or rules on how to diagnose process failures and provide advice on corrective
action.
Expert systems create value for groups of people, ranging from laboratory units to
entire companies, in several ways, by:
• capturing, refining, packaging, distributing expertise; an "an expert at your
fingertips";

• solving problems whose complexity exceeds human capabilities;

• solving problems where the required scope of knowledge exceeds any


individual's;

• solving problems that require the knowledge and expertise of several fields
(fusion);

• preserving the group's most perishable asset, the organizational memory;

• creating a competitive edge with a new technology.


The packaging of complex knowledge bases leads to powerful performance. This
performance is possible due to the thoroughness of the machine and the synthesis of
expertise from several experts. Similarly, if the knowledge base cuts across several
disciplines, the fusion of such knowledge creates additional value. A n obvious value of
expert systems is what is referred to above as preserving the organizational memory. Many
organizations will have to confront the loss of some of their most valuable experts over the
next few years, whether through graduation, death, a new job, or retirement. Several

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
1. SMITH The Technology of Expert Systems 5

companies are turning to expert systems in order to capture the problem-solving expertise
of their most valuable people. This preserves the knowledge and makes it available in
easily accessible ways to those who must assume the responsibilities of the departing
experts.
Considering commercial applications of the technology, expert systems can create
value through giving a company a competitive edge. This consideration means that the
first companies to exploit this technology to build useful products will obviously be some
steps ahead of those that do not.

Some Areas of Application. I next summarize some areas of application where expert
systems exist or are being developed, usually by several laboratories. Some of these areas
are covered in detail in other presentations as part of this symposium. I want to emphasize
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch001

that this is a partial list primarily of scientific and engineering applications. A similar list
could easily be generated for operations research, economics, law, and so forth. Some of the
areas are outside strict definitions of the fields of chemistry and chemical engineering, but I
have included them to illustrate the breadth of potential applications in related disciplines.

• Medical diagnosis and treatment

• Chemical synthesis and analysis

• Molecular biology and genetic engineering

• Manufacturing: planning and configuration

• Signal processing: several industries

• Equipment fault diagnosis: several industries

• Mineral exploration

• Intelligent CAD

• Instrumentation: set-up, monitoring, data analysis

• Process control: several industries


Many readers will have read about medical applications, the M Y C I N and
INTERNIST programs. There are many systems being developed to diagnose equipment
failures. Layout and planning of manufacturing facilities are obvious applications.
Chemistry and molecular biology systems were among the earliest examples of expert
systems and are now embodied in commercial systems.
There is a suite of related applications involving signal processing. Whether the data
are from images, oil well-logging devices, or military sensor systems, the problems are the
same; vast amounts of data, only some of which are amenable to numerical analysis. Yet
experts derive valid interpretations from the data. Systems have already been built to
capture this expertise.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
6 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

There are many diagnosis and/or advisory systems under development, applied to
geology, nuclear reactors, software debugging and use, manufacturing and related financial
services.
There are several applications to scientific and engineering instrumentation which
especially relevant to chemistry and chemical engineering. These include building into
instruments expertise in instrument control and data interpretation, to attempt to minimize
the amount of staff time required to perform routine analyses and to optimize the
performance of a system. There are several efforts underway in process control, focused
currently in the electrical power and chemical industries.
Before looking at some applications in more detail, let me briefly describe why the
number and scope of applications is increasing so dramatically.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch001

The Technology is Maturing Rapidly. The work that computers are being required to do is
increasingly knowledge intensive. For example, instrument manufacturers are producing
more powerful computer systems that are integral to their product lines. These systems are
expected to perform more complex tasks all the time, i.e., to be in some sense "smarter".
Two developments are proceeding in parallel with this requirement for "smarter" systems.
The software technology for building expert systems is maturing rapidly. A t the same
time, workstations that support A I system development are making a strong entry into the
computer market. For the first time, the hardware and software technology are at a point
where development of systems can take place rapidly.
Beginning in 1970, programming languages such as LISP became available. Such
languages made representation and manipulation of symbolic knowledge much simpler than
use of conventional languages. Around 1975, programming environments became
available. In the case of LISP, its interactive environment, INTERLISP, made system
construction, organization and debugging much more efficient. In 1980, research work led
to systems built on top of LISP that removed many of the requirements for programming,
allowing system developers to focus on problem solving rather than writing code. Some of
these research systems have now evolved to become commercial products that dramatically
simplify development of expert systems. Such products, often referred to as tools, are
specifically designed to aid in the construction of expert systems and are engineered to be
usable by experts who may not be programmers.
Supporting evidence for the effects of these developments is found by examining the
approximate system development time for some well known expert systems. Systems begun
in the mid-1960's, DENDRAL and M A C S Y M A required of the order of 40-80 man-years to
develop. Later systems of similar scope required less and less development time, of the
order of several man years, as programming languages and system building tools matured.
With current, commercially available tools, developers can expect to build a prototype of a
system, with some assistance, in the order of one month. The prototype that results
already performs at a significant level of expertise and may represent the core of a
subsequent, much larger system (examples are shown below). Such development times were
simply impossible to achieve with the limited tools that existed before mid-1984.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
1. SMITH The Technology of Expert Systems 1

Developing Expert Systems. How has such rapid progress been achieved? The
improvement in hardware and software technologies is obviously important. Another
important factor is that people are becoming more experienced in actually building systems.
There has emerged, from the construction of many systems designed for diverse
applications, a strong model for the basic steps required in constructing an expert system.
The four major steps are as follows:

• Select an appropriate application

• Prototype a "narrow vertical slice"

• Develop the full system

• Field the system, including maintenance and updates


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch001

First, one must select an appropriate application. There are applications that are so
simple, that require so little expertise, that it is not worth the time and money to emulate
human performance in a machine. A t the other end of the spectrum, there are many
problems whose methods of solution are poorly understood. For several reasons, these are
not good candidates either. In between, there are many good candidates, and in the next
section I summarize some of the rules for choosing them.
Second, a prototype of a final system is built. This prototype is specifically designed
to have limited, but representative, functionality. During development of the prototype,
many important issues are resolved, for example, the details of the knowledge
representation, the man-machine interface, and the complexity of the rules required for high
performance. Rapid prototyping is already creeping into the jargon of the community.
The latest expert system building tools are sufficiently powerful that one can sit down and
try various ideas on how to approach the problem, find out what seems logical and what
doesn't, reconstruct the knowledge base into an entirely different form, step through
execution of each rule and correct the rules interactively. This approach differs
substantially from traditional methods of software engineering.
The third step, however, reminds us that we do have to pay attention to good
software development practices if a generally used, and useful system is to result from the
prototype. Development of a full system, based at least in part on the prototype, proceeds
with detailed specifications as the system architects define and construct its final form.
The last step is just as crucial as its predecessors. The system must be tested in the
field, and the usual requirements in the software industry for maintenance and updates
pertain.
The primary differences, then, between development of expert systems and more
traditional software engineering are found in steps one and two, above. First, the problems
chosen will involve symbolic reasoning, and will require the transfer of expertise from
experts to a knowledge base. Second, rapid prototyping, the "try it and see how it works,
then fix it or throw it away" approach will play an important role in system development.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
8 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

The only phase of development of expert systems that I will say any more about is
the first, and in many ways the most crucial, step for those who are contemplating building
expert systems for the first time. How do you go about selecting an appropriate
application? Here are the basic criteria:

• Symbolic reasoning

• Availability and commitment of expert

• Importance of problem

• Scope of problem

• General agreement among specialists


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch001

• Data and test cases available

• Incremental progress possible


First, the application should involve symbolic reasoning. There is no point in trying
to develop an expert system to perform numerical calculations, for example, Fourier
transforms.
Second, there should be experts available that can solve the problems involved in the
selected application and they must be committed to spend their time working with the
system and other experts in developing the knowledge base. If such experts are not
available, or will not commit to the effort, forget the application.
Third, the problem must be important. It must be a problem whose computer-aided
solution creates value by some measure. Such problems may require substantial expertise,
or they may be simple, repetitive, and labor intensive, test. No one will invest in a system
if the problems are infrequently encountered and can be solved quickly by persons of
normal intelligence.
Fourth, the scope of the application must be bounded. There must be some
specification of the functionality of the expert system and characteristics of the problems it
is expected to solve. Trying to build an expert system to solve the world's economic
problems is not a good application to choose. However, selecting a product mix from an oil
refinery based on the current state of supply and demand in the world's energy markets
might be a good application.
Fifth, there must be general agreement among experts on how to solve the problem,
on what constitutes the facts in the domain, and what are judgemental rules. Without such
agreement, the values mentioned previously of extending the knowledge base beyond any
single individual's contribution, and fusion of expertise across several domains will not be
realized. More practically, without general agreement, other experts will criticize the
performance of the system.
Sixth, there must be ample data and test cases available to convince the system
builders that some defined level of performance has been achieved. Although this may seem
obvious, some systems have been built and tuned to perform well on a single test case.
Needless to say, such systems usually fail when confronted with a second test case.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
1. SMITH The Technology of Expert Systems 9

Seventh, it must be possible to build the system incrementally. It must be easy to


extend the knowledge base and modify its contents, because as you all know, rules often
change as new evidence is gathered. The progress of science and technology are always
working to make our knowledge inadequate or obsolete. We must learn new things; we
must be able to instruct the expert system accordingly.

Selected Applications

Biological Reactors. In this section I discuss some applications that are at least indirectly
related to chemical science and engineering. The first example, illustrated in Figure 1, is
derived from a simulation and diagnosis of a biological reactor that we put together for a
demonstration.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch001

Because the expert system was not connected to a real reactor, we built a small table-
driven simulation to model the growth of cells in suspension. The graphical interface
includes images representing the reactor itself, several feed bins and associated valves. Also
shown in Figure 1 are several types of gauges, including a strip chart, monitors of various
states and alarm conditions, temperature, and the on/off state of heaters and coolers.
The simulation runs through a startup phase, then through an exponential growth
phase which is inhibited by one of several conditions. The expertise captured in the rules in
the knowledge base is designed to diagnose one of several possible faults in the system and
to take action to correct the condition. Growth inhibition may be caused by incorrect
temperature, depletion of nutrients, incorrect pH or contamination. The system is able to
diagnose the fault and to take action to adjust temperatures, the pH, add nutrients or
recommend the batch be discarded due to contamination. A simple example, but one that
illustrates several points mentioned earlier. The graphical interface is essential for non-
experts. The system was developed rapidly as a prototype. As such, it does useful things,
it can be examined, criticized, refined, and can represent the beginnings of a larger system.
Combinations of relatively simple rules can diagnose problems and take specific actions.

Communication Satellites. The next example illustrates an expert system similar to those
under development in process control and instrumentation companies. These systems are
designed to diagnose faults and suggest corrective actions.
An aerospace company in California monitors telecommunication satellites in
geosynchronous orbit, 23,000 miles away in space. When something goes wrong on that
satellite, $50 to $100 million are dependent on taking the right corrective action. This
company is using expert systems to capture the knowledge of the developers of the satellites
in diagnosing and correcting problems, and to make this knowledge available to all
operators responsible for monitoring the condition of on-board systems.
Like many modern instruments, their instrument, the satellite, is connected to their
computer systems through an interface, in this case an antenna dish that transfers data
from the satellite to computers at a ground control center.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
10 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch001

F i g u r e 1. G r a p h i c s s c r e e n f o r the p r o t o t y p e e x p e r t system f o r
diagnosing f a u l t s i n a b i o l o g i c a l reactor. The s c r e e n shows a
schematic of the r e a c t o r , t o g e t h e r w i t h gauges, s t r i p c h a r t s , and
" t r a f f i c l i g h t s " i n d i c a t i n g the s t a t e of the r e a c t o r o b t a i n e d from
sensor r e a d i n g s . (Reproduced w i t h p e r m i s s i o n . C o p y r i g h t 1983
IntelliCorp.)

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
1. SMITH The Technology of Expert Systems II

What is especially interesting about their problem of diagnosis of failures and advice
On corrective measures is their treatment of the alarm conditions that trigger the execution
of the expert system. The first goal of their rules is to focus on the single, or small set of,
alarm(s) that are of highest priority, thereby ignoring what may be many lower priority
alarms for a single problem. This usually allows^ isolation of the problem to a specific
subsystem, such as the energy storage and heating system shown schematically in Figure 2.
When the problem is localized, the system provides advice on what actions to take, then
examines the other alarms to determine if they are of secondary importance or represent
concurrent, major problems. Here, the graphical presentations, for example, Figure 2,
provide information to the operator on which systems are being examined and where the
faults may occur.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch001

Space Stations. The final example I have selected results from work done by the National
Aeronautics and Space Administration (NASA) in preparation for flying the space station.
NASA's general problem is that many space station systems must be repairable in orbit by
astronauts who will not be familiar with the details of all the systems. Therefore, NASA is
looking to the technology of expert systems to diagnose problems and provide advice to the
astronauts on how to repair the problems.
The problem they chose for their prototype is part of the life support system,
specifically the portion that removes C 0 from the cabin atmosphere. This system already
2

has been constructed, and NASA engineers are already familiar with its operation and how
it can fail. Using this information they were able to build as part of their knowledge base a
simple simulation for the modes of failure of each of the components in the system. The
life support system is modular, in that portions of it can be replaced, once a problem has
been isolated. The graphical representation chosen for the instrument schematic and panel
is shown in Figure 3.
On the left of Figure 3 is a schematic of the system, with hydrogen gas (the
consumable resource) flowing through a valve to the six-stage fuel cell. Cabin atmosphere
enters from the right, excess hydrogen plus C 0 exits at the H Sink, and atmosphere
2 2

depleted in C 0 exits at the Air Sink. There is a variety of pressure, flow, temperature and
2

humidity sensors on the system. The lower subsystem is a coolant loop that maintains
temperature and humidity in the fuel cell. On the right of Figure 3 is a schematic of an
instrument panel that contains many of the instruments the astronauts will actually see.
Each component in the schematic is active. Pointing to any component with a mouse
yields a menu of possible modes of failure for that component. Selection of a failure results
in setting parameters in the underlying knowledge base, which are of course reflected in the
settings of the meters and gauges on the instrument panel.
Simply pointing to the IDENTIFY button runs the rule system, which diagnoses the
problem and provides advice on action to take to fix the life support system. The
remainder of the screen is devoted to various switches and output windows that are used to
build and debug the knowledge base.
As an indication of how rapidly the technology of expert systems has matured, this

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch001

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
F i g u r e 2. G r a p h i c s s c r e e n f o r a p o r t i o n of the e x p e r t system developed f o r an
alarm a d v i s o r y system f o r communications s a t e l l i t e s . This screen d i s p l a y s a p o r t i o n
of the b a t t e r y and h e a t e r subsystem used to m a i n t a i n t h e r m a l b a l a n c e on the s a t e l l i t e .
1. SMITH The Technology of Expert Systems 13

prototype was built in our offices by two people from NASA, one a programmer who knew
nothing about LISP, the other an expert on the life support system who knew nothing
about programming. Neither had seen K E E ™ , our system building tool, before receiving
training and beginning work on the prototype. The system, including all the graphics, the
simulation and the rules, was built in four weeks. It is capable of diagnosing many of the
important modes of failure of this portion of the life support system. Much work remains
to be done before a final version of the expert system is completed, but this prototype
provides an important starting point.

Concluding Remarks

I have used this paper as an introduction to what amount to revolutionary change in the
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch001

software technologies of expert systems. A t the same time, a revolution is occurring in


hardware technology as well. A t the moment, tools for building high performance expert
systems run on special purpose hardware, LISP machines. These machines have been quite
expensive, making entry into this area difficult for many laboratory groups. Several things
are happening that are changing this situation dramatically. First, applications developed
on LISP machines can now be ported to midi and minicomputers, making replication of a
developed system much less expensive. Second, hardware vendors such as Xerox have
recently announced LISP machines at modest prices, just under $10,000 for one such
machine. Third, Texas Instruments has a contract to produce a VLSI implementation of its
LISP machine on a chip. Successful development of this chip will further reduce the cost of
a machine. Fourth, better programming environments are becoming available on midi and
minicomputers, and in the short run some of these systems will mature to the point where
significant work can be done, albeit at performances substantially below the LISP machines.
In the longer term, better hardware for symbolic computation will become available.
These machines will support large knowledge bases, and be able to perform rapid retrievals
of data from them. Logical inferences will be performed at much higher rates, approaching
those now achieved by arithmetic operations. Parallel architectures will further improve
the speed of symbolic computations, just as they have done for numeric computations. The
keyboard is already becoming obsolete in expert systems products. Menus accessed by
pointing devices, or special purpose, programmable touchpads are much easier for most
people to use. Speech and picture input is already achievable in simple systems; the
improvement of this technology will continue to revolutionize human-machine interactions.
A n important characteristic of expert systems technology is that it can be added on to
existing technologies. Such systems are already compatible with modern distributed
computing environments, and can be networked easily with existing systems. Thus,
investments in hardware and software are protected, and machines of more conventional
architectures can be used as they are used now, for example, to support large data bases or
to perform numerical calculations. A n expert system can make use of these machines,
passing requests for retrievals or calculations over a network, and gathering results to be
used in the problem solving activities.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch001

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch001

DIAGNOSE ·^ RESET
(tv:nouse-1nitial1*e)
I £51.RESETTFQR.RULES I
MIL
K U L t . u L A S S t S ' s AND. I RACE DIAGNOSE '» *tDf-MTT Y
ON A L L - E S s CHOOSE MODE
RULE.CLASSES's Ofl.TRAUC Γ SET.IMDIVIDUAL.LEVELS I
SFLtCTJATTfRM
C S - T s F AULTY.COMPONEN/T
RULE-CLASSE S *s STEPPER J 4 0 0 É
MIL 2/04/85 15:25:49 C o n s i d e r g o a l ( f l COMPONENT.FAILURE.TYPE OF ?C0f1P0MENT I S ?FAILURE)
«Unit (U2H2 CS1) 2/04/85 1 5 : 3 1 : a t n o d e 2.
NIL 2/04/85 15:32:24
«Unit (U2H2 CS1) 2/04/85 1 5 : 4 8 : C o n s i d e r H2.SOURCE.RULE t o d e r i v e t h e g o a l .
MIL 2/04x85 16:29:26
C r e a t e node 5. b e l o w node 2..
ttriOREt»!

F i g u r e 3. G r a p h i c s s c r e e n f o r the p r o t o t y p e e x p e r t system d e v e l o p e d by N A S A f o r
d i a g n o s i s and r e p a i r of the l i f e - s u p p o r t system. T h i s p o r t i o n of the system
s t r i p s c a b i n atmosphere of C02« (Reproduced w i t h p e r m i s s i o n . C o p y r i g h t 1983
IntelliCorp.)

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
16 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

In my opinion, these technologies will have substantial impact on the practice of


chemistry and chemical engineering. Everyone is familiar already with the extent to which
computers have taken over routine tasks of data acquisition, reduction and presentation.
Machines for data interpretation are now being constructed. Robotics is another discipline
of A I that is now being used in simple systems to perform repetitive laboratory operations.
The fusion of vision and expert systems technologies with robotics will make the latter
much more flexible and adaptable to changing conditions. These changes, and many others
brought on by the new technologies, will probably not diminish the total number of jobs
available in the physical sciences, but it certainly will change what work is done in these
jobs. There is already a history of jobs requiring limited skills being displaced by
computers and automation. Expert systems will create additional displacements. A t the
same time, more jobs related to building and maintaining such systems will become
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch001

available, but these jobs will require substantially more education and skills.
For jobs that already require substantial skills, expert systems will serve to make the
people holding these jobs more productive. A n analogy has been made to engineers who
used to calculate trajectories by hand, but now use computers to perform these routine
tasks, thereby freeing their time for more intellectual pursuits. Chemists and chemical
engineers will see similar improvements to their own productivity.

R E C E I V E D January 24, 1986

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
2
A Knowledge-Engineering Facility
for Building Scientific Expert Systems

Charles E . Riese and J . D. Stuart

Radian Corporation, Austin, T X 78766-0948


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch002

RuleMaster is a general-purpose software package for


building and delivering expert systems. Its features
include 1) knowledge acquisition by inductive learning,
2) specialized a r t i f i c i a l intelligence programming
s k i l l s are not required, and 3) it runs on a wide range
of micro-computers and mini-computers. RuleMaster was
developed to enable scientists and engineers to
incorporate human-like decision making as part of their
computer applications. One such application is TOGA, an
expert system to diagnose faults in large transformers
based on gas chromatographic analysis of the insulating
oil.

An e x p e r t system i s a computer program w h i c h c o n t a i n s t h e c a p t u r e d


knowledge o f an e x p e r t i n some s p e c i f i c domain. The program i s a b l e
t o g i v e a d v i c e w i t h i n t h e d o m a i n i n much t h e same manner as t h e
human e x p e r t w o u l d , a s k i n g f o r i n f o r m a t i o n as i t i s n e e d e d ,
v o l u n t e e r i n g p a r t i a l diagnoses as t h e y a r e r e a c h e d , and f u n c t i o n i n g
w i t h incomplete or p o s s i b l y erroneous i n f o r m a t i o n . The e x p e r t
s y s t e m i s a b l e t o p r o v i d e an e x p l a n a t i o n o f t h e l i n e o f r e a s o n i n g
upon demand.
U n t i l r e c e n t l y , most e x p e r t system b u i l d i n g t o o k p l a c e i n t h e
r e s e a r c h departments o f u n i v e r s i t i e s and a few major c o r p o r a t i o n s .
The p r i m a r y e m p h a s i s was i n v e s t i g a t i o n o f a r t i f i c i a l i n t e l l i g e n c e
p r i n c i p l e s , a n d t h e a p p l i c a t i o n was o f s e c o n d a r y i m p o r t a n c e . The
e x p e r t systems t o o l s used r e f l e c t t h i s i n t e r e s t . They a r e t y p i c a l l y
s t a n d - a l o n e A I computer systems, u s i n g s p e c i a l hardware and s o f t w a r e
environments ( u s u a l l y L i s p ^ b a s e d ) not commonly found i n s c i e n t i f i c
and e n g i n e e r i n g o r g a n i z a t i o n s .
But a p p l i c a t i o n s u s u a l l y n e e d a d i f f e r e n t t y p e o f computing
environment. The r e a s o n i n g t a s k , a c c o m p l i s h e d b y A I t e c h n i q u e s ,
o f t e n c o n s t i t u t e s t e n p e r c e n t o r l e s s o f t h e code o f an a p p l i c a t i o n .
The m a j o r i t y o f t h e code i s f o r c o n v e n t i o n a l programming t a s k s , such
as d a t a a c q u i s i t i o n , d a t a base a c c e s s , n u m e r i c a l c a l c u l a t i o n s , and
graphics. I n each a p p l i c a t i o n domain, computer hardware and
s o f t w a r e has b e e n s e l e c t e d t o m a t c h t h e n e e d s o f i t s t a s k s . In

0097-6156/86/0306-0018$06.00/0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
2. RIESE A N D STUART A Knowledge-Engineering Facility 19

e s t a b l i s h e d f i e l d s l i k e c h e m i s t r y , computer s o l u t i o n s have been


i m p l e m e n t e d and i n use f o r y e a r s . I t i s n o t r e a s o n a b l e f o r t h e A I
component, a r e l a t i v e l y s m a l l a d d i t i o n t o t h e t o t a l s y s t e m , t o
d i c t a t e major changes t o t h e computing environment.
W h i l e t h e o r i g i n a l e x p e r t system approaches were s u i t a b l e f o r
A I r e s e a r c h , s e v e r a l t y p e s o f p r o b l e m s a r e e n c o u n t e r e d when t h e
emphasis i s s h i f t e d t o s c i e n t i f i c e x p e r t system a p p l i c a t i o n s .
In t h e o r i g i n a l a p p r o a c h e s , e x p e r t s y s t e m b u i l d i n g i s s l o w and
e x p e n s i v e due t o t h e amount o f e x p e r t a n d k n o w l e d g e e n g i n e e r t i m e
r e q u i r e d t o e x p r e s s and t e s t r u l e s . The c o s t o f A I h a r d w a r e a n d
s p e c i a l A I p r o g r a m m e r s makes s m a l l a p p l i c a t i o n s p r o h i b i t i v e l y
expensive. The e x p e r t systems a r e s t a n d - a l o n e programs, and i t i s
d i f f i c u l t or i m p o s s i b l e t o i n t e g r a t e t h e i r reasoning w i t h e x i s t i n g
s c i e n t i f i c software. Sometimes, f i n i s h e d e x p e r t systems can not be
used i n the f i e l d because they are too s l o w , or r e q u i r e
i n a p p r o p r i a t e l y e x p e n s i v e hardware.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch002

B e c a u s e o f t h e c u r r e n t h i g h demand f o r e x p e r t s y s t e m
a p p l i c a t i o n s , s o f t w a r e packages w h i c h a r e o p t i m i z e d f o r a p p l i c a t i o n
building, rather than for AI technique research, h a v e been
developed. One o f t h e s e i s R u l e M a s t e r (l) 9 which i s designed to
e x t r a c t e x p e r t r e a s o n i n g and t o i n c o r p o r a t e i t i n t o a w i d e range o f
s c i e n t i f i c a n d e n g i n e e r i n g a p p l i c a t i o n s . I n c o n t r a s t w i t h many
o t h e r A I approaches, R u l e M a s t e r i s based on contemporary s t r u c t u r e d
programming p r i n c i p l e s . C o n v e n t i o n a l m i c r o - and m i n i - c o m p u t e r s may
be u s e d b y a n y c o m p u t e r p r o f e s s i o n a l t o b u i l d e x p e r t s y s t e m s
i n t e g r a t e d w i t h e x i s t i n g computer programs. A knowledge a c q u i s i t i o n
system based on i n d u c t i v e l e a r n i n g speeds up t h e r u l e g e n e r a t i o n and
t e s t i n g process. A p r o c e d u r a l r e p r e s e n t a t i o n o f the r u l e base i s
a u t o m a t i c a l l y g e n e r a t e d , p r o v i d i n g c o n s i s t e n c y and c o m p l e t e n e s s
c h e c k i n g and e f f i c i e n t r u n - t i m e b e h a v i o r . Embedding e x p e r t system
r e a s o n i n g i n t o e x i s t i n g systems i s s u p p o r t e d by two f e a t u r e s :
a c c e s s t o e x t e r n a l u s e r programs from t h e R u l e M a s t e r r u l e l a n g u a g e ,
and t h e a u t o m a t i c g e n e r a t i o n o f a C c o d e r e p r e s e n t a t i o n o f t h e
expert system.

RuleMaster D e s c r i p t i o n

History. R a d i a n C o r p o r a t i o n i s a t e c h n i c a l c o n s u l t i n g company,
e m p l o y i n g about 1000 p e o p l e . About h a l f o f R a d i a n ' s b u s i n e s s i s i n
t h e c h e m i s t r y and c h e m i c a l e n g i n e e r i n g f i e l d s . I n 1981, Radian
management r e a l i z e d t h a t e x p e r t systems c a p a b i l i t y c o u l d enhance and
complement e x i s t i n g c o n s u l t i n g a c t i v i t i e s . R a d i a n e n t e r e d i n t o an
agreement w i t h D o n a l d M i c h i e , o f E d i n b u r g h U n i v e r s i t y and
I n t e l l i g e n t T e r m i n a l s L i m i t e d (ITL). F o r a number o f y e a r s , he had
done r e s e a r c h i n i n d u c t i v e l e a r n i n g a n d i n o t h e r e x p e r t s y s t e m
t e c h n i q u e s , and o f t e n used c o n v e n t i o n a l s t r u c t u r e d programming
languages l i k e P a s c a l . He n o t e d t h a t t h e s p e c i a l A I environments
were p r i m a r i l y u s e f u l f o r r e s e a r c h i n t o A I t e c h n i q u e s , and were n o t
n e c e s s a r y f o r an e x p e r t systems package o r i e n t e d toward b u i l d i n g
applications. R u l e M a s t e r was d e s i g n e d a n d d e v e l o p e d by I T L a n d
R a d i a n d u r i n g 1982 and 1 9 8 3 . S i n c e t h e n , b o t h companies have
c o n t i n u e d e n h a n c i n g R u l e M a s t e r , and s e v e r a l d o z e n e x p e r t s y s t e m
a p p l i c a t i o n s a r e under c o n s t r u c t i o n o r c o m p l e t e d .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
20 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Components. The two p r i n c i p l e components o f R u l e M a s t e r a r e :

Radial: a procedural, block structured language for


e x p r e s s i n g d e c i s i o n r u l e s , and

RuleMaker: t h e knowledge a c q u i s i t i o n s y s t e m ; induces d e c i s i o n


t r e e s from examples o f e x p e r t d e c i s i o n - m a k i n g , and
e x p r e s s e s t h e s e d e c i s i o n s t r e e s as executable
R a d i a l code,

R u l e M a s t e r e x p e r t s y s t e m s a r e r e p r e s e n t e d as R a d i a l programs. To
b u i l d an e x p e r t system, domain knowledge i s n o r m a l l y e n t e r e d i n two
parts: a m o d u l e s t r u c t u r e and t h e b o d i e s o f t h e m o d u l e s . The
s t r u c t u r e d e f i n e s t h e h i e r a r c h i c a l o r g a n i z a t i o n o f d e c i s i o n s used t o
s o l v e t h e p r o b l e m . The code w i t h i n each module d e f i n e s t h e d e t a i l s
o f one o f t h e s e d e c i s i o n s .
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch002

R u l e M a k e r i s a knowledge e x t r a c t i o n u t i l i t y f o r b u i l d i n g and
t e s t i n g the d e c i s i o n l o g i c contained w i t h i n R a d i a l modules. The
l o g i c i s s p e c i f i e d as a t a b l e o f e x a m p l e s o f c o r r e c t e x p e r t
d e c i s i o n s f o r each module. R u l e M a k e r t r a n s f o r m s each example s e t
i n t o an e q u i v a l e n t d e c i s i o n t r e e , and a u t o m a t i c a l l y generates t h e
body o f t h e module i n t h e form o f R a d i a l code. System b u i l d e r s may
a l s o choose t o e n t e r R a d i a l code d i r e c t l y , a l t h o u g h t h e y u s u a l l y
p r e f e r t o work w i t h example t a b l e s .
C o n s u l t a t i o n o f an e x p e r t system i s a c c o m p l i s h e d by u s i n g i t s
R a d i a l code r e p r e s e n t a t i o n as i n p u t t o t h e R a d i a l i n t e r p r e t e r . The
i n t e r p r e t e r f i r s t performs completeness and c o n s i s t e n c y c h e c k s , and
then provides i n t e r a c t i v e run-time support.

I n d u c t i v e L e a r n i n g ( R u l e M a k e r ) . Experts are best a b l e t o e x p l a i n


complex concepts t o human a p p r e n t i c e s i m p l i c i t l y by u s i n g examples
o f t h e e x p e r t ' s d e c i s i o n - m a k i n g , r a t h e r t h a n by e x p l i c i t l y s t a t i n g
fundamental t h e o r e t i c a l p r i n c i p l e s . The a p p r e n t i c e quickly
g e n e r a l i z e s t h e s e example d e c i s i o n s t o form w o r k i n g r u l e s , w h i c h he
a p p l i e s when s i m i l a r s i t u a t i o n s a r e e n c o u n t e r e d .
f
R u l e M a s t e r s knowledge a c q u i s i t i o n t o o l , R u l e M a k e r , employs a
l e a r n i n g process s i m i l a r to that o f the apprentice. To t e a c h a
concept t o R u l e M a k e r , t h e e x p e r t p r o v i d e s a s e t o f examples ( c a l l e d
a t r a i n i n g s e t ) o f c o r r e c t d e c i s i o n s w i t h i n some c o n t e x t . Each
t r a i n i n g set contains a l i s t o f the a t t r i b u t e s which are factors for
d e t e r m i n i n g t h e c h o i c e o f a c t i o n . Each example c o n t a i n s a v a l u e f o r
e a c h o f t h e a t t r i b u t e s , t o g e t h e r w i t h t h e s p e c i f i e d a c t i o n s t o be
t a k e n when t h a t c o m b i n a t i o n o f a t t r i b u t e v a l u e s i s encountered. The
R u l e M a k e r u t i l i t y c h e c k s e a c h t r a i n i n g s e t f o r c o m p l e t e n e s s and
c o n s i s t e n c y , and t h e n g e n e r a t e s a p r o c e d u r a l r e p r e s e n t a t i o n o f t h e
knowledge embodied i n t h e example.
To i l l u s t r a t e t h i s , t h e e x a m p l e s e t o f F i g u r e 1 shows how a
s i m p l e corona d e t e c t i o n d e c i s i o n ( l i k e l y , p o s s i b l e , or u n l i k e l y ) i n
TOGA (Transformer O i l Gas A n a l y s i s ) might be s p e c i f i e d . TOGA i s an
e x p e r t system t h a t diagnoses f a u l t s i n l a r g e e l e c t r i c a l t r a n s f o r m e r s
a n d w i l l be d e s c r i b e d i n d e t a i l l a t e r i n t h i s p a p e r . The c o r o n a
d e c i s i o n i s b a s e d on f o u r a t t r i b u t e s : H 2 , t h e r m a l , H 2 / C 2 H 2 , and
I!
temperature. The a t t r i b u t e " H 2 i s t h e c o n c e n t r a t i o n o f h y d r o g e n
g a s ; i t may be l o w , medium, o r h i g h , a c c o r d i n g t o n u m e r i c a l r a n g e s

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
2. RIESE A N D STUART A Know ledge-Engineering Facility 21

s e t by t h e e x p e r t i n a n o t h e r R a d i a l m o d u l e . " T h e r m a l " r e f e r s t o
t h e r m a l l y generated hydrocarbon g a s e s , w h i c h may be a b s e n t , s l i g h t ,
or d e f i n i t e l y p r e s e n t . The o t h e r two a t t r i b u t e s a r e t h e h y d r o g e n -
t o - a c e t y l e n e r a t i o and t h e e s t i m a t e o f t h e t e m p e r a t u r e a t w h i c h t h e
h y d r o c a r b o n gases were generated. A h i e r a r c h y o f r u l e s s u p p l i e d by
t h e e x p e r t determines t h e v a l u e o f each o f t h e s e a t t r i b u t e s , based
e v e n t u a l l y on t h e n u m e r i c a l c o n c e n t r a t i o n s r e c e i v e d from t h e gas
chromatograph.
The d e c i s i o n f o r each example i s e x p r e s s e d as an " a c t i o n - n e x t
state" pair. The " a c t i o n " i s a r e f e r e n c e t o e x e c u t a b l e R a d i a l code,
w h i c h c o n s i s t s o f a sequence o f R a d i a l s t a t e m e n t s . These s t a t e m e n t s
may c o n t a i n r e f e r e n c e s t o e x t e r n a l p r o g r a m s i n v a r i o u s l a n g u a g e s
( t h i s w i l l be d i s c u s s e d f u r t h e r l a t e r ) . The "next s t a t e " d e s c r i b e s
the c o n t e x t t o which c o n t r o l i s t o pass a f t e r the a c t i o n i s
c o m p l e t e d . F o r d i a g n o s t i c e x p e r t s y s t e m s , s u c h as TOGA, t h e n e x t
s t a t e w i l l u s u a l l y be t h e " g o a l " s t a t e o f t h e module. T h i s passes
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch002

c o n t r o l back t o t h e c a l l i n g module. F o r p r o c e d u r a l e x p e r t systems,


s u c h as r o b o t i c s a n d i n s t r u m e n t a t i o n c o n t r o l a p p l i c a t i o n s , t h e
c o n t r o l w i l l be t r a n s f e r r e d between s e v e r a l s t a t e s w i t h i n a module
t o implement l o o p i n g .
The d e c i s i o n t r e e f o r t h e t r a i n i n g s e t o f F i g u r e 1, as
g e n e r a t e d b y R u l e M a k e r , i s shown i n F i g u r e 2. The g e n e r a t e d t r e e
a g r e e s w i t h a l l d e c i s i o n s r e p r e s e n t e d i n t h e example s e t , and
g e n e r a l i z e s to reach decisions for u n s p e c i f i e d portions of the
space. The r u l e i n d u c t i o n a l g o r i t h m , c a l l e d ID3 ( 2 ) , uses
i n f o r m a t i o n t h e o r e t i c t e c h n i q u e s t o r e d u c e t h e number o f d e c i s i o n
nodes i n t h e g e n e r a t e d t r e e .

R u l e Language ( R a d i a l ) . R u l e M a s t e r e x p e r t systems a r e e x p r e s s e d i n
R a d i a l , a b l o c k s t r u c t u r e d i n t e r p r e t e d language w i t h a syntax
s i m i l a r t o P a s c a l a n d ADA. R a d i a l i s a s i m p l e , e a s y - t o - l e a r n
language which supports the f u l l range of expert system
capabilities.
The b u i l d i n g b l o c k o f R a d i a l , c o r r e s p o n d i n g t o t h e P a s c a l
p r o c e d u r e , i s c a l l e d a "module". The s y n t a x w i t h i n each module i s
based on f i n i t e automata t h e o r y , t o p r o v i d e t h e c o n t r o l s t r u c t u r e s
needed t o s u p p o r t b o t h d i a g n o s t i c and p l a n n i n g a s p e c t s o f e x p e r t
systems a p p l i c a t i o n s . Other language features include recursive
r o u t i n e c a l l s , argument p a s s i n g , s c o p e d v a r i a b l e and f u n c t i o n s ,
a b s t r a c t d a t a t y p e s , and u s e r - d e f i n e d o v e r l o a d e d o p e r a t o r s . Built-
i n d a t a t y p e s i n c l u d e s t r i n g , i n t e g e r , f l o a t i n g p o i n t , and b o o l e a n .
The R a d i a l c o d e f o r t h e d e c i s i o n t r e e o f F i g u r e 2 i s shown i n
F i g u r e 3. T h i s c o d e was g e n e r a t e d b y R u l e M a k e r . Experts have
d i f f i c u l t y c o r r e c t l y g e n e r a t i n g a d e e p l y n e s t e d c o n d i t i o n a l phrase
l i k e t h i s , but t h e y are a b l e t o i n s p e c t i t f o r p o s s i b l e e r r o r s or
omissions.
TOGA u s e s t h e b u i l t - i n n u m e r i c a l c a p a b i l i t i e s o f R a d i a l t o
compute f u n c t i o n s o f c o n c e n t r a t i o n v a l u e s , w h i c h are used
e x t e n s i v e l y i n the r u l e s . The r a t i o o f h y d r o g e n t o a c e t y l e n e
c o n c e n t r a t i o n i n t h e corona r u l e i s a s i m p l e example o f t h i s . User-
d e f i n e d compound d a t a t y p e s a r e used t o h a n d l e b l o c k s o f d a t a as a
s i n g l e named s t r u c t u r e . These f e a t u r e s a r e i n v a l u a b l e i n b u i l d i n g
p r a c t i c a l e x p e r t systems, but a r e not a v a i l a b l e w i t h a l l packages.
Most R a d i a l code i s c o n s t r u c t e d by R u l e M a k e r from t r a i n i n g s e t s

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
22 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

next
H2 thermal H2/C2H2 temperature action state

high - high low => ( likely, GOAL)


med absent high low => I: l i k e l y , GOAL)

high - high moderate => < possible,


med absent high moderate => ( possible,

high - high high => ( unlikely,


med absent high high => ( unlikely,
med present
- moderate => ( unlikely,
med slight
- moderate => ( unlikely,
low
-— -low -— => ( unlikely,
- => ( unlikely,
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch002

Figure 1. Example s e t f o r corona r u l e .

unlikely

unlikely

marf
unlikely likely likely unlikely ( thermal ) possible
absent <·" [^^rresent

possible unlikely unlikely

F i g u r e 2. Decision t r e e f o r corona determination.

I F (temp) I S
" l o w " : I F (H2/C2H2) I S
" h i g h " : I F (H2) I S
" l o w " : ( " u n l i k e l y " -> r e s u l t , GOAL )
"med" : ( " l i k e l y " -> r e s u l t , GOAL )
ELSE ( " l i k e l y " -> r e s u l t , GOAL )
ELSE ( " u n l i k e l y " -> r e s u l t , GOAL )
"moderate" : I F (H2/C2H2) I S
" h i g h " : I F (H2) I S
" l o w " : ( " u n l i k e l y " -> r e s u l t , GOAL )
"med" : I F ( t h e r m a l ) IS
"absent" : ( " p o s s i b l e " -> r e s u l t , GOAL )
" s l i g h t " : ( " u n l i k e l y " -> r e s u l t , GOAL )
ELSE ( " u n l i k e l y " -> r e s u l t , GOAL )
ELSE ( " p o s s i b l e " -> r e s u l t , GOAL )
ELSE ( " u n l i k e l y " -> r e s u l t , GOAL )
EI£E ( " u n l i k e l y " -> r e s u l t , GOAL )

Figure 3. Corona d e t e r m i n a t i o n r u l e i n d u c e d from F i g u r e 1


examples, as e x p r e s s e d i n a u t o m a t i c a l l y g e n e r a t e d R a d i a l code.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
2. RIESE A N D STUART A Know ledge-Engineering Facility 23

o f examples, as d e s c r i b e d i n t h e p r e v i o u s s e c t i o n . However, R a d i a l
code c a n a l s o be e n t e r e d d i r e c t l y by t h e s y s t e m b u i l d e r s , i f t h e y so
desire.

Explanation. A u s e r may a s k f o r e x p l a n a t i o n o f t h e l i n e o f
r e a s o n i n g a t any t i m e d u r i n g an e x p e r t s y s t e m c o n s u l t a t i o n .
R u l e M a s t e r p r e s e n t s e x p l a n a t i o n as a l i s t o f p r e m i s e s and
conclusions i n E n g l i s h - l i k e text. The e x p l a n a t i o n d e s c r i b e s t h e
e x e c u t i o n p a t h w h i c h l e d up t o t h e c u r r e n t c o n c l u s i o n o r q u e s t i o n .
E x p l a n a t i o n i s presented i n proof o r d e r i n g , which u s u a l l y d i f f e r s
f r o m t h e o r d e r i n w h i c h t h e q u e s t i o n s and c o n c l u s i o n s w e r e
encountered. T h i s i s p e r c e i v e d as more r e l e v a n t and u n d e r s t a n d a b l e
t h a n t h e t i m e - o r d e r e d p r e s e n t a t i o n o f f i r e d r u l e s , as i s p r e s e n t i n
most e x p e r t system approaches.
A sample e x p l a n a t i o n f o r t h e c o r o n a d e c i s i o n i s as f o l l o w s :
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch002

S i n c e t h e e s t i m a t e d o i l t e m p e r a t u r e i s moderate
when H2/C2H2 i s a b o v e j *
and t h e c o n c e n t r a t i o n o f H2 i s medium
and o v e r h e a t i n g o f o i l i s absent
i t follows t h a t a corona i s p o s s i b l e

T h i s t e x t was c o n s t r u c t e d a t r u n - t i m e b y t h e R a d i a l i n t e r p r e t e r
from t e x t fragments p r o v i d e d beforehand by t h e s y s t e m b u i l d e r s . It
d i s p l a y s , i n E n g l i s h , the path through the corona d e c i s i o n t r e e
( F i g u r e 2).
When e x p l a n a t i o n i s r e q u e s t e d a t i n t e r m e d i a t e p o i n t s i n a
s e s s i o n , just the reasoning for the current d e c i s i o n t r e e i s
presented. By a s k i n g f o r e l a b o r a t i o n , t h e u s e r c a n i n s p e c t t h e
reasoning u n d e r l y i n g the c u r r e n t r u l e . E l a b o r a t i o n of the corona
d e c i s i o n above w o u l d y i e l d d e s c r i p t i o n s o f t h e l i n e s o f r e a s o n i n g
which determined the premises: t h a t t h e o i l t e m p e r a t u r e was
moderate, t h a t t h e c o n c e n t r a t i o n o f H2 was medium, e t c . E l a b o r a t i o n
may be r e p e a t e d u n t i l t h e u s e r i s s a t i s f i e d o r u n t i l a l l t h e s t e p s
have been examined.
I f e x p l a n a t i o n i s r e q u e s t e d a t t h e end o f a s e s s i o n , t h e e n t i r e
l i n e o f r e a s o n i n g l e a d i n g up t o t h e l a t e s t t o p - l e v e l c o n c l u s i o n i s
presented i n proof order. Intermediate conclusions are d e r i v e d
b e f o r e t h e y a r e used i n p r e m i s e s .
The number o f l e v e l s o f e x p l a n a t i o n a v a i l a b l e depends on t h e
nesting o f routine c a l l s at run-time. The h i e r a r c h i c a l o r g a n i z a t i o n
o f m o d u l e s makes i t e a s i e r t o c o n t r o l a n d u n d e r s t a n d t h e r u n - t i m e
behavior o f r u l e execution.
E x p l a n a t i o n - d r i v e n expert system b u i l d i n g leads to robust
systems. By t e s t i n g t o e n s u r e t h a t t h e r i g h t c o n c l u s i o n s a r e
reached for the r i g h t reasons, the p r o b a b i l i t y of the reasoning
b e i n g c o r r e c t f o r u n f o r e s e e n s i t u a t i o n s i s enhanced. Quality
e x p l a n a t i o n a l s o makes systems more u s e f u l as t e a c h i n g t o o l s .

E x t e r n a l Processes. The R a d i a l l a n g u a g e s u p p o r t s i n t e r f a c i n g t o
s o f t w a r e w r i t t e n i n t h e v a r i o u s computer languages a v a i l a b l e under
UNIX: F o r t r a n , C, P a s c a l , L i s p , e t c . The R a d i a l l a n g u a g e t a k e s
c a r e o f t h e d e t a i l s o f p a s s i n g arguments t o and from e x t e r n a l
routines. T h i s c a p a b i l i t y a l l o w s R a d i a l t o be u s e d j u s t f o r t h e

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
24 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

r e a s o n i n g p o r t i o n o f an a p p l i c a t i o n . The remainder o f t h e domain-


dependent code can be w r i t t e n i n w h a t e v e r language i s most s u i t a b l e .
E x t e r n a l c o d e may be u s e d t o o b t a i n i n p u t ( e . g . , f r o m d a t a
bases, i n s t r u m e n t a t i o n , n u m e r i c a l data base r o u t i n e s , other
computers), t o send output (to d a t a bases, p r i n t e r s , graphic
d e v i c e s ) , o r t o p e r f o r m a c t i o n s when d e c i s i o n s a r e r e a c h e d ( e . g . ,
i n s t r u m e n t c o n t r o l ) . The o p e r a t o r s f o r u s e r - d e f i n e d d a t a t y p e s w i l l
u s u a l l y be implemented w i t h e x t e r n a l r o u t i n e s .
A R u l e M a s t e r p r o g r a m may a l s o be s e t up t o be c a l l e d f r o m
another program. By c o m b i n i n g s e v e r a l e x p e r t s y s t e m s i n t h i s
manner, a l a r g e a p p l i c a t i o n can be modeled as a s e t o f c o o p e r a t i n g
experts.

C Code G e n e r a t i o n . The p r i m a r y r e p r e s e n t a t i o n o f a R u l e M a s t e r
e x p e r t s y s t e m i s as R a d i a l c o d e , much o f w h i c h i s g e n e r a t e d f r o m
example t a b l e s . The b u i l d i n g a n d t e s t i n g i s c a r r i e d o u t b y
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch002

i n t e r p r e t i n g t h i s R a d i a l program. The advantage o f i n t e r p r e t i n g i s


speed o f development and s u p p o r t f o r i n t e r a c t i v e o p e r a t i o n .
Once a n a p p l i c a t i o n i s t e s t e d a n d f a i r l y s t a b l e , another
d e l i v e r y mechanism i s a v a i l a b l e . The s y s t e m c a n g e n e r a t e a C s o u r c e
code r e p r e s e n t a t i o n o f t h e e x p e r t system. When c o m p i l e d a n d
e x e c u t e d , t h e same b e h a v i o r as t h e i n t e r p r e t e d R a d i a l v e r s i o n o f t h e
e x p e r t s y s t e m w i l l be e x h i b i t e d .
There a r e s e v e r a l reasons f o r u s i n g t h e C v e r s i o n o f an e x p e r t
s y s t e m . A l t h o u g h i n t e r p r e t e d R a d i a l i s a l r e a d y f a s t e r t h a n many
expert system approaches, compiled C i s faster s t i l l . For expert
systems which get i n p u t s from i n s t r u m e n t a t i o n ( r a t h e r t h a n from a
p e r s o n a t a k e y b o a r d ) and need t o r e s p o n d i n r e a l t i m e , t h i s speed
may be e s s e n t i a l . A n o t h e r advantage i s p o r t a b i l i t y . The C code may
be c o m p i l e d on computers o t h e r t h a n t h e one on w h i c h t h e s y s t e m was
developed. The t h i r d advantage i s t h e s m a l l s i z e o f t h e c o m p i l e d
code. F o r l a r g e r a p p l i c a t i o n s , t h e c o m p i l e d o b j e c t code i s about
one e i g h t h t h e s i z e o f t h e c o r r e s p o n d i n g r e s i d e n t c o d e f o r t h e
i n t e r p r e t e d v e r s i o n . T h i s a l l o w s e x p e r t systems t o be d e l i v e r e d i n
s y s t e m s w i t h l i m i t e d c o m p u t i n g r e s o u r c e s , s u c h as embedded i n
chemical instrumentation.

Efficiency. Much o f t h e computer r e s o u r c e r e q u i r e m e n t o f


t r a d i t i o n a l p r o d u c t i o n r u l e e x p e r t systems i s used t o d e c i d e w h i c h
r u l e s are l e g a l f o r f i r i n g at each s t e p o f a c o n s u l t a t i o n . With
R u l e M a s t e r , t h i s p a r t o f the inference engine job i s accomplished at
expert system b u i l d i n g time, rather than at execution time. For
e v e r y knowledge base and i n f e r e n c e e n g i n e c o m b i n a t i o n , t h e r e i s an
equivalent procedural representation. R a d i a l i s d e s i g n e d so t h a t
t h i s p r o c e d u r a l r e p r e s e n t a t i o n can be determined a t s y s t e m b u i l d i n g
t i m e , and p o i n t e r s between c o n d i t i o n a l b r a n c h i n g s t a t e m e n t s a r e s e t
up a t t h a t t i m e . D u r i n g a s e s s i o n , i t i s o n l y n e c e s s a r y t o t r a v e r s e
these r u l e pointers. The e x e c u t i o n speed improvement r e s u l t i n g from
r u l e c o m p i l a t i o n i n c r e a s e s w i t h t h e s i z e o f t h e knowledge b a s e .
A s i d e e f f e c t o f t h i s a p p r o a c h i s t h e c o n s i s t e n c y and
completeness c h e c k i n g w h i c h i s performed a t b u i l d i n g t i m e when t h e
r u l e p o i n t e r s a r e b e i n g s e t up. E r r o r s and o v e r s i g h t s a r e caught a t
t h i s s t a g e and c o r r e c t e d b e f o r e t h e i t e r a t i v e development c y c l e i s
continued. M o s t e x p e r t s y s t e m a p p r o a c h e s do n o t s u p p o r t e r r o r

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
2. RIESE A N D STUART A Know ledge-Engineering Facility 25

d e t e c t i o n o f t h i s t y p e and i n c o n s i s t e n c y and redundancy i n t h o s e


knowledge "bases a r e d i f f i c u l t t o d e t e c t .
F o r e v e n f a s t e r o p e r a t i o n , C code v e r s i o n s o f t h e e x p e r t s y s t e m
may be u s e d . T h i s w i l l r e s u l t i n a t l e a s t an o r d e r o f magnitude
f a s t e r response over the i n t e r p r e t e d v e r s i o n .

Portability. R u l e M a s t e r i s w r i t t e n i n the C language, making i t


p o r t a b l e t o a wide range o f m i c r o - and m i n i - c o m p u t e r s w i t h t h e UNIX,
VMS, o r PC-DOS o p e r a t i n g systems. By l a t e 1 9 8 5 , R u l e M a s t e r had been
i n s t a l l e d on more t h a n t w e n t y brands o f computers, r a n g i n g i n s i z e
from IBM PCs t o l a r g e m i n i - c o m p u t e r s .

TOGA: An E x p e r t System f o r Transformer F a u l t D i a g n o s i s

H a r t f o r d Steam B o i l e r I n s p e c t i o n and I n s u r a n c e Company (HSB) i n s u r e s


d i s t r i b u t i o n t r a n s f o r m e r s f o r power g e n e r a t i o n u t i l i t y c o m p a n i e s .
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch002

The c o s t t o HSB when a n i n s u r e d t r a n s f o r m e r f a i l s o f t e n e x c e e d s a


million dollars. The p o s s i b i l i t y o f l o s s e s o f t h i s magnitude has
g i v e n HSB t h e i n c e n t i v e t o d e v e l o p a t r a n s f o r m e r f a u l t e a r l y
d e t e c t i o n and d i a g n o s i s program, based on c h e m i c a l a n a l y s i s o f t h e
transformer i n s u l a t i n g o i l .

D i a g n o s t i c Approach. P o s s i b l e causes o f t r a n s f o r m e r f a i l u r e i n c l u d e
g e n e r a l i n s u l a t i o n d e t e r i o r a t i o n , o v e r h e a t i n g due t o o v e r l o a d ,
s h o r t i n g a t f a i l e d j o i n t s , c o r o n a a c t i v i t y near i n s u l a t i o n , a r c i n g ,
and g r o u n d e d c o r e . E a c h f a i l u r e mode c a u s e s h e a t i n g o f t h e o i l ,
w h i c h may be l o c a l and i n t e n s e o r w i d e s p r e a d and moderate. The o i l
decomposes when s u b j e c t e d t o h e a t , a n d some o f t h e d e c o m p o s i t i o n
p r o d u c t s a r e gases w h i c h d i s s o l v e i n t h e o i l : h y d r o c a r b o n , c a r b o n
monoxide, c a r b o n d i o x i d e , and h y d r o c a r b o n s . The r e l a t i v e
c o n c e n t r a t i o n s o f t h e v a r i o u s gases depends on t h e h e a t i n g h i s t o r y ,
and i s t h e r e b y r e l a t e d t o t h e cause o f f a i l u r e . The c o n c e n t r a t i o n s
o f t h e s e gases c a n be a c c u r a t e l y measured w i t h gas chromatographs,
and t h i s i n f o r m a t i o n u s e d t o d i a g n o s e t h e c a u s e o f a n i n c i p i e n t
breakdown p r i o r t o c a t a s t r o p h i c f a i l u r e .
D i a g n o s i n g a t r a n s f o r m e r ' s c o n d i t i o n from c h e m i c a l a n a l y s i s o f
i t s o i l i s an e x p e r t s k i l l w h i c h has been d e v e l o p e d o v e r t h e p a s t 20
years. I t i s r e l a t i v e l y e a s y t o f i n d s k i l l e d c h e m i s t s who c a n
p r o v i d e t h e c h e m i c a l a n a l y s i s , b u t e x p e r t s who c a n d i a g n o s e a
t r a n s f o r m e r ' s c o n d i t i o n from t h i s d a t a a r e r a r e . The d i a g n o s i s i s
t y p i c a l l y based on a m i x t u r e o f s c i e n c e and h e u r i s t i c r u l e s
d e v e l o p e d from y e a r s o f e x p e r i e n c e .

F u n c t i o n o f TOGA. A n HSB e m p l o y e e , R i c h a r d I . Lowe, i s one o f t h e


h a n d f u l o f t r a n s f o r m e r d i a g n o s i s e x p e r t s i n t h e U.S. H i s r u l e s h a v e
been i n c o r p o r a t e d i n an e x p e r t s y s t e m c a l l e d TOGA, w h i c h was b u i l t
w i t h t h e R u l e M a s t e r e x p e r t s y s t e m b u i l d i n g package.
The f u n c t i o n o f TOGA i s t o t r a n s f o r m t h e r e s u l t s o f c h e m i c a l
a n a l y s i s , t o g e t h e r w i t h d e s c r i p t i v e i n f o r m a t i o n about a t r a n s f o r m e r ,
i n t o a d i a g n o s i s o f t r a n s f o r m e r c o n d i t i o n and a recommended a c t i o n .
The r u l e s were c r e a t e d by a p r o c e s s o f s u c c e s s i v e r e f i n e m e n t , u s i n g
t h e HSB d a t a base o f p a s t t r a n s f o r m e r h i s t o r i e s as a s o u r c e o f t e s t
cases.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
26 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Motivation. TOGA c o n t a i n s o n l y a s m a l l p o r t i o n o f t h e knowledge o f


t h e e x p e r t and i t s p o t e n t i a l performance i s l i m i t e d t o s o m e t h i n g
l e s s than t h a t o f the expert. However, t h e r e a r e s t i l l a number o f
reasons f o r b u i l d i n g t h e s y s t e m .

Document E x p e r t Techniques. R i c h a r d Lowe became an e x p e r t by


m a k i n g t h o u s a n d s o f d i a g n o s t i c d e c i s i o n s o v e r more t h a n
twenty years. Most r u l e s used i n t h i s d i a g n o s i s are
h e u r i s t i c ( r a t h e r t h a n based s o l e l y on t h e o r y ) and t h e y had
not been w r i t t e n down v e r y w e l l by anyone. B u i l d i n g TOGA was
an e f f e c t i v e method f o r e l i c i t i n g a c o n s i s t e n t , c o m p l e t e , and
tested d e s c r i p t i o n o f the diagnostic r u l e s . Value resides i n
t h e w r i t t e n e x p r e s s i o n o f t h e r u l e s , and not j u s t i n t h e
computer program w h i c h executes them.

Training. By u s i n g a d i a g n o s t i c s y s t e m b u i l t by an
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch002

acknowledged e x p e r t , n o v i c e s c a n q u i c k l y l e a r n t o d i a g n o s e
t r a n s f o r m e r s by o b s e r v i n g d e c i s i o n s w h i c h a r e r e a c h e d and
l i n e s o f reasoning.

D i s t r i b u t e E x p e r t i s e . TOGA a l l o w s n o v i c e s t o p e r f o r m as
experts at chemistry l a b o r a t o r i e s and u t i l i t y sites,
e s p e c i a l l y f o r t h e s i m p l e r a n d more p r e v a l e n t s i t u a t i o n s
c o v e r e d by t h e r u l e s .

Consistency. TOGA c a n be u s e d t o i n s u r e t h a t t h e same


d i a g n o s i s and recommendation i s made f o r t h e same t r a n s f o r m e r
d a t a a t a l l l o c a t i o n s and t i m e s . Thus, i t can be a t o o l f o r
both informing of implementing standard diagnostic
procedures.

Automate D e c i s i o n - m a k i n g P r o c e s s . F o r d a i l y o p e r a t i o n , TOGA
i s r u n a u t o m a t i c a l l y from gas chromatograph output and d a t a
b a s e s (as o p p o s e d t o i n t e r a c t i v e l y ) t o g e n e r a t e e x p e r t
i n t e r p r e t a t i o n of data, t h i s s p e e d s up t h e d a t a a n a l y s i s
t a s k a n d r e m o v e s t h e e l e m e n t o f human e r r o r f r o m r o u t i n e
diagnoses.

A i d E x p e r t W i t h C o m p l e x D e c i s i o n s . TOGA h e l p s p r e v e n t t h e
j u d g m e n t m i s t a k e s w h i c h c a n o c c u r when r a r e t r a n s f o r m e r
c o n d i t i o n s a r e encountered o r when e x p e r t s a r e f o r c e d t o make
a hurried diagnosis.

Validation. TOGA was v a l i d a t e d by comparing i t s d i a g n o s e s t o t h o s e


p r e v i o u s l y made by t h e e x p e r t who s u p p l i e d t h e r u l e s . A s e t o f 859
t e s t c a s e s f r o m a h i s t o r i c a l d a t a b a s e w e r e u s e d . The d a t a b a s e
c o n t a i n e d t h e gas a n a l y s i s r e s u l t s , transformer d e s c r i p t i v e
i n f o r m a t i o n , and t h e e x p e r t ' s d e t a i l e d d i a g n o s e s ( w h i c h had been
p r e p a r e d s e v e r a l y e a r s b e f o r e TOGA was b u i l t ) . None o f t h e c a s e s
were used i n r u l e c o n s t r u c t i o n .
The c o m p a r i s o n ( T a b l e I ) shows t h a t t h e e x p e r t s y s t e m i s a n
e x c e l l e n t representation o f the expert's decision-making process.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
2. RIESE A N D STUART A Know ledge-Engineering Facility 27

Table I . TOGA V a l i d a t i o n R e s u l t s

TOGA and E x p e r t :
Transformer
Condition Agreed Disagreed

No P r o b l e m 651 0

Problem 20k k

One w o u l d a l s o l i k e t o c o m p a r e t h e d i a g n o s e s w i t h t h e a c t u a l
t r a n s f o r m e r c o n d i t i o n , and n o t j u s t w i t h t h e e x p e r t ' s p r e v i o u s
assessment o f the c o n d i t i o n . U n f o r t u n a t e l y , t h i s i s u s u a l l y not
possible, i t i s e x p e n s i v e t o remove a t r a n s f o r m e r from s e r v i c e ,
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch002

open i t up, and d e t e r m i n e i t s c o n d i t i o n . However, t h i s was done f o r


t e n o f t h e 208 " p r o b l e m " c a s e s . Engineers overhauled these
t r a n s f o r m e r s and determined t h e n a t u r e and cause o f t h e i r p r o b l e m s .
For a l l t e n o f t h e s e c a s e s , b o t h t h e e x p e r t s y s t e m and t h e e x p e r t
had made t h e c o r r e c t d i a g n o s i s .

O p e r a t i o n a l Use. TOGA i s u s e d d a i l y b y c h e m i s t s i n R a d i a n ' s


a n a l y t i c a l laboratory to screen the a n a l y t i c a l r e s u l t s for
i n d i c a t i o n s o f p o s s i b l y f a u l t y transformers. This helps insure that
HSB c a n t a k e q u i c k a c t i o n when i t i s n e c e s s a r y , a n d a l s o h e l p s
R a d i a n ' s c h e m i s t s l e a r n t h e r e l a t i o n s h i p between v a r i o u s h y d r o c a r b o n
gas c o n c e n t r a t i o n s and t h e t r a n s f o r m e r c o n d i t i o n . A t HSB, TOGA i s
a l s o used t o d i a g n o s e t r a n s f o r m e r s and p r e p a r e r e p o r t s , w h i c h a r e
s e n t t o t h e t r a n s f o r m e r owner a f t e r b e i n g v e r i f i e d by t h e e x p e r t .

Using RuleMaster

Knowledge E x t r a c t i o n . E x p e r t systems are u s u a l l y used t o s o l v e


hard problems f o r which the s o l u t i o n methodology i s not
documented. A n e x p e r t i s a p e r s o n who c a n p r o v i d e t h e h i g h e s t
q u a l i t y answers o r a d v i c e f o r a s p e c i f i c p r o b l e m domain. U n l e s s t h e
e x p e r t r o u t i n e l y t e a c h e s t h e p r o b l e m - s o l v i n g method, he o r she w i l l
p r o b a b l y have d i f f i c u l t y i n c l e a r l y d e s c r i b i n g t h e method.
R u l e M a s t e r p r o v i d e s an example-based knowledge i n p u t mechanism
which p r a c t i c i n g experts f i n d comfortable t o use.
For TOGA, a topr-down p r o c e d u r e was used t o b e g i n t h e knowledge
e x t r a c t i o n process. The e x p e r t was i n t e r v i e w e d t o d e t e r m i n e t h e
t e r m i n o l o g y a n d t h e c o a r s e f r a m e w o r k o f t h e s o l u t i o n m e t h o d . He
s e l e c t e d a s e t o f kO t r a n s f o r m e r t e s t cases t o j o g t h e memory d u r i n g
t h e e x p e r t s y s t e m b u i l d i n g p r o c e s s . Then a l i s t o f p o s s i b l e t o p -
l e v e l d e c i s i o n s o r a c t i o n s was generated t o d e f i n e t h e scope o f t h e
e x p e r t system. T h i s c o n s i s t e d o f t h e l i s t o f p o s s i b l e diagnoses (no
p r o b l e m , c o r o n a , a r c i n g , . . .) and t h e l i s t o f recommendations (no
a c t i o n , r e s a m p l e a t s p e c i f i e d t i m e , remove t r a n s f o r m e r from s e r v i c e ,
. . .)
Then t h e e x p e r t was a s k e d a b o u t t h e f a c t o r s u s e d t o a r r i v e a t
each d e c i s i o n . Sometimes t h e f a c t o r s were raw d a t a a v a i l a b l e from
t h e gas chromatograph, and sometimes t h e f a c t o r s were i n t e r m e d i a t e

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
28 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

a t t r i b u t e s y e t t o be d e f i n e d ( l i k e p r e s e n c e o r absence o f t h e r m a l l y
generated hydrocarbon gases). W h e n e v e r new q u a n t i t i e s w e r e
i n t r o d u c e d , t h e e x p e r t was asked about t h e f a c t o r s used t o determine
it. T h i s p r o c e s s was r e p e a t e d r e c u r s i v e l y u n t i l e v e n t u a l l y t h e
e n t i r e s o l u t i o n was d e s c r i b e d i n terms o f chromatograph d a t a .
A t t h i s p o i n t , t h e r e was enough i n f o r m a t i o n t o g e t h e r t o b u i l d a
f i r s t p r o t o t y p e . Each i n t e r m e d i a t e o r f i n a l c o n c l u s i o n d e f i n e d a
d e c i s i o n module. These modules were o r g a n i z e d i n t o a h i e r a r c h i c a l
structure. W i t h i n each module, example t a b l e s t r u c t u r e s were
created. B a s e d on t h e i n t e r v i e w i n g r e c o r d s , a f i r s t c u t a t t h e
example s e t s was e n t e r e d . At t h i s p o i n t , a running prototype expert
system e x i s t e d .
The v a l u e o f t h i s a p p r o a c h i s t h a t a r u n n i n g e x p e r t s y s t e m i s
r a p i d l y created, without f o r c i n g the expert to a r t i c u l a t e a general
p r o b l e m - s o l v i n g procedure. The p r o t o t y p e system i s a v a i l a b l e f o r
t h e i t e r a t i v e k n o w l e d g e r e f i n e m e n t p r o c e s s , w h i c h d r a w s o u t more
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch002

d e t a i l s o f the d e c i s i o n - m a k i n g procedure from the expert t o


g r a d u a l l y b u i l d a complete and t e s t e d e x p e r t s y s t e m .

Knowledge Refinement. The f i r s t p r o t o t y p e i s o n l y a rough


approximation o f the expert's d e c i s i o n strategy. Many d e t a i l s a r e
missing. R e f i n e m e n t o f t h e p r o t o t y p e i s a c c o m p l i s h e d by a
c o n t i n u a t i o n o f example-based l e a r n i n g s t e p s .
F o r TOGA, t h e hO t e s t c a s e s f o r m e d t h e b a s i s o f k n o w l e d g e
refinement. The p r o t o t y p e was e x e r c i s e d f o r e a c h c a s e . Wrong
a d v i c e , o r c o r r e c t a d v i c e r e a c h e d f o r t h e wrong r e a s o n s , i n d i c a t e d
t h e need f o r changes t o t h e knowledge base. Whenever one o f t h e s e
e r r o r s was e n c o u n t e r e d , t h a t t e s t c a s e was s t e p p e d s l o w l y t h r o u g h
t h e p r o t o t y p e e x p e r t s y s t e m a g a i n . The p o i n t w h e r e t h e p r o t o t y p e
reasoning d i f f e r e d with the expert's reasoning s p e c i f i e d e x a c t l y
where t h e knowledge base needed t o be changed.
W i t h t h e p r o b l e m l o c a l i z e d i n t h e module h i e r a r c h y , t h e f i x i s
easy. U s u a l l y , i t r e q u i r e d adding a s i n g l e example (matching the
t e s t case) o r c o r r e c t i n g an e x i s t i n g example. Sometimes t h e e r r o r
p o i n t e d o u t t h e n e e d f o r more d e t a i l , as when t w o d i f f e r e n t
c o n c l u s i o n s c o u l d be r e a c h e d f o r t h e same example v e c t o r s . I n t h e s e
c a s e s , t h e e x p e r t was asked t o p r o v i d e a new a t t r i b u t e , w h i c h c o u l d
d i s t i n g u i s h b e t w e e n t h e t w o c o n c l u s i o n s . On r a r e o c c a s i o n s , t h e
e x p e r t and knowledge e n g i n e e r n o t i c e d t h a t t h e module h i e r a r c h y no
l o n g e r seems s u i t a b l e . This suggests a p o s s i b l e r e - o r g a n i z a t i o n o f
t h e module s t r u c t u r e .
L e a r n i n g from examples i s e s p e c i a l l y e f f e c t i v e because the
knowledge r e p r e s e n t a t i o n ( i n t h e form o f example t a b l e s ) i s c l o s e t o
t h e way t h a t e x p e r t s n o r m a l l y t h i n k a b o u t t h e i r f i e l d . The
t r a n s l a t i o n f r o m t h e e x p e r t ' s n o t a t i o n t o a more a b s t r a c t r u l e
language i s done by t h e i n d u c t i v e l e a r n i n g a l g o r i t h m . Not o n l y c a n
knowledge be generated and t e s t e d e f f e c t i v e l y i n t h e form o f example
s e t s , b u t c o l l e a g u e s i n t h e f i e l d o f e x p e r t i s e w i l l be a b l e t o
e a s i l y and t h o r o u g h l y u n d e r s t a n d t h e r e a s o n i n g i n c o r p o r a t e d i n t h e
system.

Programming S k i l l s . One o f t h e f i r s t s t e p s i n b u i l d i n g R u l e M a s t e r
expert system i s c r e a t i n g the module h i e r a r c h y f o r the p r o t o t y p e .
T h i s r e q u i r e s s k i l l i n top-down d e s i g n and s t r u c t u r e d programming.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
2. RIESE A N D STUART A Know ledge-Engineering Facility 29

P e o p l e w i t h o u t some c o u r s e w o r k a n d e x p e r i e n c e i n t h e s e c o m p u t e r
s c i e n c e d i s c i p l i n e s t e n d t o make m i s t a k e s a n d f l o u n d e r a t t h i s
stage.
For t h e m a j o r i t y o f t h e i t e r a t i v e r e f i n e m e n t p r o c e s s , however,
o n l y m i n i m a l computer s k i l l s a r e r e q u i r e d . The modules a r e s m a l l
e n o u g h s o t h e i r l o g i c c a n be e a s i l y u n d e r s t o o d b y a n y o n e f a m i l i a r
with the a p p l i c a t i o n . Changes a r e u s u a l l y l i m i t e d t o e d i t i n g
e x a m p l e s , and t h e example o r d e r i n g i s not i m p o r t a n t . The i n d u c t i v e
l e a r n i n g a l g o r i t h m a u t o m a t i c a l l y takes care o f c o n t r o l flow. Most
o f k n o w l e d g e r e f i n e m e n t c a n be done b y a n y o n e who knows a l i t t l e
e d i t i n g and f i l e management. This i s often the expert h i m s e l f .
T h e r e f o r e , a d d i t i o n a l programmers w i t h h i g h l y s p e c i a l i z e d
s k i l l s a r e not r e q u i r e d t o add an e x p e r t r e a s o n i n g c a p a b i l i t y t o an
e x i s t i n g computer program. The programmers a l r e a d y on t h e p r o j e c t
can a l s o b u i l d t h e e x p e r t s y s t e m . N o t o n l y d o e s t h i s s a v e money,
b u t t h e s e p e o p l e u n d e r s t a n d t h e p r o b l e m a n d a r e l i k e l y t o do a
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch002

b e t t e r j o b t h a n someone whose p r i m a r y i n t e r e s t l i e s e l s e w h e r e .

Conclusions. TOGA i s an e x p e r t s y s t e m b u i l t w i t h R u l e M a s t e r w h i c h
has b e e n v a l i d a t e d a n d i s i n d a i l y u s e . The p r i m a r y b e n e f i t f r o m
b u i l d i n g TOGA i s t h a t t h e t r a n s f o r m e r d i a g n o s t i c k n o w l e d g e now
e x i s t s i n a f o r m w h i c h c a n be u s e d t o p a s s t h e s k i l l o n t o a new
generation of engineers. HSB w i l l n o t l o s e i t s transformer
d i a g n o s i s c a p a b i l i t y when t h e c u r r e n t e x p e r t r e t i r e s . Other
employees can use the expert system t o diagnose t r a n s f o r m e r s , or
t h e y c a n l e a r n t h e t e c h n i q u e by s t u d y i n g a w r i t t e n v e r s i o n o f t h e
knowledge base.
Other a p p l i c a t i o n s b u i l t w i t h R u l e M a s t e r demonstrate a d d i t i o n a l
reasons f o r b u i l d i n g e x p e r t systems.
WILLARD (3) i s a s e v e r e storms f o r e c a s t i n g e x p e r t s y s t e m w h i c h
can o b t a i n a l l i n p u t d a t a from N a t i o n a l Weather S e r v i c e d a t a l i n e s .
When s e v e r e s t o r m s i t u a t i o n s o c c u r , f o r e c a s t e r s become v e r y busy and
do n o t h a v e t i m e t o u t i l i z e a l l t h e d a t a w h i c h i s a v a i l a b l e . The
e x p e r t system can t a k e o v e r t h e r o u t i n e p o r t i o n o f t h e f o r e c a s t i n g ,
l e a v i n g t h e e x p e r t s f r e e t o focus on t h e more d i f f i c u l t and c r i t i c a l
portions o f the a n a l y s i s .
TURBOMAC 0 0 diagnoses f a u l t s i n l a r g e r o t a t i n g m a c h i n e r y , s u c h
as power g e n e r a t i o n t u r b i n e s . This expert system a l l o w s f i e l d
e n g i n e e r s t o i n c o r p o r a t e t h e r e a s o n i n g o f one o f t h e t o p e x p e r t s i n
v i b r a t i o n d i a g n o s i s i n t h e i r maintenance and o p e r a t i o n a l d e c i s i o n s .
G l o v e A I D (5.) p r e d i c t s t h e most e f f e c t i v e g l o v e m a t e r i a l s t o
c h o o s e f o r p r o t e c t i o n a g a i n s t h a z a r d o u s c h e m i c a l s . T h e r e a r e no
e s t a b l i s h e d e x p e r t s i n t h i s f i e l d , because much o f t h e p r o t e c t i o n
e f f e c t i v e n e s s m e a s u r e m e n t s a r e j u s t now b e i n g p e r f o r m e d . The
i n d u c t i v e l e a r n i n g a s p e c t o f R u l e M a s t e r i s used t o h e l p o r g a n i z e t h e
d a t e w h i c h i s a v a i l a b l e and t o suggest w h i c h measurements s h o u l d be
performed n e x t .
The o b j e c t i v e o f Q u a l A I D i s t o p r o v i d e a d v i c e o n how much
and what t y p e o f q u a l i t y a s s u r a n c e (QA) and q u a l i t y c o n t r o l (QC) i s
needed f o r v a r i o u s t y p e s o f e n v i r o n m e n t a l a n a l y s e s . The purpose o f
t h i s system i s t o p r o v i d e c o n s i s t e n t l y good a d v i c e t o c h e m i s t s whose
p r i m a r y f i e l d o f e x p e r t i s e i s o t h e r t h a n QA/QC.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
30 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Literature Cited

1. Michie, D.; Muggleton, S.; Riese, C. E.; Zubrick, S. M.


Proc. of the First Conference on A r t i f i c i a l Intelligence
Applications. IEEE Computer Society; Washington, D.C., 1984,
pp.591-7.
2. Quinlan, J. R., In Expert Systems i n the Micro-electronic
Age, (D.Michie, ed.), Edinburgh Univ. Press, Edinburgh,
U. K.; pp 168-201.
3. Zubrick, S. M.; Riese, C. E. Proc. 14th Conf. on Severe
Local Storms, American Meteorology Society; Boston, MA;
1985; pp 117-122..
k. Stuart, J. D.; Vinson, J. W. Proc. 1985 AS ME International
Computers in Engineering Conference, American Society of
Mechanical Engineers: New York, Ν. Y.; Vol. II, pp 319-328.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch002

5. Keith, L. H; Stuart, J. D. In " A r t i f i c i a l Intelligence


Applications in Chemistry"; Hohne, B. Pierce, T., Ed.; ACS
Symposium Series (in print), American Chemical Society:
Washington, D.C., 1985.

RECEIVED January 17, 1986

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
3
A Rule-Induction Program
for Quality Assurance-Quality Control and Selection
of Protective Materials
L. H. Keith and J. D. Stuart

Radian Corporation, Austin, TX 78766-0948


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

This chapter describes two prototype expert systems


for chemical applications being developed using Rule-
Master. (1) The f i r s t , QualAId, is a traditional type
of system where knowledge on how much and what type of
quality assurance (QA) and quality control (QC) is
needed for various types of environmental analyses.
The second, GloveAId, is being developed to help
select the best glove material(s) for protection
against a wide variety of hazardous chemicals.
However, unlike the former example, the knowledge base
for selecting the best glove materials is not yet
known. Therefore, experimental data is being sub-
jected to the rule-induction process of RuleMaster and
the resulting correlations are examined and tested to
help formulate the rules which are, in turn, used to
build the expert system.

QualAId

The prototype of QualAId currently i n existence i s one small part of


the t o t a l framework needed f o r a useful expert system. The objec-
t i v e of QualAId i s to provide advice on how much and what type of
QA/QC i s needed f o r various types of environmental analyses. The
rules f o r determining these needs have been derived from the Ameri-
can Chemical Society (ACS) p u b l i c a t i o n , " P r i n c i p l e s of Environmental
Analysis," (2) and from various protocols and recommendations of the
U.S. Environmental Protection Agency (EPA).
This p a r t i c u l a r demonstration module only incorporates d e c i -
sions involving analysis of v o l a t i l e and semivolatile organic
compounds from water. These compounds are, by d e f i n i t i o n , v o l a t i l e
enough to be separated by gas chromatography (GC). The complete
expert system w i l l incorporate decisions based upon any type of
chemical i n any type of matrix and w i l l also be capable of providing
advice s p e c i f i c a l l y f o r selected EPA methods commonly i n use, i . e . ,
EPA Methods 624, 625, 1624, 1625, the various non-mass spectrometric
600 Methods, etc. (Figure 1).

0097-6156/ 86/ 0306-0031 $06.00/ 0


© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

high

No QA/QC
Litigation M medium

Importance

low
Yes
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

sampleAId / \
^ Routine . Advice J

f methodAId
y •( Advice j
Routine J

Inorganic \
Advice I
"*\ Routine .'

Θ A d v i C e
•I " e )

^ "^ ^ V
• Specific \ f *
Methods ι • Advice )
\ Routine ' \ y

Figure 1a. Diagram of Modules for QualAId Expert System (first


half).

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
KEITH A N D STUART A Rule-Induction Program for QA-QC

Determine Extent of
Method Verification
and Validation
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

Determine Number
of Samples Planned

~ i —
Determine
Probable Analyte
Concentration Range

Figure 1b. Diagram of Modules for QualAId Expert System (second


half).

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
34 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

The purpose of t h i s expert system i s t o provide consistently


good advice i n both the types and amounts of QA/QC to use. There
are many decisions to make and errors are very expensive i n terms of
time and money.
The expert system i s comprised of a series of modules encompas-
sing the many varied aspects of decision-making. Information from
each of these modules i s a v a i l a b l e to other modules t o make d e c i -
sions where they require i n t e r r e l a t e d knowledge.
For example, the f i r s t module. Confidence L e v e l , i s key t o many
of the decisions that w i l l be made i n other modules. The f i r s t
query by the computer asks the user whether the r e s u l t i n g a n a l y t i c a l
data w i l l be used f o r enforcement or l i t i g a t i o n actions. I f the
answer i s "yes," then a high l e v e l of confidence w i l l be needed and
the user i s advised of t h i s assignment. I f the answer i s "no," then
the user i s asked to specify how important he/she views the accuracy
and p r e c i s i o n of the data.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

Routines awaiting future development w i l l provide advice on the


best a n a l y t i c a l methodology and sampling procedures, QA/QC needs f o r
inorganic, n o n v o l a t i l e organic, and selected methodologies (Figure
1). For the present system, these are skipped and the routine f o r
general QA/QC advice f o r v o l a t i l e organics i n water i s entered.
The second module. Method, involves determining the l e v e l of
v e r i f i c a t i o n and v a l i d a t i o n to which the user's methodology has been
subjected. V e r i f i c a t i o n i s the general process used to decide
whether a method i n question i s capable of producing accurate and
r e l i a b l e data. V a l i d a t i o n i s an experimental process involving
external corroboration by other laboratories ( i n t e r n a l or external)
of methods or the use of reference materials to evaluate the s u i t -
a b i l i t y of methodology (1). A menu of choices includes: (1) the
method has only been v e r i f i e d , (2) the method has been both v e r i f i e d
and v a l i d a t e d , or (3) the method has been neither v e r i f i e d or
validated.
The t h i r d module. Samples, queries the user f o r how many
samples w i l l be taken and the fourth. Cone ent rat ion, f o r the expec-
ted range of probable concentration values. The choices of probable
concentration values are: (1) high [ > 10,000 p a r t s - p e r - b i l l i o n
(ppb) ] ; (2) Medium [10-10,000 ppb] ; or (3) Low [< 10 ppb]. The
f i f t h module, Detector, queries the user f o r the detector that w i l l
be used i n conjunction with the GC analysis (Figure 2).
The information from these f i v e modules i s then used to provide
a series of advisory statements r e l a t i n g to whether the user w i l l or
w i l l not meet the stated confidence l e v e l s and, i f not, what the
options are.
Figure 3 i s the r e s u l t i n g advice f o r an example of a good QA/QC
match with the user's needs. In t h i s example, a high l e v e l of
confidence was established, the methodology was both v e r i f i e d and
validated, two samples were t o be taken and analyzed by gas chroma-
tography-mass spectrometry (GC-MS) at l e v e l s below 10 parts-per-
b i l l i o n (ppb). These conditions might be t y p i c a l of analyses f o r
2»3,7,8-Tetrachloro-£-dioxin (TCDD) i n polluted water.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
3. KEITH A N D STUART A Rule-Induction Program for QA-QC 35

We need to establish what instrument you plan to use for the analysis.

Since the compound(s) you are analyzing are s u f f i c i e n t l y v o l a t i l e to be


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

separated by gas chromatography, I am assuming that you will use a GC for


your separations. Here are the detector choices we have to consider:
a = Mass spectrometer (general purpose)
b = Hall detector (in the halogen, nitrogen or sulfur mode)
c = Flame photometric detector (for phosphorous)
d = Photo Ionization detector (for olefins and aromatlcs)
e = Electron capture detector (for pesticides, halogens, etc.)
f = Flame ionization detector (general purpose)

Choose one of these by typing the corresponding menu letter: [a,b,c,d,e,f] a

F i g u r e 2. Q u e r i e s f o r t h e Module D e t e c t o r .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

Adν ice:

You have your methodology both v e r i f i e d and validated as required for measurements needing a high
level of confidence. But, you must also assure that your analyst Is experienced In performing the
type of analysis you need, that you have standards for the analytes a v a i l a b l e , and that you have a
written quality assurance plan that documents good laboratory p r a c t i c e .

You Indicated that you need a high level of confidence In your data and GC-MS Is an Instrument
combination that w i l l provide you with the c a p a b i l i t y of generating that degree of q u a l i t y . You
should a l s o use c a p î l l a r y GC columns If at a l l possible in order t o obtain the best chromatographic
r e s o l u t i o n . When you have analytes that are completely, or almost completely resolved from other
compounds, the quality of the mass spectra generated w i l l be better. You w i l l a l s o need documenta-
tion that the mass spectrometer is operating and is tuned up c o r r e c t l y and that the resolution
obtained from the gas chromatograph meets your requirements. These requirements should be met with
QC standards for each instrument prior to analyzing your samples.

Less than 10 ppb Is the most d i f f i c u l t range of analytes to quantify with a high degree of c o n f i -
dence. Quantitative measurements which do not f a l l within the "Region of Quantitation" cannot be

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


reported. This region is defined as r e s u l t i n g from signal measurements which are 10 or more

ACS Symposium Series; American Chemical Society: Washington, DC, 1986.


standard deviation units above the background noise. In order t o meet t h i s requirement at levels
below 10 ppb, very good cleanup procedures are usually needed in addition to c a p i l l a r y GC columns
and experienced analysts. All measurements reported must also be within the range of your c a l i b r a -
t i o n curves. If you cannot meet these stringent requirements, then you w i l l probably have to lower
your desired level of confidence in the data.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

For a high level of confidence you w i l l need to have both " f i e l d " and "method" blanks. Field
blanks are blanks from a s i m i l a r source that do not contain the analytes of Interest. Control
s i t e s (uncontamlnated s i t e s ) are used to obtain f i e l d blanks and If f i e l d blanks are not a v a i l a b l e ,
every e f f o r t should be made t o obtain blank samples that best simulate a sample that does not
contain the analyte (such as a simulated or synthetic f i e l d blank). Your method blanks w i l l
consist of a l l solvents, r e s i n s , e t c . that you w i l l use for e x t r a c t i n g , concentrating and cleaning
up the samples prior to a n a l y s i s . You may want about half of these unsplked and the remainder
spiked with known levels of your analyte standards. Similarly you may want to spike about half of
your f i e l d blanks with known levels of your analyte standards so that any matrix e f f e c t s wllI be
Identified during the a n a l y s i s . This plan would provide you with:

• 25% unsplked f i e l d blanks for control samples,


• 25% spiked f i e l d blanks for monitoring matrix e f f e c t s ,
• 25% unsplked method blanks for workup/Instrument QC, and
• 25% spiked method blanks for workup/Instrument QC.

The total number of blanks you would need, based on the number of samples you plan to take, Is: 4.

Advice: This w i l l give you the required quality control

(RETURN continues)

[Note: 4 blanks were recommended even though only 2 samples were planned.]

Figure 3. Example of Advice Provided f o r a Good QA/QC Match


w i t h User Needs.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
38 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

GloveAId

GloveAId i s an expert system being developed f o r the National Toxi-


cology Program, I t has been programmed to choose from seven glove
materials the one most l i k e l y to provide the greatest p r o t e c t i o n at
the cheapest cost against a v a r i e t y of chemicals. Chemical input i s
selected by choosing one of seventeen chemical classes. Glove
t a c t i l i t y needs and the desired amount of protection ( i n u n i t s of
minutes) are also input. The computer provides advice as to the
probable best glove to select and, i f none meet requested c r i t e r i a ,
i t advises the best choice i t has available and explains the l i m i t a -
tions of that choice with respect to the users request. Factors
used i n making the decisions include: chemical c l a s s , molecular
weight, v o l a t i l i t y ( b o i l i n g p o i n t ) , reaction with glove materials
(weight change), t a c t i l i t y and glove cost.
The prototype GloveAId system was developed using a data base
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

generated from chemical permeation measurements performed at Radian.


Experimental data from these tests were entered into a LOTUS-1-2-3
spreadsheet and sorted by a l l c l a s s i f i a b l e respects i n order to make
v i s u a l c o r r e l a t i o n s with the protective character of seven d i f f e r e n t
glove materials. The data base consisted of 90 chemicals with
associated physical properties (molecular weight, b o i l i n g point and
l i n e a r i t y of the molecule), chemical class and measurements of
breakthrough times, steady-state permeation rate and degradation
c h a r a c t e r i s t i c s . The l a t t e r consisted of percent weight change when
a piece of the material was immersed i n the t e s t chemical f o r four
hours. Each of the chemicals was tested against a l l seven glove
materials f o r weight change but only against four of the glove
materials for breakthrough and permeation rate data so that 1,300
measured values and 540 associated pieces of information were
a v a i l a b l e . V i s u a l c o r r e l a t i o n of t h i s data produced the protective
r a t i n g approximations l i s t e d i n Table I .
I t i s time consuming and d i f f i c u l t f o r humans to make v i s u a l
comparisons of a numerical data set and draw the simplest possible
correlations between them; the larger the data set, the more d i f f i -
c u l t t h i s i s to do. A l o t of time and e f f o r t was expended to make
the approximate evaluations l i s t e d i n Table I . When the data set i s
a dynamic one, i . e . , i t i s changing due to the addition of new data,
i t simply adds to t h i s problem. However, one strength of computer
usage i s that such tasks can be performed with ease and, when t h i s
c a p a b i l i t y i s coupled to the a b i l i t y to induce correlations or
" r u l e s " from a data set, an extremely powerful t o o l f o r evaluating
data i s created. This second way of evaluating the data i s cur-
r e n t l y being pursued and i s described i n more d e t a i l i n the next
section.
The ratings i n Table I are based only on the safety aspects of
the glove m a t e r i a l s ; i . e . , protection from exposure to chemicals as
indicated by the majority of breakthrough times observed w i t h i n the
members of a chemical c l a s s . However, t a c t i l i t y i s often an addi-
t i o n a l important ergonomie f a c t o r ; i t i s impossible to perform
d e l i c a t e tasks with t h i c k , bulky gloves. T a c t i l i t y of the gloves
was rated subjectively using a dime. I f the features of a dime
could be r e a d i l y f e l t through the glove, i t was assigned a r a t i n g of
"very good." I f the features were not very distinguishable through

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
3. KEITH A N D STUART A Rule-Induction Program for QA-QC 39

4J
cd
ο· ο·
> £>£ > £ o* Pu Ρ-ι ο· £4
•J

> C»- Ol PU Ο · PupuPuPuc^c—


> > >
PuP-i
£ ^£
P4

pU PU Οι pU pUPUPUC-O-pQPUPU Pu Ο pU Pu
>
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

φ
G
Φ PU [χι ο ·
μ pqpuOc-c^c-PuPu
pu pu pu pu pE4
α
ο

P* pq Pu pq puPno-c*-PupqPuPn pq pq pq o . pq
Ρ-ι >

μ
rH 0) pu pu Pu Pu pqpqpqpLipqpqpqpq pq Pu ο pq Pu
> >

a
ο
-P pq pq pq pq pqg4pqpuPupqpupq Pu * Ο pu
•H
>

Ό
0)

Cd
Ό cd
(D β Ό
α •Ρ Φ Φ

ο S o cd
tJO +J
Φ rH (3
60 CO φ
00 8
rH Ο θ4
(0 -Ρ rH
w
α
ο
f d Ο Q)
ι
υ
G Λ

1
Φ
•Η
r
•a •Η Ο ο Φ rH
cd 4J · Η •ri
Cd -Ρ
*d id φ Φ φ
ο •Ρ
cd rC μ «1 •Ρ
cd d
ο Λ Θ 0 Φ Ό Ό Μ Φ
·Η Φ Φ
M
•Η Ο Ο «Ο · Η · Η Ο 4J μ
. . .Μ. μ
rH
α pq
(χ ρα
ωw w
<< < 333 <j

Η es m 4 l O v O N C O C ^ O H C M en «tf m vo

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
40 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

the glove, but the dime could be e a s i l y picked up from a f l a t


surface, i t was assigned a "moderate" r a t i n g . I f the dime was
picked up with d i f f i c u l t y from a f l a t surface, the r a t i n g of the
glove was "poor."
F i n a l l y , the approximate costs of gloves from the seven materi-
als were also considered. A l l of t h i s information was provided to
the computer with rules f o r p r i o r i t i z i n g the choices, i . e . , safety
f i r s t , t a c t i l i t y second, and cost t h i r d .
An introductory screen i s printed a f t e r log-on i n t o GloveAId.
This i s followed by a menu from which the chemical class i s chosen.
M u l t i p l e functional groups cannot be handled by the system yet. The
user i s then queried f o r the amount of time that the glove needs to
provide protection. This i s followed by a request f o r the t a c t i l i t y
requirements of the user.
The f i n a l screen summarizes the answers given to the computer
and provides the best advice possible from the information and rules
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

supplied (Figure 4). In t h i s example, there were no gloves that met


the user's needs, so the computer provided the next best choices.
The recommended materials are a moderately t a c t i l e ( n i t r i l e ) glove
with probable short protection time or a t h i c k (butyl rubber) glove
with poor t a c t i l i t y but probable good protective properties. When
safety and t a c t i l i t y requirements can be met, then the most cost-
e f f e c t i v e choice i s provided.
This prototype expert system i s currently being tested by
comparing the GloveAId predictions before a chemical i s tested with
the best gloves a f t e r t h e i r performance has been documented. To
date, 62 a d d i t i o n a l chemicals have been tested. F i f t y - s i x (56) of
these (90%) had one or more gloves c o r r e c t l y predicted by the expert
system. Although t h i s i s good f o r a prototype system, we are
s t r i v i n g to improve the percentage of t o t a l choices. Often, more
than one glove material w i l l have very good breakthrough protection.
For example, with the 62 chemicals, there were a t o t a l of 132 gloves
with good performance. The expert system c o r r e c t l y advised only 60
of these (45%).

Rule Induction

The easiest way to describe the rule induction c a p a b i l i t i e s of Rule-


Master i s to demonstrate i t s use with a relevant set of data. This
data consists of information on a series of nonhalogenated aromatic
compounds which were tested with f i v e d i f f e r e n t glove materials. An
a r b i t r a r y protective r a t i n g was assigned t o each t e s t based on the
following breakthrough times:

Protective Rating Breakthrough Time

Very Poor <5 min.


Poor 5 - <15 min.
Fair 15 - <100 min.
Good 100 - <200 min.
Best > = 200 min.

Readily available information f o r each of the compounds consisted of


molecular weight and b o i l i n g point. In addition, the compounds were

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
KEITH A N D STUART A Rule-Induction Program for QA-QC

The specific glove and protection requirements are:


Chemical type Is aldehyde
Protection time requirement In minutes Is 200
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

T a c t i l i t y requirement Is moderate t a c t i l e

There are no glove materials In the data base


meeting the requirements that you specified. The
closest are:

Nltrlle
Approximate cost Is $3.00 per pair of gloves.
Protection time Is probably greater than 5 minutes
T a c t i l i t y Is moderately t a c t i l e

Butyl Rubber
Approximate cost Is $10.00 per pair of gloves.
Protection time (s probably greater than 200 minutes
T a c t i l i t y Is not t a c t i l e

Figure 4 . Example of Summary and Advice from GloveAId When User


Needs Are Not Met.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
42 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

assigned a l i n e a r (1) or non-linear (0) shape. In t h i s example a l l


compounds were assigned (0) for shape designation since aromatic
compounds are not l i n e a r . Other measured data included the steady
state permeation rate and percent weight gain or l o s s . The permea-
t i o n rate u s u a l l y exhibits the reverse trend as the breakthrough
time ( i . e . as breakthrough time increases the permeation rate
usually decreases). Therefore, permeation rates were not included
i n the data set since they seldom r e s u l t i n a d i f f e r e n t r e l a t i v e
protective r a t i n g than would be derived from breakthrough times.
However, weight gain or loss i s a good i n d i c a t i o n of a chemical
either reacting with the protective material or being absorbed into
it.
RuleMaker, a subsystem of RuleMaster, induces rules f o r a l l
situations from examples that may cover only some of the cases. At
the heart of the induction process i s the creation of an induction
f i l e , which i n part includes examples i n d i c a t i n g what the expert
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

system should do under d i f f e r e n t circumstances. Now, i n the example


above. THE RULES FOR CORRELATING VARIOUS CHEMICAL AND PHYSICAL
PARAMETERS OF THE HAZARDOUS CHEMICALS TESTED WITH THE PROTECTIVE
ABILITY OF THE SELECTED GLOVE MATERIALS ARE NOT KNOWN — THEY WILL
HAVE TO BE INDUCED FROM THE ANALYTICAL DATA.
The RuleMaster induction f i l e produced from the example data
set i s shown i n Figure 5. The name given to t h i s induction f i l e
module i s "ClasslO". The STATE i n a module i s e s s e n t i a l l y the name
of sub-modules that w i l l carry out actions w i t h i n a module. In t h i s
simple example there are no sub-modules so the name given to the
state i s "only".
The CONDITIONS s e c t i o n of the module i s comprised of descrip-
tions of the various parameters upon which a decision w i l l be based.
Each l i n e i n the conditions section i s made up of three parts:

• the name of the decision parameter (for example, glove,


molecular weight, etc.)

• the s p e c i f i e d method of determining the parameters value


(for example the statement "integer.read What i s the
molecular weight?" means the computer w i l l display that
question and w i l l expect a numerical answer from the
user); t h i s part i s denoted using square brackets, and

• the allowable values f o r the parameter. However, i n t h i s


case we don't know what the allowable values f o r the para-
meter are so any value i s allowed by typing the word
"integer". Later, a f t e r rules have been defined and the
allowable values are known, they can be used to replace
any integer. This w i l l be an important part of the second
phase when the expert system i s refined to include t h i s
knowledge.

The experimental data i n t h i s i l l u s t r a t i o n comprise the actual


rule base f o r RuleMaker. The f i r s t column of data i n the EXAMPLE
section of Figure 5 consists of the glove material tested. The
second column of data consists of the molecular weights, and the
t h i r d column consists of the b o i l i n g points i n degrees centigrade.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
3. KEITH A N D STUART A Rule-Induction Program for QA-QC 43


CLASSι AROMATIC NOT HALOGENATED
·/

MODULE! c l a s a l O

STATEt only
CONDITIONS:
glove [ a s k "What 1· t h e g l o v e type?"
N n
Butyl_Rubbor N«oproni N1trUo»PVA PVC V1ton ]
v v v v

[ B u t y l j f c i b b e r Neopreno N l t r l l e PVA PVC V i t o n }


•olwt [ I n t e g e r . r o a d "What l a t h e Molecular w e i g h t ? " ] I n t e g e r
bollpt [ I n t o g o r . r o a d "What 1a t h e b o i l i n g p o i n t ? " ] I n t e g e r
shape [ I n t e g e r . r e a d "Whet I s t h e s h a p e ? " ] I n t e g e r
change [ I n t e g e r . r e e d "Whet l e t h e p e r c e n t c h a n g e ? " ] I n t e g e r
EXAMPLESι
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

Butyl_Rubber 130 185 0 56 «>lgood,G0AL)


Butyljtabber 106 144 0 180 »>(fair,GOAL)
Butyljfaibber 106 138 0 181 »>[fe1r,GQAL)
Butyljtubber 106 136 0 80 »>(fair,GOAL)
Butyljtubbor 106 133 0 188 »>{fe1r GQAL)f

Neoprene 148 183 0 64 =>[feir,G0AL)


N1tr1le 148 183 0 11 »(best,G0AL)
NHMle 106 136 0 95 =>(fair,GOAL)
N1tr1lo 106 144 0 60 *>tpoor,G0AL)
Nltrlle 78 80 0 58 =>(fair,GOAL]
N1tr1le 106 138 0 77 s>(fa1r,GQAL)
Nltrlle 106 138 0 82 «>(fa1r,G0AL)
Nltrlle 130 185 0 63 =>(fair,GOAL]
PVA 130 185 0 62 =>[best,GOAL)
PVA 148 183 0 0 =>{beet GQAL) t

PVA 106 138 0 3 =>(beet,G0AL)


PVA 106 144 0 0 =>[best GQAL] f

PVA 106 136 0 0 =>(fair,GOAL]


PVA 78 80 0 0 =>(fa1r,G0AL)
PVA 106 138 0 0 =>{best G0AL) f

PVC 78 80 0 40 => (νery_poor,GOAL)


PVC 106 138 0 8 =>(ve ry_poo r » GOAL)
Viton 106 138 0 1 =>(beet,GOAL)
Viton 106 138 0 1 =>(best,G0AL)
Viton 148 183 0 0 =>[best G0AL)
f

Viton 78 80 0 3 =>(baet,G0AL)
Viton 130 195 0 0 =>(beet,GQAL)
VUon 106 136 0 0 =>(ba8t,G0AL)
Viton 106 144 0 1 =>(beet,G0AL)
ACTIONS:
best [ a d v i s e " T h i s g l o v e has a * b e s t * r a t i n g . " ]
good [ a d v i s e " T h i s g l o v e has β *good* r a t i n g . " )
fair [ a d v i s e " T h i s g l o v e has a * f a i r * r a t i n g . " ]
poor [ a d v i s e " T h i s g l o v e hes e • p o o r * r a t i n g . " ]
veryj>oor [ a d v i s e "Thia glove hat a *very poor* r a t i n g . " ]

Figure 5 . Induction Module f o r Nonhalogenated Aromatic


Compounds. The symbol => means "then".

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
44 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

The fourth column pertains to the designation of a non-linear shape


(0), and the l a s t column of data l i s t s the percent change i n weight
gain or loss when the material i s soaked i n the test chemical f o r 4
hours. The data w i t h i n a row i s associated with a s p e c i f i c compound
but the compounds were l i s t e d i n random order w i t h i n a glove materi-
a l group i n order to emphasize an important feature of RuleMaker —
that information (data) can be entered as i t i s thought of. This i s
an extremely important (and powerful) difference between RuleMaster
and other a r t i f i c i a l i n t e l l i g e n c e programs which are w r i t t e n i n a
highly structured i n t e r r e l a t i v e fashion. The powerful inductive
l o g i c of RuleMaker enables t h i s l i m i t a t i o n to be ignored and t h i s
frees the user to add, change, or delete example data which i n f l u -
ence the rulemaking l o g i c e a s i l y and at w i l l . This feature i s very
important when working with a growing/ changing data base.
The part of the example to the r i g h t of the arrow (=>), i s an
action-next-state-pair. I t indicates what w i l l happen when the
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

s p e c i f i e d combination of condition values occur. In t h i s example


the action i s the designation of the r e l a t i v e protection of the
material (good, f a i r , etc.) and the word "GOAL" which indicates that
the goal of the module w i l l have been reached when the action
section of the module has been carried out and the computer can e x i t
t h i s p a r t i c u l a r module. Since there i s only one module i n t h i s
simple example, the program would then end.
The ACTIONS section of the module i s comprised of two parts:

• the action keyword corresponding to the t h i r d part of the


EXAMPLE section, and

• the action that i s to be c a r r i e d out (for example to


advise the user by a p r i n t on the screen and/or a p r i n t e r
that "This glove has a *best* glove r a t i n g " .

A f t e r the information i n the induction module i s entered, the


program i s assembled by the computer. During t h i s phase, two
actions take place automatically with no further input from the
user:

1. Rules are induced from the examples given the computer,


and

2. The actual program f o r running the computer i s COMPILED


AND WRITTEN by the computer i t s e l f !

These two actions by the computer are key to the success of


t h i s project. This i s because i t w i l l be impossible f o r a human to
consider a l l the p o s s i b i l i t i e s of a large data set and to deduce the
best (most simple and therefore cost e f f e c t i v e ) rules to use i n
order t o choose the best protective materials to use. And when the
data base i s dynamically growing i t would be impossible to use a
highly structured a r t i f i c i a l i n t e l l i g e n c e system where the user had
to rewrite the program modifications himself every time there was a
change i n the information.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
3. KEITH A N D STUART A Rule-Induction Program for QA-QC 45

The rules induced by the computer are shown i n Figure 6. The


program which the computer wrote f o r i t s e l f ( i n "Radial" which i s
s i m i l a r to a C-type language) i s shown i n Figure 7. Both of these
abbreviated notations say the same thing which, i n English i s as
follows:

"The rules induced from the example data given are:

1. I f the glove material i s PVC, the r a t i n g i s VERY POOR.

2. I f the glove material i s n i t r i l e , and compounds have a


molecular weight <118 and b o i l i n g points >= 142°C, the
r a t i n g i s POOR.

3. I f the glove material i s neoprene, the r a t i n g i s FAIR.


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

4. I f the glove material i s PVA, and the compounds have a


molecular weight <92, the r a t i n g i s FAIR.

5. I f the glove material i s butyl rubber, and compounds have


a molecular weight <118» the r a t i n g i s FAIR.

6. I f the glove material i s n i t r i l e , and the compounds have a


molecular weight between 118-139 or i f the molecular
weight i s <118 and the b o i l i n g point i s <142°C, the r a t i n g
i s FAIR.

7. I f the glove material i s butyl rubber, and the compounds


have a molecular weight >118, the r a t i n g i s GOOD.

8. I f the glove material i s Viton, the r a t i n g i s BEST.

9. I f the glove material i s n i t r i l e and the molecular weight


i s >139» the r a t i n g i s BEST.

10. I f the glove material i s PVA and the molecular weight i s


>118 or i f the molecular weight i s 92-118 and the b o i l i n g
point i s >137°C, the r a t i n g i s BEST."

I t i s i n t e r e s t i n g to c o r r e l a t e these rules with the f i r s t rules


that were estimated with no help from RuleMaster. These were the
rules used to construct the f i r s t prototype expert system, GloveAId
for non-halogenated aromatic compounds:

1. If the glove material i s PVC, the r a t i n g i s VERY POOR.

2. I f the glove material i s n i t r i l e , the r a t i n g i s POOR.

3. I f the glove material i s butyl rubber, the r a t i n g i s FAIR.

4. If the glove material i s PVA, the r a t i n g i s FAIR.

5. If the glove material i s neoprene, the r a t i n g i s FAIR.


6. If the glove material i s Viton, the r a t i n g i s BEST.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

<cless10>

0 ( a l l states)
1 only
[glove]
Butyljfcbber : [molwt]
<118 ι => ( f a i r , GOAL )
>=118 : => ( good, GOAL ]
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

Neoprene : => ( f a i r , GOAL ]


N l t r l l e χ [molwt]
<92 : => ( f a i r , GOAL )
>=92 : [molwt]
<118 : [bollpt]
<137 : => ( f a i r , GOAL ]
>=137 : [bollpt]
<139 : => ( f a i r , GOAL ]
>=139 : [boilpt]
<142 : => ( f a i r , GOAL ]
>=142 : => ( poor, GOAL ]
>=118 : [molwt]
<139 : => ( f a i r , GOAL )
>=139 : => ( best, GOAL ]
PVA : [molwt]
<92 : => ( f a i r , GOAL )
>=92 : [molwt]
<118 : [bollpt]
<137 : => ( f a i r , GOAL )
>=137 : => ( bast, GOAL ]
>=118 : => ( best, GOAL ]
PVC : => [ v e r y j o o r , GOAL )
Viton : => ( best, GOAL ]

The Induced rule has 11 test nodes end 16 l e e f nodes.


Figure 6. The Induced Rules f o r Nonhalogenated Aromatic
Compounds. The f o l l o w i n g are meanings assigned t o symbols:
[...] means " I f ... i s " ; => means "then"; and a colon means
"and".

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

MODULE : classIO

STATE: only m
IF [ask "What is the glove type?" g
"Buty l_Rubber,Neoprene,Nitrile,PVA,PVC,V iton"] I S >
"Butyl_Rubber" : IF [ integer.read "What is the molecular weight?" < "118" ] IS *
"Τ" : [ advise "This glove has a * f a i r * r a t i n g . " , GOAL ] ^
ELSE [ advise "This glove has a *good* r a t i n g . " , GOAL ] Η
"Neoprene" : C advise This glove has a * f a i r * r a t i n g . " , GOAL ] >
" N i t r i l e " : IF [ integer.read "What is the molecular weight?" < "92" ] IS 5
"T" : [ advise "This glove has a * f a i r * r a t i n g . " , GOAL j
ELSE IF [ integer.read "What is the molecular weight?" < "118" ] IS
"T" : IF [ integer.read "What is the b o i l i n g point?" < "137" ] IS X
"T" : [ advise "This glove has a * f a i r * r a t i n g . " , GOAL ] >3
ELSE IF [ integer.read "What is the b o i l i n g point" < "139" ] IS SL
"T" : [ advise "This glove has a * f a i r * r a t i n g . " , GOAL ] ^
ELSE IF [ integer.read "What is the b o i l i n g point?" < "142" ] IS
"T" : [ advise "This glove has a * f a i r * r a t i n g . " , GOAL ] g
ELSE [ advise "This glove has a *poor* r a t i n g . " , GOAL ] §·
3
ELSE IF [ integer.read "What is the molecular weight?" < "139" ] IS
"T" : [ advise "This glove has a * f a i r * r a t i n g . " , GOAL ] ^
ELSE [ advise "This glove has a * b e s t * r a t i n g . " , GOAL ] <§
"PVA" : IF [ integer.read "What is the molecular weight?" < "92" ] IS 2
3
"T" : [ advise "This glove has a * f a i r * r a t i n g . " , GOAL ]
ELSE IF [ integer.read "What is the molecular weight?" < "118" ] IS o>
"T" : IF [ integer.read "What is the b o i l i n g point?" < "137" ] IS ^
"T" : [ advise "This glove has a * f a i r * r a t i n g . " , GOAL ] £
ELSE [ advise "This glove has a * b e s t * r a t i n g . " , GOAL ] J*
ELSE [ advise "This glove has a * b e s t * r a t i n g . " , GOAL ] ^
"PVC" : [ advise "This glove has a *very poor* r a t i n g . " , GOAL ]
ELSE [ advise "This glove has a * b e s t * r a t i n g . " , GOAL ]

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


GOAL OF classIO

ACS Symposium Series; American Chemical Society: Washington, DC, 1986.


Figure 7 · The Computer-Generated Program for Using the Rules Induced by RuleMaker in
an Expert System for Advising Glove Materials To Be Used for Protection Against
Nonhaolgenated Aromatic Compounds.
48 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

As can be seen by a comparison of the two rule sets, the one


induced by RuleMaster has s i g n i f i c a n t l y more refinement to i t and
w i l l come much closer to making accurate predictions than the human
induced rule set.
I t i s useful to display these rules as a series of bar charts
i n order to be able to view them i n r e l a t i o n to one another. This
i s presented i n Figure 8 so that the human induced ranges can be
compared to the ranges induced by RuleMaster. I t i s r e a d i l y seen
that there i s good agreement between the two ranges i n that a l l of
the i n i t i a l human assignments are s t i l l present i n the RuleMaster
assignments. The notable difference i s that there i s considerably
more refinement to the possible choices i n the RuleMaster chart.
The s i g n i f i c a n c e i s that based on the simpler human induced rules i f
long term protection (more than 1 hour) was needed f o r working with
nonhalogenated aromatics, V i t o n was the only good choice. However,
Viton gloves are not only very expensive ($30 a p a i r ) , but they have
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

poor t a c t i l i t y , so work involving much dexterity i s precluded when


wearing them. With the RuleMaster information new p o s s i b i l i t i e s are
now a v a i l a b l e f o r consideration:

• I f the compounds have molecular weights >138 then N i t r i l e


may be used; n i t r i l e gloves o f f e r greater t a c t i l i t y and
they are much l e s s expensive than Viton.

• I f the molecular weight of the compounds i s <118 or >93


with b o i l i n g points greater than 137°C, then PVA may be
used; PVA gloves have no better t a c t i l i t y properties than
Viton gloves but they are cheaper so the expenses could be
lowered.

Thus, the rules induced by RuleMaster o f f e r p o s s i b i l i t i e s f o r


reducing cost and allowing more dextrous work to be performed than
would have been available using the human induced rules.
The important caveat to remember, however, i s that the computer
has produced the best rules possible from the data i t was given and
has extended those rules to cover examples past that data set where
possible. Thus, u n t i l proven with a s u f f i c i e n t number of examples
any set of rules must always be viewed simply as the best ADVICE
available. There can always be " o u t l i e r s " caused by a d d i t i o n a l
factors that have not yet been discovered.
Once the computer has induced the rules governing a p a r t i c u l a r
set of complex data then i t i s easy f o r a human to check and see i f
they are true. This can be done i n two ways:

1. a simple Rule Table can be constructed, and

2. a d d i t i o n a l known examples can be analyzed to challenge the


rules and see i f they hold true; i f they don»t then addi-
t i o n a l data i s given the computer so that modified rules
can be induced.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
KEITH A N D STUART A Rule-Induction Program for QA-QC

Protective Rating Based on Breakthrough Time

ι «*» ι Good | Best

Nitrile
KWSSSNN

Neoprene
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

Butyl Rubber

PVA

Human Induced Protective Ranges

Butyl Rubber
\,'.'.'t'.'.'y//////a
I» » * *
IZ3

RuleMaker Induced Protective Ranges

Figure 8. Protective Ranges of S i x Glove Materials Against Non-


halogenated Compounds.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
50 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

An example of the Rule Table that can be constructed from t h i s


data set i s Table I I .
Now, once the Rule Table i s constructed i t i s easy to check the
data again and v i s u a l i z e these r e l a t i o n s h i p s ; that i s , to v e r i f y
that they are true. But, remember the lack of obvious relationships
when the example data was f i r s t examined.
The use of a r t i f i c i a l i n t e l l i g e n c e , and s p e c i f i c a l l y a rule
inductive program such as RuleMaster i s an excellent way that
meaningful relationships can be derived from the large and diverse
mass of data being produced. The use of a r t i f i c i a l i n t e l l i g e n c e i n
t h i s way i s referred to as "knowledge manufacturing". Thus, the
strongest features of a computer (to remember and correlate large
numbers of data) and humans (to be creative and to use reasoning
c a p a b i l i t i e s beyond that of a computer) are being used to solve very
complex problems.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

Summary

In summary, RuleMaster i s an expert system b u i l d i n g package intended


to solve many of the problems involved i n the construction of large
knowledge based programs. I t s inductive learning system (RuleMaker)
allows rapid and e f f e c t i v e a c q u i s i t i o n of expert knowledge. The
Radial language allows structured organization of large quantities
of knowledge. Radial also provides a f a c i l i t y f o r presenting
ordered explanation of reasoning to any l e v e l of elaboration r e -
quired.
Use of an expert system i n conjunction with a statistical
program f o r pattern recognition such as Ein*Sight or SIMCA i s a
concept that offers an excellent p r o b a b i l i t y of success i n (1)
f i n d i n g , (2) ordering, and (3) using the most s e l e c t i v e chemical and
physical parameters f o r choosing the best protective materials to
use with a wide v a r i e t y of hazardous chemicals. No other program
can be used both to help develop the rules needed f o r analysis of a
complex data base (by induction) and then to use these rules i n a
l o g i c sequence to provide a diagnostic decision. Furthermore, the
basis of any and a l l decisions made by the computer are completely
available on demand so that they can e a s i l y be checked and/or
verified.
The f i r s t prototype system used rules which were derived as
"best estimates" from a data base of about 1300 tests using 90
d i f f e r e n t chemicals. However, the prototype system i s being revised
using computer-generated rules. Thus, i t i s becoming "smarter" and
better as i t ' s data base and the r e s u l t i n g rules derived from i t i s
expanded. Using a computer to evaluate large masses of data i s not
novel, but using i t to help generate rules by an inductive l o g i c
process from large masses of data i s an important new achievement.
One of the s i g n i f i c a n t advantages of t h i s expert system w i l l be a
consistent unbiased i n t e r p r e t a t i o n of the data i n a rapid manner
once the expert system has been developed. And l a s t l y , RuleMaster
i s structured so that i t i s easy to add, change, or delete data from
the expert system so that i t can continue to grow and improve w i t h
use and experience. These features w i l l be invaluable as the data
base continues to grow and change.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
3. KEITH A N D STUART A Rule-Induct ion Program for QA-QC 51

TABLE I I . RULE TABLE FOR NONHALOGENATED AROMATIC COMPOUNDS

Glove Material Rating


Best Good Fair Poor V Poor
BuR i f BuR i f
MW >= 118 MW < 118

Neoprene

Nitrile i f Nitrile i f Nitrile i f


MW >= 139 MW < 118 and MW < 118 and
bp = <142 bp >= 142
- or -
MW >= 118 -< 139
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch003

PVA i f PVA i f
MW >= 118 MW < 92
- or - - or -
MW >= 92 -< 118 MW >= 92 -< 118
and bp >= 137 and bp < 137

Viton PVC

MW = Molecular Weight
bp = B o i l i n g Point
BuR = Butyl rubber
PVA = P o l y v i n y l acetate
PVC = P o l y v i n y l chloride

Literature Cited

1. D. Michie, S. Muggleton, C. Riese and S. Zubrick, "RuleMaster -


A Second Generation Knowledge Engineering Facility," from
Proceedings of the First Conference on A r t i f i c i a l Intelligence
Applications, Denver, Colorado, 5-7 December 1984.
2. L.H. Keith, W. Crummett, J. Deegan, Sr., R.A. Libby, J.K.
Taylor and G. Wentler, "Principles of Environmental Analysis,"
Anal. Chem., 55, p. 2210-18, 1983.

R E C E I V E D January 15, 1986

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
4
A Chemistry Diagnostic System for Steam Power Plants

James C. Bellows

Westinghouse Electric Corporation, Orlando, F L 32817

A diagnostic system for the steam system


chemistry of utility power plants i s describ-
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch004

ed. It i s an expert system which accepts data


from a monitoring system and generates
recommendations for action to improve the
chemistry of the plant. The monitors c o l l e c t
data from important points in the steam
cycle. Data i s transferred to a central data
center for transmission to a centralized
diagnostic center. At the diagnostic center,
the monitors readings are validated before
being used in the diagnosis of the power
plant. Recommendations are transmitted to the
data center for d i s p l a y . The removal of a
malfunctioning sensor from consideration i s
given as an example of the operation of the
system.

Downtime at a steam power plant can be valued at as much as $1


million/day. The actual value depends upon the size of the plant
and the cost of replacement power. For 1000 MW nuclear plants,
such as those that supply approximately 50% of Chicago's e l e c t r i c -
i t y , the $1 million/day i s f a i r l y accurate. One of the major
causes of downtime, especially unscheduled downtime, i s corrosion
due to improper steam and water chemistry. Replacement of
corroded turbine blading often requires downtime of a month or
more. Replacement of corroded nuclear steam generators has
required on the order of 9 months. The chemistry of power plants
w i l l be b r i e f l y reviewed. The goals of the chemistry diagnostic
system will be stated. The supporting monitoring system w i l l be
b r i e f l y described, and c a p a b i l i t i e s of the current diagnostic
system described. The scheme for diagnosing monitors and removing
erroneous data from plant diagnosis w i l l be o u t l i n e d , and an
example of a sensor malfunction diagnosis w i l l be given.

0097-6156/86/0306-0052$06.00/0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
4. BELLOWS A Chemistry Diagnostic System for Steam Power Plants 53

Power Plant Chemistry

A power plant may be viewed as a chemical plant which has taken


by-product sale to the l i m i t . It recycles the product and s e l l s
only by-product. By most standards, i t is a chemical plant, f u l l
of reactor v e s s e l s , piping, pumps, and tanks. Since the principal
product is not a chemical, however, people tend to forget that a
power plant is a chemical plant. Chemistry has often been the
entry level p o s i t i o n , and people were promoted to j a n i t o r . The
materials are generally chosen to optimize heat transfer and
mechanical strength and are not optimized for compatibility.
Figure 1 shows a simplified schematic of a power plant.
The condenser may consist of copper bearing a l l o y s , such as
aluminum bronze, admiralty brass, and Muntz metal. Titanium is
also used, as are stainless and carbon s t e e l s . Its purpose i s to
act as a sink for approximately 2/3 of the heat produced in the
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch004

b o i l e r . The feedwater heaters are steam to l i q u i d heat exchangers


and have have steel or copper alloy tubing and usually carbon
steel s h e l l s . Restricting the discussion to f o s s i l plants for
s i m p l i c i t y , the boiler has carbon and alloy steels which are
chosen for resistance to the 1000-1250° F thermal conditions more
than for corrosion resistance. The high pressure and intermediate
pressure turbines must be designed to operate with inlet
temperatures of the same range. The low pressure turbine operates
between approximately 700° F and 100° F . The f i n a l stages of the
low pressure turbine are supersonic. A large f o s s i l turbine w i l l
be over 10 feet in diameter and weigh on the order of 250,000
lb. It rotates at 3600 rpm. The centrifugal stresses in the last
stages of the low pressure turbine dictate high strength alloys in
the same region that concentrated salt solutions can form.
There are two fundamental types of boilers: once through and
recirculating. In the case of once through b o i l e r s , a l l the feed-
water is converted to steam as i t passes through the boiler in es-
s e n t i a l l y a plug flow regime. Most once through boilers are
s u p e r c r i t i c a l pressure (3500 to 4500 p s i ) , so the d i s t i n c t i o n
between l i q u i d and vapor is l o s t . In a recirculating b o i l e r ,
pressures are limited to about 2800 p s i , and steam i s separated
from l i q u i d in a steam drum. The l i q u i d i s recirculated back to
the bottom of the b o i l e r and the steam is superheated. In both
b o i l e r s , the steam is reheated after i t is passed through the high
pressure turbine. Occasionally a second reheat after the
intermediate pressure turbine is found.
Which type of b o i l e r i s present in the system has a
s i g n i f i c a n t influence on the fundamental chemistry used in the
plant. In once through b o i l e r s , no solids can be used so A l l
V o l a t i l e Treatment (AVT) is employed. AVT consists of extremely
pure water with the addition of ammonia, or other v o l a t i l e amine
for pH c o n t r o l , and hydrazine for oxygen scavenging. The exact
concentration of ammonia is chosen to minimize corrosion of the
feedwater heaters and depends upon the alloys used in their
construction. The hydrazine feed rate is determined by the amount
of oxygen in the feedwater. In recirculating boilers at pressures
over 1500 p s i , the AVT regime is used, but a s o l i d conditioning

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
54 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

·» i—
DC Ό
C
tO

Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch004

C t-
«C=3
ι— (Λ
O- </>
0)
t- Ι­
Ο) C L

Ο Φ
D- 4->
tO
tO »r~
TJ
φ

ε α)
tO +·>
L- C
σ>·ι~
to
•r~ *
Ο 0)
t- ·
Ο 3 >>
•r~ tO r—
tO Φ >
ε t- «r-
χ: ο
ο .c <υ
οο σ α
-σ -c eu
φ 1~
«4- - C *
•r- 4-> CO
ι— Q)
α. ω c
ε -M
•r Ό JD
to ο t-
•r- 3
X J •»->
• C
r H «r- φ
Ι­
Ο) Û - 3
C I tO
3 CO
σ>Ό α;
11 Ό Q .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
4. BELLOWS A Chemistry Diagnostic System for Steam Power Plants 55

agent may be added to the recirculating water in the b o i l e r . The


purpose of this conditioning agent is to control pH and to
precipitate impurities in compounds which do not adhere to the
boiler surfaces. In the United States, this is usually a mixture
of disodium and trisodium phosphate. In other countries, sodium
hydroxide may be used. In both types of boilers the fundamental
chemistry problems are to avoid oxygen and to avoid deposits of
chemicals on the heat transfer surfaces. F a i l u r e to avoid either
of these conditions leads to corrosion, and ultimately to rupture
of boiler tubing.
Considerable dissolution of salts may occur in the high
pressure steam. As the steam density decreases through the
turbine, the s o l u b i l i t y of the salts decreases, and the salts
deposit on the turbine. Two categories of deposition e x i s t .
A l k a l i metal hydroxides form stable water solutions at all
pressure and temperature conditions within the turbine. Sodium
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch004

hydroxide concentrations can be as high as 90%. These


concentrated hydroxide solutions lead to rapid stress corrosion
cracking of turbine materials and must be rigorously avoided. The
second case is represented by sodium c h l o r i d e , which deposits as a
s o l i d throughout most of the turbine. However, salts elevate the
boiling point of water enough that near the transition from
superheated to saturated steam, a region exists in which salt
solutions of 30% are stable. Sodium chloride solutions of t h i s
concentration at temperatures of 100° C are quite corrosive and
lead to stress corrosion and corrosion fatigue of turbine a l l o y s .
The problem of power plant chemistry becomes one of
determining which sources of chemicals are active at any given
time and whether the p u r i f i c a t i o n systems are working properly.
The condenser is a common source of impurities. On one side is
the steam, which must be maintained pure to a few parts per
b i l l i o n ; on the other is cooling water, which may be sea water.
The condenser w i l l commonly consist of tens of thousands of tubes,
each of which is sealed to two tube sheets. The sum of a l l the
leaks must be on the order of pints per day. The condenser and
some other parts of the system operate below atmospheric
pressure. Oxygen and carbon dioxide from a i r leaking into the
system represent s i g n i f i c a n t contaminants. Condensate polishers
are large ion exchange units which remove trace impurities in the
feedwater. They must be operated properly, or they may add more
impurities than they remove. Deaerating heaters remove dissolved
gases from the feedwater. F i n a l l y , a drum boiler is a s t i l l , and
the e f f i c i e n c y of liquid-vapor separation i s c r i t i c a l to the
purity of the steam.

Definitions

At least in the power industry, the terms "monitoring" and "diag-


nostics" are often used interchangeably or without careful d e f i -
nition. Much confusion can arise when these terms are used. For
purposes of this paper, these terms and the terms "expert system"
and "malfunction" w i l l be defined here.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
56 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Monitoring. Collection and manipulation of data for control and


diagnostic purposes. Monitoring systems w i l l have sensors to
measure temperatures, pressures, flow rates, and the
concentrations of chemicals in streams. They may store this
information in a computer or a data logger. They may perform
transformations on the data, such as conversion of voltages to
engineering u n i t s , computation of averages, and comparison of two
values. They may provide alarms when variables are beyond
acceptable l i m i t s . Monitoring systems may even plot graphs.

Diagnosis (Diagnostics). Determination of condition and s p e c i f i c


cause of this condition. A diagnostic system determines that a l l
conditions are as they would be expected, or that there is some
malfunction of a component that is causing an undesired
condition. A diagnostic system may also generate recommendations
for correction of an undesired condition.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch004

Expert System. A computer reasoning system based on rules


generated by questioning experts in a given f i e l d . Expert systems
generally consist of three parts: a rule entering and editing
program, a rule base, and an inference engine which takes data and
applies the rules to i t to reach conclusions about the system
which generated the data.

Malfunction. Any condition in which a piece of equipment or


system i s imperfect for any reason. Examples might be exhausted
condensate p o l i s h e r s , leaky condenser tubes, and sensors for which
the power has been unintentionally turned o f f . This d e f i n i t i o n of
malfunction includes deterioration of equipment due to normal
wear. By this d e f i n i t i o n , worn out car brakes are a malfunction.

Goals of the Diagnostic System

The goal of an a r t i f i c i a l intelligence diagnostic system is to


provide the available expert advice to the user, in a time that i s
probably faster than the human could d e l i v e r i t . A number of
decisions have been made about the scope of the system which
should be stated here. One goal was to use only on-line monitors,
since only then could the system be responsible for the quality of
the data. The quality of manual analyses varies in unpredictable
ways, and we chose not to depend upon i t . If one is to work with
on-line sensors as the primary source of data, then the v a l i d i t y
of sensors must be determined within the system. Since on-line
monitors are expensive there is a corollary goal of a minimum
number of sensors consistent with diagnosis of important equipment
malfunctions and sensor condition. The diagnosis must be done
centrally so that experience gained from one plant can be
immediately available to other plants by rapid revision of the
diagnostic r u l e s . Chemistry upsets in power plants generally
require several hours to develop, so transmission once or twice
per day to the Diagnostic Center would usually be adequate to
detect upsets which were developing slowly. To handle the upsets
that were faster than the regular transmission would detect, i t

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
4. BELLOWS A Chemistry Diagnostic System for Steam Power Plants 57

was decided that the data gathering computer at the plant should
be sophisticated enough to determine that something is happening
and make a special transmission of data at that time. The monitor
set has been chosen to allow high r e l i a b i l i t y diagnosis of common
power plant conditions, but it will support some unusual
conditions as w e l l . Those unusual conditions are included in the
diagnostic system simply because the supporting data are
present. F i n a l l y , i t was decided that no information which might
be relevant, including manual analysis data, should be rejected
completely, and manual entry points have been included for that
data. Manual entry of data requires validation of the data before
entry.

Monitoring System

In order for any diagnostic system to draw valid conclusions about


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch004

the condition of a plant, i t must have an appropriate monitoring


system for gathering data. The monitoring system chosen is shown
schematically in Figure 1. Sensors are placed on the influent and
effluent streams of each chemically active component of the
plant. Thus, by looking at changes in concentrations from
condensate to condensate polisher e f f l u e n t , as well as the
concentrations in the polisher e f f l u e n t , one determines the
effectiveness of the p o l i s h e r s . For the chemical feed, the
polisher effluent is the influent to the zone, and the final feed
(economizer i n l e t ) is the e f f l u e n t . The sensor set is kept as
small as is reasonable, consistent with high certainty of sensing
malfunctions of the plant and of the sensors. The sensors used
are given in Table I.
The output of the sensors is transmitted to a data center in
the plant, which stores the data. Normally the data are trans-
mitted to the central diagnostic center at least once per day, and
the diagnosis is returned to the data center for d i s p l a y . The
data center also computes rates of change of variables. The data
and rates of change are compared with alarm limits and a more
sensitive l i m i t , which we call a diagnosis activation l i m i t . If
the diagnosis activation l i m i t is reached, a special transmission
of data is made immediately so that a diagnosis may be made
immediately. It is believed that this is a suitable compromise
among the expense of continuously on-line diagnostics, the need
for immediate diagnosis of an upset, and the need to keep the
diagnostic system centralized to allow rapid improvement in
diagnosis as experience with the automated system develops.

Diagnostic System

Diagnosis is accomplished by the expert system. The central part


of the expert system is the rule base. The rule base consists of
ideas, called nodes, and rules which interconnect them as shown in
Figure 2 . The upper node is the evidence; the lower node is the
conclusion. The rule between them w i l l state that i f the evidence
is known to be true with absolute certainty, then the conclusion
w i l l be known to be true (or false) with a s p e c i f i c confidence.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
58 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Table I. Sensors for a once-through b o i l e r

Condensate Condensate
Sensor Pump Polisher Economizer Hot
Description Discharge Effluent Inlet Reheat Makeup

Cation

Conductivity Χ X X X

Specific

Conductivity Χ X X X
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch004

Sodium Χ X X X

Chloride Χ X X X

Dissolved

Oxygen Χ X X

Hydrazine X

pH Χ X X

Silica Χ X

A i r Exhaust X

Makeup Flow

Electrical
Load aspect of the rule is known as s uXf f i c i e n c y .
This The rule w i l l
also state that i f the evidence is known to be false with absolute
c e r t a i n t y , that the conclusion w i l l be known to be false (or true)
with another s p e c i f i c confidence. This aspect of the rule i s
known as necessity. The sufficiency and necessity need not be
equal. There are many times when something may indicate the
presence of a condition but not be a necessary consequence of that
condition. The increases in monitor readings that occur at the
start of malfunctions are good examples of indicators which will
signal the presence of a malfunction, but when the malfunction
becomes stable at some s e v e r i t y , the increase w i l l no longer be
present. Of course a high value for the monitor reading w i l l then
be present. Evidence may be sensors or the conclusions from other
r u l e s . Several rules may support a single conclusion and the same
evidence may be used for several r u l e s .
When the system i s used to diagnose the power plant
chemistry, the inference engine w i l l activate a l l the rules for
which evidence e x i s t s . Thus a l l possible conclusions are examined

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
BELLOWS A Chemistry Diagnostic System for Steam Power Plants

EVIDENCE NODE 1
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch004

RULE

NODE 2

Figure 2 . Basic Step in an Expert System Rule Base.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
60 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

simultaneously. A l l the evidence for and against a l l conclusions


is always considered. Diagnosis of simultaneous malfunctions
occurs simply because evidence for those malfunctions e x i s t s . The
structural d e t a i l s of our expert system have been published
elsewhere (1).
The nodes and rules for an expert system are based on expert
judgements. Usually, d e f i n i t i v e s t a t i s t i c s are unavailable for
the relationships between ideas, but experts have a good feel for
the relationships. We have found that when the information in the
rule set is broken down into small enough steps, experts tend to
have substantial agreement concerning the sufficiency and
necessity of evidence to a given conclusion. It is quite common
to find that what an experienced engineer considers to be one step
is in fact several. When the rule base is constructed, the small
steps are used. The use of small steps promotes c l a r i t y in the
rule base and, at times, provides experts with new i n s i g h t s .
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch004

Since the diagnostic process must be broken down into small steps,
the process of building the rule base is much l i k e that of
training an able, but rather ignorant person.
It has been a r b i t r a r i l y decided to say that any malfunction
for which there is less than 30% confidence i s probably not
present with enough severity to cause concern. Between 30% and
50% confidence, one should be concerned that the malfunction may
be developing. This represents an early warning, but with
increased p o s s i b i l i t y of e r r o r . Between 50% and 70% confidence,
action is appropriate to confirm or disconfirm the presence of the
malfunction by c o l l e c t i n g additional information, i f necessary.
Above 75% confidence, a plant malfunction is present with enough
confidence that action ought to be taken to correct the
malfunction. Action on a sensor malfunction indication should
take place above 50% confidence, since by that time the system has
l o s t substantial s e n s i t i v i t y to the plant malfunctions supported
by the sensor.

Results for a Fossil Once-Through Steam System

There are currently over 50 malfunctions of a f o s s i l once-through


steam system that can be diagnosed. Some of these malfunctions
are l i s t e d in Table II. It w i l l be noted that some of these
malfunctions occur as sets of related malfunctions. In some cases
the members of the set are mutually exclusive, as in the ammonia
feed malfunctions. In other cases, such as contaminated makeup,
there is a malfunction which can be broken into smaller, more
detailed malfunctions. The system can diagnose a variety of
sensor malfunctions as w e l l . The diagnosable malfunctions related
to each sensor are shown in Table III. To accomplish these
diagnoses, the system contains over 1300 r u l e s . To test the
system we have made use of whatever monitoring data we have had
accessible. This has consisted of Steam Purity Analyzer System
(2) data which is single location data, Total Plant Survey (3)
data which is system wide but grab sample, and such plant data as
has been accumulated from record reviews and diagnostic
missions. None of these data sets conforms exactly to the monitor

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
4. BELLOWS A Chemistry Diagnostic System for Steam Power Plants 61

system that is envisioned as input to the diagnostic system, so


estimates of values of other data which were deemed necessary to
test the system have been used. A number of diagnoses use the
rates of change of variables. Since the only available continuous
(one minute interval) data sets were for the Steam Purity Analyzer
System, these data sets were sometimes moved to other locations to
test the s e n s i t i v i t y of plant malfunction confidences to sensor
malfunctions which were known to be in the data. Where
intermediate values between grab samples or discrete readings were
necessary, they were either l i n e a r l y interpolated with time or
made proportional to a sample for which continuous data were
available.
One of the important tasks of the system is to diagnose the
sensor malfunctions and remove the malfunctioning sensors from
consideration in the plant diagnosis scheme. Chemical sensors are
high maintenance and high malfunction rate devices. If they were
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch004

not removed from consideration when they malfunction, they could


generate spurious plant malfunction diagnoses and d i s c r e d i t the
diagnostic system. The task of removing a malfunctioning sensor
from consideration is accomplished by taking the confidence that

Table II. Representative Malfunction Groups for a Once-Through


Boiler System

Malfunction Description
Numbers

1. Condenser cooling water leak


2-8. Contaminated makeup
9-12. A i r in leakage
13-17. Polisher malfunctions
18-31. Ammonia feed
Malfunctions
32-45. Hydrazine feed
Malfunctions
46-47. Contaminated feed
Chemicals
48-51. Organic contamination
52. Contaminated b o i l e r

Table III. Number of Malfunctions Diagnosed for each Sensor

Sensor Number of D i s t i n c t Malfunctions

Cation Conductivity 5
S p e c i f i c Conductivity 4
pH 4
Dissolved Oxygen 3
Sodium 2

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
62 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

there i s a malfunction in the sensor and using i t to drive a rule


that changes the sufficiency and necessity of rules coming from
that sensor so that the plant diagnostic system is less sensitive
to the information coming from that sensor. This i s shown
schematically in Figure 3. Data from the sensor and other sensors
are used to drive rules that diagnose the sensor of interest
(4). The results of these rules are accumulated in a sensor
malfunction node. The confidence in this node is used to drive
rules which a l t e r the sufficiency and necessity of other r u l e s .
The scheme i s analogous to setting a valve point based on the
values of a number of sensors.
An example of sensor validation and removal from
consideration i l l u s t r a t e s the working of the diagnostic system.
The malfunctioning sensor i s an acid cation exchanged conductivity
monitor, commonly called "cation conductivity." It consists of a
cation exchange resin in the hydrogen form followed by a
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch004

conductivity meter. The cation exchange resin removes ammonia


from the sample stream and the resulting conductivity provides a
good estimate of total ionic content, except for hydroxide. The
monitor i s very sensitive to most of the impurities that are
important to power plants. However, when the cation exchange
resin i s exhausted, the monitor reverts to a s p e c i f i c conductivity
monitor and the output is dominated by the ammonia concentration.
Figure 4 shows a test of the diagnostic system for an
incident of resin exhaustion for a cation conductivity sensor.
The data are a combination of real and synthesized plant data and
are given in Table IV. The condensate values for the condensate
sensor are those recorded during the actual exhaustion of the
cation resin column at a plant i n s t a l l a t i o n . The steam and
polisher effluent values would be reasonable based on the starting
value of the real sensors. A l l of the sensor values other than
the condensate cation conductivity were held constant to c l e a r l y

Table IV. Sensors related to a cation conductivity resin


exhaustion incident

Sensor Value Data Source

Condensate cation conductivity See F i g . 4 Real


Condensate s p e c i f i c conductivity 7.87 Real
Condensate sodium 2.2 Real
Makeup addition Off Estimated
Steam cation conductivity 0.17 Estimated
Steam s p e c i f i c conductivity 7.8 Estimated
Steam sodium 2.1 Estimated
Polisher effluent s p e c i f i c
conductivity 0.16 Estimated

Note: These values were held constant to show the effect of the
variation in the single variable.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
4. BELLOWS A Chemistry Diagnostic System for Steam Power Plants 63

I OTHER
SENSOR
(SENSORS

1 _ £
/ SENSOR \ INTERPRETATION
(DIAGNOSTICS/
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch004

SENSOR
MALFUNCTION

VALIDATED
INTERPRETATION

PLANT
EQUIPMENT
MALFUNCTIONS

Figure 3. Block Diagram of Sensor V a l i d a t i o n .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch004

.4

CONDENSATE
3
CATION ·"
CONDUCTIVITY

.2H

>
CONDENSATE H H
CATION
CONDUCTIVITY Ο
RESIN >
EXHAUSTION 0

CF

H
CONDENSER Ο
LEAK C F m
η
m
>

SF η
EVALUATION

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


C/3
ι 1 1 —I

ACS Symposium Series; American Chemical Society: Washington, DC, 1986.


740 750 950
800 850 900
TIME (MINUTES SINCE MIDNIGHT) m

Figure 4 . Cation Conductivity Sensor Malfunction: Data, So


Confidences, and Evaluation Function, Η
4. BELLOWS A Chemistry Diagnostic System for Steam Power Plants 65

show the effect of the single malfunction on the confidence in


other malfunctions. Of particular interest is the condenser tube
leak, which has great s e n s i t i v i t y to the value of the cation
conductivity of the condensate. One sees that the confidence in
the sensor malfunction, the exhaustion of the resin in the cation
conductivity sensor, p a r a l l e l s the increase in the cation
conductivity reading. At f i r s t , the confidence in the condenser
tube leak also parallels the increase in the cation
conductivity. However, as the confidence in the malfunction of
the cation conductivity sensor increases, the confidence in the
tube leak peaks at 30% and declines with further increase in the
confidence in the sensor malfunction. The value of the evaluation
function, which is used to reduce the s e n s i t i v i t y of the plant
diagnosis to the malfunctioning sensor, is shown at the bottom of
Figure 4. It starts at 100% s e n s i t i v i t y and declines as the
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch004

sensor malfunction becomes more c e r t a i n . By the time the sensor


malfunction confidence has reached 70%, the plant diagnosis
p r a c t i c a l l y ignores the sensor. Of course, i f the malfunction had
been a condenser leak, the condensate sodium would have increased
at the same time as the cation conductivity. The rule base would
have recognized this occurrence, the confidence in a malfunction
of the cation conductivity sensor would have been substantially
reduced, and the confidence in the condenser leak would have
increased due to the increases in both the cation conductivity and
the sodium concentration.

Data Center Displays

The data center displays the diagnosis and a number of d i f f e r e n t


u t i l i t y screens. Figure 5 is a picture of the RECOMMENDATION SUM-
MARY screen. It shows the actions which are most important to im-
proving the chemistry of the unit at the current time. They are
l i s t e d in p r i o r i t y based on confidence in the existence of the
malfunction and on the seriousness of the consequences of the
malfunction at i t s current severity. On the data center screen,
the recommendations are color coded, with red recommendations
having a confidence in the underlying malfunction of at least
70%. Yellow indicates 30-70% confidence, and green indicates 0-
30% confidence. The rectangles on the right hand edge and along
the bottom are touch buttons to allow access to other screens.
They blink i f new information i s available on those screens.
Their color is determined by the color of the most urgent
information on the screen. The RECOMMENDATION screen shown in
Figure 6 displays the a c t i o n , a cryptic reason for taking the
action, and the consequences of not taking action. The
consequences are as s p e c i f i c as the current state of knowledge
w i l l allow.

Experience with a Generator Diagnostic System

Although the subject is hardly chemistry, i t would be appropriate


to make mention of a companion project in the diagnosis of
conditions in e l e c t r i c a l generators. Such a system i s in

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
66 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

RECOMMENDATION SUMMARY
PL3
I
L 1
I Find and repair air leak above hotwell waterline within Select
- 100 hr. Unit
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch004

I I Reduce load and repair leak in condenser section 2


within 24 hr.

I I
1 1
Remove polisher vessel #3 from service and regenerate
— within 8 hr.
Diagnostic
Summary

Diagnostic
Procedures

Explanation

Diagnosis Monitor Service Print Previous


Alarms Screen Selection
Menu Menu

Figure 5 . Recommendation Summary Screen.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
4. BELLOWS A Chemistry Diagnostic System for Steam Power Plants 67

RECOMMENDATION

ACTION: Remove polisher vessel #3 from service within 8


hr. and regenerate.

REASON: There are significant and increasing acid concen-


trations in the boiler feedwater and steam.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch004

CONSEQUENCES: Continued operation with significant acid concen-


(INACTION) trations will lead to acid corrosion of the boiler Diagnostic
tubing and the turbine blading and steeples. Summary
Damage can be significant in 48 hr.
Diagnostic
Procedures

Explanation

Diagnosis Monitor Service Print Previous


Alarms Menu Menu Screen Selection

Figure 6. Recommendation Screen.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
68 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

operation for seven power plants in Texas. While yet in the


prototype stage, i t detected a developing generator malfunction
several hours before any alarms sounded. Since the malfunction
was known, appropriate resources could be mobilized before the
generator was taken out of s e r v i c e , and the problem was repaired
in four days. This particular malfunction normally requires two
to three weeks for repair when i t is allowed to progress to the
point where the automatic generator control systems take the
generator out of s e r v i c e . Working with a customer in the final
stages of the development of the generator system has influenced
many decisions in both the generator and the chemistry diagnostic
systems.

Summary

An a r t i f i c i a l intelligence system for the chemistry of a f o s s i l


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch004

once-through steam system has been constructed. It i s based on


on-line monitors. It diagnoses both sensor and plant malfunction
and removes malfunctioning sensors from diagnosis of plant mal-
functions. The system has been tested o f f - l i n e using real and
synthesized power plant data and is now ready for testing in a
plant.

Acknowledgments

The assistance of C. T. Kemper and S . Lowenfeld with the


a r t i f i c i a l intelligence and of numerous Westinghouse Engineers in
providing data and information for the rule base is gratefully
acknowledged.
Portions of this paper were previously presented at the 45th
International Water Conference, Pittsburgh, P a . , October 22-24,
1984 and permission to republish them i s gratefully acknowledged.

Literature Cited

1. Fox, M. S . Kleinosky, P. Lowenfeld, S. Proc. 8th Internat.


Joint Conf. Artificial Intelligence, 1983, p. 158.

2. Bellows, J.C.; Carlson, G . L . ; Pensenstadler, D.F. J.


Materials Energy Systems 1983, 5, 43.

3. Peterson, S . H . ; Bellows, J.C.; Pensenstadler, D . F . ; Hickam,


W.M. Proc. 40th Internat. Water Conf. 1979, p. 201.

4. Gonzalez, A.J.; Osborne, R . L . ; Bellows, J . C . "On-Line


Diagnosis of Instrumentation through Artificial
Intelligence," presented at ISA Power Industry Symposium, New
Orleans, L a . , May 1985.

R E C E I V E D January 10, 1986

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
5
A Real-Time Expert System for Process Control

Lowell B. Hawkinson, Carl G. Knickerbocker, and Robert L. Moore

LISP Machine Inc., Los Angeles, C A 90045

Expert systems technology can provide improvements i n


analysis of process information, i n t e l l i g e n t alarming,
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch005

process diagnosis, c o n t r o l and optimization of proces-


ses. However, t o r e a l i z e these benefits, a real-time
expert system c a p a b i l i t y i s required. A program
design i s described which supports forward and
backward chaining inference i n a real-time environ-
ment, with dynamic measurement data. The knowledge
base for the program i s implemented i n structured
natural language form for application to a broad range
of process expert systems. Plant t e s t r e s u l t s are
described.

In the real-time a p p l i c a t i o n of expert systems, a number of design


considerations, beyond those usually considered i n expert systems,
become important. Execution e f f i c i e n c y i s a prime consideration.
In conventional expert systems, the facts and knowledge upon which
the inference i s based are s t a t i c . In the i n d u s t r i a l a p p l i c a t i o n ,
the facts or process measurements are dynamic. In an i n d u s t r i a l
a p p l i c a t i o n there may be several thousand measurements and alarms
which may s i g n i f i c a n t l y change i n value or status i n a few minutes.
The problem posed by an operator advisor, to give expert
diagnosis of plant condition and to recommend emergency actions or
economic optimization adjustments, i l l u s t r a t e s these real-time
requirements. Some of the plant conditions which can occur
include :

1. C r i t i c a l measurement f a i l u r e . In t h i s case, the information


presented to the operator i s incorrect. An expert system would
use a process knowledge base to detect inconsistencies and t o
a l e r t the operator.
2. Process upset. In t h i s case, the expert system would i d e n t i f y
underlying process problems, distinguishing causes from
e f f e c t s , and would advise the operator accordingly. Heuristic

0097-6156/ 86/0306-O069$06.00/0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
70 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

rules of optimization would be applied by the expert system to


give control advice.

In these examples, the expert system i s simply applying the


expertise used i n i t s development. The p o t e n t i a l advantage of the
operator advisor i s that t h i s expertise i s available q u i c k l y , on
any s h i f t , for providing organized advice to the operator.
To meet these requirements, several design considerations must
be addressed:

1. Data access. An e f f i c i e n t real-time data interface must be


established with the d i s t r i b u t e d measurement system.
2. Inference paradigms. The basic inference mechanisms of
forward-chaining and backward-chaining must be integrated into
a real-time execution environment.
3. Computational e f f i c i e n c y . The e f f i c i e n c y of inference i s
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch005

enhanced by program and knowledge-base structure and by machine


speed. Also, h e u r i s t i c procedures, as used by experts, can
augment the deductive procedures of conventional inference.

The program developed by LMI i n response to these design


requirements i s c a l l e d Process I n t e l l i g e n t Control (PICON). The
i n d i v i d u a l design considerations are addressed i n the following
discussion.

Process I n t e l l i g e n t Control

The expert system package i s designed to operate on a LISP machine


interfaced with a conventional d i s t r i b u t e d c o n t r o l system. The
design assumes that up to 20,000 measurement points and alarms may
be accessed. The Lambda machine from LMI was u t i l i z e d . The r e a l -
time data interface i s v i a an i n t e g r a l Multibus connected to a
computer gateway i n the d i s t r i b u t e d system.
Data t r a n s f e r s , i n f l o a t i n g point engineering u n i t s or i n
status states, are requested by the expert system. Thus the
d i s t r i b u t e d system does not transmit a l l measurements and alarms
on a fixed scan basis, but rather the process data are accessed as
required for inference. In a sense, the expert system i s acting
l i k e an expert operator, who focuses attention or scans the
process operation s e l e c t i v e l y , using expertise to determine
s p e c i f i c areas of attention.
The basic inference paradigms supported by the expert system
are forward-chaining and backward-chaining. Within the context of
an alarm advisor, there are requirements for both of these para-
digms. An expert process operator, during normal plant operation,
w i l l scan key process information. This i s for purposes of moni-
toring c o n t r o l performance and detecting problems which may not
cause e x p l i c i t alarms. The programming paradigm which r e f l e c t s
t h i s approach i s a scanned forward-chaining inference. The
h e u r i s t i c rules which determine possibly-significant-events are
scanned, and rule condition matching t r i g g e r s an a l e r t to the
expert system monitor program. Conventional alarms also may
trigger an a l e r t , i f they are h e u r i s t i c a l l y ranked as p o s s i b l y -
significant-events.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
5. HAWKINSON E T A L . A Real- Time Expert System for Process Control 71

An expert process operator, once a l e r t e d , w i l l focus a t t e n t i o n


on the problem. This may involve invoking procedure rules f o r
safety or other reasons, and i t may involve assembling information
and primary analyses to allow inference about the problem. Logic
rules and procedures are used when required for the diagnostic
inference. The expert system mimics the expert process operator
in t h i s regard: Logic rules and procedures are invoked
s p e c i f i c a l l y when they are required for diagnosis of a process
problem, or as requested for a s p e c i f i c step i n inference.
In working through process c o n t r o l examples, we found that many
c a l c u l a t i o n s , data checks, rate checks and other computationally
intensive tasks are done at the f i r s t l e v e l of inference.
Considerations of computational e f f i c i e n c y led to a design
u t i l i z i n g two p a r a l l e l processors with a shared memory (Figure 1).
One of the processors i s a 68010 programmed i n C code. This
processor performs computationally intensive, low l e v e l tasks
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch005

which are directed by the expert system i n the LISP processor.


The processing of data applies a l e v e l of i n t e l l i g e n c e . Instead
of mere measurement values, the expert may base inference on
trends or patterns of measurements. Thus the system must be able
to access p r i m i t i v e functions of data, such as averages and trends
of values, and q u a l i t y information, such as the presence of noise
or discontinuous values. Such functions are conveniently
calculated i n the p a r a l l e l 68010 processor, coded i n C language
for execution e f f i c i e n c y .
An expert, given time to do so, may u t i l i z e c a l c u l a t i o n s to
develop inference r e s u l t s . For example, a material balance
c a l c u l a t i o n around a process unit may indicate a measurement
inconsistency. To mimic t h i s expertise, general mathematical
operations on combinations of measurements or functions of
measurements are implemented i n the p a r a l l e l processor also.
Higher l e v e l s of inference depend on the truth conditions of
the f i r s t l e v e l antecedent conditions, and thus higher l e v e l s of
inference involve pattern matching and chained-inference l o g i c .
Higher l e v e l inference i s done i n the LISP processor, using
various expert system paradigms, while the f i r s t l e v e l antece-
dents, which are computationally intensive, are evaluated i n the
p a r a l l e l 68010 processor.
The expert system package i s designed so that an algorithm of
reasonably a r b i t r a r y structure can be dynamically loaded into the
68010 from the LISP processor. This allows, for example, the
expert system to implement process-monitoring f u n c t i o n a l i t y i n a
dynamic fashion, the equivalent o f :

"look c l o s e l y at the energy balance around the s p e c i f i c process


unit for the next few minutes."

The expert system design includes the a b i l i t y to change the time


period of measurement and algorithm processing i n i n d i v i d u a l
cases. Thus, i n e f f e c t , the system can "focus attention" to a
s p e c i f i c area of the process p l a n t , and put a l l associated
measurements and rules for that area on frequent scan. This can
be done under c o n t r o l of the LISP program. Thus, for example:

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch005

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
F i g u r e 1. D e s i g n f o r the LMI system f o r p r o c e s s c o n t r o l u s i n g
two p a r a l l e l p r o c e s s o r s w i t h a s h a r e d memory.
5. HAWKINSON E T A L . A Real-Time Expert System for Process Control 73

A back-chaining diagnostic expert system could reach a point


where an inference test i s required.
The LISP program would t e l l the 68010 processor to "focus" on
the measurements and low-level inferences required around a
process u n i t .
The inference could then be tested.

Another use of t h i s "focus" f a c i l i t y i s to scan the plant i n a


background mode, focusing attention on parts of the plant to
evaluate unit process performance and detect subtle problems,
u t i l i z i n g both the programmed knowledge of the the expert process
operator and the expert process engineer. I t i s not p r a c t i c a l to
examine an e n t i r e plant continuously with t h i s i n t e n s i t y , but the
i n d i v i d u a l parts of the plant could be scanned i n a background
mode. This i s equivalent t o the way a process engineer would
analyze plant performance during normal plant operation.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch005

I t should be noted that the a b i l i t y to focus not only emulates


the way a human expert works, but also i t avoids the problem
associated with overloading the d i s t r i b u t e d process system with
requests for information. While the expert system knows about a l l
20,000 measurement and alarm points i n the process environment,
only those of i n t e r e s t to the expert system need be accessed.
The LISP environment contains the h i g h e r - l e v e l f u n c t i o n a l i t y of
the expert system. A truth-maintenance design structure i s used.
The design assumption i s that lower-level i n t e l l i g e n t processing,
done i n the 68010, w i l l s i g n a l p o t e n t i a l l y s i g n i f i c a n t process
events. Thus, only a table of truth condition t r i g g e r s needs to
be checked by the LISP programs.

Some general examples of inference using the system:

- detecting process problems, p a r t i c u l a r l y on complex


combinations of conditions which require expertise for proper
interpretation.
- focus inference, i n which rules of a l l p r i o r i t i e s are activated
for a unit process. In the t y p i c a l use, a
possibly-significant-event (detected by a high p r i o r i t y
procedure rule) would t r i g g e r a focus on the process u n i t , thus
i n i t i a t i n g the gathering of information required for inference
around the process unit.
- diagnosis, a backward chaining inference procedure, which would
be triggered by a possibly-significant-event or by operator
request. Diagnosis uses the focus mechanism. An explanation
i s then given of the diagnostic conclusion.

Summary and Future Extensions

V i r t u a l l y a l l tasks which require the routine a p p l i c a t i o n of human


expertise, i n an organized way, are candidates for expert systems.
The computer implementation of expertise has such advantages as
speed, around-the-clock a v a i l a b i l i t y , and ease of expansion of the
knowledge base. As such, expert systems represent the next
generation of higher l e v e l software, performing tasks presently
done by human operators.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
74 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Expert systems have been investigated for 20 years. The


implementation of expert systems i s now being undertaken on a
widespread basis, due to the a v a i l a b i l i t y of hardware and software
tools which a l l e v i a t e the "knowledge-engineer bottleneck", allow-
ing cost e f f e c t i v e implementation. In a s i m i l a r way, real-time
applications of expert systems require tools to allow s t r a i g h t -
forward implementation. We have presented a software/hardware
structure which supports knowledge-base capture and real-time
inference for process a p p l i c a t i o n s .
In general, the LMI package (Figure 2) provides a
knowledge-base structure, f a c i l i t i e s for acquiring the knowledge
base i n an organized manner, and real-time c o l l e c t i o n of data with
some p a r a l l e l processing of inference, and higher-level inference
t o o l s . The i n d i v i d u a l applications require s p e c i f i c knowledge
engineering, which i s f a c i l i t a t e d using the t o o l s we have
described. The system i s currently i n s t a l l e d at Texaco and Exxon
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch005

f a c i l i t i e s and i s i n p i l o t plant or laboratory t e s t i n g at seven


additional sites.

CAPTURE

y y

RULES DIAGRAM
/

/
I/O

/
RTIME MEMORY

Figure 2. General structure o f t h e LMI package.

R E C E I V E D December 17, 1985

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
6
Interpretation and Design
of Chemically Based Experiments
with Expert Systems
1 1 2 2
David Garfinkel , Lillian Garfinkel , Von-Wun Soo , and Casimir A. Kulikowski
1
University of Pennsylvania, Philadelphia, PA 19104
2
Rutgers University, New Brunswick, NJ 08903
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch006

Expert system building programs, e.g., EXPERT, can


now supervise numerical calculations in addition to
performing qualitative reasoning and choosing among
possible alternatives. This capability can be used
to interpret experiments, calculate optimal designs
for them, and automate model construction and mani­
pulation, as well as to resolve associated problems
due to differing conceptual frameworks and defini­
tions. Three hierarchically arranged applications
2+
are suggested to (a) determine and manage free Mg
levels; (b) construct an expert system to derive
2+
enzyme kinetic models (including Mg ) and f i t them
to data; (c) design experiments (including enzyme
kinetics) using minimal numbers of animals to prove
drugs safe and effective.

E x p e r t systems, and a r t i f i c i a l i n t e l l i g e n c e i n g e n e r a l , a r e new


f i e l d s whose b r e a d t h o f a p p l i c a t i o n , and i n d e e d , whose exact d e f i n i ­
t i o n s , are not yet completely s e t t l e d . I t i s sometimes c l a i m e d t h a t
no two e x p e r t s on a r t i f i c i a l i n t e l l i g e n c e agree e x a c t l y on what i t s
d e f i n i t i o n i s . D e f i n i t i o n s o f e x p e r t systems a t l e a s t agree on t h e
n e c e s s i t y f o r e x p e r t i s e , b u t even h e r e t h e r e a r e d i f f e r e n c e s i n
emphasis and i n p r i o r i t y .
E x p e r t systems, which e v o l v e d from many s o u r c e s , were r e c o g n i z e d
as a d i s t i n c t system t y p e because o f a l a r g e body o f work on m e d i c a l
c o n s u l t a t i o n problems. The r e s u l t i n g systems, such as MYCIN, CASNET,
and INTERNIST/CADUCEUS, e s s e n t i a l l y s o l v e d what a r e c o n s i d e r e d
c l a s s i f i c a t i o n problems, by c h o o s i n g among a s e t o f p o s s i b l e d i a g ­
n o s t i c o r treatment a l t e r n a t i v e s . Such systems have u s u a l l y
o b t a i n e d i n f o r m a t i o n by a s k i n g t h e u s e r q u e s t i o n s . They have u s u a l l y
performed q u a l i t a t i v e r e a s o n i n g w i t h "knowledge" r u l e s o f t h e t y p e :
i f c o n d i t i o n s A a r e t r u e then c o n c l u d e h y p o t h e s i s Β w i t h p r o b a b i l i t y
X. There e x i s t o t h e r t y p e s o f e x p e r t systems, such as DENDRAL, w h i c h
produces i n t e r p r e t a t i o n s o f q u a n t i t a t i v e e x p e r i m e n t a l e v i d e n c e , and
MOLGEN, which f o r m u l a t e s p l a n s f o r t h e d e s i g n o f e x p e r i m e n t s . Most
e x p e r t systems have been w r i t t e n i n some v a r i a n t o f LISP o r a r e -

0097-6156/86/0306-0075$06.00/0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
76 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

l a t e d l a n g u a g e , w h i c h were o r i g i n a l l y not as w e l l s u i t e d f o r c a l c u l a -
t i o n as f o r l o g i c a l m a n i p u l a t i o n . More r e c e n t l y i t has been
p o s s i b l e t o g e t an e x p e r t system t o s u p e r v i s e c a l c u l a t i o n s , d i g e s t
c o n s i d e r a b l e masses of o b s e r v a t i o n a l d a t a , and draw c o n c l u s i o n s which
a r e not s t r i c t l y c o m p u t a t i o n a l , as i n t h e c a s e of ELAS and t h e o i l -
w e l l d r i l l i n g programs. These i n v o l v e t h e EXPERT system b u i l d e r
( 1 ) , w h i c h has t h e f o l l o w i n g advantages: i t i s w r i t t e n i n FORTRAN
and can t h e r e f o r e e a s i l y communicate w i t h FORTRAN programs; a PROLOG
v e r s i o n has a l s o r e c e n t l y been p r e p a r e d ; i t has d a t a b a s e c a p a b i l i -
t i e s ; and i t i s good a t e x p l a i n i n g what i t i s d o i n g and why. Inter-
a c t i o n between a r t i f i c i a l i n t e l l i g e n c e and m o d e l i n g has e v o l v e d t o
where m o d e l i n g s o c i e t i e s r o u t i n e l y program a r t i f i c i a l i n t e l l i g e n c e
s e s s i o n s a t m e e t i n g s , and a r e f o r m i n g t e c h n i c a l committees on t h i s
subj e c t .
T h i s paper r e f l e c t s t h e p a s t a c t i v i t i e s of some of i t s a u t h o r s
i n computer m o d e l i n g of t h e c h e m i c a l a s p e c t s of b i o l o g i c a l systems.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch006

T h i s a c t i v i t y r e q u i r e s e x p e r t i s e i n b o t h m o d e l - b u i l d i n g and i n t h e
relevant biology. I t a l s o i n v o l v e s e x a m i n a t i o n of the a c t i o n s of and
r e s u l t s o b t a i n e d by e x p e r t s , l i k e t h a t r o u t i n e l y done i n b u i l d i n g ex-
p e r t systems. I t a l s o i n v o l v e s k e e p i n g t r a c k o f and coherently
e x p l a i n i n g sequences o f d e c i s i o n s , w h i c h e x p e r t systems a r e equipped
t o do.
In t h i s paper we a r e c o n c e r n e d w i t h a s e t of r e l a t i v e l y s i m i l a r
p o s s i b l e a p p l i c a t i o n s i n v o l v i n g management of c a l c u l a t i o n s and of
modeling. These i n v o l v e a c t i o n s ( c a l c u l a t i o n , i n f o r m a t i o n r e t r i e v a l ,
and " i n t e l l i g e n t " r e a s o n i n g ) a t more t h a n one h i e r a r c h i c a l l e v e l .
P a r t i c u l a r a t t e n t i o n w i l l be g i v e n t o t h e d e s i g n and i n t e r p r e t a t i o n
of e x p e r i m e n t s i n enzyme k i n e t i c s . D e s i g n i n g an experiment may i n -
v o l v e c o m p u t a t i o n o f o p t i m a l c o n d i t i o n s , and i t s i n t e r p r e t a t i o n may
i n v o l v e f i t t i n g of o p t i m a l parameters of a model, but n o n - n u m e r i c a l
reasoning procedures are a l s o involved. Attention i s therefore re-
q u i r e d t o t h e k i n d s of r e a s o n i n g employed i n d e s i g n i n g e x p e r i m e n t s
and t o t h e c r i t i q u i n g o f t h e r e a s o n i n g and t e c h n i q u e s i n v o l v e d i n
such experiments. A h i g h - l e v e l d e s c r i p t i o n o f an e x p e r i m e n t a l d e s i g n
c y c l e can be g i v e n i n s u c h s t e p s a s : d e f i n i t i o n of the problem (what
q u e s t i o n s a r e t o be a d d r e s s e d ? what h y p o t h e s e s a r e t o be t e s t e d ? ) ;
q u a n t i t a t i v e m o d e l i n g ; d e s i g n and t h e n p e r f o r m a n c e of t h e n e c e s s a r y
e x p e r i m e n t s ; a n a l y s i s of t h e r e s u l t s ; and t h e n model r e i n t e r p r e t a t i o n
and p o s s i b l e problem r e d e f i n i t i o n (2).

A Problem of D e f i n i t i o n

The p r o c e s s of b u i l d i n g e x p e r t systems u s u a l l y i n v o l v e s d e t e r m i n i n g
the c o n c e p t u a l framework and p a t t e r n o f d e c i s i o n making of e x p e r t s
( o f t e n one o u t s t a n d i n g e x p e r t ) . These a r e o f t e n not w r i t t e n down
and may not be c l e a r l y e x p l a i n a b l e b e c a u s e t h e r e i s heavy r e l i a n c e on
h e u r i s t i c s and even hunches. However, we would l i k e t o s u g g e s t t h a t
t h i s may not be t h e o n l y way t o a p p l y e x p e r t i s e . We have e n c o u n t e r e d
workers i n d i f f e r e n t f i e l d s h a n d l i n g t h e same s u b j e c t m a t t e r d i f f e r -
e n t l y b e c a u s e t h e y have d i f f e r e n t c o n c e p t u a l frameworks and d i f f e r e n t
j a r g o n as w e l l as d i f f e r e n t h e u r i s t i c s and p r i o r i t i e s . We o f f e r t h e
f o l l o w i n g example i n v o l v i n g a r e l a t i v e l y s i m p l e m u l t i p l e e q u i l i b r i u m
calculation.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
6. GARFINKEL ET AL. Chemically Based Experiments with Expert Systems 77

A l t h o u g h t h e r e i s no c o n t r o v e r s y about t h e b a s i c d e f i n i t i o n o f
s t a b i l i t y c o n s t a n t s , p h y s i c a l c h e m i s t s and b i o c h e m i s t s h a n d l e t h e
c o n c e p t s i n v o l v e d and t h e r e s u l t i n g c a l c u l a t i o n s d i f f e r e n t l y . Physi­
c a l c h e m i s t s t h i n k i n terms o f r e a c t i v e s p e c i e s and b i o c h e m i s t s i n
terms o f t o t a l c o n c e n t r a t i o n s o f components. A f u r t h e r source of
1 1
c o n f u s i o n i s t h e d i f f e r i n g d e f i n i t i o n s o f "apparent c o n s t a n t . To a
p h y s i c a l chemist t h e s t a b i l i t y c o n s t a n t f o r MgATP formation

2+ 4- 2-
Mg + ATP = MgATP

i s d e f i n e d as

2
CMgATP "*]
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch006

^ =
CMg 2 +
][ATP -] 4

For a g i v e n temperature the standard s t a t e i s a t zero i o n i c strength.


The c o n s t a n t o b s e r v e d e x p e r i m e n t a l l y a t f i n i t e i o n i c s t r e n g t h s would
be c o n s i d e r e d " a p p a r e n t " . A b i o c h e m i s t would c a l l s u c h a c o n s t a n t + +

"intrinsic". The p r e s e n c e o f i n t e r f e r i n g i o n s ^ ( H and Κ ) w h i c h


f o r m Η and Κ c h e l a t e s o f ATP by b i n d i n g t o ATP " would be h a n d l e d by
c a l c u l a t i o n s i n v o l v i n g the corresponding e q u i l i b r i a .
B i o c h e m i s t s h a n d l e t h e s e c a l c u l a t i o n s d i f f e r e n t l y , and d e f i n e
apparent c o n s t a n t S 2 i n terms o f t o t a l components. Tljius an apparent
c o n s t a n t f o r MgATP a t low pH i n t h e p r e s e n c e o f Κ would be d e f i n e d
as

*SlgATP

Κ = *
C k + 3
< 1 + C H + 3 K
HATP
+
W
W h i l e i t i s r e l a t i v e l y easy t o show t h a t t h e two c a l c u l a t i o n s
a r e e q u i v a l e n t i n s i m p l e systems, i t i s n o t so easy w i t h more comj^
p l e x i n v i v o systems, a s when t h e s e e q u i l i b r i a a r e s t u d i e d w i t h ?
NMR s p e c t r a f r o m p e r f u s e d o r i n t a c t o r g a n s . We r e c e n t l y (3) became
i n v o l v e d i n a c o n t r o v e r s y where a 4 - f o l d d i f f e r e n c e i n magnesium i o n
l e v e l was c a l c u l a t e d f r o m s u b s t a n t i a l l y i d e n t i c a l NMR s p e c t r a as a
r e s u l t o f such d i f f e r e n c e s i n d e f i n i t i o n . Our e x p e r i e n c e i n d i c a t e s
t h a t an i n t e l l i g e n t program t o s u p e r v i s e such c a l c u l a t i o n s would be
quite useful.
In such a s i t u a t i o n an i n t e l l i g e n t program may f u n c t i o n a s an
" i n t e l l i g e n t i n t e r f a c e " , a program which can t r a n s l a t e i n f o r m a t i o n
from one c o n c e p t u a l framework t o a n o t h e r . Even though t h e r e a r e many
e x p e r t s i n t h e s u b j e c t m a t t e r i n v o l v e d , programs o f t h i s t y p e would
be u s e f u l f o r t h e many o t h e r s who a r e n o t e x p e r t i n t h e s u b j e c t
m a t t e r o r t h e c a l c u l a t i o n s i n v o l v e d o r who have d i f f i c u l t i e s i n com­
munication. The advent o f s o f t w a r e f o r s m a l l e x p e r t systems on
m i c r o c o m p u t e r s would add t h e advantage o f c o n v e n i e n c e as w e l l .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
78 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Applications

We d e s c r i b e h e r e t h r e e p o s s i b l e a p p l i c a t i o n s o f e x p e r t systems t o
s u p e r v i s e c a l c u l a t i o n s and d e s i g n e x p e r i m e n t s w h i c h a r e l a r g e l y chem-
i c a l l y based, a l t h o u g h t h e y have b i o l o g i c a l c o n t e n t as w e l l . These
are arranged i n a h i e r a r c h i c a l l y i n c r e a s i n g order of complexity
( i . e . , each l e v e l needs t h e c a p a b i l i t i e s o f t h e p r e c e d i n g o n e ) . The
s i m p l e s t o f t h e s e a p p l i c a t i o n s i s t o s u p e r v i s e complex e q u i l i b r i u m
calculations. The example d e s c r i b e d i s o f a t y p e w h i c h o f t e n o c c u r
i n s t u d y i n g b i o l o g i c a l systems where i t i s n e c e s s a r y t o c o n t r o l c o n -
c e n t r a t i o n s of r e a c t i v e s p e c i e s . Such c a l c u l a t i o n s a r e o f t e n n o t
p r o p e r l y handled.

C a l c u l a t i o n s I n v o l v i n g Magnesium Ions
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch006

Enough examples of p o o r l y d e s i g n e d e x p e r i m e n t s and poor c a l c u l a t i o n s


i n v o l v i n g magnesium i o n s e x i s t i n t h e b i o c h e m i c a l l i t e r a t u r e t o i n -
d i c a t e a need f o r a b e t t e r method. T h i s a l s o a p p l i e s t o o t h e r
e q u i l i b r i a o f comparable c o m p l e x i t y , as w i t h o t h e r m e t a l i o n s . Ex-
p e r i m e n t s i n v o l v i n g enzyme k i n e t i c s a r e p a r t i c u l a r l y a f f e c t e d .
Magnesium i o n s a f f e c t many enzymes by b i n d i n g s t r o n g l y b o t h t o t h e
enzymes and t o i m p o r t a n t r e a c t a n t s such as ATP. I n a r e v i e w on t h e
k i n e t i c s o f magnesium-dependent enzymes, M o r r i s o n (4^) s t a t e d i t i s 11

u n f o r t u n a t e t h a t s t u d i e s on many m e t a l - a c t i v a t e d enzymes . . . have


been u n d e r t a k e n u s i n g c o n d i t i o n s t h a t p r e c l u d e i n t e r p r e t a t i o n o f t h e
data."
The r e l e v a n t c a l c u l a t i o n s a r e commonly h a n d l e d p o o r l y , b e c a u s e
the e q u i l i b r i u m equations i n v o l v e d are d i f f i c u l t to s o l v e manually
(but n o t w i t h computers). The few c a l c u l a t i o n s t h a t a r e a c t u a l l y
r e p o r t e d i n t h e b i o c h e m i c a l l i t e r a t u r e u s e s i m p l i f i e d methods o f
l i m i t e d and f r e q u e n t l y unknown v a l i d i t y . L a r g e e x c e s s e s o f magnesium
i o n a r e f r e q u e n t l y u s e d i n e x p e r i m e n t s , perhaps i n an attempt t o
a v o i d such c a l c u l a t i o n s . The r e l e v a n t t h e o r y i s w e l l worked out and
there are e x c e l l e n t reviews. The l i m i t a t i o n a p p e a r s t o i n v o l v e d i f -
f u s i o n t o t h e ( m a t h e m a t i c a l l y ) i n e x p e r t u s e r , w h i c h i s one o f t h e
m o t i v a t i o n s of b u i l d i n g e x p e r t systems.
The c o m p u t a t i o n a l and o t h e r ( e . g . , d a t a base and d e s i g n ) c a p a -
b i l i t i e s t o meet t h e s e needs can be s p e c i f i e d . We may need t o d e t e r -
mine how much magnesium i o n ( o r o t h e r s u b s t a n c e of i n t e r e s t i n an
e q u i l i b r i u m system) i s p r e s e n t i n a c e l l i n t e r i o r o r a s o l u t i o n emu-
l a t i n g the c e l l i n t e r i o r . Here a complex s e r i e s o f e q u i l i b r i a may
be a f f e c t e d by c o n d i t i o n s such as t e m p e r a t u r e o r i o n i c s t r e n g t h . Or
i t may be n e c e s s a r y t o work t h r o u g h a p a t t e r n o f c o n c e n t r a t i o n s of
some p a r t i c u l a r m o l e c u l a r o r i o n i c s p e c i e s t o d e t e r m i n e an u l t i m a t e
e f f e c t , o r t o keep p a r t i c u l a r s p e c i e s o r p a r t i c u l a r s i d e e f f e c t s
w i t h i n c e r t a i n l i m i t s w h i l e changing o t h e r s . Computations may have
t o s t a r t from any o f t h e p a r t i c i p a t i n g s u b s t a n c e s w h i c h a r e e i t h e r t o
be c o n t r o l l e d o r a r e o b s e r v a b l e .
Computation of amounts o f s p e c i e s p r e s e n t i n s t r a i g h t f o r w a r d
e q u i l i b r i a can u s u a l l y be done w i t h o u t much d i f f i c u l t y , e.g., (J5) .
Some of t h e o t h e r r e q u i r e m e n t s mentioned above a r e demanding enough
to d e f i n e a m i n i m a l i n t e r e s t i n g problem i n a r t i f i c i a l i n t e l l i g e n c e
("toy p r o b l e m " ) . I n c l u d e d a r e c o n v e r s i o n s among s e t s o f c o n d i t i o n s
( i . e . , d i f f e r e n t t e m p e r a t u r e o r i o n i c s t r e n g t h ) , w h i c h have caused

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
6. GARFINKEL ET AL. Chemically Based Experiments with Expert Systems 79

c o n s i d e r a b l e d i f f i c u l t y , and w h i c h c o u l d be h a n d l e d by p r o v i d i n g an
e x p e r t system w i t h t h e n e c e s s a r y c o n v e r s i o n a l g o r i t h m s and d a t a .
Such a system would i n c l u d e a program s i m i l a r t o t h a t of S t o r e r and
Cornish-Bowden t o do e q u i l i b r i u m c a l c u l a t i o n s . A communication-
c o n t r o l subprogram would be l i i k e d t o an e x p e r t model by u s i n g t h e
EXPERT knowledge-base s h a l l ( o r s y s t e m - b u i l d e r ) w h i c h i s advantageous
h e r e because i t can i n t e r a c t w i t h p r o c e d u r e s such as t h o s e w r i t t e n i n
FORTRAN f o r n u m e r i c a l c o m p u t a t i o n . A d d i t i o n a l programs and a s m a l l
d a t a base, w h i c h EXPERT can h a n d l e , would keep t r a c k o f w h i c h chemi-
c a l was what a r r a y element, and o t h e r r e q u i r e m e n t s mentioned above.
The system c o u l d be used t o answer q u e s t i o n s such a s :
How c o u l d I add t o a s o l u t i o n c o m b i n a t i o n s of ATP and magnesium
i o n so t h e i r c h e l a t e i s c o n s t a n t and f r e e ATP varies systematically
so as t o d e f i n e families o f c u r v e s w i t h ( d i f f e r e n t ) c o n s t a n t magnesium
ion?
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch006

T h i s t y p e o f c a p a b i l i t y c o u l d be extended t o m a g n e s i u m - c o n t r o l -
l e d enzymes w i t h o u t s u b s t a n t i a l e x p e r t i s e r e g a r d i n g t h e i r k i n e t i c s
by a d d i n g a n o t h e r l i m i t e d e x p e r t system t o manage s i m p l e c a l c u l a t i o n s
i n v o l v i n g m o d i f i c a t i o n s to t h e i r k i n e t i c s . T h i s would r e q u i r e a d d i n g
a s m a l l d a t a base o f t h e b i n d i n g and i n h i b i t i o n c o n s t a n t s of magne-
sium i o n w i t h i m p o r t a n t enzymes. We have assembled t h i s i n f o r m a t i o n
f o r some o f t h e enzymes we have worked w i t h (6) . T h i s would p e r m i t
answering q u e s t i o n s l i k e :
How much magnesium i o n can I add to s o l u t i o n X without inhibit-
i n g enzyme Y by more t h a n 10%?

C a l c u l a t i o n s I n v o l v i n g Enzyme K i n e t i c s

At t h e n e x t h i e r a r c h i c a l s t e p would be an e x p e r t system f o r t h e de-


s i g n o f experiments i n enzyme k i n e t i c s (and m a t h e m a t i c a l l y s i m i l a r
systems l i k e t r a n s p o r t k i n e t i c s ) . Such a system would l a r g e l y a r i s e
from our e x p e r i e n c e i n m o d e l i n g enzyme k i n e t i c s . I t c o u l d systema-
t i c a l l y perform, c o r r e c t l y , r o u t i n e o p e r a t i o n s t h a t a r e e i t h e r not
done o r done i n c o r r e c t l y , because t h e y a r e t o o t e d i o u s o r r e q u i r e
p a r t i c u l a r e x p e r t i s e . ( F o r t h i s s u b j e c t t h e r e e x i s t s a s i z e a b l e body
of well-worked out theory, and y e t c o n s i d e r a b l e work i s done as i f
t h i s theory d i d not e x i s t ) . Such an e x p e r t system c o u l d o f f e r t h e
e x p e r t u s e r b e t t e r m o d e l i n g s t r a t e g y and completeness and t h e i n e x -
1
p e r i e n c e d u s e r t h e advantage o f " f r i e n d l i n e s s ' . As an extreme
example, we modeled (2) what i s p r o b a b l y t h e b e s t d a t a s e t i n t h e
p h o s p h o f r u c t o k i n a s e l i t e r a t u r e , and improved on i t s i n t e r p r e t a t i o n
(which i n c l u d e d m o d e l i n g ) by t h e o r i g i n a l e x p e r i m e n t e r s , who a r e
h i g h l y e x p e r t i n t h i s s u b j e c t . I t was found t h a t one i m p o r t a n t
e f f e c t was n o t d e t e r m i n e d by t h e i r d a t a , but c o u l d have been w i t h a
few a d d i t i o n a l measurements. I f an e x p e r t system such as t h a t
d e s c r i b e d below were a v a i l a b l e , t h i s c o u l d have been done b e f o r e
t h e i r experiments were c o n c l u d e d — a n d l e f t p e r m a n e n t l y i n c o m p l e t e .
A l s o , t h e e n t i r e i n t e r p r e t a t i o n t a s k would have r e q u i r e d c o n s i d e r -
a b l y l e s s time and e f f o r t .
In working w i t h enzyme and t r a n s p o r t k i n e t i c s we a l r e a d y have a
program of c o n s i d e r a b l e s o p h i s t i c a t i o n , PENNZYME (8.) to f i t e x p e r i -
m e n t a l d a t a t o r a t e laws by o p t i m i z a t i o n methods and t o d i s p l a y t h e
r e s u l t s of t h e f i t t i n g p r o c e s s . T h i s program would r e q u i r e e x t e n s i o n
t o p e r f o r m e x p e r i m e n t a l d e s i g n f u n c t i o n s ( s u c h as c a l c u l a t i n g d e s i g n

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
80 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

and i n f o r m a t i o n m a t r i c e s ) . F o r most a p p l i c a t i o n s i t would be b e s t


f o r t h e s c i e n t i s t t o remain i n t h e l o o p . An i n t e r f a c e between
1
PENNZYME and EXPERT such t h a t EXPERT c o u l d d i r e c t PENNZYME S c a l c u -
l a t i o n f u n c t i o n s would be v e r y s i m i l a r t o t h e i n t e r f a c e between
EXPERT and s e v e r a l o i l - w e l l l o g g i n g programs ( 9 - 1 0 ) . To h e l p i n
a s s e s s i n g and documenting m o d e l i n g a p p l i c a t i o n s i t would be d e s i r a b l e
to have EXPERT produce a r e c o r d of i t s a c t i o n s , d e c i s i o n s , and
reasoning, i n a d d i t i o n to the chemical or b i o l o g i c a l output. This
would r e q u i r e o n l y a s t r a i g h t f o r w a r d e x t e n s i o n o f EXPERT'S v e r y good
e x i s t i n g c a p a b i l i t i e s f o r e x p l a i n i n g i t s a c t i o n s to a user o n - l i n e .
The major o p e r a t i o n s t h a t would have t o be performed by such an
e x p e r t system a r e :

S e l e c t i o n o f a c o n c e p t u a l model. As t h e f i r s t s t e p i n m o d e l i n g , i t i s
n e c e s s a r y t o d e c i d e what k i n d of a c o n c e p t u a l model t o t r y . F o r an
enzyme t h i s i n c l u d e s a c h o i c e of mechanism and an i n d i c a t i o n o f t h e
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch006

n u m e r i c a l v a l u e s t h a t go w i t h i t ( d e t e r m i n a t i o n o f t h e b e s t v a l u e s
comes l a t e r ) . P r o b a b l y t h i s w i l l be b e t t e r done by an e x p e r t human
than by a program f o r some t i m e . Examples o f r u l e s (domain knowledge)
f o r enzyme k i n e t i c s w h i c h a r e a p p l i c a b l e ( r e g a r d l e s s o f t h e methods
of c a l c u l a t i o n used) a r e :
1. K i n a s e s u s u a l l y have Km*s f o r ATP c o n s i d e r a b l y lower t h a n
t i s s u e l e v e l s o f ATP.
1
2. Most o t h e r Km s a p p r o x i m a t e t h e u s u a l t i s s u e l e v e l of t h e
substrate involved.
3. C e r t a i n c l a s s e s o f enzymes tend t o have c h a r a c t e r i s t i c
mechanisms. (Examples: t r a n s a m i n a s e s o f t e n have ping-pong mechan-
isms, k i n a s e s u s u a l l y do n o t ) .
4. The commonly used l i n e a r i z e d p l o t s o f k i n e t i c d a t a a r e a
u s a b l e i n i t i a l g u i d e t o d e t e r m i n i n g t h e mechanism.

S e l e c t i o n o f a c o m p u t a t i o n a l model. Once a c o n c e p t u a l model has been


s e l e c t e d , i t i s n e c e s s a r y t o encode i t i n a form u s a b l e f o r c a l c u l a -
t i o n , i . e . , a r a t e law g i v i n g t h e v e l o c i t y o f t h e enzyme as a f u n c -
t i o n o f t h e r e l e v a n t c h e m i c a l c o n c e n t r a t i o n s . An e x p e r t model would
i n c l u d e d e t e r m i n a t i o n o f t h e s i t u a t i o n s where a g i v e n r a t e law s h o u l d
be t r i e d t o g e t h e r w i t h c o n t r o l i n f o r m a t i o n t h a t d e t e r m i n e s how t h i s
i s t o be done. The e x p e r t model o b t a i n s such i n f o r m a t i o n by q u e r y -
i n g t h e u s e r o r by d e d u c t i o n from i t s knowledge r u l e s u s i n g r e s u l t s
from p a s t c a l c u l a t i o n s . Means f o r d e r i v i n g such r a t e laws e x i s t ,
e.g., t h e KINAL program o f Cornish-Bowden ( 1 1 ) , w h i c h we have modi-
f i e d (PROKINAL) t o f a c i l i t a t e i n t e r f a c i n g w i t h EXPERT t o d e r i v e r a t e
laws a u t o m a t i c a l l y . The o p e r a t i o n s i n v o l v e d i n c l u d e d :
1. O b t a i n i n g t h e p r o p e r r a t e law from an e x i s t i n g l i b r a r y o r
g e n e r a t i n g a new one, as w i t h KINAL;
2. M a t c h i n g t h e g e n e r a l i z e d d e s i g n a t i o n s f o r r e a c t a n t s i n t h e
r a t e law ( r e a c t a n t A, r e a c t a n t B, . . .) w i t h t h e r e a l ones i n t h e
system b e i n g s t u d i e d ;
3. A s k i n g t h e u s e r f o r c o r r e c t i o n s i f t h e r e i s a problem;
4. O b t a i n i n g s t a r t i n g v a l u e s of p a r a m e t e r s , as from i n f o r m a t i o n
on a n a l o g o u s enzymes;
5. F i t t i n g t h e r a t e law t o t h e d a t a and o b t a i n i n g t h e o p t i m a l
parameters;
6. Making a p p r o p r i a t e m o d i f i c a t i o n s t o t h e r a t e law.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
6. GARFINKEL ET AL. Chemically Based Experiments with Expert Systems 81

F i t t i n g o f models t o d a t a . F i t t i n g r a t e laws r e p r e s e n t i n g models t o


t h e e x p e r i m e n t a l d a t a i s t h e l o w e s t - l e v e l and most f r e q u e n t o p e r a t i o n
t h i s e x p e r t system would do. PENNZYME does a two-step o p t i m i z a t i o n ,
f i r s t u s i n g t h i s s i m p l e x method w h i c h i s r o b u s t and independent o f
s t a r t i n g g u e s s e s , and then t h e more a c c u r a t e F l e t c h e r - P o w e l l method,
which r e q u i r e s b e t t e r s t a r t i n g e s t i m a t e s . Examples o f h e u r i s t i c
r u l e s on how t o o p e r a t e PENNZYME ( p r o b l e m - s o l v i n g knowledge) a r e :
1. Do a t l e a s t one s i m p l e x o p t i m i z a t i o n b e f o r e d o i n g a F l e t c h e r -
Powell o p t i m i z a t i o n .
2 . Always g e t an o p t i m i z a t i o n r e p o r t . I f t h e p e r c e n t a g e r e d u c -
t i o n o f t h e l e a s t - s q u a r e s e r r o r i s 0.00%, do n o t r e p e a t t h e l a s t
t y p e o f o p t i m i z a t i o n performed.
3 . I f a s i m p l e x o p t i m i z a t i o n has n o t converged f o r a model w i t h
at l e a s t two p a r a m e t e r s a f t e r many i n t e r a c t i o n s , and t h e l e a s t -
s q u a r e s e r r o r r e d u c t i o n i s 0.00%, then something i s wrong w i t h t h e
r a t e law e q u a t i o n o r t h e r a t e law f i l e .
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch006

Having EXPERT o p e r a t e PENNZYME under p a r t i c u l a r l y f a v o r a b l e


c o n d i t i o n s i s e x p e c t e d t o be s t r a i g h t f o r w a r d t o t h e p o i n t o f b e i n g
uninteresting. O b t a i n i n g t h e p e r c e n t a g e r e d u c t i o n r e s u l t i n g from a
g i v e n o p t i m i z a t i o n and d e c i d i n g from i t s v a l u e and p l a c e i n p a t t e r n
of o p e r a t i o n s and r e s u l t i n g v a l u e s what o p e r a t i o n t o a s k f o r n e x t
would be c o n c e p t u a l l y s i m i l a r t o t h e u s u a l one-query-at-a-time
d i r e c t e d t o a human u s e r o f t h e u s u a l e x p e r t system. Determining
what i s wrong w i t h a r a t e law f i l e (which i s n o t a common problem)
would n o r m a l l y r e q u i r e u s e r i n t e r v e n t i o n . At t h e o t h e r extreme, t h i s
program c o m b i n a t i o n w i l l n o t be a b l e t o e x t r a c t from a poor d a t a s e t
information that i s not there to begin with. The most u s e f u l a p p l i -
c a t i o n i s t o t h e i n t e r m e d i a t e s i t u a t i o n , where t h e r e i s u s e f u l b u t
l i m i t e d o r n o i s y d a t a , o r where t h e e x p e r i m e n t a l d e s i g n i s n o t o p t i -
mal.

Experimental Design. I t i s now p o s s i b l e , b u t i n c o n v e n i e n t , t o u s e


PENNZYME i n an i n v e r s e mode, by d e t e r m i n i n g t h e p a r a m e t e r s i n a r a t e
law and then m a n i p u l a t i n g t h e c h e m i c a l c o n c e n t r a t i o n s so as t o f i n d
t h e p o i n t i n c o n c e n t r a t i o n space t h a t maximizes a g i v e n e f f e c t . An
immediate a p p l i c a t i o n i s t o maximize t h e d i f f e r e n c e between two r a t e
laws by means o f a d i s c r i m i n a t i o n f u n c t i o n (2). T h i s amounts t o de-
s i g n i n g a c r i t i c a l experiment t o d i s t i n g u i s h between them. The u s e r
who has a s an a p p r o p r i a t e s e t o f e x p e r i m e n t a l d a t a o f s u f f i c i e n t l y
good q u a l i t y and two a l t e r n a t i v e r a t e laws t h a t might f i t i t c o u l d
have t h e EXPERT-PENNZYME c o m b i n a t i o n :
(a) F i n d t h e o p t i m a l p a r a m e t e r s f o r t h e s e r a t e laws;
(b) Determine t h e p o i n t ( s ) o r r e g i o n ( s ) i n c o n c e n t r a t i o n space
where t h e y d i f f e r most ( f o l l o w i n g (2));
(c) Recommend one o r more e x p e r i m e n t a l measurements a t t h o s e
points.
In d e s i g n i n g s e q u e n t i a l e x p e r i m e n t a l measurements o r groups o f
them, o t h e r f u n c t i o n s t h a t might be performed w i t h a p p r o p r i a t e
calculations are:
(1) M i n i m i z e t h e c o n f i d e n c e l i m i t o r v a r i a n c e o f a g i v e n p a r a -
meter, s u c h a s a M i c h a e l i s c o n s t a n t . This requires picking a point
or p o i n t s i n c o n c e n t r a t i o n space where t h e v a l u e o f t h e parameter i s
maximally s e n s i t i v e t o the experimental r e s u l t obtained, i . e . , a
k i n e t i c constant b a s i c a l l y representing the binding of a small

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
82 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

molecule i s i n s e n s i t i v e to measurements where i t s binding i s very


small or very l a r g e , and more s e n s i t i v e to measurements where i t s
binding i s near half maximal.
(2) Maximize or minimize information or design matrices.
While performing the Fletcher-Powell optimization, PENNZYME
calculates the variance-covariance matrix of the parameters. This
can be used to test model a c c e p t a b i l i t y : the parameters of a good
model should be r e l a t i v e l y (although never completely) independent
of each other; i f they are not, there i s something questionable
about i t . More important, t h i s matrix i s also usable f o r design
c a l c u l a t i o n s . Endrenyi (12) points out "optimal designs aim at
minimizing the volume of the j o i n t confidence region of the para-
meters. In the l i n e a r least-square approximation, t h i s i s propor-
t i o n a l to the determinant of the parameter variance-covariance
matrix V . . ." The important D-optimality c r i t e r i o n maximizes the
determinant of the information matrix which i s proportional to i t s
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch006

inverse. (Other o p t i m a l i t y c r i t e r i a may be more robust or better i n


s p e c i a l situations.) The necessary matrix manipulations can be co-
ordinated with the PENNZYME program using e x i s t i n g matrix manipula-
t i o n software packages. Appropriate expert r u l e s to use such
computations to design experiments would then have to be derived.
These would have to consider the probable accuracy or d i f f i c u l t y of
a given measurement. A small net s i g n a l above a large background
noise w i l l probably be inaccurate. An experimenter might prefer two
measurements under convenient conditions to one measurement under
inconvenient (or scarce-material consuming) ones. Considerations of
minimizing experimenter's e f f o r t , number of animals used, etc. can
either be included i n a body of r u l e s , or by adding some kind of
penalty function to the c a l c u l a t i o n s .

Special consideration of metal ions. The e f f e c t s of metal ions such


as magnesium ion could be calculated by e f f e c t i v e l y incorporating
into t h i s system the software described previously. Q u a l i t a t i v e
considerations could then be included by assembling a set of know-
ledge r u l e s applicable to magnesium ion behavior with regard to
enzymes, e.g.
Magnesium ion i s u s u a l l y involved (for "charge n e u t r a l i z a t i o n " )
where "high-energy phosphate" i s moved from one molecule to another
by an enzyme, i . e . , the metabolically a c t i v e form of ATP i s u s u a l l y
the magnesium chelate.
The ensemble of EXPERT plus data knowledge bases and c a l c u l a t i o n
routines would then be used to solve problems such as determining a
change i n enzyme a c t i v i t y on changing metal ion l e v e l — o r determining
whether there i s an e f f e c t i v e change i n mechanism as w e l l .
Pharmacokinetics and Drug Dosage Regimen Design—A Possible A p p l i c a -
t i o n Requiring Construction and Manipulation of a Complex Model and
Data Base with an Expert System

A major part of the slow and expensive drug development process con-
s i s t s of t e s t i n g to determine that a given p o t e n t i a l drug i s both
safe and e f f e c t i v e . The number of drug (or cosmetic) t o x i c i t y t e s t s
performed annually i n the United States i s very l a r g e , involving
perhaps 15 m i l l i o n animals and considerably more d o l l a r s . The
expense of t e s t i n g and q u a l i f i c a t i o n may be p r o h i b i t i v e f o r use i n

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
6. G A R F I N K E L ET AL. Chemically Based Experiments with Expert Systems 83

animals: i t may cost more to prove a drug safe and e f f e c t i v e i n a


given species than could ever be earned by sales f o r use i n that
species. The techniques used, e s p e c i a l l y to test t o x i c i t y , are now
being strongly c r i t i c i z e d , e s p e c i a l l y because of the large number of
animal-based experiments. Computer-based methods of predicting
t o x i c i t y from the chemical structure are being developed i n response
to t h i s problem.
An suitable expert system which can manage pharmacokinetic simu-
l a t i o n could s u b s t a n t i a l l y improve the speed and e f f i c i e n c y of t h i s
process. Such a system would contain information about drugs, drug
metabolism, excretion, etc. and the relevant p h y s i o l o g i c a l para-
meters. I t would supervise construction of models from quantitative
measurements of the behavior of the drug under test i n animals. The
expert system would be needed because large-scale b i o l o g i c a l modeling
has thus f a r been slow. Also, pharmacokinetic modeling has empha-
sized simple systems and given l i t t l e attention to q u a l i t a t i v e data
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch006

or to extrapolation from one species to another.

Modeling Considerations. Drugs f o r i n t e r n a l use must enter the body


i n some way, reach the blood stream, be transported to the relevant
organs and active s i t e s , exert t h e i r action, perhaps be metabolized
or modified, with subsequent departure from the body. These pro-
cesses involve the action of enzymes and of k i n e t i c a l l y s i m i l a r
transport mechanisms, so the techniques and software described above
(an expert system involving EXPERT and PENNZYME) are applicable here.
The major v a r i a b l e which t h i s type of analysis would t r y to predict
and manage i s the (free) plasma l e v e l of a drug. This l e v e l i s
l i k e l y to be i d e n t i c a l to the drug l e v e l at the s i t e of action. I t
has been shown to be d i r e c t l y related to therapeutic effect f o r many
drugs—but less c l o s e l y related to the dose administered. Important
t h e o r e t i c a l l y predictable perturbing factors here include disease
conditions such as renal f a i l u r e , old age, and physiological f a c t o r s ;
an important but unpredictable one i s f a i l u r e to take a drug as d i -
rected. The e f f e c t s of such factors on the behavior and apparent
t o x i c i t y of a given drug would require systematic exploration with
appropriate models, which i s best supervised by an expert system
because i t would otherwise take too long.
Compartments between which drugs do not mix, or mix only slowly,
commonly e x i s t i n the body. Metabolism w i t h i n them i s carried out
and controlled by enzymes i n the usual way. These compartments can
be detected by time-curve analysis of the blood l e v e l s of drugs.
Compartments determined i n t h i s way have the l i m i t a t i o n s that:
1. They are d i f f i c u l t to predict a p r i o r i ;
2. Their structure may depend on p a r t i c u l a r numerical values
associated with a system under study as w e l l as i t s structure or
organization;
3. Sometimes d i f f e r e n t competent workers disagree as to the
compartmental structure of the same system.
An important methodology of extrapolating pharmacokinetic or
drug properties from one species to another which i s r e l a t i v e l y i n -
dependent of such compartmental modeling has been developed by
Bischoff and collaborators (13). I t i s instead based on known ana-
tomical and physiological functions, such as blood flow to organs
which either metabolize drugs or are affected by them, the size and

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
84 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

m e t a b o l i c r a t e o f t h e a n i m a l , e t c . To some e x t e n t t h i s approach
("physiological pharmacokinetics") i s a chemical engineer's formula-
t i o n o f p h a r m a c o k i n e t i c problem. The r a t e a t w h i c h a g i v e n drug i s
d e l i v e r e d t o a m e t a b o l i z i n g o r t a r g e t organ by t h e plasma ( w i t h i t s
l e v e l of drug) i s c a l c u l a t e d a l o n g w i t h t h e r a t e s o f m e t a b o l i s m o r
d e t o x i f i c a t i o n by such o r g a n s , as w e l l as t h e r a t e o f removal of t h e
drug ( o r i t s m e t a b o l i t e s ) from t h e body. From t h i s i n f o r m a t i o n t h e
t o t a l and f r e e ( a f t e r b i n d i n g t o p r o t e i n s , e t c . ) organ c o n t e n t of t h e
drug and t h e l e v e l a t t h e a c t i v e s i t e i s c a l c u l a t e d . T h i s method i s
based on t h e o r d e r l y change o f many a n a t o m i c a l and p h y s i o l o g i c a l
p r o p e r t i e s w i t h body w e i g h t . Anatomical dimensions i n c r e a s e n e a r l y
l i n e a r l y w i t h w e i g h t , w h i l e p h y s i o l o g i c a l r a t e s v a r y as t h e .7 t o .8
power ( 1 4 ) . P h y s i o l o g i c a l p r o c e s s e s a r e t h e r e f o r e s l o w e r i n l a r g e r
a n i m a l s ; t h e c a r d i a c o u t p u t o f a mouse p e r body weight i s about an
o r d e r o f magnitude h i g h e r t h a n t h a t o f a man. T h i s t r e n d i s coherent:
t h e d i s p o s i t i o n h a l f - l i f e o f h e x o b a r b i t a l a p p r o x i m a t e s 1,680 g u t -
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch006

b e a t s i n a wide v a r i e t y o f mammals ( 1 4 ) . D e d r i c k (15) has d e s c r i b e d


a f o r m a l i s m f o r a n i m a l s c a l e - u p . G e n e r a l a p p l i c a t i o n o f t h i s method
would r e q u i r e assembly a t a d a t a b a s e w i t h s i z e s o f and b l o o d f l o w s
t o t h e most i m p o r t a n t o r g a n s , e x c r e t o r y c a p a c i t y and r e n a l f u n c t i o n s ,
e t c . , and even p r e d i c t i o n o f p o t e n t i a l compartmental s i z e s where
possible.
The c o m p a r a t i v e b e h a v i o r of a few drugs has been t h o r o u g h l y
s t u d i e d by t h e s e w o r k e r s , w i t h t h e l a r g e s t e f f o r t d i r e c t e d t o metho-
t r e x a t e ( 1 6 ) . T h i s drug c o n s t i t u t e s a good t e s t c a s e because i t s
mechanism o f a c t i o n i s well-known and s i m p l e , t h e amount o f i n f o r m a -
t i o n about i t i s v e r y l a r g e , and i t now a p p e a r s a p p l i c a b l e t o two
u n r e l a t e d t h e r a p e u t i c s i t u a t i o n s r e q u i r i n g d i f f e r e n t dosage l e v e l s .
B i s c h o f f e t a l were a b l e t o f i t s u b s t a n t i a l l y t h e same model t o d a t a
f o r mouse, r a t , dog ( i n c l u d i n g dogs o f d i f f e r e n t s i z e s ) , monkey, and
man. They were t h e n a b l e t o s u c c e s s f u l l y e x t r a p o l a t e from t h e s e
mammalian s t u d i e s a l l t h e way t o t h e s t i n g r a y , w h i c h i s z o o l o g i c a l l y
a v a r i e t y of s h a r k ( 1 7 ) .
A second l e v e l of s o p h i s t i c a t i o n i s p o s s i b l e h e r e . To quote
from B i s c h o f f ( 1 8 ) , " W i l l i a m s n o t e s t h a t f o r e i g n o r g a n i c compounds
tend t o be m e t a b o l i z e d i n two phases. Phase one r e a c t i o n s l e a d t o
o x i d a t i o n - r e d u c t i o n and h y d r o l y s i s p r o d u c t s . Phase two r e a c t i o n s
lead to s y n t h e t i c or conjugation products that are r e l a t i v e l y p o l a r
and a r e t h u s more e a s i l y e x c r e t e d " . S p e c i e s v a r i a t i o n s o f phase one
r e a c t i o n s a r e v e r y common but h a r d t o p r e d i c t ; phase two r e a c t i o n s
a r e much more l i m i t e d i n number and more p r e d i c t a b l e . A more power-
f u l e x p e r t system c o u l d p r o b a b l y make u s e f u l p r e d i c t i o n s o f t h e
q u a n t i t a t i v e b e h a v i o r o f t o x i c m e t a b o l i t e s o f drugs and perhaps h e l p
get an i n d i c a t i o n of what p r e s e n t l y unkown s p e c i e s - d e p e n d e n t t o x i c i -
t i e s might be. F o r t h i s purpose t h e a d m i t t e d l y i n c o m p l e t e i n f o r m a -
t i o n on w h i c h pathways of d e t o x i f i c a t i o n and o t h e r m e t a b o l i s m a r e
p r e s e n t i n w h i c h organ and how a c t i v e t h e y a r e , would have t o be
c o l l e c t e d ( t h i s i n c l u d e s h e u r i s t i c s as w e l l as h a r d d a t a i n c l u d i n g
t h e t y p e s of i n f o r m a t i o n mentioned a b o v e ) . Some o f t h e u n p r e d i c t a -
b i l i t i e s as t o w h i c h t o x i c p r o d u c t s might be formed by t h e l i v e r o f
what s p e c i e s might be compensated f o r by a p p r o p r i a t e l y d e s i g n e d ex-
p e r i m e n t s w i t h such l i v e r s o r t i s s u e c u l t u r e s derived from them).
One c o u l d s e t up an e x p e r t system by i n t e r f a c i n g a s u i t a b l e
s i m u l a t i o n program w i t h EXPERT. Good o p t i m i z a t i o n c a p a b i l i t i e s and

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
6. GARFINKEL ET AL. Chemically Based Experiments with Expert Systems 85

a b i l i t y t o h a n d l e d e s i g n o p t i m a l i t y problems l i k e t h o s e mentioned
above a r e i m p o r t a n t i n t h e s i m u l a t i o n program, i n a d d i t i o n t o t h e
good d a t a - b a s e and e x p l a n a t i o n c a p a b i l i t i e s o f EXPERT. Such an ex-
p e r t system c o u l d t h e n b u i l d m u l t i - s p e c i e s p h a r m a c o k i n e t i c models by
t h e method of B i s c h o f f and D e d r i c k . A f t e r r e p e a t i n g t h e i r work as
t h e t e s t c a s e , t h i s e x p e r t system c o u l d be u s e d f o r t h e o t h e r drugs
whose k i n e t i c s have been s u f f i c i e n t l y s t u d i e d ( i n c l u d i n g sampling i n
s e v e r a l t i s s u e s ) as r e q u i r e d f o r such a n a l y s i s . Subsequent e x t e n s i o n
to i n c l u d e a d d i t i o n a l m e t h o d o l o g i e s i s p o s s i b l e ( e . g . d e t a i l e d r e p r e -
s e n t a t i o n o f enzyme k i n e t i c s ) . Model c o n s t r u c t i o n w i t h o n l y p a r t of
the o r i g i n a l d a t a c o u l d t h e n be r e p e a t e d t o d e t e r m i n e t h e need f o r
completeness o f ( e x p e r i m e n t a l l y determined) i n f o r m a t i o n , i . e . , w h i c h
and how many a n i m a l e x p e r i m e n t s a r e r e a l l y n e c e s s a r y . Such c o n s i d e r -
a t i o n s a r e i m p o r t a n t i n drug t e s t i n g , and an e x p e r t system would h e l p
b o t h by d o i n g t h e m o d e l i n g f a s t e r t h a n a human, and a l s o more
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch006

systematically.
A w e l l - e s t a b l i s h e d s p e c i a l i z e d e x p e r t system w i t h w h i c h t h e
proposed e x p e r t system c o u l d be compared i s t h e d i g i t a l i s a d v i s o r of
S z o l o v i t z and Long (19) w h i c h r e p r e s e n t s a w e l l - u n d e r s t o o d c l i n i c a l
situation. I t p e r f o r m s c l i n i c a l f u n c t i o n s beyond t h e scope o f t h i s
proposed system, but i t does do some t h i n g s , l i k e m a i n t a i n i n g t h e
b l o o d l e v e l o f t h e drug i n v o l v e d , and m o n i t o r i n g i t s t o x i c i t y , that
t h i s proposed system i s c o n c e r n e d w i t h and s h o u l d p e r f o r m a d e q u a t e l y .
S t a r t i n g w i t h a p p r o p r i a t e knowledge o f t h e b e h a v i o r o f a p r o s -
p e c t i v e drug i n one s p e c i e s one c o u l d then e x t r a p o l a t e t o o t h e r
s p e c i e s , u l t i m a t e l y i n c l u d i n g humans. T h i s c a p a b i l i t y c o u l d be used
i n t e s t i n g a proposed drug t o d e t e r m i n e p r o p e r dosage and regimen
under what c o n d i t i o n s i t (and p o s s i b l y i t s m e t a b o l i t e s ) i s t o x i c , and
how s e n s i t i v e i t s b e h a v i o r might be t o p e r t u r b i n g c o n d i t i o n s , w h i c h
p r e s e n t l y have t o be r e - p e r f o r m e d f o r each s p e c i e s i n v o l v e d by empir-
i c a l l y and h e u r i s t i c a l l y g u i d e d e x p e r i m e n t s . I t i s reasonable to
hope f o r s i g n i f i c a n t l y improved e f f i c i e n c y i n p e r f o r m i n g t h e s e
expensive o p e r a t i o n s .

Conclusion

We have d e s c r i b e d a s e t o f a p p l i c a t i o n s of a c o n v e n t i o n a l e x p e r t s y s -
tem w h i c h extend t h e u s u a l f u n c t i o n s o f such systems from p r i m a r i l y
l o g i c a l r e a s o n i n g and s o l u t i o n o f c l a s s i f i c a t i o n problems t o i n c l u d e
s u p e r v i s i o n of c a l c u l a t i o n s and o f m o d e l i n g , i . e . , systems manage-
ment. A h i e r a r c h y of a p p l i c a t i o n s a r i s i n g from b i o c h e m i c a l r e s e a r c h
have been d i s c u s s e d . These f o l l o w b i o l o g i c a l systems i n b e i n g p r i -
m a r i l y c h e m i c a l a t t h e l o w e s t l e v e l but a c q u i r e more b i o l o g i c a l
character at the higher l e v e l s . At t h e l o w e s t l e v e l , t h e s e p e r m i t
t h e c o n v e n i e n t performance of c a l c u l a t i o n w h i c h i s n o t b e i n g done o r
done p r o p e r l y . At t h e i n t e r m e d i a t e l e v e l , t h e y p r o v i d e a b e t t e r
r e s e a r c h t o o l , e s p e c i a l l y f o r e x p e r i m e n t a l d e s i g n . At t h e most com-
p l e x l e v e l , t h e y would p e r m i t a complex, slow, and e x p e n s i v e p r o c e s s
to be c a r r i e d out w i t h l e s s r e s o u r c e e x p e n d i t u r e ( c a l e n d a r t i m e ,
money, and a n i m a l e x p e r i m e n t s ) .

Acknowledgments

Supported by NIH g r a n t s HL15622, AM33016, RR643, and RR2230.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
86 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Literature Cited

1. Weiss, S.; Kulikowski, C. Proc. 6th International Joint Confer­


ence on A r t i f i c i a l Intelligence, 1979, p. 942.
2. Mannervick, B. "Kinetic Data Analysis";Endrenyi, L.,Ed.;Plenum
Press: New York, 1981; p. 235.
3. Garfinkel, L.; Garfinkel, D. Biochemistry 1984, 23, 3547.
4. Morrison, J. F. Methods in Enzymology 1979, 63, 257.
5. Storer, A. C.; Cornish-Bowden, A. Biochem. J. 1977, 165, 61.
6. Garfinkel, L.; Garfinkel, D. Magnesium 1985, 4, 60.
7. Waser, M. R.; Garfinkel, L.; Kohn, M. C.; Garfinkel, D.
J. Theoret. Biol. 1983, 103, 295.
8. Kohn, M. C.; Menten, L. E.; Garfinkel, D. Comput. Biomed. Res.
1979, 12, 461.
9. Weiss, S.; Apte, C. IEEE Transactions on Pattern Analysis and
Machine Intelligence 1985, PAMI7, 586.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch006

10. Weiss, S.; Kulikowski, C.; Apte, C.; Uschold, M.; Patchett, J.;
Briggham, R. M.; Spitzer, B. Proc. 2nd Annual Nat'l. Conf. on
A r t i f i c i a l Intelligence, Pittsburgh, PA,1982, 322.
11. Cornish-Bowden, A. Biochem. J. 1977, 165, 55.
12. Endrenyi, L. "Kinetic Data Analysis"; Endrenyi, L., Ed,;
Plenum Press: New York, 1981, p. 137.
13. Bischoff, Κ. B. Cancer Chemotheraphy Reports 1975, 59, Part 1,
p. 777.
14. Adolph, E. F. Science 1949, 109, 579.
15. Dedrick, R. L. J. Pharmacokinet. Biopharm. 1, 1978, 435.
16. Bischoff, Κ. B.; Dedrick, R. L.; Zaharko, D. S.; Longstreth, J.
A. J. Pharmaceutical Sciences 1971, 60, 1128.
17. Zaharko, D. S.; Dedrick, R. L.; Oliverio, V. T. Comp. Biochem.
Physiol. 1974, 42A, 183.
18. Bischoff, Κ. B. Fed. Proc. 1980, 39, 2456.
19. Szolovitz, P.; Long, W. J. " A r t i f i c i a l Intelligence in
Medicine", Szolovitz, P., Ed.: AAAS Selected Symposium 51;
Westover Press: Boulder, Colo.; 1982; p. 79.

RECEIVED December 17, 1985

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
7
An Expert System for the Formulation
of Agricultural Chemicals

Bruce A. Hohne and Richard D. Houghton

Rohm and Haas Company, Spring House, PA 19477

An expert system has been written which helps the


agricultural chemist develop formulations for new
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch007

biologically active chemicals. The decision making


process is segmented into two parts. The f i r s t is
which type of formulation to use. The second is how
to make a formulation of that type with the chemical
of interest. The knowledge base currently contains
rules to determine which formulation type to try and
how to make an emulsifiable concentrate. The next
phase will add rules on how to make other types of
formulations. The program also interfaces to several
FORTRAN programs which perform calculations such as
solubilities.

What Is An A g r i c u l t u r a l Formulation

An e s s e n t i a l part of the development of a new p e s t i c i d e i s


e s t a b l i s h i n g a good, dependable formulation. The product's a c t i v e
ingredient and p h y s i c a l properties should remain acceptable f o r two
years or more. These formulations are often subjected to storage
conditions of extreme heat, cold, and humidity. Once sold to the
a p p l i c a t o r , the concentrated formulation should d i l u t e e a s i l y to
f i e l d strength and pass f r e e l y through conventional spray equipment.
A g r i c u l t u r a l (Ag) formulations that are commonly d i l u t e d and
applied by means of spray equipment include water soluble l i q u i d s ,
emulsifiable concentrates, wettable powders, and flowable
suspensions. The choice of which formulation to develop normally
depends upon the s o l u b i l i t y properties of the t e c h n i c a l p e s t i c i d e .
S c i e n t i s t s often must also consider manufacturing costs, f i e l d
e f f i c a c y and product t o x i c i t y .
A water soluble l i q u i d formulation (WSL) i s prepared from
p e s t i c i d e s that are h i g h l y water soluble. This i s , by f a r , the
1
simplest type of formulation. One d i s t i n c t advantage of WSL s over
other formulations i s that the f i e l d spray d i l u t i o n s are i n f i n i t e l y
stable as true s o l u t i o n s . Pesticides that are h y d r o p h i l i c and
i o n i c , such as inorganic or organic m e t a l l i c s a l t s , often f a l l i n t o
t h i s category. Unfortunately, only a small p o r t i o n of a l l
p e s t i c i d e s are adequately soluble i n water.

0097-6156/86/0306-O087$06.00/0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
88 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

An emulsifiable concentrate i s prepared from p e s t i c i d e s that


are soluble i n common organic solvents, such as xylene and kerosene.
Using e m u l s i f i e r s i n the composition causes the formulation to
disperse into small p a r t i c l e s , c a l l e d an emulsion, when d i l u t e d i n
water.
Pesticides that are not soluble or have l i m i t e d s o l u b i l i t y i n
common solvents are formulated as wettable powders (WP) or flowable
concentrates (flowables). A wettable powder has the capacity f o r
high active ingredient content, often between f i f t y and eighty
percent by weight, and i s made by blending and grinding dry
ingredients. Wettable powders are best prepared from p e s t i c i d e s
that are high melting, f r i a b l e s o l i d s . Diluents, such as n a t u r a l
clays and synthetic s i l i c a t e s , are used to improve the powder's
p h y s i c a l properties. The disadvantages of a WP are: messy handling
properties; p o t e n t i a l dust i n h a l a t i o n hazard f o r f i e l d personnel;
and the need to measure the powder on a weight basis. In some cases
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch007

these problems can be overcome by formulating the p e s t i c i d e into a


suspension. Water and other ingredients are added to the
composition to suspend and disperse the active compound into a
flowable.
Regardless of what type of formulation i s employed i n the
f i e l d , the formulation must wet, disperse, and remain homogeneous i n
the a p p l i c a t i o n spray equipment. Careful s e l e c t i o n of formulating
agents, commonly c a l l e d i n e r t s , i s extremely important. These
ingredients have no b i o l o g i c a l a c t i v i t y of t h e i r own, but combined,
they f u n c t i o n as the d e l i v e r y system f o r the p e s t i c i d e .
In a d d i t i o n to solvents and d i l u e n t s , formulations may contain
e m u l s i f i e r s , dispersants, chelating agents, thickeners, defoamers,
and more. The large number and v a r i e t y of each type makes s e l e c t i n g
the components f o r a formulation d i f f i c u l t and time consuming.

Why Is This A Good Area f o r an Expert System

The process of choosing a p p l i c a t i o n areas f o r expert system


development has been d e t a i l e d elsewhere, both f o r the general case
and the corporate environment [1]. There are several s p e c i f i c
advantages i n the formulations a p p l i c a t i o n . Experts on one type of
formulation are not necessarily experts on other formulation types.
Expertise i n Ag formulations tends to be i n the form of 'rules of
thumb', based on experiences with s i m i l a r chemical systems.
Incremental growth, l i k e t h i s , i s i d e a l f o r expert system
development. Formulation s c i e n t i s t s are also l i k e l y to be more
tolerant of the program's mistakes because t h e i r s k i l l i s measured
by how few bad formulations they make before they make a good one.
M u l t i l e v e l expert systems o f f e r a d d i t i o n a l advantages over
t r a d i t i o n a l expert systems. M u l t i l e v e l expert systems draw on
computational computer programs to solve parts of the problem. The
Ag formulation expert system does t h i s i n the areas of computational
chemistry, bookkeeping, and communication.
There are numerous computational programs a v a i l a b l e to chemists
today. These programs are algorithmic by nature, and solve problems
that do not lend themselves to expert systems. However, a great
deal of expertise may be needed by the chemist to decide which
program to use and how to a c t u a l l y use i t . Most chemists do not

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
7. H O H N E A N D HOUGHTON Formulation of Agricultural Chemicals 89

have, and are not w i l l i n g to gain, t h i s computer expertise. Some


would rather use t r a d i t i o n a l , noncomputational, methods rather than
navigate the maze of a v a i l a b l e computer programs and users manuals.
Expert systems can be extremely valuable i n providing t h i s expertise
to chemists.
The Ag formulations expert system has the a b i l i t y to execute
the appropriate computational programs, giving i t an advantage over
the formulation chemist. Bookkeeping tasks are generally handled
better by a computer than a chemist. For example, time tables must
be met f o r long term storage studies, toxicology data, and
government r e g i s t r a t i o n s . These tasks are e a s i l y handled by the
computer.
The expert system f i l l s several p o t e n t i a l communication gaps.
Molecular modeling c a l c u l a t i o n s which are performed by the synthetic
chemists, outside the formulation area, can be accessed by the
expert system. Through t h i s i n t e r f a c e , the expert system can
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch007

extract u s e f u l , s t r u c t u r a l information d i r e c t l y . Also, i f a


structure has not been entered, the formulation chemist can use the
modeling program to enter the structure into the computer. In
addition, the system safeguards against communication gaps between
the chemist and management/marketing by including marketing and
production considerations i n the rule base. In t h i s way, management
can determine which new formulations are possible, and what
c h a r a c t e r i s t i c s they w i l l s a c r i f i c e with a p a r t i c u l a r formulation.

Structure of the Problem

The problem of devoloping a new formulation i s highly structured.


The structure tends to be h i e r a r c h i c a l , although t h i s hierarchy does
not resemble a t r a d i t i o n a l decision tree. Each branch point may
have any number of branches. The decision about which 'branch' to
take at each l e v e l can be viewed as an independent expert system.
The a b i l i t y to break the o v e r a l l problem into smaller, simpler
subproblems i s desirable f o r expert systems.
Many of the facts i n the system are shared by several
subproblems, and subproblems must be. developed by s t a r t i n g at the
top of the hierarchy and working down. Other than these
s t i p u l a t i o n s , they are independent problems. Each branch of the
tree can be used independently, and need not be complete to be
u s e f u l i n the formulation study. The expert system's competence on
each subproblem can be judged independently. In many cases
d i f f e r e n t experts are used to develop the knowledge bases f o r
d i f f e r e n t subproblems. Figure 1 shows the structure of the problem,
t r a c i n g one branch from each l e v e l .

Structure of the Expert System

The program was w r i t t e n on an Apollo computer i n LISP. Apollo's


Domain LISP, a v e r s i o n of Portable Standard LISP, was the d i a l e c t
available.
The expert system has been w r i t t e n to follow the natural
structure of the Ag formulation problem. Figure 2 shows the o v e r a l l
structure of the expert system. One nice feature of the program i s
that at each branch point the user can override the computer's
choice, and can also select as many branches to pursue as desired.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
90 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Emulsifiable
Concentrate
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch007

Determine
Solvent

Solvent Solvent Solvent Solvent


1 3 4 5

Det<srmine
Em jlsifier I

Emulsifier Emulsifier
1 2

F i g u r e 1. S t r u c t u r e o f the Problem

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
7. H O H N E A N D HOUGHTON Formulation of Agricultural Chemicals 91

Load Relevant
Rules and Hypotheses

Collect
Background
User Information
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch007

Forward Chain

Sort Hypotheses

Reverse Chain on
Best Hypothesis

JCollect Additional
User Information

User •Choose Hypothesis

Figure 2. Structure of the Program

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
92 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

The l o g i c a l deduction p o r t i o n of the program i s based on


IF-THEN r u l e s . FACTS, acquired both as the r e s u l t of l o g i c a l
deductions and by querying the user, are stored i n s i m i l a r data
structures. Because the branch points i n the problem are also
l o g i c a l deductions, they are stored i n a data structure s i m i l a r to
the FACTS. The branch points contain a d d i t i o n a l flow of c o n t r o l
information that r e l a t e s to the hierarchy of the problem. The
difference between FACTS and branch points i s transparent to the
l o g i c a l deduction p o r t i o n of the program.
The top l e v e l i n the structure of FACTS i s the f a c t name, e.g.,
ACTIVE_INGREDIENT. Under each f a c t are various properties relevant
to that f a c t , e.g., H20_S0LUBILITY. For each property, several
pieces of information are stored (see Figure 3). A l l properties
contain a VALUE, which i s i n i t i a l i z e d to a n u l l or missing value.
They also contain the method to obtain the VALUE. Currently
supported methods are ASKIT, PROVEIT, and CALL.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch007

Fact name
Property 1
Value
Where to f i n d i t (Ask, prove, c a l c u l a t e )
Prompt (How to ask user)
Allowable response (Checks user's response)
Explanation (For prove and c a l c u l a t e )

Property 2
Value

Figure 3. Structure of Facts


I f the method f o r acquiring a VALUE i s ASKIT, then a user
PROMPT i s stored. In order to guarantee a v a l i d response to the
question, a LISP function to check the answer i s included with the
FACT. Table I l i s t s the c u r r e n t l y implemented response checking
functions. Whenever the inference engine reaches one of these
f a c t s , searching i s stopped and the user i s prompted f o r a value.

Table I. User Input

Function Allowed Response


PercentP P o s i t i v e integer between 1 and 100
Yes_NoP Yes, No, Υ, Ν
Any_0f Any number of members of the l i s t e d p o s s i b i l i t i e s
0ne_0f One member of the l i s t e d p o s s i b i l i t i e s
PositiveP Any p o s i t i v e number
ImportanceP High, Med, Low, H, M, L
IntegerP Any p o s i t i v e or negative integer
I n t e g e r _ l i s t P A l i s t of integers seperated by spaces
Minusl_to_0ne Any number between -1 and +1

For values which must be deduced, a TEXT explanation i s saved.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
7. HOHNE A N D HOUGHTON Formulation of Agricultural Chemicals 93

This TEXT i s used i n the various explanation and t r a c i n g f a c i l i t i e s .


Whenever the inference engine reaches one of these FACTS i t e i t h e r
continues i t s search, i f possible, or proceeds another l e v e l deeper
i n the reverse search and t r i e s to prove that FACT.
The CALL f a c i l i t y allows the expert system to access software
external to the LISP program. Included with the CALL i s the name of
a LISP function which handles the outside software. In the case of
the f a c t CHEMICAL_NAME, the LISP function executes a FORTRAN program
which allows the user to e i t h e r r e t r i e v e the structure of a
previously entered compound or enter a new one. The program also
breaks the chemical structure into i t s f u n c t i o n a l groups. When the
FORTRAN program terminates, the LISP function updates the l i s t of
f a c t s , and i n s e r t s the name i n t o CHEMICAL_NAME and the f u n c t i o n a l
groups i n t o FUNCT_GROUPS. These FACTS are then a v a i l a b l e to the
expert system. I n t h i s way, access to outside software i s
completely data driven.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch007

The structure of the branch points i s the same as that of those


FACTS which must be deduced, except f o r the a d d i t i o n a l c o n t r o l
information. The properties correspond to the d i f f e r e n t branches i n
the hierarchy a t that point. Figure 4 shows the data structure of
branch points. For each branch point (property), there i s a l i s t of
rules which apply. By only considering rules applicable to the
s p e c i f i c subproblem, the time required f o r searching i s d r a s t i c a l l y
reduced. A l i s t of FACT-PROPERTY p a i r s , which are u s e f u l background
information f o r the subproblem, i s also saved. This background
information i s c o l l e c t e d at the beginning of each subproblem and
used i n a forward-chaining function. This approach can prevent the
reverse chaining p o r t i o n of the system from appearing as though i t
i s "wandering" a t the beginning of each subproblem. The f i n a l piece
of c o n t r o l information i s the name of the next subproblem, and
correspondes to the FACT name. These names are stored f o r each
branch point.

Conclusion Name
Branch Point 1
Value
Where to f i n d i t (prove)
Explanation ( T e l l the user i f i t i s true)
Next l e v e l name
Background facts (Questions always asked)
Rule names ( L i s t of relevant rules)

Branch Point 2
Value

Figure 4. Structure of Conclusions


Rules i n the expert system are structured to allow f l e x i b i l i t y
and future expansion. For speed of execution, the IF-THEN clauses
are a c t u a l l y executable LISP code. Tables I I and I I I contain
examples of how rules are structured. The IF clauses contain
functions, c a l l e d predicates. Predicates have a value of e i t h e r

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
94 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

true or f a l s e when evaluated. I f a l l the IF clauses are true, then


the THEN clauses are executed. The THEN clauses contain ACTIONS
which change the VALUEs of other FACTs. The PREDICATES and ACTIONS
are the basic b u i l d i n g blocks f o r a l l the rules i n the system.
There i s no l i m i t to the number of IF or THEN clauses which a r u l e
can contain. As more powerful rules are required, a d d i t i o n a l
b u i l d i n g blocks can e a s i l y be added by w r i t i n g new PREDICATES or
ACTIONS.

Table I I . Structure of Rules

AgRule_l
If-1 (Isequal Active_Ingredient Desired_Level Value >40)
Then-1 (Suggest Form_Type EC -.5)
Then-2 (Suggest Form_Type WSL -.5)
Then-3 (Suggest Form_Type Flowable -.5)
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch007

Why EC's, WSL's and Flowables r a r e l y have that high an AI l e v e l


Date 11/14/83
Author Houghton

Agrulel
IF
1. The value of the a c t i v e ingredient's desired concentration
i s >40%
THEN
1. There i s suggestive evidence (-0.5) that the
formulation type should not be emulsifiable concentrate
2. There i s suggestive evidence (-0.5) that the
formulation type should not be water soluble l i q u i d
3. There i s suggestive evidence (-0.5) that the
formulation type should not be flowable concentrate
BECAUSE:
EC's, WSL's and Flowables r a r e l y have that high an AI l e v e l

Table I I I . Structure of Rules

AgRule_13
If-1 (Isequal Solvent Req_EPA_Clear Value C)
Then-1 (Avoid NotEqual EC_Solvent EPA_Clear C -1)
Why I t ' s the law
Date 12/20/83
Author Hohne

Agrulel3
IF
1. The value of the solvent's required EPA clearance
is C
THEN
1. Avoid (-1) e m u l s i f i a b l e concentrate solvents where EPA clearance
i s not equal to C
BECAUSE
I t ' s the law

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
7. HOHNE A N D HOUGHTON Formulation of Agricultural Chemicals 95

The r u l e structure allows simple Boolean functions to be


performed. M u l t i p l e numbered IF clauses are l o g i c a l l y ANDed
together. M u l t i p l e clauses which are part of the same numbered IF
are l o g i c a l l y ORed. The l o g i c a l NOT does not e x i s t , but can be
simulated using predicates with the opposite meaning i n the IF
clause, ( i . e . BIGGER i s equivalent to NOT SMALLER). Table IV l i s t s
the c u r r e n t l y a v a i l a b l e predicates f o r IF clauses.

Table IV. Relationships (predicates)

Predicate Meaning
BIGGER Bigger than
SMALLER Smaller than
MEMB Member of the l i s t
NOTMEMB Not a member of the l i s t
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch007

ISEQUAL Is equal to
NOTEQUAL Not equal to

The ACTIONS a v a i l a b l e to the THEN clauses are l i s t e d i n


Table V. These ACTIONS give r i s e to two types of THEN clauses. The
f i r s t type a f f e c t s the VALUE of only one property. The THEN clauses
i n Table I I show the construction of one-property THEN clauses. The
second type of THEN clause deals with a l l of the current branch
points. Table I I I shows the construction of t h i s type of THEN
clause.

Table V. Actions

Action Meaning
SUGGEST Adjust the property's value using the l i s t e d
confidence factor
SET_EQUAL Set the property's value equal to the l i s t e d value

ORDER_BY Order the hypotheses by the value of the l i s t e d


property
AVOID Avoid conclusions where the requirement l i s t e d

The inference engine was designed to use multivalued l o g i c ,


i . e . , i t handles inexact reasoning. Confidence factors (CF) are
contained i n the THEN clauses of each r u l e . The equation f o r
combining p o s i t i v e confidences i s :

CF - old_value + new_value - (old_value X new_value)

The equation f o r negative confidences i s :

CF » old_value + new_value + (old_value X new_value)

For mixed p o s i t i v e and negative confidences, a simple sum i s


used. The advantage to these functions i s they are bounded by -1
and +1.
The program also handles exact reasoning through the SET_EQUAL

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
96 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

ACTION i n the THEN clause. This ACTION can be used to set a


confidence value to +1 (true) or -1 ( f a l s e ) , regardless of
previously compiled confidences. SET_EQUAL can also be used to set
FACT values equal to nonnumeric values, where required.
The n a t u r a l language i n t e r p r e t a t i o n of the rules given at the
bottom of Tables I I and I I I was generated by the program. The
natural language generator uses synonyms f o r FACT names and
properties. The synonyms are simply substituted i n t o one of several
templates to generate a sentence. The template used i s determined
by the value of the confidence factor and the combination of ACTIONS
and PREDICATES.

Current Status of the Project

The project i s s t i l l i n the prototype stage. I t i s being used, but


not widely. Presently, the knowledge base f o r the system has l e s s
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch007

than 100 r u l e s . This number i s misleading because a l l the work


performed by the FORTRAN programs i s not counted i n the number of
r u l e s . These programs give the system f a r more knowledge than would
be expected from the 'small' knowledge base.
The system can help s c i e n t i s t s r e l i a b l y determine what type of
formulation to make. However, the only branch of the d e c i s i o n tree
which has rules i s the emulsifiable concentrates (EC) branch. The
system can determine which solvents to t r y to make an EC. I t s
decision r e l i e s h e a v i l y on rules and s o l u b i l i t y c a l c u l a t i o n s . Work
i s j u s t beginning on the rules to determine which e m u l s i f i e r s to
use.
The program has been interfaced to two FORTRAN programs. The
f i r s t , MOLY, i s a l o c a l l y developed product f o r chemical structure
entry, d i s p l a y , and molecular modeling [2]. The expert system only
takes advantage of the chemical structure handling p o r t i o n of the
program. The other program, UNIFAC [3], performs s o l u b i l i t y
c a l c u l a t i o n s f o r the active ingredient i n a group of solvents of
i n t e r e s t to formulation chemists.
The inference engine performs both forward and
reverse-chaining. The reverse-chain algorithm i s a depth f i r s t
search. Using t h i s algorithm, questions asked by the system are
grouped by subject, making the program appear more l o g i c a l to the
user. The program handles exact and inexact l o g i c c a l c u l a t i o n s and
explains, i n English, why a question was asked and how a conclusion
was reached. The program also allows the s c i e n t i s t to change
answers i n case of mistakes, or to investigate "what i f " scenarios.

Directions f o r Future Development

Future developments f a l l i n t o two classes: additions to the


knowledge base and enhancements to the program. As the program i s
used by more people, f i n e tuning of the rules to s e l e c t which type
of formulation to t r y w i l l be needed. Work, from that point, w i l l
continue on the emulsifiable concentrate branch. The solvent
s e l e c t i o n p o r t i o n w i l l require some f i n e tuning, but the major work
i s i n adding to the l i s t of solvents. The e m u l s i f i e r s e l e c t i o n
p o r t i o n of the knowledge base w i l l d e f i n i t e l y require a d d i t i o n a l
r u l e s , to be followed by considerable tuning as i t i s used. The
remaining four formulation types have yet to be started. They w i l l

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
7. HOHNE A N D HOUGHTON Formulation of Agricultural Chemicals 97

require d i f f e r e n t experts and can be developed concurrently with the


EC p o r t i o n .
The f i r s t major enhancement to the program w i l l be the a b i l i t y
to stop sessions at any point and r e s t a r t at the same point at a
l a t e r time. This c a p a b i l i t y w i l l be more than j u s t a convenience,
i t w i l l be necessary to make the laboratory r e s u l t s requested by the
program u s e f u l . A f t e r t h i s addition, the next major enhancement
w i l l be to develop a method of using the rules to trouble-shoot
f i e l d problems. This enhancement w i l l involve adding some r u l e s ,
but most of the knowledge should already be i n the knowledge base.
As the program becomes widely used, the a b i l i t y to generate reports
and data sheets f o r laboratory r e s u l t s w i l l be a valuable addition.
The added a b i l i t y to remind the s c i e n t i s t about c e r t a i n deadlines
for a p r o j e c t may be e a s i l y included, but w i l l not be u s e f u l u n t i l
s c i e n t i s t s use the Apollo computer r e g u l a r l y .
The expert system c u r r e n t l y has no r u l e entry or maintenance
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch007

f a c i l i t i e s . Rules are entered and modified using the Apollo


computer text e d i t o r . This i s acceptable f o r a prototype, but not
for a production system. Before these f a c i l i t i e s are added, i t s
cost and c a p a b i l i t i e s w i l l need to be compared to those o f
commercial expert systems.

Literature Cited

1. Prenau, D.S., "Selection of an Appropriate Domain for an Expert


System", AI Magazine, 6(2), 1985
2. Dyott, T., Stuper, A.J., Zander, G.S., "MOLY, an Interactive
System for Molecular Analysis", J. Chem. Inf. Comp. Sci.,
20(28), 1980
3. Fredenslund, Α., Jone, R.L., Prausnitz, J.M.,
"Group-Contribution Estimation of Activity Coefficients in
Nonideal Liquid Mixtures", AIChE Journal, 21(6), 1975

R E C E I V E D December 17, 1985

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
8
Computer Algebra: Capabilities and Applications
to Problems in Engineering and the Sciences

Richard Pavelle

M A C S Y M A Group, Symbolics, Inc., Cambridge, M A 02142


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch008

MACSYMA is a large, interactive computer system


designed to assist engineers, scientists, and
mathematicians in solving mathematical problems. A user
supplies symbolic inputs and MACSYMA yields symbolic,
numeric or graphic results. This paper provides an
introduction to MACSYMA and provides the motivation for
using the system. Many examples are given of MACSYMA's
capabilities with actual computer input and output.

My purpose i n t h i s paper i s t o p r o v i d e a broad i n t r o d u c t i o n t o t h e


c a p a b i l i t i e s o f MACSYMA. I t i s my hope t h a t t h i s i n f o r m a t i o n w i l l
c r e a t e new u s e r s o f Computer A l g e b r a systems by s h o v i n g what one
might e x p e c t t o g a i n by u s i n g them and what one w i l l l o s e by n o t
u s i n g them*

MACSYMA output i s used and CPU times a r e o f t e n g i v e n . I n some


cases I have m o d i f i e d t h e output s l i g h t l y t o make i t more
p r e s e n t a b l e . The CPU times c o r r e s p o n d t o a S y m b o l i c s 3600 and t o t h e
MACSYMA C o n s o r t i u m machine (MIT-MC) w h i c h i s a D i g i t a l Equipment
K L 1 0 . These a r e about e q u a l i n speed and about t w i c e as f a s t as a
D i g i t a l Equipment VAX 11/780 f o r MACSYMA c o m p u t a t i o n s . When CPU
times a r e n o t g i v e n one may assume the c a l c u l a t i o n r e q u i r e s a t most
10 CPU s e c o n d s .

What i s MACSYMA. The development o f the Computer A l g e b r a s y s t e m ,


MACSYMA» began a t MIT i n the l a t e 60s» and i t s h i s t o r y has been
d e s c r i b e d e l s e w h e r e Q . ) . A few f a c t s w o r t h r e p e a t i n g a r e t h a t a
g r e a t d e a l o f e f f o r t and expense v e n t i n t o MACSYMA. There a r e
e s t i m a t e s t h a t 100 man-years o f d e v e l o p i n g and debugging have gone
i n t o t h e program. W h i l e t h i s i s a l a r g e number, l e t us c o n s i d e r t h e
even l a r g e r number o f man-years u s i n g and t e s t i n g MACSYMA. A t MIT,
betveen 1972 and 1982, we had about 1000 MACSYMA u s e r s . I f v e had 50
s e r i o u s u s e r s u s i n g MACSYMA f o r 50% o f t h e i r t i m e , 250 c a s u a l u s e r s
a t 10% and 700 i n f r e q u e n t u s e r s a t 2% then the t o t a l i s over 600
man-years. MACSYMA has been a t 50 s i t e s f o r f o u r y e a r s and i s a t

0097-6156/ 86/ 0306-0100506.00/ 0


© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
8. PAVELLE Computer Algebra: Capabilities and Applications 101

400 s i t e s t o d a y . W e l l , we c o u l d conclude t h a t a t l e a s t 1000


man-years have been spent i n u s i n g MACSYMA. MACSYMA i s now v e r y
l a r g e and c o n s i s t s of about 3000 l i s p s u b r o u t i n e s o r about 300,000
l i n e s of c o m p i l e d l i s p code j o i n e d t o g e t h e r i n one g i a n t package f o r
performing symbolic mathematics.

W h i l e t h i s paper i s d i r e c t e d towards MACSYMA, the development


of MACSYMA and o t h e r Computer A l g e b r a systems has r e a l l y been the
r e s u l t o f an i n t e r n a t i o n a l e f f o r t (.2). There a r e many systems,
w o r l d - w i d e , of v a r i o u s s i z e s and d e s i g n s w h i c h have been developed
over the p a s t f i f t e e n t o twenty y e a r s ( 3 , 4 ) . Research r e l a t e d t o
the development of these systems has l e a d t o many new r e s u l t s i n
mathematics and the c o n s t r u c t i o n of new a l g o r i t h m s . These r e s u l t s i n
t u r n h e l p e d the development of MACSYMA as w e l l as o t h e r systems.
These systems a r e now b e i n g r e c o g n i z e d as i m p o r t a n t t o o l s a l l o w i n g
r e s e a r c h e r s t o make s i g n i f i c a n t d i s c o v e r i e s i n many f i e l d s o f
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch008

interest (5).

Why MACSYMA i s U s e f u l or N e c e s s a r y . Here a r e some of the more


i m p o r t a n t reasons f o r u s i n g MACSYMA:

1. The answers one o b t a i n s a r e e x a c t and can o f t e n be checked by


independent p r o c e d u r e s . F o r example, one can compute an i n d e f i n i t e
i n t e g r a l and check the answer by d i f f e r e n t i a t i n g ; the
d i f f e r e n t i a t i o n a l g o r i t h m i s independent of the i n t e g r a t i o n
a l g o r i t h m . S i n c e e x a c t answers a r e g i v e n , the s t a t i s t i c a l e r r o r
a n a l y s i s a s s o c i a t e d w i t h n u m e r i c a l c o m p u t a t i o n i s u n n e c e s s a r y . One
o b t a i n s answers t h a t a r e r e l i a b l e to a h i g h l e v e l of c o n f i d e n c e .

2 . The user can g e n e r a t e FORTRAN e x p r e s s i o n s t h a t a l l o w numeric


computers to r u n f a s t e r and more e f f i c i e n t l y . T h i s saves CPU c y c l e s
and makes computing more e c o n o m i c a l * The user can g e n e r a t e FORTRAN
e x p r e s s i o n s from MACSYMA e x p r e s s i o n s * The FORTRAN c a p a b i l i t y i s an
e x t r e m e l y i m p o r t a n t f e a t u r e combining s y m b o l i c and numeric
capabilities. The t r e n d i s c l e a r , and i n a few y e a r s we w i l l have
p o w e r f u l , i n e x p e n s i v e desktop o r notebook computers t h a t merge the
s y m b o l i c , numeric and g r a p h i c c a p a b i l i t i e s i n a s c i e n t i f i c
workstation.

3 . The user can e x p l o r e e x t r e m e l y complex problems t h a t cannot be


s o l v e d i n any o t h e r manner. T h i s c a p a b i l i t y i s o f t e n thought of as
the major use of Computer A l g e b r a systems. However, one s h o u l d not
l o s e s i g h t of the f a c t t h a t MACSYMA i s o f t e n used as an advanced
c a l c u l a t o r to perform everyday s y m b o l i c and numeric p r o b l e m s . I t
a l s o complements c o n v e n t i o n a l t o o l s such as r e f e r e n c e t a b l e s or
numeric p r o c e s s o r s .

4 . A g r e a t d e a l of knowledge has gone i n t o the MACSYMA knowledge


base. T h e r e f o r e the user has a c c e s s t o m a t h e m a t i c a l t e c h n i q u e s t h a t
a r e not a v a i l a b l e from any o t h e r r e s o u r c e s , and the user can s o l v e
problems even though he may not know o r understand the t e c h n i q u e s
t h a t the system uses to a r r i v e a t an answer.

5 . A user can t e s t m a t h e m a t i c a l c o n j e c t u r e s e a s i l y and p a i n l e s s l y .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
102 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

One f r e q u e n t l y encounters m a t h e m a t i c a l r e s u l t s i n the l i t e r a t u r e and


q u e s t i o n s t h e i r v a l i d i t y . O f t e n MACSYMA can be used t o check t h e s e
r e s u l t s u s i n g a l g e b r a i c or numeric t e c h n i q u e s or a c o m b i n a t i o n of
these. S i m i l a r l y one can use the system to show t h a t some problems
do not have a s o l u t i o n .

6 . MACSYMA i s easy t o u s e . I n d i v i d u a l s w i t h o u t p r i o r computing


e x p e r i e n c e can l e a r n t o s o l v e f a i r l y d i f f i c u l t problems w i t h MACSYMA
i n a few hours or l e s s . W h i l e MACSYMA i s w r i t t e n i n a d i a l e c t of
L I S P , the user need never see t h i s base l a n g u a g e . MACSYMA i t s e l f i s
a f u l l programming language, almost m a t h e m a t i c a l i n n a t u r e , whose
syntax resembles ALGOL.

There a r e two a d d i t i o n a l reasons f o r u s i n g MACSYMA t h a t are


more i m p o r t a n t than the o t h e r s .
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch008

7 . One can c o n c e n t r a t e on the i n t e l l e c t u a l c o n t e n t of a problem


l e a v i n g c o m p u t a t i o n a l d e t a i l s t o the computer. This often r e s u l t s
i n a c c i d e n t a l d i s c o v e r i e s and, owing t o the power of the program,
these o c c u r a t a f a r g r e a t e r r a t e than when c a l c u l a t i o n s a r e done by
hand.

8. But the most i m p o r t a n t r e a s o n i s t h a t , t o quote R.W. Hamming,


"The purpose of computing i s i n s i g h t , not numbers." T h i s e x e m p l i f i e s
the major b e n e f i t of u s i n g MACSYMA, and I w i l l demonstrate the
v a l i d i t y of t h i s statement by showing n o t o n l y how one g a i n s i n s i g h t
but a l s o how one uses MACSYMA f o r t h e o r y b u i l d i n g . However, a second
q u o t a t i o n r e p u t e d t o be by Hamming i s c o r r e c t as w e l l , namely t h a t
"The purpose of computing i s not y e t i n s i g h t . "

C a p a b i l i t i e s and uses o £ MACSYMA

Capabilities. I t i s not p o s s i b l e to f u l l y i n d i c a t e the c a p a b i l i t i e s


of MACSYMA i n a few l i n e s s i n c e the r e f e r e n c e manual i t s e l f o c c u p i e s
more than 500 pages (6.). However, some o f the more i m p o r t a n t
c a p a b i l i t i e s i n c l u d e ( i n a d d i t i o n to the b a s i c a r i t h m e t i c a l
operations) f a c i l i t i e s to provide a n a l y t i c a l t o o l s f o r

Limits Taylor Series (Several Variables)


Derivatives Poisson Series
Indefinite Integration Laplace Transformations
Definite Integration I n d e f i n i t e Summation
Ordinary D i f f e r e n t i a l Equations Matrix Manipulation
Systems o f E q u a t i o n s ( N o n - L i n e a r ) Vector Manipulation
Simplification Tensor M a n i p u l a t i o n
Factorization Fortran Generation

There a r e o t h e r r o u t i n e s f o r c a l c u l a t i o n s i n number t h e o r y ,
c o m b i n a t o r i c s , c o n t i n u e d f r a c t i o n s , s e t t h e o r y and complex

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
8. PAVELLE Computer Algebra: Capabilities and Applications 103

arithmetic. There i s also a share l i b r a r y currently containing


about 80 subroutines. Some of these perform computations such as
asymptotic analysis and optimization while others manipulate many of
the higher transcendental functions. In addition one can evaluate
expressions numerically at most stages of a computation. MACSYMA
also provides extensive graphic c a p a b i l i t i e s to the user.

To put the c a p a b i l i t i e s of MACSYMA i n perspective we could say


that MACSYMA knows a large percentage of the mathematical techniques
used i n engineering and the sciences. I do not mean to imply that
MACSYMA can do everything. I t i s easy to come up with examples that
MACSYMA cannot handle, and I w i l l present some of these. Perhaps the
following quotation w i l l add the necessary balance. I t i s an e x i t
message from some MIT computers that often flashes on our screens
when logging out. I t states: " I am a computer. I am dumber than any
human and smarter than any administrator." MACSYMA i s remarkable i n
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch008

both the questions i t can and cannot answer. I t w i l l be many years


before i t evolves into a system that r i v a l s the human i n more than a
few areas. But u n t i l then, i t i s the most useful t o o l that any
engineer or s c i e n t i s t can have at h i s disposal.

Uses. I t i s d i f f i c u l t to l i s t the a p p l i c a t i o n f i e l d s of MACSYMA


because users often do not state the tools that helped them perform
t h e i r research. However, from Computer Algebra conferences (7, &,
9) we do know that MACSYMA has been used i n the following f i e l d s :

Acoustics F l u i d Dynamics
Algebraic Geometry General R e l a t i v i t y
Antenna Theory Number Theory
C e l e s t i a l Mechanics Numerical Analysis
Computer-Aided Design P a r t i c l e Physics
Control Theory Plasma Physics
Deformation Analysis Solid-State Physics
Econometrics Structural Mechanics
Experimental Mathematics Thermodynamics

Researchers have reported using MACSYMA to explore problems i n :

A i r f o i l Design Nuclear Magnetic Resonance


Atomic Scattering Cross Sections Optimal Control Theory
B a l l i s t i c M i s s i l e Defense Systems Polymer Modeling
Decision Analysis i n Medicine Propeller Design
Electron Microscope Design Robotics
Emulsion Chemistry Ship H u l l Design
F i n i t e Element Analysis Spectral Analysis
Helicopter Blade Motion underwater Shock Waves
Maximum Likelihood Estimation
Genetic Studies of Family Resemblance
Large Scale Integrated C i r c u i t Design
Resolving Closely Spaced Optical Targets

Examples of MACSYMA

Polynomial Equations. Here i s an elementary example that

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
104 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

demonstrates the a b i l i t y of MACSYMA t o s o l v e e q u a t i o n s . I n MACSYMA,


as w i t h most systems, one has user i n p u t l i n e s and computer output
lines. Below, i n the i n p u t l i n e ( C I ) , we have w r i t t e n an e x p r e s s i o n
i n an ALGOL l i k e s y n t a x , t e r m i n a t e d i t w i t h a s e m i - c o l o n , and i n
( D l ) the computer echos the e x p r e s s i o n by d i s p l a y i n g i t i n a two
d i m e n s i o n a l format i n a form s i m i l a r t o hand n o t a t i o n . Terminating
an i n p u t s t r i n g w i t h $ i n h i b i t s the d i s p l a y of the D l i n e s .

Λ Λ Λ
(CI) Χ 3+B*X~2+Α 2*X~2-9*Α*Χ 2+A~2*B*X-2*A*B*X-

S
9*A~3*X+14*A~2*X-2*A' 3*B+14*A~4=0;

3 2 2 2 2 2 3
(Dl) Χ + B X +A X - 9 A X +A B X - 2 A B X - 9 A X
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch008

2 3 4
+ 14 A X - 2 A B + 1 4 A « 0

In (C2) we now ask MACSYMA t o s o l v e the e x p r e s s i o n ( D l ) f o r X


and the t h r e e r o o t s appear i n a l i s t i n ( D 2 ) .

(C2) S0LVE(D1,X);

2
(D2) [X - 7 A - Β, X « - A , X » 2 A]

N o t i c e t h a t MACSYMA has o b t a i n e d the r o o t s a n a l y t i c a l l y and


t h a t numeric a p p r o x i m a t i o n s have not been made. T h i s demonstrates a
fundamental d i f f e r e n c e between a Computer A l g e b r a system and an
o r d i n a r y numeric e q u a t i o n s o l v e r , namely the a b i l i t y t o o b t a i n a
s o l u t i o n without approximations. I c o u l d have g i v e n MACSYMA a
" n u m e r i c " c u b i c e q u a t i o n i n X by s p e c i f y i n g numeric v a l u e s f o r A and
B . MACSYMA then would have s o l v e d the e q u a t i o n and g i v e n the numeric
r o o t s a p p r o x i m a t e l y or e x a c t l y depending upon the s p e c i f i e d command.

MACSYMA can a l s o s o l v e q u a d r a t i c , c u b i c and q u a r t i c e q u a t i o n s


as w e l l as some c l a s s e s o f h i g h e r degree e q u a t i o n s . However, i t
o b v i o u s l y cannot s o l v e e q u a t i o n s a n a l y t i c a l l y i n c l o s e d form when
methods are not known, e . g . a g e n e r a l f i f t h degree ( o r h i g h e r )
equation.

D i f f e r e n t i a l C a l c u l u s . MACSYMA knows about c a l c u l u s . I n ( D l ) we


have an e x p o n e n t i a t e d f u n c t i o n t h a t i s o f t e n used as an example i n a
f i r s t course i n d i f f e r e n t i a l c a l c u l u s .

X
X
(Dl) X

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
8. PAVELLE Computer Algebra: Capabilities and Applications 105

We now ask MACSYMA t o d i f f e r e n t i a t e ( D l ) w i t h r e s p e c t t o X t o


o b t a i n t h i s c l a s s i c textbook r e s u l t of d i f f e r e n t i a t i o n . N o t i c e how
f a s t , 3/100 CPU seconds, MACSYMA computes t h i s d e r i v a t i v e .

(C2) D I F F ( D 1 , X ) ;
Time= 30 msec.

X
X X X- 1
(D2) Χ (X LOG(X) (LOG(X) + 1) + X )

Below i s a more c o m p l i c a t e d f u n c t i o n , the e r r o r f u n c t i o n of the


tangent of the a r c - c o s i n e of the n a t u r a l l o g a r i t h m of X . N o t i c e t h a t
MACSYMA does not d i s p l a y the i d e n t i c a l i n p u t . T h i s i s because the
i n p u t i n ( C I ) passes through MACSYMA's s i m p l i f i e r . MACSYMA
r e c o g n i z e s t h a t the tangent o f the a r c - c o s i n e of a f u n c t i o n
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch008

s a t i s f i e s a t r i g o n o m e t r i c i d e n t i t y , namely TAN(AC0S(X)) •
SQRT(1-X"2)/X. I t t a k e s t h i s i n t o account b e f o r e d i s p l a y i n g ( D l ) .

(CI) ERF(TAN(ACOS(LOG(X))));
2
SQRT(1 - LOG ( X ) )
(Dl) ERF( )
LOG(X)

Now when MACSYMA i s asked t o d i f f e r e n t i a t e ( D l ) w i t h r e s p e c t t o


X , i t does so i n a s t r a i g h t f o r w a r d manner and s i m p l i f i e s the r e s u l t
u s i n g the r a t i o n a l c a n o n i c a l s i m p l i f i e r RATSIMP. T h i s command p u t s
the e x p r e s s i o n i n a n u m e r a t o r - o v e r - d e n o m i n a t o r form c a n c e l i n g any
common d i v i s o r s . I n (D2) the symbols %E and %PI a r e MACSYMA's
r e p r e s e n t a t i o n s f o r the base of the n a t u r a l l o g a r i t h m s and p i ,
respectively.

(C2) DIFF(D1,X),RATSIMP;
Time= 1585 msec.
1
1
2
LOG (X)
2 %E
( ) D2

2 2
SQRT(%PI) X LOG (X) SQRT(1 - LOG ( X ) )

Factorization

MACSYMA can f a c t o r e x p r e s s i o n s . Below i s a m u l t i v a r i a t e p o l y n o m i a l


i n four v a r i a b l e s .

2 7 4 8 2 6 3 8 3 7 4 6
( D l ) - 36 W Χ Υ Ζ + 3 W X Y Z - 24 W Χ Υ Ζ

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
106 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

3 6 3 6 2 8 6 5 4 7 6 5
+ 2W Χ Υ Ζ +96W Χ Y Ζ -168W Χ Υ Ζ

2 7 6 5 2 10 5 5 2 7 5 5 7 5 5
+ 12 W Χ Υ Ζ - 216 W Χ Υ Ζ - 8 W Χ Υ Ζ + 9 Χ Υ Ζ

4 6 5 5 2 6 5 5 2 9 4 5 7 3 5
+ 14 W Χ Υ Ζ -W Χ Υ Ζ +18W Χ Υ Ζ +87 Χ Υ Ζ

2 6 3 5 7 5 3 7 3 3 3 6 3 3
- 3 W Χ Υ Ζ + 6WX Υ Ζ +58WX Υ Ζ - 2W Χ Υ Ζ
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch008

8 7 2 2 7 7 2 7 7 2 10 6 2
- 2 4 X Y Z + 4 2 W X Y Z - 3 X Y Z + 5 4 X Υ Ζ

8 5 2 2 7 5 2 7 5 2 4 6 5 2
- 232 Χ Υ Ζ + 4 1 4 W X Y Z - 29 Χ Υ Ζ - 14 W Χ Υ Ζ

2 6 5 2 10 4 2 2 9 4 2
+ W X Y Z + 522 X Y Z - 1 8 W X Y Z

We now c a l l the f u n c t i o n FACTOR on ( D l ) and

(C2) FACTOR(Dl);
Time= 111998 msec.

6 3 2 3 2 2 2 2 3
(D2) - Χ Υ Ζ (3 Ζ + 2 W Z - 8 X Y +14W Y - Y +18X Y)

2 3 2 3 2 2
(12 W X Y Z - W Ζ - 3 X Y - 29 X + W )

MACSYMA f a c t o r s t h i s m a s s i v e e x p r e s s i o n i n about two CPU


m i n u t e s . One can a l s o extend the f i e l d of f a c t o r i z a t i o n t o the
G a u s s i a n i n t e g e r s or o t h e r a l g e b r a i c f i e l d s ( 1 0 ) .

Simplification. A v e r y i m p o r t a n t f e a t u r e of MACSYMA i s i t s a b i l i t y
t o s i m p l i f y e x p r e s s i o n s . When I s t u d i e d plane-wave m e t r i c s f o r a
new g r a v i t a t i o n t h e o r y ( 1 1 , 12)» one p a r t i c u l a r c a l c u l a t i o n produced
an e x p r e s s i o n w i t h s e v e r a l hundred thousand t e r m s . From g e o m e t r i c a l
arguments I knew the e x p r e s s i o n must s i m p l i f y and i n d e e d , u s i n g
MACSYMA, the e x p r e s s i o n c o l l a p s e d t o a s m a l l number of pages of
o u t p u t . The f o l l o w i n g e x p r e s s i o n o c c u r r e d r e p e a t e d l y i n the course
of the c a l c u l a t i o n and caused the c o l l a p s e of the l a r g e r e x p r e s s i o n
during s i m p l i f i c a t i o n .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
8. PAVELLE Computer Algebra: Capabilities and Applications 107

2 2 2 2
(SQRT(R + A ) + A) (SQRT(R + Β ) + B)
( ) D l

2
R
2 2 2 2
SQRT(R + Β ) + SQRT(R + A ) + Β + A

2 2 2 2
SQRT(R + Β ) + SQRT(R + A ) - Β - A

(C2) RATSIMP(Dl);
Tim€= 138 msec.
(D2) 0

When the canonical s i m p l i f i e r RATSIMP i s c a l l e d on (Dl) above


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch008

i t returns zero. At f i r s t I did not believe that (Dl) i s zero, and


I spent 14 minutes v e r i f y i n g i t by hand (almost exceeding my 15
minute l i m i t ) . I t i s not easy to prove. Combining the expressions
over a common denominator r e s u l t s i n a numerator that contains 20
terms when f u l l y expanded, and one must be very c a r e f u l to assure
c a n c e l l a t i o n . Try i t by hand!

I n d e f i n i t e Integration. MACSYMA can handle i n t e g r a l s involving


r a t i o n a l functions and combinations of r a t i o n a l , algebraic
functions, and the elementary transcendental functions. I t also has
knowledge about error functions and some of the higher
transcendental functions.

Below i s an i n t e g r a l that i s quite d i f f i c u l t to do by hand. I t


i s not found i n standard tables i n i t s given form although i t may
transform to a recognized case. I t i s e s p e c i a l l y d i f f i c u l t to do by
hand unless one notices a t r i c k that involves performing a p a r t i a l
f r a c t i o n decomposition of the integrand with respect to LOG(X).
However, MACSYMA handles i t r e a d i l y .

/
[ LOG(X) - 1
(Dl) I dX
] 2 2
/ LOG (X) - X

(C2) INTEGRATE(Dl,X);
Time= 744 msec.

L0G(L0G(X) + X) L0G(L0G(X) - X)
(D2)
2 2

D e f i n i t e Integration. D e f i n i t e i n t e g r a t i o n i s f a r more d i f f i c u l t to
code than i n d e f i n i t e i n t e g r a t i o n because the number of known
techniques i s much larger. One has the added complication of taking
l i m i t s at the endpoints of the i n t e g r a l . MACSYMA has impressive
c a p a b i l i t i e s for d e f i n i t e i n t e g r a t i o n . Here i s an example of a

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
108 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

f u n c t i o n whose d e f i n i t e i n t e g r a l does not appear to be t a b u l a t e d :

2
2 - U X
(Dl) X %E LOG(X)

(C2) INTEGRATE(Dl,X,0,INF),FACTOR;
55
Time 138442 msec.

SQRT(%PI) (LOG(U) + 2 LOG(2) + %GAMMA - 2)


( )
D2

3/2
8 U

I n (C2) above we have asked MACSYMA t o i n t e g r a t e ( D l ) w i t h


r e s p e c t t o X from 0 t o i n f i n i t y . I n the answer, %GAMMA i s the
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch008

s
MACSYMA s y n t a x f o r the E u l e r - M a s c h e r o n i c o n s t a n t 0.577215664.· ·

I n a d d i t i o n t o d e f i n i t e i n t e g r a t i o n , MACSYMA can perform


numeric i n t e g r a t i o n u s i n g the Romberg numeric i n t e g r a t i o n p r o c e d u r e .
There a r e a number of o t h e r numeric t e c h n i q u e s a v a i l a b l e . A n d , one
has the a b i l i t y t o e v a l u a t e e x p r e s s i o n s n u m e r i c a l l y t o a r b i t r a r y
precision.

Taylor/Laurent Series. The T a y l o r ( L a u r e n t ) s e r i e s c a p a b i l i t y i s


v e r y i m p r e s s i v e . Below we ask f o r the f i r s t 15 terms of the s e r i e s
of ( D l ) about the p o i n t X « 0 . N o t i c e t h a t MACSYMA computes t h i s
e x p r e s s i o n i n l e s s than 1/2 CPU second.

2
3 Β L0G(X - X + 1)
(Dl) A SIN(X ) +
5
X

(C2) T A Y L O R ( D 1 , X , 0 , 1 5 ) ;
53
Time 365 msec.
2 3
Β Β 2B Β Β Β Χ Β Χ (Β + 8 A) Χ
(D2)/T/ + + + +
4 3 2 4 X 5 3 7 8
Χ 2 Χ 3 Χ

4 5 6 7 8 9
(2 Β) Χ ΒΧ ΒΧ ΒΧ ΒΧ (3 Β - 7 A) Χ

9 10 11 6 13 42

10 11 12 13 14 15
(2 Β) Χ ΒΧ ΒΧ ΒΧ ΒΧ (6 Β + A) Χ

15 16 17 9 19 120

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
8. PAVELLE Computer Algebra: Capabilities and Applications 109

The program can a l s o compute T a y l o r ( L a u r e n t ) s e r i e s i n s e v e r a l


variables.

O r d i n a r y D i f f e r e n t i a l E q u a t i o n s . Another p o w e r f u l f e a t u r e i s the
MACSYMA program ODE. ODE i s a c o l l e c t i o n of a l g o r i t h m s f o r s o l v i n g
o r d i n a r y d i f f e r e n t i a l e q u a t i o n s . I t was b u i l t over s e v e r a l y e a r s by
E . L . L a f f e r t y , J . P . G o l d e n , R . A . Bogen and B . K u i p e r s , and i t s
c a p a b i l i t i e s a r e d e s c r i b e d i n the MACSYMA Reference Manual (6) i n
V2-4-14.

In ( C I ) , we f i r s t d e c l a r e t h a t Y i s a f u n c t i o n of X . T h i s
a s s u r e s t h a t the d e r i v a t i v e (2nd) of Y w i t h r e s p e c t t o X w i l l not
v a n i s h when (C2) i s e v a l u a t e d .

(CI) DEPENDS(Y,X)$
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch008

(C2) (1+X"2)*DIFF(Y,X,2)-2*Y=0;

2
(D2) ( X + 1 ) Y - 2 Y = 0
X X

We now ask the system to s o l v e (D2) f o r Y as a f u n c t i o n of X


u s i n g the ODE command. The g e n e r a l s o l u t i o n w i t h the two
i n t e g r a t i o n c o n s t a n t s , %K1 and %K2 i s g i v e n i n (D3) i n about two CPU
seconds. The program can a l s o f i n d p o w e r s e r i e s s o l u t i o n s f o r some
d i f f e r e n t i a l e q u a t i o n s when i t can s o l v e the r e c u r r e n c e r e l a t i o n .
I t does t h i s i n ( D 4 ) . MACSYMA can be used t o check the answer ( D 3 ) .
I n (C5) we t e l l the system t o s u b s t i t u t e (D3) i n t o ( D 2 ) ,
d i f f e r e n t i a t e the r e s u l t and s i m p l i f y .

(C3) 0 D E ( D 2 , Y , X ) ;
Time- 2068 msec.
2 ATAN(X) X 2
(D3) Y = %K2 (X + 1) ( + ) + %K1 (X + 1)
2 2
2 X +2

(C4) 0 D E ( D 2 , Y , X , S E R I E S ) ;
2
T i m e 8766 msec.
INF
==== 1 2 1
2 \ ( - 1) X
(D4) Y = %K1 (X + 1) - %K2 X >
/ 1 1
= (I - -) (I + -)
0 2 2

(C5) D2,D3,DIFF,RATSIMP;
Time= 2051 msec.
(D5) 0=0

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
110 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

MACSYMA i s a computer system which performs many highly


sophisticated computations that w i l l amaze people who use
mathematical t o o l s . For many types of c a l c u l a t i o n s MACSYMA offers
enormous advantages over numeric systems. In t h i s paper I have
shown but a few of the c a p a b i l i t i e s of MACSYMA. I t i s d i f f i c u l t to
present many c a p a b i l i t i e s i n a few pages. References (j>, 13) provide
many more examples as w e l l as motivating the use of MACSYMA i n
several f i e l d s of research and development.

Literature Cited

1. Moses, J . MACSYMA - the fifth year. Proceedings Eurosam 74


Conference. Aug. 1974, Stockholm.
2. Pavelle, R.; Rothstein, M.; Fitch, J . P . Computer Algebra.
Scientific American 1981, 245.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch008

3. van Hulzen, J . Α . ; Calmet, J . Computer Algebra Systems. In


"Computer Algebra, Symbolic and Algebraic Manipulation";
Buchberger, B.; Collins, G . E . ; Loos, R., Eds.; Springer-Verlag:
Wien - New York, 1983, p. 220.
4. Yun, D.Y.Y.; Stoutemyer, D. Symbolic Mathematical Computation.
In "Encyclopedia of Computer Science and Technology" 15; Belzer,
J.; Holzman, A . G . , Eds.; Marcel Dekker: New York - Basel, 1980,
p. 235.
5. Pavelle, R., E d . , "Applications of Computer Algebra"; Kluwer:
Boston, 1985.
6. "The MACSYMA Reference Manual (Version 10)"; Massachusetts
Institute of Technology and Symbolics, Inc.: Cambridge, MA, Dec.
1984.
7. Proceedings of the 1977 MACSYMA users' Conference. R . J .
Fateman, Ed., Berkeley, CA, July 1977. NASA: CP-2012,
Washington, D.C.
8. Proceedings of the 1979 MACSYMA users' Conference. V.E. Lewis,
Ed., June 1979, Washington, D.C.
9. Proceedings of the 1984 MACSYMA users' Conference. V.E.
Golden, Ed., July 1984, General Electric Corporate Research and
Development, Schenectady, NY.
10. Wang, P.S. Math. Comp., 1978, 32, 1215.
11. Mansouri, F . ; Chang, L.N. Phvs. Rev. D 1976, 13, 3192.
12. Pavelle, R. Phvs. Rev. Lett. 1978, 40, 267.
13. Pavelle, R.; Wang, P.S. J. Symbolic Computation. 1985, 1,
69-100.

RECEIVED January 24, 1986

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
9
A Rule-Based Declarative Language
for Scientific Equation Solving

Allan L. Smith

Chemistry Department, Drexel University, Philadelphia, PA 19104

Procedural languages for scientific computation are briefly reviewed


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch009

and contrasted with declarative languages. The capabilities of


TK!Solver are explained, and two examples of its use in chemical
computations are given.

Most of the applications of artificial intelligence in chemistry so far have not involved
numerical computation as a primary goal. Yet there are aspects of the AI approach to
problem-solving which have relevance to computation. In scientific computation, one
could view the knowledge base as the set of equations, input variable values, and unit
conversions relevant to the problem, and the inference engine the numerical method
used to solve the equations. This paper describes such a software system,
TK!Solver.

Brief Review of Software for Scientific Computation

Since the beginning of electronic computing, one of the major incentives for
developing computer languages has been to improve the ease of solving mathematical
problems arising in science and engineering. Many such problems can be reduced to
the solution of a set of Ν algebraic equations - not necessarily linear - in Ν
unknowns. The earliest ways of doing this involved direct hand coding in
hexadecimal machine language or in assembly language mnemonics, specifying in
excruciating detail the procedures needed to transform input data into results. My
first experience with computers (I) was on a Bendix laboratory computer, generating
three-component polymer-copolymer phase diagrams in assembly language. After a
summer of this I became quickly convinced that there must be a better way.
In the early 1960's the first compiled procedural programming language for
scientific computation, FORTRAN, became widely used in the US, with a parallel
development of the use of ALGOL in Europe. Later in the decade, the interpretive
procedural language BASIC emerged, followed by the powerful algebraic notational
language APL. The first structured, procedural language developed to teach the
concepts of programming, Pascal, appeared in 1971, followed later in the decade by
the C language.
In all of these procedural languages (also called imperative languages (2), one of
the basic elements of syntax is the assignment statement, in which an algebraic

0097-6156/ 86/ 0306-0111 $06.00/ 0


© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
112 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

expression is evaluated and stored in a named storage location called a variable.


Although both BASIC and FORTRAN use the equality symbol " = " for the
assignment statement, Pascal emphasizes the procedural nature of the assignment
statement by using the symbol " := " , thus distinguishing it from an algebraic
equation. Another characteristic of procedural languages is that they specify in detail
the procedures and flow of control needed to solve a problem, using such structures
as conditionals and loops.
Parallel to, but largely independent of, this development of procedural
computational languages was the evolution of non-procedural or declarative
languages used for symbolic processing. Eisenbach and Sadler (2) have reviewed the
evolution of declarative languages, which began with LISP in 1960 and includes such
recent languages as Prolog. One of the characteristics of declarative languages is that
problems are defined in terms of logical or mathematical relationships, rather than
assignment statements and flow of control, and that the language itself then decides
how best to solve the problem posed and in what order to use the information
provided. Declarative languages have not so far been widely used in scientific
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch009

computation because of their computational inefficiency.


There have also been important developments in the past decade in scientific
applications software as scientists and engineers have looked for other ways of
solving problems than by writing a program in FORTRAN or another procedural
language. Libraries of mathemetical procedures commonly used in science and
engineering became available for those who wanted to write their own procedural
software but needed robust numerical algorithms in an easily used form. One of the
first full scientific software packages which freed the user from writing in a compiled
or interpreted procedural language was RS/1, which evolved as a part of the Prophet
Network established by NIH for its research grantees in the 1970's and now runs as
a separate package on DEC sur^rminis and personal computers (2). Another was the
electronic spreadsheet, first embodied in its simple tabular format in Visicalc but now
enhanced with plotting and sorting capabilities in Lotus 1-2-3 and several other
packages. A third example is statistical software such as SPSS (4) or Minitab (5).
Symbolic processing languages such as LISP led to the development of symbolic
mathematics packages such as MACSYMA; their use in chemistry has been reviewed
by Johnson (©. A recent ACS Symposium on symbolic algebraic manipulation
contains a full description of MACSYMA among other systems, and a variety of
applications in chemistry (2).
Scientific applications software packages are often characterized by close
attention to the design of the user interface, sometimes at the expense of program size
or execution time. By far the dominant computational idiom in these packages,
however, is procedural. For example, RS/1 has an internal language called RPL,
modelled after the procedural language PL/1, in which specialized procedures and
functions not available in the package may be written by the user. In spreadsheets, the
cell is the basic storage location for either data or formulae. Cells are provided with
data by an assignment process, and formulae reference other cell locations as
variables.

TK!SQlygr
TKîSolver (S) is a high-level computer language for solving sets of algebraic
equations and tabulating or plotting their results. In TKîSolver, equations are viewed
as relationships or rules, not as assignment statements, and in that sense it may be
viewed as a declarative language. The basic computational approach taken by
TKîSolver grew out of the research of textile engineer Milos Konopasek in the
1970's. It was realized early on by Konopasek and Papaconstadopoulos (2) that a
high level computational langauge need not be procedural but could be declarative;

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
9. SMITH A Rule-Based Declarative Language for Equation Solving 113

this point has been recently amplified by Konopasek and Jayaraman (1Q), who also
make the case for TKîSolver's being an expert system for equation solving.
To produce TKîSolver, the problem-solving methodology implemented by
Konopasek in his Question Answering System (2) was combined with the experience
in designing full-screen user interfaces of Software Arts, Inc. (the originators of the
electronic spreadsheet). The goal of the language was to obviate three of the
time-consuming stages of procedural program development (11): (1) algebraic
transformations necessary for formulating assignment statements; (2) sequencing
assignment statements to secure desiredflowof information through the program;
and (3) setting up input and output statements. The capabilities of TKîSolver, which
runs on a number of different personal computers, are as follows (10,11):
(1) It parses entered algebraic equations and generates a list of variables.
(2) It solves sets of equations using a consecutive substitution procedure (the
direct solver).
(3) It solves sets of simultaneous (non-linear) algebraic equations by a modified
Newton-Raphson iterative procedure when consecutive substitution fails
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch009

(the iterative solver).


(4) It searches through tables of data and evaluates either unknown function
values or arguments when required in solving.
(5) It performs unit conversions with definable conversion factors.
(6) It detects inconsistencies in problem formulation and domain errors.
(7) It generates series of solutions for lists of input data and displays results in
tabular or graphical form.
To see how such a language can speed up the process of equation-solving,
consider the steps needed to solve a set of algebraic equations when using a
procedural language. First, you must identify the variable or variables for which you
need to solve. Next, you must use algebraic substitution methods to express the
variables to be solved for in terms of the known variables using assignment
statements. Finally, you must write a program to input valuesforthe known
variables, evaluate the unknown variables, and output the results. There are several
disadvantages to this method. If a different combination of variables serves as input
for another similar problem based on the same set of equations, the algebra must be
reworked to solve for those new variables. In many cases it may not be possible to
obtain analytic expressions usable in assignment statements, so you must find some
numerical approximation algorithm suitable for the problem at hand and either obtain
or write the code based on that algorithm.

A Chemical Example: The van der Waals Gas

Take, for example (12), the problem of solving for the P-V-T properties of a real gas
obeying the van der Waals equation of state,

P =nRT/(V-nb)-n a/V 2 2
(1)

where a and b are coefficients characteristic of a given gas. Solving for P, given n,
V, and Τ is a simple assignment statement, but solving for η given Ρ, V, and Τ
requires considerable algebraic manipulation, followed either by applying the formula
for the roots of a cubic equation or by using a numerical technique for determining
roots (the latter usually requires more mathematical analysis - for example, finding
first derivatives using the Newton-Raphson method).
Figure 1 shows the Rule Sheet for a TKîSolver model REALGAS.TK (12).
Thefirstrule is the van der Waals equation of state. The second defines the gas
constant, and the third rule defines the number density. The fourth defines the
compressibility factor z, a dimensionless variable which measures the amount of

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
114 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

departure of a real gas from ideality. The next three rules give the critical pressure,
molar volume, and temperature of a van der Waals in terms of the coefficients a and
b. The Van der Waals equation can be recast in a form which uses only reduced,
dimensionless variables; these are defined in the next three rules. The last two rules
provides values for the van der Waals coefficients a and b when the name of the gas
is given (user-defined functions with symbolic domain elements and numerical range
elements can be used in any model which requires reference to built-in data tables).
S Rule

"Equation of State of a van der Waals Gas. Chap. 4. Model name: REALGAS.TK
* R = 0.0820568 "Value of gas constant
*nd = n/V "Number density
*z = P* V / ( n * R * T ) "Compressibility factor
A
*Pc = a/(27*b 2) "Critical Pressure
* Vc = 3 * b "Critical Molar Volume
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch009

*Tc = 8*a/(27*b*R) "Critical Temperature


*Pred = P/Pc "Reduced pressure
* Vred = V/Vc "Reduced volume
*Tred = T/Tc "Reduced temperature

* a = acoeff ( gasname ) "Function for Van der Waals a coefficient


* b = bcoeff ( gasname ) "Function for Van der Waals b coefficient

Figure 1. Rule Sheet for Model REALGAS.TK

A typical use for this model would be to solve for the number of moles of a gas,
given its identity, pressure, volume, and temperature. The iterative solver is used for
this purpose. You must decide which variable to choose for iteration and what a
reasonable initial guess is. Real gases approach ideal behavior at low pressure and
moderate temperatures. Since the compressibility factor ζ is 1 for an ideal gas, and
since knowing ζ along with Ρ, V, and Τ allows a calculation of n, we choose ζ as
the iteration variable and 1.0 as the initial guess.
The Variable Sheet with the solution to such a problem is shown in Figure 2.
Unit conversions from psi to atmospheres, from cubic feet to liters, and from
Fahrenheit to Kelvins have been built into the model via the Units Sheet. For input
values of 100 cubic feet of acetylene at 300 psi and 66°F, there are 728.9 moles of
acetylene and the value of ζ of 0.874 indicates that the deviationfromideality is
12.6%.

Another Example: Acid Rain

Problems in chemical equilibrium with many reactions involving many species


often generate mathematical models containing large sets of simultaneous, nonlinear
equations which must be solved by numerical means. TKîSolver is a good tool for
solving such problems. For example, consider the acid-base chemistry of a raindrop.
Vong and Charlson (12) have developed an equilibrium model which predicts the pH
of cloud water, assuming an atmosphere with realistic levels of three soluble,
hydrolyzable gases: SO^, C 0 , and NH . Also included is the effect of acidic dry
2 3

aerosols, particles of sub-micron diameter containing high concentrations of sulfuric


and nitric acid.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
9. SMITH A Rule-Based Declarative Language for Equation Solving 115

St Input Name Output Unit Comment

300 Ρ psi pressure


100 V A
ft 3 volume
η 728.92419 mol number of moles
R .0820568 l*atm/(mo gas constant
66.000000 Τ oF temperature
'acetylen gas nam 'text name of the gas
ζ .87417992 decimal compressibility factor
nd .97443505 mol/1 molar density

a 4.39 A
atm*l 2/m van der Waals a coefficient
b .05136 1/mol van der Waals b coefficient
Pc 61.638310 atm critical pressure
Vc .15408 1 critical molar volume
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch009

Tc 308.63925 Κ critical temperature


Pred .33118688 decimal reduced pressure
Vred 4854.9325 decimal reduced volume
Tred .94624676 decimal reduced temperature

Figure 2. Variable Sheet for REALGAS.TK with Solution

There are five laws of chemical equilibrium relevant to the Charlson-Vong model:
(1) the ideal gas law, relating gas species density to its temperature and partial
pressure; (2) Henry's law, relating the partial pressure to the concentration of
dissolved gas; (3) the law of mass action, giving equilibrium constant expressions for
the hydrolysis reactions of the dissolved gases; (4) conservation of mass for species
containing sulfur(IV), sulfur (VI), carbon(IV), nitrogen(V), and nitrogen(-IH); and
(5) conservation of charge. Applying these laws, Vong and Charlson were able to
calculate the pH of a raindrop by solving a set of 17 equations in 29 variables (cloud
water content, temperature, partial pressures, and species concentrations) and 9
parameters (Henry's law constants, equilibrium constants, and the gas constant).
They wrote a FORTRAN program which solved all equations but one, that of charge
conservation. The pH at electrical neutrality was determined by a graphical method, in
which the total positive and negative charge concentrations were calculated and
plotted for a series of assumed pH's and the crossing point found.
A TKîSolver model called RAINDROP.TK has beendeveloped to incorporate
the full Charlson-Vong model of cloud water equilibrium (12), including the
temperature dependence of all equilibrium constants. The iterative solver makes it
possible to compute the pH at charge neutrality without having to make plots of
intermediate results. The Rule Sheet is shown in Figure 3.
The Unit Sheet contains a number of conversions necessary to accommodate the
variety of units used in experimental atmospheric chemistry. The Variable Sheet is
arranged so that the variables at the top are the ones normally chosen as input
variables. Since the usual goal of running the model is to determine the pH of the
raindrop, the variable p H is chosen as the one on which to iterate.
The following problem, taken to match the conditions in Figure 2 of reference
13, is typical of those solved in less than one minute on an IBM PC with this model:
"a cloud at 278 Κ contains 0.5 grams of liquid water per cubic meter of air. The
atmosphere of the cloud contains 5 ppb sulfur dioxide, 340 ppm carbon dioxide, 0.29
3r 3
μg/m of nitrogen base, 3 μg/m of sulfate aerosol, and no nitrate aerosol. What is
the pH of the cloud water?' Figure 4 shows the Variable Sheet after solution.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
116 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Rule

Equilibrium pH of a Raindrop, Charlson-Vong model. Chap. 8. Model name: RAINDROP.TK


PS02 = NS02 * R * Τ "Ideal gas law for S02
PNH3 = NNH3 * R * Τ "Ideal gas law for NH3
PC02 = NC02 * R * Τ "Ideal gas law for C02

PS02 = CS02 * KHS "Henry's law for S02


PNH3 = CNH3 * KHN "Henry's law for NH3
PC02 - CC02 * KHC "Henry's law for C02

K1S = CHS03m * CHp / CS02 "Mass action law for S02 - HS03m
K2S = CS032m * CHp / CHS03m "Mass action law for HS03m - S032m
KB = CNH4p * COHm / CNH3 "Mass action law for NH3 - NH4p
K1C = CHC03m * CHp / CC02 "Mass action law for C02 - HC03m
K2C = CC032m * CHp / CHC03m "Mass action law for HC03m - C032m
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch009

KW = CHp * COHm "Mass action law for water

NTS4 = NS02 + L * (CS02 + CHS03m + CS032m ) "Mass balance for sulfur(IV)


NTN3m = NNH3 + L * ( CNH3 + CNH4p ) "Mass balance for nitrogen(-ffl)
NTC4 = NC02 + L * (CC02 + CHC03m + CC032m ) "Mass balance for carbon(IV)
NTN5 = L * CN03m "Mass balance for nitrogen(V)
NTS6 = L * CS042m "Mass balance for sulfur(VI)

C^4p+CHp=ŒS03m+2*CS032m+COHm+CHC03m+2*CC032m+CN03m+2*CS042m

"Chargebalance

pH = -log(CHp) "Definition of pH

"Tenperature-dependent equilibrium constants


KHS = 0.379 * exp ( -3145.99 * (278 - Τ ) / ( 278 * Τ ) )
KHC= 16.6* exp (-2367.65* ( 2 7 8 - Τ ) / ( 2 7 8 * Τ ) )
KHN = 7.11E-3 * exp( -3730.87 * (278 - T ) / (278 * T) )
K1S = 2.06E-2 * exp(2003.54 * (278 - T ) / (278 * T) )
K2S = 8.88E-8 * exp (1461.46* ( 2 7 8 - Τ ) / ( 2 7 8 * T ) )
KlC = 2.94E-7*exp(-1716.92*(278-T)/(278*T))
K2C = 2.74E-11 * exp (-2217.49 * ( 278 - Τ ) / ( 278 * Τ ) )
KB = 1.5E-5 * exp ( -685.59 * ( 278 - Τ ) / ( 278 * Τ ) )
KW = 1.82E-15 * exp ( -7057.27 * ( 278 - Τ ) / ( 278 * Τ ) )
R = 0.0820565 "Ideal gas constant
Figure 3: Rule Sheet for Model RAINDROP.TK
Summary of Other Chemical Applications

In addition to the two examples above, I have developed TKîSolver models for the
ideal gas, for two-component mixture concentrations, for acid base chemistry
(including the generation of titration curves), for transition metal complex equilibria,
for general gaseous and solution equilibria, and for linear regression (12).
Drexel undergraduate students in both the lecture and the laboratory of physical
chemistry have been using TKîSolver for such calculations as least squares fitting of
experimental data, van der Waals gas calculations, and quantum mechanical
computations (plotting particle-in-a-box wavefunctions, atomic orbital electron
densities, etc.). I use TKîSolver in lectures (on a Macintosh with video output to a
25" monitor) to solve simple equations and plot functions of chemical interest.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
9. SMITH A Rule-Based Declarative Language for Equation Solving 117

TKîSolver has also had heavy use in the material balance course in chemical
engineering, and in a mathematical methods course in materials engineering. Graduate
students in chemistry are using it in research projects in spectroscopy and kinetics.
In the teaching of quantum mechanics, TKîSolver has proved especially useful.
For example, Berry, Rice, and Ross Q4) give several problems on the regions of

Input Name Output Unit Comment

278 Τ Κ temperature
.5 L g/m 3 A
liquid water content of the cloud

5 PS02 ppb partial pressure of S02


340 PC02 ppm partial pressure of C02
A
.29 NTN3m ug(N)/m 3 total nitrogen base concentration
3 NTS6 ug(S04)/m sulfate aerosol concentration
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch009

0 NTN5 ug(N03)/m nitrate aerosol concentration

pH 4.0190385 decimal pH of water in the cloud


PNH3 2.9020E-4 ppb partial pressure of NH3

CN03m 0 mol/lw concentration of N03 anion


CS042m .0000625 mol/lw concentration of S04 anion

NS02 2.192E-10 mol/la concentration of gaseous S02


CS02 1.3193E-8 mol/lw concentration of dissolved S02
CHS03m 2.8395E-6 mol/lw concentration of HS03 anion
CS032m 2.6344E-9 mol/lw concentration of S03 anion

NC02 1.4905E-5 mol/la concentration of gaseous C02


CC02 2.0482E-5 mol/lw concentration of dissolved C02
CHC03m 6.2915E-8 mol/lw concentration of HC03 anion
CC032m 1.801E-14 mol/lw concentration of C03 anion

NNH3 1.272E-14 mol/la concentration of gaseous NH3


CNH3 4.082E-11 mol/lw concentration of dissolved NH3
CNH4p 3.2197E-5 mol/lw concentration of NH4 cation

CHp 9.5711E-5 mol/lw concentration of hydrogen ion


COHm 1.902E-11 mol/lw concentration of hydroxide ion

NTS4 2.206E-10 mol/la total concentration of sulfur (IV)


NTC4 1.4905E-5 mol/la total concentration of carbon (TV)

KHS .379 atm*la/mo Henry's law constant for S02


KHC 16.6 atm*la/mo Henry's law constant for C02
KHN .00711 atm*la/mo Henry's law constant for NH3

K1S .0206 decimal equilibrium constant for S02 - HS03


K2S 8.88E-8 decimal equilibrium constant for HS03 - S03
K1C 2.94E-7 decimal equilibrium constant for C02 - HC03
K2C 2.74E-11 decimal equilibrium constant for HC03 - C03
KB .000015 decimal equilibrium constant for NH3 - NH4
KW 1.82E-15 decimal ionization constant for water
R .0820565 la*atm/(m ideal gas constant

Figure 4 : Variable Sheet for Solution to Sample Problem

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
118 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

bonding and anti-bonding in diatomics, one of which requires the calculation and
plotting of contours of constant bonding force. They suggest calculation of the
bonding force on a large grid of points and then connecting points of constant force,
but with TKîSolver it is possible to solve directly the set of parametric equations in r
and theta and to plot the resulting contours.
In summary, the rule-based, declarative approach to solving sets of algebraic
equations presented by TKîSolver has proved to be a fruitful medium for chemical
computations.
Literature Cited

1. S. Krause, A. L. Smith, and M. G. Duden, J. Chem. Phys. 1965, 43, 2144-45.


2. S. Eisenbach and C.Sadler, Byte 1985, 10, 181-87; see also other articles on
declarative languages in the same issue of Byte.
3. "RS/1 Command Language Guide"; BBN Research Systems, Cambridge, Mass,
1982.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch009

4. "SPSS X User's Guide"; McGraw-Hill, New York, 1983.


5. T. A. Ryan, Jr. ,B. L. Joiner, and B. F. Ryan, "Minitab Student Handbook";
Duxbury Press, Boston, 1976.
6. C.S. Johnson, J. Chem. Inf. Comp. Sci. 1983, 23, 151-7.
7. R. Pavelle, ed. "Applications of Computer Algebra"; Kluwer Academic
Publishers, Hingham, Mass, 1985.
8. TK!Solver is a software product developed by Software Arts, of Wellesley,
Mass., and introduced in late 1982. At the present time (November, 1985), the
rights to distribute TK!Solver belong to the Lotus Development Corporation.
9. M. Konopasek and C. Papaconstadopoulos, Computer Languages 1978, 3, 145 -
155.
10. M. Konopasek and S. Jayaraman, Byte 1984, 9, 137-145 . See also M.
Konopasek and S. Jayaraman, "The TK!Solver Book"; Osborne/Mc-Graw
Hill, 1984.
11. M. Konopasek, "Software Arts' TK!Solver: A Message to Educators";
unpublished manuscript, October, 1984.
12. The examples in this paper are from A. L. Smith, "TK!Solver Pack in Chemical
Equilibrium and Chemical Analysis"; McGraw-Hill, to be published.
13. R. J. Vong and R. J. Charlson, J. Chem. Ed. 1985, 62, 141-3.
14. R. S. Berry, S. A. Rice, and J. Ross, "Physical Chemistry"; John Wiley, 1980;
Chap. 6.

R E C E I V E D January 16, 1986

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
10
A Chemical-Reaction Interpreter for Simulation
of Complex Kinetics
1
David Edelson

AT&T Bell Laboratories, Murray Hill, NJ 07974

Simulation of the kinetics of complex chemical


systems is finding ever increasing use for analysis
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch010

of reaction mechanism as well as for process


prediction and control. Software for the solution of
the large number of coupled mass-action differential
equations is now readily available, as are reaction
rate data banks for many systems of interest.
However, application of the technique is discouraged
by the tedious, error-prone task of manually
formulating the differential equation set and coding
i t for the computer. Several chemical reaction
interpreters which can do this have been written over
the years; this report describes our most recent
version which uses modern operating systems and
programming techniques to implement an interactive,
user-friendly program. Portability was a prime
consideration in its design so that i t could be
interfaced with any differential equation solving
program. Although i t was written in C for use on
machines having a UNIX operating system, the
subroutines that i t produces for the equation solver
are in FORTRAN, so that they can be ported to other
machines, and are compatible with most simulation
packages in use today. Special features include free
form input, batch or interactive operation, f u l l
ASCII capability, and dynamic storage allocation.
The extensive use of the "structure" data type in the
C source code makes i t easy to modify or enhance the
interpreter to suit the needs of the current
application or computing environment.

Computer s i m u l a t i o n o f c h e m i c a l r e a c t i o n o r r e a c t i o n - t r a n s p o r t
systems has l o n g been used i n chemical engineering process
d e s i g n , and h a s more r e c e n t l y moved i n t o t h e c h e m i c a l r e s e a r c h

1
Current address: Department of Chemistry, Florida State University, Tallahassee,
F L 32306

0097-6156/86/0306-0119$06.00/0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
120 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

a r e a , where i t has become a t o o l f o r t h e e l u c i d a t i o n o f c h e m i c a l


mechanism(/). The t e c h n i q u e has a l s o f o u n d a p p l i c a t i o n i n t h e
p r e d i c t i o n o f t h e b e h a v i o r o f l a r g e complex c h e m i c a l systems,
s u c h as a t m o s p h e r i c and e n v i r o n m e n t a l systems(2,5), e s p e c i a l l y i n
t h e s t u d y o f t h e e f f e c t o f p o l l u t a n t s and s t r a t e g i e s f o r t h e
m i n i m i z a t i o n of t h e i r e f f e c t s ( 4 ) .
The mathematical problem posed is the solution of the
s i m u l t a n e o u s d i f f e r e n t i a l e q u a t i o n s w h i c h a r i s e f r o m t h e mass-
a c t i o n treatment of the c h e m i s t r y . F o r t h e homogeneous, w e l l -
mixed reactor, this becomes a s e t of ordinary, non-linear,
f i r s t - o r d e r d i f f e r e n t i a l equations. For systems t h a t are not
s p a t i a l l y u n i f o r m and i n v o l v e m a t e r i a l and e n e r g y t r a n s p o r t , t h e
c h e m i c a l t e r m s a r e c o u p l e d w i t h t h e f l u i d m e c h a n i c s and heat
transfer to give sets of partial differential equations.
Numerical techniques for solving these systems have been
e x t e n s i v e l y developed(5,6), but r e g a r d l e s s of the s t r a t e g y used
t o s o l v e the s e t of o r d i n a r y d i f f e r e n t i a l equations, or the
s p a t i a l d i s c r e t i z a t i o n methods employed f o r p a r t i a l d i f f e r e n t i a l
equations(7,£), t h e f i n a l t a s k i s t h a t of s o l v i n g v e r y l a r g e
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch010

m a t r i x systems of a l g e b r a i c e q u a t i o n s . T h i s has traditionally


been t h e f o r t e o f t h e l a r g e m a i n - f r a m e computer, b u t t h e r a p i d l y
expanding c a p a b i l i t y o f m i n i s and m i c r o s has e n a b l e d them t o
h a n d l e t h e s o l u t i o n o f modest p r o b l e m s . A t t h e o t h e r end o f
s c a l e , the expanding scope of a p p l i c a t i o n of t h e s e s i m u l a t i o n
methods, e s p e c i a l l y t o two- and t h r e e - d i m e n s i o n a l s y s t e m s , has
v a s t l y i n c r e a s e d t h e number o f e q u a t i o n s t o be s o l v e d , and so has
e n t e r e d the realm of the supercomputer.
From t h e c h e m i s t ' s p o i n t o f view as a u s e r , t h e s e s i m u l a t i o n
t e c h n i q u e s r e q u i r e him t o p r o v i d e computer code f o r t h e time
d e r i v a t i v e o f e a c h c h e m i c a l s p e c i e s i n t h e mechanism. According
t o the p r i n c i p l e of m a s s - a c t i o n , the d e r i v a t i v e of the x-th
s p e c i e s c o n c e n t r a t i o n i n a mechanism o f M r e a c t i o n s i n v o l v i n g Ρ
c h e m i c a l s p e c i e s i s g i v e n by
d[N ] x μ ρ
a t
i-1 7-1

where fc, i s t h e r a t e c o n s t a n t o f t h e i - t h r e a c t i o n , and v« i s t h e


stoichiometric coefficient of the ;-th species i n the i - t h
reaction. Since the d i f f e r e n t i a l equations are u s u a l l y handled
by methods appropriate to stiff equations, the partial
d e r i v a t i v e s o f e a c h o f t h e above e x p r e s s i o n s w i t h r e s p e c t t o a l l
t h e Ρ s p e c i e s ( J a c o b i a n m a t r i x ) a r e n e e d e d as w e l l . While each
t e r m i n t h e summation above r a r e l y has more t h a n t h r e e s p e c i e s i n
t h e p r o d u c t (i.e. most o f t h e i/~'s a r e z e r o ) , t h e a l g e b r a i n v o l v e d
i n c o l l e c t i n g a l l t h e sums ana p r o d u c t s i s so l a r g e and t h e l a b o r
(and t h e p o s s i b i l i t y o f e r r o r ) i n t h e c o d i n g so g r e a t , t h a t t h i s
t a s k i s u n l i k e l y t o be u n d e r t a k e n m a n u a l l y f o r any b u t s m a l l
c h e m i c a l mechanisms. However, a mechanism f o r a t m o s p h e r i c or
c o m b u s t i o n c h e m i s t r y may e a s i l y r u n t o s e v e r a l h u n d r e d r e a c t i o n s
and s p e c i e s . F u r t h e r m o r e , i n r e s e a r c h a p p l i c a t i o n s i t i s common
t o t e s t s e v e r a l a l t e r n a t e m o d e l s f o r t h e s y s t e m under s t u d y , and
t h e amount o f code t o be w r i t t e n e s c a l a t e s g r e a t l y . Clearly a
m a c h i n e a i d i s r e q u i r e d t o make t h e t e c h n i q u e s i m p l e t o use so
t h a t i t s e x p l o i t a t i o n i s encouraged.
Over t h e y e a r s , v a r i o u s a p p r o a c h e s t o t h i s p r o b l e m have been
taken. In one of the earliest, each chemical species was
a s s i g n e d an i d e n t i f i c a t i o n number, c h e m i c a l e q u a t i o n s r e w r i t t e n
i n t h e s e t e r m s , and t h e computer c o n s t r u c t e d a s y m b o l i c r e a c t i o n
t a b l e w h i c h would s u b s e q u e n t l y be u s e d i n a l o o k u p p r o c e d u r e t o
guide the computation. I n t h o s e y e a r s , t h e c o m p u t a t i o n was slow
and cumbersome, and t h e a d d i t i o n a l o v e r h e a d o f t h e t a b l e l o o k u p
a t each s t e p of the i t e r a t i v e s o l u t i o n p r o c e s s g r e a t l y i n c r e a s e d
t h e c o s t s o f t h e s i m u l a t i o n . The n e x t s t e p was t o use t h i s t a b l e
l o o k u p j u s t once t o w r i t e a r o u t i n e w h i c h would be c o m p i l e d as
p a r t of the s i m u l a t i o n program. F o r t r a n was t h e m a j o r h i g h - l e v e l
l a n g u a g e a v a i l a b l e ; i t was n e c e s s a r y t o use a s s e m b l y l a n g u a g e t o
w r i t e t h e F o r t r a n code f o r t h e s i m u l a t i o n p a c k a g e ( P ) .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
10. EDELSON Simulation of Complex Kinetics 121

The n e x t advance was s c a n n i n g and i n t e r p r e t a t i o n of t h e t e x t o f


t h e o r i g i n a l r e a c t i o n s e t , w r i t t e n i n as c l o s e an a p p r o x i m a t i o n
t o r e a l c h e m i c a l n o t a t i o n as t h e s t r a i g h t - l i n e , u p p e r - c a s e o n l y
f o r m a t o f t h e 72-column c a r d would permit(10). Since character
and string manipulation through Fortran was at that time
cumbersome and inefficient, these p a r t s of t h e i n t e r p r e t a t i o n
p r o g r a m were w r i t t e n i n a s s e m b l y l a n g u a g e . The i n t e r p r e t e r u s e d
F o r t r a n output statements to generate the s i m u l a t i o n code i n
assembly l a n g u a g e , making t h e s i m u l a t i o n more e f f i c i e n t , but
unfortunately non-portable.
The p a s s a g e o f t i m e has b r o u g h t v a s t l y i m p r o v e d f a c i l i t i e s f o r
s t r i n g and c h a r a c t e r h a n d l i n g by h i g h - l e v e l l a n g u a g e s . Smaller
m a c h i n e s have come i n t o vogue, and i n t e r a c t i v e operation has
taken p r e f e r e n c e over batch systems. The g r o w t h i n p r o b l e m s i z e ,
however, has k e p t t h e m a i n f r a m e m a c h i n e s i n t h e p i c t u r e , and has
even brought i n the supercomputers, which a r e , at t h i s w r i t i n g ,
g e a r e d t o b a t c h mode o p e r a t i o n and m o s t l y F o r t r a n programming.
Front-end machines, however, offer a v a r i e t y of i n t e r a c t i v e
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch010

e n v i r o n m e n t s and programming l a n g u a g e s . This suggests that a


b e t t e r a p p r o a c h would be t o s e p a r a t e t h e p r o b l e m i n t e r p r e t a t i o n
and solving functions of a simulation system. This paper
describes a new implementation of our previous simulation
p a c k a g e , i n w h i c h e a c h p a r t i s done on a m a c h i n e and w i t h a
l a n g u a g e w h i c h a r e t h e most e f f e c t i v e and a p p r o p r i a t e .
The Bell Laboratories Central Computer Service supports a
Cray-1 with Fortran as the primary h i g h - l e v e l language for
compute-bound p r o b l e m s . This i s accessed t h r o u g h a number o f
front-end m a c h i n e s , m o s t l y Vaxes o p e r a t i n g u n d e r UNIX, w h i c h
support several high-level languages for interactive use.
B e c a u s e o f t h e number o f d i f f e r e n t t a s k s t h e c h e m i c a l i n t e r p r e t e r
is required to perform in addition to elementary string
manipulation, we c h o s e C as t h e l a n g u a g e i n w h i c h t o w r i t e i t .
T h i s o f f e r s a d e g r e e of p o r t a b i l i t y , as C o r C - l i k e c o m p i l e r s a r e
t o be f o u n d on a l a r g e number o f m a c h i n e s and o p e r a t i n g s y s t e m s .
Input Language
C h e m i c a l n o t a t i o n i s m o s t l y t h e r e s u l t o f h i s t o r i c p r e c e d e n t , and
was c e r t a i n l y n e v e r i n t e n d e d t o be i n t e r p r e t e d by a computer.
However, i n o r d e r t o m a i n t a i n t h e g r e a t e s t e a s e of o p e r a t i o n by a
c h e m i s t , t h e i n p u t l a n g u a g e s h o u l d be d e s i g n e d t o be as c l o s e t o
the normal n o t a t i o n f o r r e a c t i o n e q u a t i o n s . The basic input
record is a chemical equation; reactants on the left are
separated from products on the right by an arrow ( — r e a d
' y i e l d s ' ) , and a r e i n t u r n s e p a r a t e d f r o m e a c h o t h e r by p l u s (+)
signs. S u b s t i t u t i n g t h e e q u a l s i g n (-) f o r t h e arrow and the
ampersand (€) f o r t h e p l u s r e s u l t s i n a m i n i m a l s a c r i f i c e of
r e a d a b i l i t y f o r the chemist, but e l i m i n a t e s a m b i g u i t i e s f o r the
machine. Subsidiary fields for identification numbers are
s e p a r a t e d f r o m t h e r e a c t i o n e x p r e s s i o n by t a b c h a r a c t e r s . Input
i s i n f r e e form, w i t h embedded s p a c e s i g n o r e d (except i n t e x t
e x p r e s s i o n s , see b e l o w ) .
Compounds a r e e x p r e s s e d by t h e i r s y m b o l i c f o r m u l a s . The use of
t h e f u l l A S C I I c h a r a c t e r s e t a l l o w s t h e e l e m e n t s t o be e x p r e s s e d
by t h e i r u s u a l one o r two c h a r a c t e r names, w i t h t h e u p p e r o r
lower case context p r o v i d i n g the c h a r a c t e r count. More e l a b o r a t e
d e s i g n a t i o n s f o r atoms ( i n c l u d i n g s u p e r s c r i p t s d e n o t i n g isotope
number, f o r example) a r e accommodated by e n c l o s i n g t h e e x p r e s s i o n
i n appropriate quotes. U p s h i f t and d o w n s h i f t m e t a c h a r a c t e r s can
be u s e d h e r e t o d e n o t e a p p r o p r i a t e c h a r a c t e r p l a c e m e n t on o u t p u t
devices (such as p l o t t e r s and typesetters) which allow for
p a r t i a l l i n e s p a c i n g s ; l i n e p r i n t e r s would i g n o r e them.
U n f o r t u n a t e l y , t e r m i n a l i n p u t does n o t a l l o w f o r t h e s u b s c r i p t s
and s u p e r s c r i p t s u s e d by c h e m i s t s . A r i g i d format i s t h e r e f o r e
e n f o r c e d t o d i s t i n g u i s h s u b s c r i p t s ( i n d i c a t i n g number o f atoms)
from superscripts (indicating valence state or charge) by

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
122 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

p r e c e d i n g t h e l a t t e r by a + o r - s i g n . C h a r g e i n d i c a t i o n c a n be
by appropriate repetition of the sign, o r by a single sign
f o l l o w e d by n u m e r i c a l i n d i c a t i o n . Parenthesized expressions are
accommodated and expanded i n the usual way, and nesting to
several l e v e l s i s allowed.
Where i t i s f e l t that c l a r i t y ( f o r the chemist) i s b e t t e r
s e r v e d by u s i n g compound names r a t h e r t h a n f o r m u l a s , t e x t i n p u t
i s a c c e p t e d by s u r r o u n d i n g i t w i t h q u o t a t i o n m a r k s . This text i s
n o t s u b j e c t t o l e x i c a l a n a l y s i s ; s u b s i d i a r y t a s k s s u c h as s y n t a x
c h e c k i n g c a n n o t be p e r f o r m e d i n t h i s c a s e . Quoted t e x t can a l s o
be a t t a c h e d t o a compound e x p r e s s e d by f o r m u l a ; t h e f o r m u l a i s
i n t e r p r e t e d and t h e t e x t p a s s e d t h r o u g h u n c h a n g e d .
Syntax A n a l y s i s
As each reaction equation is entered, several checks are
performed to c a t c h e r r o r s i n f o r m u l a t i o n or t y p i n g : the c o r r e c t
number of tabs, equals, balanced quotes or parentheses, and
c o n f o r m a n c e t o t h e s y n t a x r u l e s w h i c h a l l o w s t h e e q u a t i o n t o be
s e p a r a t e d i n t o r e a c t a n t s and p r o d u c t s , and t h e s e i n t u r n t o be
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch010

decomposed into atoms. Each time a new chemical species i s


e n c o u n t e r e d i t i s r e p o r t e d t o t h e u s e r , who c a n d e t e r m i n e w h e t h e r
a valid name has been e n t e r e d . The equation i s checked f o r
b a l a n c e s i n a t o m i c e l e m e n t s and c h a r g e s , and d i s c r e p a n c i e s l i s t e d
for corrective action. S p e c i e s names may be e i t h e r f o r m u l a s o r
t e x t ; i n t h e l a t t e r c a s e t h e b a l a n c e c h e c k i n g f e a t u r e must be
turned o f f to avoid f a l s e errors.
When t h e i n p u t l i s t i s f i n i s h e d , the i n t e r p r e t e r checks the
e q u a t i o n s a g a i n s t e a c h o t h e r , t o a s c e r t a i n t h a t a r e a c t i o n has
not i n a d v e r t e n t l y been entered more than once, even with
permutations of the r e a c t a n t s or p r o d u c t s . F i n a l l y , i f a l l input
has been error-free, the interpreter continues by lexically
s o r t i n g t h e a t o m i c e l e m e n t s and s p e c i e s names, a s s i g n i n g f i n a l
i d e n t i f i c a t i o n numbers, and p r o v i d i n g a l i s t f o r t h e u s e r .
S h o u l d an e r r o r be e n c o u n t e r e d d u r i n g input, the interpreter
w i l l not complete i t s t a s k . However, i t does c o p y a l l i n p u t t o a
f i l e f r o m w h i c h i t may be r e t r i e v e d , e d i t e d and r e s u b m i t t e d i n
b a t c h mode, m a k i n g i t u n n e c e s s a r y t o r e t y p e a l l t h e e q u a t i o n s .
Data Structures
Scanning of the r e a c t i o n i n p u t l e a d s t o the g e n e r a t i o n of t h r e e
t y p e s o f d a t a s t r u c t u r e ( i n t h e C s e n s e ( / / ) ) , one d e a l i n g w i t h
r e a c t i o n s , one w i t h c h e m i c a l s p e c i e s , and t h e l a s t w i t h c h e m i c a l
elements. Each reaction structure contains the appropriate
identification numbers, and a symbolic r e p r e s e n t a t i o n of the
reaction itself i n terms of p o i n t e r s t o the c h e m i c a l species
s t r u c t u r e s o f t h e r e a c t a n t s and p r o d u c t s . The c h e m i c a l s p e c i e s
structures in turn contain pointers to chemical element
structures, as w e l l as text strings t o be used in printed,
plotted, or typeset output. The chemical element s t r u c t u r e s
similarly contain identifying numerical and text information.
S t o r a g e f o r t h e s e s t r u c t u r e s i s a l l o c a t e d as n e e d e d , and t h e y a r e
c h a i n e d t o e a c h o t h e r by p o i n t e r s . As e a c h p a r t i c i p a n t in a
r e a c t i o n i s examined, t h e d a t a b a s e i s s e a r c h e d and a s t r u c t u r e
c r e a t e d f o r any n e w l y e n c o u n t e r e d species. S i n c e t h e number and
size of these searches may become q u i t e l a r g e f o r e x t e n s i v e
r e a c t i o n mechanisms, t h e s p e c i e s s t r u c t u r e s a r e o r d e r e d l e x i c a l l y
i n a b i n a r y t r e e t o k e e p t h e s e a r c h t i m e t o a minimum. After
i n p u t i s c o m p l e t e , f i n a l i d e n t i f i c a t i o n numbers a r e a s s i g n e d t o
t h e e l e m e n t s and s p e c i e s a c c o r d i n g t o a l e x i c a l s o r t .

Program Output
The principal task of the interpreter is to provide two
s u b r o u t i n e s f o r use by t h e s i m u l a t i o n program, w h i c h i s a s o l v e r
for o r d i n a r y or p a r t i a l d i f f e r e n t i a l e q u a t i o n s . Since chemical
s y s t e m s a r e f o r t h e most p a r t " s t i f f " as a r e s u l t o f n e g a t i v e
feedback(72) the i n t e r p r e t e r expects the s i m u l a t i o n package t o

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
10. EDELSON Simulation of Complex Kinetics 123

use an implicit differential equation solver requiring


c a l c u l a t i o n o f b o t h t h e f u n c t i o n (i.e. the mass-action expression
of the net rate of change o f the species) and its partial
d e r i v a t i v e s with respect to a l l species (Jacobian matrix). The
f o r m e r i s computed s t e p w i s e : f i r s t the r a t e of each r e a c t i o n i s
calculated; then these terms are combined t o g i v e individual
f o r m a t i o n and r e m o v a l t e r m s f o r e a c h s p e c i e s , and f i n a l l y these
a r e a l g e b r a i c a l l y added t o g i v e t h e d e r i v a t i v e s . This strategy
makes a v a i l a b l e t o t h e u s e r a d d i t i o n a l i n f o r m a t i o n t h a t i s o f t e n
h e l p f u l i n i n t e r p r e t i n g t h e mechanism. The Jacobian terms are
c a l c u l a t e d i n one step. F o r t r a n code f o r t h e s e s u b r o u t i n e s is
w r i t t e n u s i n g d i r e c t a d d r e s s e s f o r e a c h member o f t h e a p p r o p r i a t e
a r r a y s , i t b e i n g assumed t h a t t h e s e a r e s t o r e d i n t h e same o r d e r
as t h a t p r o v i d e d by t h e l e x i c a l s o r t a b o v e . The F o r t r a n c o m p i l e r
is thus burdened with the task of c a l c u l a t i n g the variable
addresses, r e l i e v i n g t h e s i m u l a t i o n p r o g r a m o f t h i s t a s k and so
i m p r o v i n g t h e r u n - t i m e economy.
It i s a l s o p o s s i b l e t o use the information w h i c h has been
stored to write programs f o r o t h e r t a s k s . A u s e f u l one, for
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch010

example, k e e p s t r a c k o f t h e s t o i c h i o m e t r y (i.e. t o t a l atom c o u n t s )


of the system. For a c l o s e d system, s t o i c h i o m e t r y should be
a u t o m a t i c a l l y m a i n t a i n e d by linear predictor-corrector solvers,
and t h e s t o i c h i o m e t r y p r o g r a m p r o v i d e s a d i a g n o s t i c o f numerical
e r r o r s (and o t h e r s ) w h i c h have a c c u m u l a t e d . In o t h e r than c l o s e d
s y s t e m s , i t g i v e s an i n d e p e n d e n t c h e c k on t h e s o u r c e s and sinks
which are b e i n g modeled.
Various d a t a b a s e s c a n a l s o be o u t p u t by t h e i n t e r p r e t e r , e.g.
lists of element and species names, t e x t files for labeling
p r i n t e d and p l o t t e d o u t p u t , and a s y m b o l i c r e a c t i o n m a t r i x . This
i n f o r m a t i o n i s d i s t r i b u t e d t o i n d i v i d u a l ASCII f i l e s , from which
t h e y may be r e a d by s u b s e q u e n t p a r t s o f t h e s i m u l a t i o n p a c k a g e
f o r use i n t h e a p p r o p r i a t e t a s k .
Adaptability
The i n t e r p r e t e r was d e s i g n e d t o be i n d e p e n d e n t o f t h e s i m u l a t i o n
program for which i t eventually will serve. A structured
programming l a n g u a g e s u c h as C i s t h e r e f o r e i d e a l f o r c o d i n g i t
s i n c e i t i s s i m p l e t o add code t o p e r f o r m a d d i t i o n a l t a s k s (as
f o r example w r i t i n g t h e v a r i a b l e d i m e n s i o n s p e c i f i c a t i o n s ) w h i c h
m i g h t be s p e c i f i c t o t h e a p p l i c a t i o n . The use o f s t r u c t u r e d a t a
types also allows the expansion of the t y p e of supplementary
information carried along with each variable with little
additional coding effort and with no danger of breaking the
already e x i s t i n g code. Communication of information from the
i n t e r p r e t e r t o the s i m u l a t i o n program i s through i n d i v i d u a l f i l e s
o f i n f o r m a t i o n , w h i c h can be i n p u t t o s u b s e q u e n t programs and
s t o r e d t o be u s e d as n e e d e d . The s i m u l a t i o n s y s t e m i s t h u s f r e e d
f r o m d e p e n d e n c i e s on t h e o p e r a t i n g s y s t e m e n v i r o n m e n t .
Conclusion
The interactive interpretation of chemical equations in
c o n j u n c t i o n w i t h t h e s i m u l a t i o n o f c h e m i c a l r e a c t i o n s y s t e m s has
b e e n i m p l e m e n t e d by a C - l a n g u a g e p r o g r a m w h i c h c a n be r u n on a
s m a l l m a c h i n e t h a t i s i n d e p e n d e n t o f t h e m a c h i n e on w h i c h t h e
s i m u l a t i o n program r e s i d e s . The f l e x i b l e s t r i n g and character
m a n i p u l a t i o n c a p a b i l i t i e s of t h i s environment e n a b l e s the chemist
t o use an input language s i m i l a r to the n a t u r a l language of
c h e m i c a l k i n e t i c s , and c h e c k s s y n t a x and c o n s i s t e n c y o f t h e i n p u t
as well. The interpreter provides verified code for any
s i m u l a t i o n program u s i n g s t a n d a r d d i f f e r e n t i a l e q u a t i o n s o l v e r s ,
and also facilitates the d i s p l a y of the results in chemical
notation. T h e s e i n t e r p r e t e r s have been u s e d s u c c e s s f u l l y f o r
many y e a r s and have f o s t e r e d t h e g r o w t h o f s i m u l a t i o n techniques
i n many a r e a s o f c h e m i s t r y and c h e m i c a l engineering.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
124 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Literature Cited

1. D . E d e l s o n , Science, 2 1 4 , 981 (1981).


2. H . G . B o o k e r , et al., Environmental Impact of Stratospheric Flight, N a t i o n a l
Academy o f S c i e n c e s , W a s h i n g t o n D . C., 1975.
3 . H . S. Gutowsky et al., Halocarbons: Effects on Stratospheric Ozone, National
Academy o f S c i e n c e s , W a s h i n g t o n D . C . 1976.
4. J. H . Seinfeld, Air Pollution; Physical and Chemical Fundamentals, McGraw-
Hill, New Y o r k , 1975.
5. D . D . W a r n e r , J. Phys. Chem., 81, 2329 (1977).
6. N . L. S c h r y e r , J. Phys. Chem., 81, 2335 (1977); and r e f e r e n c e s
cited therein.
7. D . E d e l s o n and N . L. S c h r y e r , Computers and Chemistry, 2, 71 (1978);
8 . G . A . N i k o l a k o p o u l o u , D . E d e l s o n and N. L. S c h r y e r , Computers
and Chemistry, 6 , 93 (1982).
9 . D . McIntyre, in A Technique for Solving the General Reaction-Rate Equations in the
Atmosphere, Appendix B. ( T . J. K e n e s h e a , a u t h o r ) ; AFCRL-67-0221,
April 1967.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch010

10. D . E d e l s o n , Computers and Chemistry, 1, 29 (1976).


11. Β . W. K e r n i g h a n and D . M . Ritchie, The C Programming Language,
C h a p t . 6; Prentice-Hall, Englewood Cliffs, N.J. 1978.
12. C . F . Curtiss and J. O. Hirschfelder, Proc. Natl. Acad. Set. U. S. A.
3 8 , 235 ( 1 9 5 2 ) .

RECEIVED December 17, 1985

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
11
Applying the Techniques of Artificial Intelligence
to Chemistry Education
1 2 2
Richard Cornelius , Daniel Cabrol , and Claude Cachet
1
Department of Chemistry, Lebanon Valley College, Annville, PA 17003
2
Department of Chemistry, Université de Nice, 06034 Nice, France
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch011

The computer program called GEORGE is a "problem-


-solving partner" for introductory chemistry students.
The program has no problems to present to students;
students give problems to GEORGE and he solves the
problems. He explains the solution using ordinary
English and then sketches a diagram to show how data
are combined and relations are applied to give the
solution. GEORGE operates on problems involving three
fundamental quantities, mass, volume, and number of
moles, and other quantities that can be expressed as
ratios of these fundamental quantities.

The power o f t h e computer h o l d s t h e promise f o r f a r - r e a c h i n g changes


i n e d u c a t i o n , but t h a t promise remains u n r e a l i z e d . Most o f t h e
a p p l i c a t i o n s o f computers i n c h e m i c a l e d u c a t i o n have been a d a p t a t i o n s
of t e a c h i n g s t r a t e g i e s used i n o t h e r m e d i a ; t h e r e a r e many t a s k s t h a t
have been done b e t t e r o r f a s t e r on t h e computer but l i t t l e r e a l l y new
has been d e v e l o p e d . There i s a quote t h a t summarizes t h e s i t u a t i o n
i n which we f i n d o u r s e l v e s t o d a y : " A f t e r y e a r s o f growing w i l d l y t h e
f i e l d o f [ e d u c a t i o n a l ] computing i s f i n a l l y a p p r o a c h i n g i t s i n f a n c y . "
T h i s quote i s n e a r l y twenty y e a r s o l d , h a v i n g been t a k e n from t h e
r e p o r t o f t h e 1967 P r e s i d e n t ' s S c i e n c e A d v i s o r y Commission (1).
The q u o t e , however, i s a s t r u e today a s i t was n e a r l y twenty y e a r s
a g o . We s t a n d on t h e t h r e s h o l d o f e x c i t i n g new a p p l i c a t i o n s f o r
computers both w i t h i n t h e f i e l d o f e d u c a t i o n and e l s e w h e r e . The
s u b j e c t o f t h i s paper i s a computer program w h i c h r e p r e s e n t s one
t o t a l l y d i f f e r e n t approach f o r t h e use o f computers i n c h e m i c a l
e d u c a t i o n . We hope t h a t i t i s o n l y one new approach o u t o f many t h a t
we w i l l see i n t h e f u t u r e .
One o f t h e most i m p o r t a n t advantages o f computers i n e d u c a t i o n
i s t h e c a p a c i t y o f s o f t w a r e t o a d j u s t t h e pace o r n a t u r e o f
a c t i v i t i e s on t h e b a s i s o f i n p u t from t h e s t u d e n t . Tutorial or d r i l l
and p r a c t i c e programs a v a i l a b l e today do i n f a c t make some
adjustments based upon s t u d e n t r e s p o n s e s . These programs a r e

0097-6156/86/0306-0125$06.00/0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
126 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

l i m i t e d , however, by the i n g e n u i t y o f the person t h a t wrote the


s o f t w a r e i n c o n s i d e r i n g a l l p o s s i b l e s t u d e n t r e s p o n s e s and i n
d e s i g n i n g a p p r o p r i a t e a c t i o n on t h e p a r t o f the s o f t w a r e . A student
cannot e x p l o r e a r e a s which the a u t h o r o f the s o f t w a r e f a i l e d t o
consider. Thus, t h e s e programs a r e " i n s t r u c t o r - d r i v e n . " The a u t h o r
o f the s o f t w a r e s e r v e s as a s u r r o g a t e i n s t r u c t o r , c r e a t i n g a
p a r t i c u l a r sequence o f a c t i v i t é s f o r the s t u d e n t . However
s o p h i s t i c a t e d the b r a n c h i n g i n the program may b e , the s t u d e n t cannot
t a k e the i n i t i a t i v e ; i n i t i a t i v e i s e x e r c i s e d o n l y by the a u t h o r o f
the s o f t w a r e .
I t i s u s e f u l t o i d e n t i f y two s o f t w a r e c a t e g o r i e s d i s t i n g u i s h e d
by the i d e n t i t y o f the person i n charge o f the e d u c a t i o n a l a c i v i t i e s
t h a t the s o f t w a r e s u p p o r t s . The f i r s t c a t e g o r y i s C o m p u t e r - A s s i s t e d
Instruction (CAI). I n CAI the r o l e o f t h e s o f t w a r e i s t o d e c i d e
w h i c h a c t i v i t i e s the s t u d e n t s h o u l d p u r s u e . Most e x i s t i n g software
for chemical education f a l l s i n t o t h i s category. We may a l s o ,
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch011

however, c o n s i d e r a c a t e g o r y o f s o f t w a r e t h a t c o u l d be l a b e l e d
C o m p u t e r - A s s i s t e d L e a r n i n g ( C A L ) . I n such s o f t w a r e , the s t u d e n t
makes d e c i s i o n s about what he o r she w i l l i n v e s t i g a t e w h i l e u s i n g the
software. Simulations f a l l into t h i s category. P r o f e s s o r John
G e l d e r ' s i d e a l gas law program (2^ i s a c l a s s i c example o f u s i n g
simulations i n chemical education. I n u s i n g t h a t program the s t u d e n t
has c o n t r o l o v e r the p a r a m e t e r s , and by e x p l o r i n g the model c o u l d
p o t e n t i a l l y l e a r n a s p e c t s o f the b e h a v i o r o f i d e a l gases unknown t o
the a u t h o r o f the program. Other s i m u l a t i o n s may a l s o f a l l i n t o the
c a t e g o r y o f c o m p u t e r - a s s i s t e d l e a r n i n g . A p a r t from s i m u l a t i o n s ,
examples o f s o f t w a r e w i t h w h i c h the s t u d e n t i s i n c o n t r o l and a r e
d i f f i c u l t to f i n d .
T h i s paper d e s c r i b e s an example o f a d i f f e r e n t s t y l e o f program
which i s under the c o n t r o l o f the s t u d e n t . The p r o j e c t began i n t h e
f a l l o f 1983 when D i c k C o r n e l i u s spent p a r t o f a s a b b a t i c a l a t the
U n i v e r s i t é de N i c e w o r k i n g w i t h D a n i e l C a b r o l and Claude C a c h e t . The
f i r s t t a s k t h e r e was t o w r i t e a c h a p t e r on microcomputers i n c h e m i c a l
e d u c a t i o n f o r a book on computers i n c h e m i s t r y . D u r i n g the c o u r s e o f
w r i t i n g t h i s c h a p t e r we d e s c r i b e d programs a v a i l a b l e i n the d i f f e r e n t
software s t y l e s : page t u r n e r s , d r i l l and p r a c t i c e , t u t o r i a l d i a l o g s ,
s i m u l a t i o n , p r e - l a b o r a t o r y a c t i v i t i e s , and p r o b l e m - s o l v i n g . I n t h e
a r e a o f p r o b l e m - s o l v i n g , however, t h e r e was l i t t l e t h a t we c o u l d
discuss. Some s o f t w a r e c o u l d be used f o r p r o b l e m - s o l v i n g , but t h e r e
were no examples o f programs w r i t t e n f o r the p r i m a r y purpose o f
helping students learn general problem-solving techniques. I t was t o
t h i s a r e a , t h e n , t h a t we t u r n e d our programming a t t e n t i o n . The
r e s u l t was a program t h a t we c a l l e d GEORGE ( 3 ) t h a t r u n s on the
A p p l e I I s e r i e s o f computers. GEORGE d i f f e r s v e r y much from most
programs a v a i l a b l e f o r c h e m i c a l e d u c a t i o n : GEORGE a s k s no q u e s t i o n s
o f s t u d e n t s . I n s t e a d , s t u d e n t s t a k e problems t o GEORGE. GEORGE
s o l v e s the problems t h a t s t u d e n t s p r o v i d e a n d , most i m p o r t a n t l y ,
e x p l a i n s the s o l u t i o n s u s i n g both t e x t and d i a g r a m s . If insufficient
o r c o n t r a d i c t o r y i n f o r m a t i o n i s a v a i l a b l e , GEORGE can p r o v i d e
d i a g n o s t i c comments t o h e l p the s t u d e n t .
The domain i n w h i c h GEORGE o p e r a t e s i s a s m a l l but i m p o r t a n t one
f o r i n t r o d u c t o r y c h e m i s t r y . He works w i t h problems i n v o l v i n g the
fundamental q u a n t i t i e s mass, volume, and number o f m o l e s . He can
a l s o work w i t h d e r i v e d q u a n t i t i e s such as d e n s i t y , molar mass, molar

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
11. CORNELIUS E T A L . Applying AI Techniques to Chemistry Education 127

concentration, etc. The s p e c i f i c q u a n t i t i e s w i t h w h i c h GEORGE can


work a r e p r e s e n t e d i n F i g u r e 1. That f i g u r e i s t a k e n from the
o n - s c r e e n documentation and a l s o g i v e s the a b b r e v i a t i o n s t h a t
s t u d e n t s may use as shorthand t o i d e n t i f y the q u a n t i t i e s t o GEORGE.
GEORGE works w i t h the u n i t s g , L , m o l . F o r d e r i v e d q u a n t i t i e s he
understands the r a t i o s o f t h e s e u n i t s such as g / L f o r d e n s i t y . He
a l s o understands the n u m e r i c a l p r e f i x e s ρ , η , μ, m, c , d , and k .
He can work w i t h t h e s e p r e f i x e s i n ^ r a t i o s ^ o f u n i t s such as g/mL o r
nmol/raL, and he can a l s o a c c e p t dm o r cm f o r volume.

The Logic

The b a s i c approach t h a t GEORGE uses t o s o l v e problems i s d i m e n s i o n a l


a n a l y s i s , the same t e c h n i q u e t h a t many o f us use i n our own
c l a s s r o o m s t o t e a c h s t u d e n t s how t o s o l v e p r o b l e m s . I n s t e a d o f
h a v i n g numerous formulas f o r d i f f e r e n t k i n d s o f p r o b l e m s , GEORGE
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch011

s i m p l y c o n t a i n s a s e t o f h e u r i s t i c r u l e s which he f o l l o w s t o s e a r c h
f o r a s o l u t i o n . One r e s u l t o f u s i n g t h e s e h e u r i s t i c r u l e s i s t h a t he
can s o l v e problems never worked by the a u t h o r s o f the program.
Another r e s u l t i s t h a t GEORGE may be a b l e t o make p r o g r e s s toward a
s o l u t i o n even i f i n c o m p l e t e i n f o r m a t i o n i s a v a i l a b l e . I n such an
i n s t a n c e , GEORGE may be a b l e t o respond w i t h a statement such as " I f
you c o u l d g i v e me the d e n s i t y o f a l c o h o l , then I c o u l d s o l v e the
p r o b l e m . " The r u l e s a r e v e r y s i m p l e i n c o n c e p t . F i r s t GEORGE
examines the v a r i o u s p i e c e s o f d a t a a v a i l a b l e . He examines a l l
p o s s i b l e p a i r s o f d a t a t o see whether any p a i r can be m u l t i p l i e d o r
d i v i d e d t o g i v e i m m e d i a t e l y the s o l u t i o n . I f he cannot f i n d a
s o l u t i o n i n t h a t way, he checks t o see whether he can a p p l y a
r e l a t i o n t o g e n e r a t e a new p i e c e o f d a t a . I f GEORGE cannot a p p l y a
r e l a t i o n , he s e a r c h e s f o r i n t e r m e d i a t e r e s u l t s t h a t might r e p r e s e n t a
s t e p toward the s o l u t i o n . GEORGE can s e a r c h f o r two t y p e s o f
intermediates. The p r e f e r r e d type i s the r e s u l t o f u n i t s c a n c e l l i n g
t o y i e l d a fundamental q u a n t i t y . Thus d i v i d i n g the mass o f a
substance by i t s molar mass i s a p r e f e r r e d method t o form an
intermediate r e s u l t . L e s s d e s i r a b l e i s the f o r m a t i o n o f an
i n t e r m e d i a t e r e s u l t which i s not a fundamental q u a n t i t y but which
r e p r e s e n t s i n f o r m a t i o n e x p r e s s e d i n a manner not r e p r e s e n t e d by o t h e r
data or i n t e r m e d i a t e s . Each time GEORGE c a l c u l a t e s a new q u a n t i t y ,
he b e g i n s a g a i n t o l o o k f o r an immediate s o l u t i o n . These a r e a l l the
r u l e s t h a t GEORGE needs t o f i n d s o l u t i o n s t o m i l l i o n s o f d i f f e r e n t
problem s t a t e m e n t s . The r e s u l t i s u s u a l l y a s o l u t i o n approached i n
the same manner t h a t a t e a c h e r might use f o r an e x p l a n a t i o n .

The Program

The p r i m a r y menu f o r GEORGE i s shown i n F i g u r e 2 . T h i s i s the way


t h a t the menu appears when no i n f o r m a t i o n has been g i v e n t o GEORGE;
more o p t i o n s a r e a v a i l a b l e a f t e r a problem has been d e f i n e d . To
understand how GEORGE o p e r a t e s we w i l l f i r s t c o n s i d e r an o p t i o n t h a t
i s o u t s i d e the p r i m a r y t h r u s t o f GEORGE, namely, o p t i o n C , C a l c u l a t e
M o l a r Mass. The s t u d e n t sees a s c r e e n which s a y s "Type the f o r m u l a : "
There a s t u d e n t may type a f o r m u l a as s i m p l e as NaCl o r more complex
such as M g ( C 1 0 , ) . 6 H 0 . GEORGE c a l c u l a t e s the molar
z 2 2

mass, e x p l a i n i n g t o the s t u d e n t how the c a l c u l a t i o n i s done as shown

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
128 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

I n s t r u c t i o n s (page 1)

George understands 11 d i f f e r e n t
quantities. Each o f these q u a n t i t i c
has a synbol and a nane:
Synbol Nane
η Mass
η no. o f noies
Ρ no. o f p a r t i c l e
ν volune
d density
c nolarity
ne nass cone,
H nolar Mass
«r Mass r a t i o
nr MOIar ratio
vr volune ratio
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch011

Press the space bar t o continue.

F i g u r e 1. Q u a n t i t i e s w i t h which GEORGE can work. (Reproduced with


permission from Ref. 3. C o p y r i g h t 1985 COMPress.)

A v a i l a b l e Options

D) Enter or Nodify Data


R) Enter a R e l a t i o n
L) Load a Problen f r o M Disk
C) C a l c u l a t e Molar Mass
?> See I n s t r u c t i o n s

Press t h e key o f your choice.

F i g u r e 2. The p r i m a r y menu s c r e e n from GEORGE. (Reproduced with


permission from Ref. 3. C o p y r i g h t 1985 COMPress.)

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
11. CORNELIUS ET AL. Applying AI Techniques to Chemistry Education 129

i n F i g u r e 3 . The a b i l i t y o f GEORGE t o e x p l a i n what he has done i s


the p r i m a r y reason f o r h i s e x i s t e n c e . Other programs, such as TK!
S o l v e r (à), have a much l a r g e r domain but f a i l t o e x p l a i n the
l o g i c t h a t l e a d s t o the answer. The emphasis w i t h i n GEORGE i s on the
s o l u t i o n as a p r o c e s s r a t h e r than upon the answer as a number.
L e t us c o n s i d e r next the d a t a page w h i c h i s used t o d e f i n e a
p r o b l e m . The f i r s t t a s k f o r the s t u d e n t i s t o i d e n t i f y t h é d e s i r e d
quantity. T h i s a c t i o n i s a d e s i r a b l e f i r s t step f o r a student
s o l v i n g a problem w i t h o r wihout the a i d o f a computer. GEORGE
u n d e r s t a n d s h i s domain. I f a s t u d e n t i d e n t i f i e s " t i m e " as the
d e s i r e d q u a n t i t y , GEORGE w i l l respond "Unknown q u a n t i t y . " Each
q u a n t i t y needs a l a b e l . S o , f o r example, a s t u d e n t may t e l l GEORGE
t o f i n d the mass o f a h a i r . Here " h a i r " i s the l a b e l and i s used by
GEORGE i n the d i m e n s i o n a l a n a l y s i s t o determine w h i c h d a t a can be
used t o g e t h e r . S t u d e n t s must s p e c i f y a c o n s i s t e n t u n i t . I f mol i s
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch011

g i v e n as the u n i t f o r the mass o f a h a i r , GEORGE w i l l r e p l y " U n i t


does not a g r e e w i t h q u a n t i t y . "
The s i m p l e s t k i n d o f problem t h a t GEORGE can work and e x p l a i n i s
a m e t r i c u n i t c o n v e r s i o n . F o r example, i f GEORGE i s asked t o f i n d
t h e mass o f a h a i r i n m i l l i g r a m s , the s t u d e n t c o u l d s u p p l y on d a t a
l i n e A the mass o f t h a t h a i r i n grams. The n u m e r i c a l v a l u e s may be
entered i n decimal or e x p o n e n t i a l n o t a t i o n w i t h the e x p o n e n t i a l
n o t a t i o n a p p e a r i n g w i t h a s u p e r s c r i p t j u s t as one would w r i t e i t on
p a p e r . When GEORGE i s asked t o s o l v e t h i s problem he s t a t e s t h a t the
answer was s u p p l i e d i n d a t a l i n e A . T h i s statement i s t r u e , but a
s t u d e n t w o r k i n g a u n i t c o n v e r s i o n problem needs t o have a b e t t e r
e x p l a n a t i o n . GEORGE d i s p l a y s the worked a r i t h m e t i c showing the u n i t
c o n v e r s i o n . An example o f such a d i s p l a y i s shown i n F i g u r e 4 .
As an example o f a s l i g h t l y more d i f f i c u l t p r o b l e m , c o n s i d e r a
q u e s t i o n w h i c h a s k s f o r the d e n s i t y o f e t h a n o l i n g/mL. A s t u d e n t
might p r o v i d e GEORGE w i t h the mass o f a p a r t i c u l a r sample o f e t h a n o l .
The mass c o u l d be, f o r example, 25 grams. I f a s t u d e n t t e l l s GEORGE
t o s o l v e the problem w i t h o n l y t h i s p i e c e o f i n f o r m a t i o n , he w i l l
q u i c k l y r e p l y t h a t he cannot s o l v e the problem w i t h o u t some
i n f o r m a t i o n r e l a t e d t o the volume o f e t h a n o l . I f the s t u d e n t then
s u p p l i e s the volume o f e t h a n o l , GEORGE e x p l a i n s i n p l a i n E n g l i s h how
t o get the answer: " S o l u t i o n found by d i v i d i n g the mass o f e t h a n o l by
t h e volume o f e t h a n o l t o g i v e the d e n s i t y o f e t h a n o l . " GEORGE works
i n t e r n a l l y w i t h the u n i t s g , L , and m o l . Thus, a f t e r he completes
the c a l c u l a t i o n i t i s n e c e s s a r y f o r him t o c o n v e r t the answer t o the
u n i t s r e q u e s t e d when the d e s i r e d q u a n t i t y was s p e c i f i e d . A l t h o u g h
GEORGE has p r o v i d e d a t e x t u a l e x p l a n a t i o n o f the s o l u t i o n p r o c e s s , i t
may be h e l p f u l f o r the s t u d e n t t o see a diagram o f how the p i e c e s o f
i n f o r m a t i o n f i t t o g e t h e r . The diagram f o r the problem i n v o l v i n g the
d e n s i t y o f e t h a n o l i s shown i n F i g u r e 5. The symbols A and Β i n t h i s
diagram r e f e r t o the l i n e s on the d a t a page and a r e f u r t h e r
i d e n t i f i e d t o the s t u d e n t when the l e t t e r s A and Β a r e pressed on the
keyboard.
A s t u d e n t can save problems on the d i s k f o r l a t e r use and t h e
d i s k i s i n i t i a l l y s u p p l i e d w i t h a s e t o f complete p r o b l e m s . One
example w h i c h comes on the d i s k i n v o l v e s c a l c u l a t i n g the m o l a r i t y o f
a n i l i n e i n s o l u t i o n . The a v a i l a b l e d a t a a r e :

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
130 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

C a l c u l a t e Molar Mass

The Holar nass i s 331.297 g/nol.


Mg<C10 ) -6H 0
4 2 2

Cl: 1 X 35. 453


0 : 4 X 15. 9994
Subunit: 99. 4506 χ 2 = 198. 9012
H : 2 X 1.0079
0 : 1 X 15. 9994
Subunit: 18. 0152 x 6 = 108. 0912
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch011

Mg: 1 X 24. 305 24. 305


Total: 331. 297

Press the space bar t o continue.

F i g u r e 3. S c r e e n e x p l a i n i n g the c a l c u l a t i o n o f molar mass.


(Reproduced w i t h p e r m i s s i o n from Ref. 3. C o p y r i g h t 1985 COMPress.)

Netnork

You s u p p l i e d the r e s u l t on data l i n e A.


You need only wake the proper conversion
of u n i t s :

.0034 g κ = 3.40 ng
1 g

Press the space bar t o continue.

F i g u r e 4. The d i s p l a y t h a t GEORGE uses t o show the s o l u t i o n t o


a u n i t c o n v e r s i o n problem. (Reproduced w i t h p e r m i s s i o n from Ref.
3. C o p y r i g h t 1985 COMPress.)

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
11. CORNELIUS E T A L . Applying AI Techniques to Chemistry Education 131

A. Volume o f C H N H6 5 2 3.00 mL

B. Volume o f s o l u t i o n 0.100 L

C. Density of C H N H 6 5 2 1.022 g/mL

D. M o l a r Mass o f C H N H 6 5 2

The s t u d e n t does not need t o s u p p l y the n u m e r i c a l v a l u e f o r a n i l i n e


i f the f o r m u l a o f a n i l i n e i s used as a l a b e l . GEORGE can do the
c a l c u l a t i o n o f the molar mass from the f o r m u l a , but the s t u d e n t must
s p e c i f y t h a t the molar mass i s a p i e c e o f i n f o r m a t i o n t o be used i n
the p r o b l e m . When GEORGE i s t o l d t o s o l v e t h i s p r o b l e m , he f i r s t
f i n d s an i n t e r m e d i a t e r e s u l t by m u l t i p l y i n g the d e n s i t y o f a n i l i n e by
the volume o f a n i l i n e t o g i v e the mass o f a n i l i n e . He t h e n d i v i d e s
the mass o f a n i l i n e by the molar mass t o g i v e the number o f moles o f
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch011

a n i l i n e i n s o l u t i o n . F i n a l l y , he d i v i d e s the number o f moles o f


a n i l i n e by the volume o f s o l u t i o n t o get the d e s i r e d q u a n t i t y , the
m o l a r i t y o f a n i l i n e . The diagram he draws t o e x p l a i n t h i s s o l u t i o n
i s shown i n F i g u r e 6 . I n t h i s f i g u r e the l e t t e r s r e p r e s e n t
i n f o r m a t i o n found on the d a t a page, w h i l e the numbers r e p r e s e n t
i n t e r m e d i a t e s c a l c u l a t e d w h i l e f i n d i n g the s o l u t i o n . P r e s s i n g one o f
the l e t t e r s o r numbers shown b r i n g s t o the t o p o f the s c r e e n an
i d e n t i f i c a t i o n of that p a r t i c u l a r quantity.
The d a t a page i s not the o n l y way t o p r o v i d e GEORGE w i t h
information. C o n s i d e r f o r example a s i m p l e a c i d - b a s e t i t r a t i o n . On
the d a t a page a s t u d e n t c o u l d s p e c i f y the d e s i r e d q u a n t i t y as the
m o l a r i t y o f HC1 i n a c i d s o l u t i o n and g i v e as a v a i l a b l e d a t a the
m o l a r i t y o f NaOH i n base s o l u t i o n , the volume o f base s o l u t i o n and
the volume o f a c i d s o l u t i o n . T h i s i n f o r m a t i o n i s i n s u f f i c i e n t t o
s u p p o r t a s o l u t i o n t o the p r o b l e m . The s t u d e n t must a l s o s p e c i f y the
s t o i c h i o m e t r i c r e l a t i o n between the number o f moles o f HC1 and the
number o f moles o f NaOH. An example o f a r e l a t i o n page showing t h i s
d e f i n i t i o n i s shown i n F i g u r e 7 . Once t h i s i n f o r m a t i o n i s a v a i l a b l e ,
GEORGE can s o l v e the problem, e x p l a i n i n g as he works what i n f o r m a t i o n
i s combined t o f i n d an i n t e r m e d i a t e r e s u l t and a t what p o i n t the
r e l a t i o n i s used. The use o f r e l a t i o n s g r e a t l y i n c r e a s e s the number
o f d i f f e r e n t k i n d s o f problems t h a t GEORGE can h a n d l e .
As an example o f a more complex p r o b l e m , c o n s i d e r one used by
Johnstone (5) i n an a r t i c l e d i s c u s s i n g p r o b l e m - s o l v i n g p u b l i s h e d
l a s t year i n the J o u r n a l o f C h e m i c a l E d u c a t i o n : "What volume o f
1.0 M h y d r o c h l o r i c a c i d would r e a c t w i t h e x a c t l y 1 0 . 0 g o f c h a l k ? "
The d e s i r e d q u a n t i t y i s the volume o f s o l u t i o n . The a v a i l a b l e d a t a
a r e the m o l a r i t y o f HC1 i n s o l u t i o n , the mass o f c h a l k , and the molar
mass o f CaCOo. I n a d d i t i o n , two r e l a t i o n s a r e r e q u i r e d . One
i d e n t i f i e s c h a l k as c a l c i u m c a r b o n a t e by s t a t i n g t h a t the mass o f
c h a l k e q u a l s the mass o f CaC0~. Another g i v e s the s t o i c h i o m e t r y
by s a y i n g t h a t two t i m e s the number o f moles o f CaCO^ e q u a l s the
number o f moles o f HC1. The diagram showing the s o l u t i o n i n t h i s
case o c c u p i e s s e v e r a l s c r e e n s ; a s e p a r a t e s c r e e n i s used t o show each
a p p l i c a t i o n of a r e l a t i o n .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
132 A R T I F I C I A L INTELLIGENCE APPLICATIONS IN C H E M I S T R Y

NetHork
Here i s a diagran of hon I used the
v a r i o u s p i e c e s of information to reach a
solution.

For d e t a i l s t y p e t h e r e l e v a n t letter or
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch011

nunber. ESC d i s p l a y s Menu.


F i g u r e 5. Diagram showing t h a t the d e n s i t y o f e t h a n o l i s o b t a i n e d
by d i v i d i n g t h e mass o f e t h a n o l by t h e v o l u m e o f e t h a n o l . (Repro­
d u c e d w i t h p e r m i s s i o n f r o m R e f . 3. C o p y r i g h t 1985 C O M P r e s s . )

NetHork
Here i s a diagran of hou I used the
v a r i o u s p i e c e s of information to reach a
solution.

Β .···'"'

For d e t a i l s t y p e t h e r e l e v a n t letter or
nunber. ESC d i s p l a y s Menu.

F i g u r e 6. D i a g r a m s h o w i n g how p i e c e s o f d a t a a r e u s e d t o g e t h e r t o
f i n d the s o l u t i o n to the problem i n v o l v i n g the m o l a r i t y of
aniline. ( R e p r o d u c e d w i t h p e r m i s s i o n f r o m R e f . 3. C o p y r i g h t 1985
COMPress.)

Relation
Coef. Quantity
no. of M O l e s
of HC1
= no. of Moles
of NaOH

Press the space bar t o continue.

F i g u r e 7. A s a m p l e o f how a r e l a t i o n c a n be d e f i n e d . (Reproduced
w i t h p e r m i s s i o n f r o m R e f . 3. C o p y r i g h t 1985 C O M P r e s s . )

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
11. CORNELIUS ET AL. Applying AI Techniques to Chemistry Education 133

E x t e n d i n g t h e Domain

The e x t e n s i o n o f the domain o f a p p l i c a t i o n f o r t r e a t i n g a g r e a t e r


v a r i e t y o f problems r e q u i r e s r e l e a s i n g GEORGE from the c o n s t r a i n t o f
using only dimensional a n a l y s i s to solve problems. For t h i s
e x t e n s i o n we a r e s h i f t i n g t o a system o f l o g i c w h i c h p e r m i t s f r e e l y
the d e f i n i t i o n o f both q u a n t i t a t i v e and q u a l i t a t i v e r e l a t i o n s . In
a d d i t i o n , the h e u r i s t i c r u l e s can be r e p r e s e n t e d w i t h i n the same
f o r m a l i s m . The programming language P r o l o g (6^) has been used i n
c r e a t i n g a p r o t o t y p e o f a more c a p a b l e problem s o l v e r . C u r r e n t l y the
p r o t o t y p e h a n d l e s a l l o f t h e problems t h a t GEORGE can h a n d l e , but i n
a d d i t i o n can d e a l w i t h gas law problems and p h y s i c a l t r a n s f o r m a t i o n s .

Conclusion

We see t h r e e l e v e l s o f use f o r GEORGE. A t t h e f i r s t l e v e l , s t u d e n t s


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch011

b e n e f i t from the r i g o r r e q u i r e d even t o s u p p l y i n f o r m a t i o n t o GEORGE.


S t u d e n t s must i d e n t i f y the d e s i r e d q u a n t i t y , l a b e l the q u a n t i t y , and
s u p p l y the u n i t . F o r each p i e c e o f a v a i l a b l e d a t a t h e s t u d e n t must
be j u s t as r i g o r o u s . T h i s r i g o r s h o u l d h e l p s t u d e n t s d e v e l o p good
h a b i t s f o r a p p r o a c h i n g p r o b l e m s . A t the second l e v e l , GEORGE a c t s i n
much the same way t h a t a roommate might a c t when h e l p i n g a f e l l o w
s t u d e n t w i t h a p r o b l e m . We can imagine a roommate s a y i n g " Y e s , I
w i l l show you how t o do t h i s p r o b l e m , but you do the n e x t one by
y o u r s e l f , " o r "You work t h e problem f i r s t , and t h e n I w i l l show you
how I would have done i t . " I n t h i s sense GEORGE a c t s as a
" p r o b l e m - s o l v i n g p a r t n e r " (J7). A t the t h i r d l e v e l o f u s e , a
s t u d e n t i s p r o f i c i e n t a t w o r k i n g the k i n d s o f problems t h a t GEORGE
can s o l v e . GEORGE then becomes a t o o l as an a i d t o s o l v i n g even
l a r g e r p r o b l e m s . The program f r e e s the s t u d e n t from t h e tedium o f
w o r k i n g t h r o u g h the a r i t h m e t i c and l e t s t h e s t u d e n t c o n c e n t r a t e on
the c h e m i s t r y o f the l a r g e r p r o b l e m . As t e a c h e r s , we seek ways t o
h e l p s t u d e n t s w i t h those l a r g e r p r o b l e m s . A program such as GEORGE
i s one approach t h a t we c o u l d use t o h e l p s t u d e n t s expand t h e i r
p r o b l e m - s o l v i n g c a p a b i l i t y t o d e a l w i t h problems t h a t c o u l d be
t e d i o u s i n d e e d w i t h o n l y a c a l c u l a t o r as a t o o l .
Most o t h e r software f o r c h e m i c a l e d u c a t i o n i s o f t h e k i n d t h a t
an i n s t r u c t o r would s e l e c t f o r a c l a s s . I m p l i c i t l y o r e x p l i c i t l y the
i n s t r u c t o r says " S t u d e n t , go use t h i s p r o g r a m . " GEORGE may l i e i n a
v e r y d i f f e r e n t r e a l m i n w h i c h the s t u d e n t s r a t h e r than t h e
i n s t r u c t o r s a r e the ones who choose t h e program. The d i f f e r e n c e
c o u l d be one s m a l l s t e p toward f u l f i l l i n g the promise t h a t computers
h o l d f o r f a r - r e a c h i n g changes i n e d u c a t i o n .

Literature Cited

1. The Pierce Report, "Computers in Higher Education"; A report of


the President's Science Advisory Committee, February 1967.
2. Gelder, J.; Snelling R. "Chem Lab Simulation 2"; High Technology
Inc., Tulsa, OK, 1979.
3. Cornelius, R.; Cabrol D.; Cachet C. "GEORGE - A Problem-Solver
for Chemistry Students"; COMPress, Wentworth, ΝΗ, 1985.
4. TK! Solver, Software Arts, Wellesley, MA, 1983
5. Johnstone, Α. H. J. Chem Educ. 1984, 61, 847.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
134 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

6. Colmerauer, Α.; Janoui, Η.; Caneghen, M. "Prolog, bases


theoriques et developpements actuels"; Technique et Science
Informatique 1983, 4, 271.
7. Cabrol D.; Cachet, C.; Cornelius R., "De nouveaux outils pour
apprendre: les partenaires de résolution de problèmes: GEORGE et
sa descendance"; Methodes Informatiques dans l'Enseignement de la
Chimie, September 17, 1985, L i l l e , France.

RECEIVED December 17, 1985


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch011

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
12
Analogy and Intelligence in Model Building

W. Todd Wipke and Mathew A. Hahn

Department of Chemistry, University of California, Santa Cruz, CA 95064

This paper describes a new approach to building


molecular models using methods of expert systems. We are
applying symbolic reasoning to a problem previously only
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch012

approached numerically. The goals of this project were


to develop a rapid model builder that mimicked the manual
process used by chemists. A further aim was to provide a
justification for the model as a chemist would justify a
particular conformation. The AIMB algorithm reported
here is extremely fast and has a complexity that
increases linearly with the number of atoms in the model.

T h i s paper d e s c r i b e s t h e f i r s t a p p l i c a t i o n o f analogy and


i n t e l l i g e n c e t o m o l e c u l a r model b u i l d i n g . I t r e p r e s e n t s a d e p a r t u r e
from p r e v i o u s methods, a new approach aimed a t r a p i d , a u t o m a t i c ,
a c c u r a t e m o l e c u l a r model b u i l d i n g .

Background

C u r r e n t approaches t o m o l e c u l a r model b u i l d i n g i n v o l v e e i t h e r manual


c o n s t r u c t i o n or energy m i n i m i z a t i o n . Manual c o n s t r u c t i o n o f models
i s performed u s i n g programs l i k e COORD. C O The user specifies
i n t e r n a l c o o r d i n a t e s (bond l e n g t h , bond a n g l e , and d i h e d r a l a n g l e )
and t h e program c o n v e r t s t h e i n t e r n a l c o o r d i n a t e s i n t o C a r t e s i a n
coordinates. Programs l i k e COORD are f r e q u e n t l y used t o g e n e r a t e
i n i t i a l C a r t e s i a n coordinates f o r refinement by other computational
programs. The cumbersome d a t a e n t r y o f COORD c a n be s i m p l i f i e d b y
h a v i n g t h e program a u t o m a t i c a l l y s e l e c t bond d i s t a n c e s and bond
a n g l e s based on t h e atom t y p e s and bond t y p e s s p e c i f i e d . ( 2 ) Using
manual c o n s t r u c t i o n from i n t e r n a l c o o r d i n a t e s , c o n s t r u c t i n g c h a i n s i s
easy, but c o n s t r u c t i n g closed r i n g s i s d i f f i c u l t . One must know
e x a c t l y t h e c o r r e c t s e t o f bond l e n g t h s , a n g l e s , and d i h e d r a l a n g l e s
t o f o r c e t h e c h a i n t o c l o s e as a p e r f e c t r i n g .
An e x t e n s i o n o f t h e atom c o n s t r u c t i o n method a l l o w s adding
groups o r p r e d e f i n e d t e m p l a t e s r a t h e r than j u s t atoms. Chemlab
11,(3) M0LBUILD,(4) MMSX,(5) S y b y l , and Chemgraph a l l use t h i s
method. Preformed r i n g s c a n be added as t e m p l a t e s , t h u s a v o i d i n g the
ring closure difficulty. These methods r e s u l t directly in a

0097-6156/86/0306-0136$06.00/0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
12. WIPKE A N D H A H N Analogy and intelligence in Model Building 137

t h r e e - d i m e n s i o n a l m o d e l , but are not a u t o m a t i c . B u i l d i n g a model


t h a t has t h e r i g h t s t e r e o c h e m i s t r y i s an a d d i t i o n a l problem s i n c e
s p e c i f i c a t i o n o f s t e r e o c h e m i s t r y i n t h e s e systems i s not s i m p l e .
The f i r s t a u t o m a t i c model b u i l d e r was PRXBLD, (60 a module o f the
SECS s y n t h e s i s p l a n n i n g program07) and l a t e r distributed as a
stand-alone program. PRXBLD t a k e s a t w o - d i m e n s i o n a l structural
diagram w i t h s t e r e o c h e m i s t r y and m i n i m i z e s the 2-D s t r u c t u r e t o a 3-D
structure. I t has been i n c o r p o r a t e d i n the PROPHET, DENDRAL, and
ADAPT systems and i s d i s t r i b u t e d by M o l e c u l a r D e s i g n L i m i t e d . PRXBLD
was the f i r s t m o l e c u l a r m o d e l i n g program t o i n t e g r a t e symbolic
i n t e l l i g e n c e and h e u r i s t i c s w i t h n u m e r i c a l methods. Some o f the
h e u r i s t i c s a r e : 1) Ignore hydrogen atoms, expand carbon to i n c l u d e
the volume t h a t the hydrogens s h o u l d o c c u p y . 2) I g n o r e low energy
terms and a v o i d e x p r e s s i o n s w i t h l a r g e exponents when the s t r u c t u r e
is badly d i s t o r t e d . 3) Use four s t a g e s o f r e f i n e m e n t , change s t a g e s
by the s t r a i n energy per atom. 4) I n c l u d e a pseudo p o t e n t i a l t o
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch012

f o r c e m i n i m i z a t i o n t o the s t e r e o c h e m i s t r y s p e c i f i e d . 5) Use a n a l o g y
t o s e l e c t parameters f o r f o r c e c o n s t a n t s t h a t are not a v a i l a b l e .
PRXBLD never b a l k e d for l a c k o f a p a r a m e t e r , t h u s always gave an
answer. U s i n g PRXBLD, the c h e m i s t c o u l d , f o r the f i r s t t i m e , o b t a i n
a t h r e e - d i m e n s i o n a l model by s i m p l y drawing the t w o - d i m e n s i o n a l
s t r u c t u r a l d i a g r a m . A l t h o u g h i t was the f a s t e s t model b u i l d e r o f i t s
time, certain types of structures still required considerable
c o m p u t a t i o n because PRXBLD used n u m e r i c a l m i n i m i z a t i o n .
More r e c e n t l y , the SCRIPT program by Cohen(jB) a l s o t a k e s a
drawing as i n p u t and uses a l i m i t e d l i b r a r y o f r i n g c o n f o r m a t i o n s t o
g e n e r a t e approximate geometry f o r m i n i m i z a t i o n . D o l a t a , u s i n g PROLOG
and p r e d i c a t e c a l c u l u s methods (analogous to those used i n our QED(£)
work) developed an e x p e r t system c a l l e d WIZARD (10) t o s e l e c t a
r e a s o n a b l e s e t o f i n t e r n a l c o o r d i n a t e s f o r an a c y c l i c m o l e c u l e . From
t h e s e i n t e r n a l c o o r d i n a t e s C a r t e s i a n c o o r d i n a t e s are d e r i v e d which
are t h e n g i v e n t o MM2 f o r r e f i n e m e n t . WIZARD has not yet handled
c y c l i c systems.
There i s a need f o r q u i c k 3-D model g e n e r a t i o n . Models are
r e q u i r e d where knowledge o f m o l e c u l a r shape i s e s s e n t i a l t o the
understanding of structure-activity and structure-reactivity
relationships. Most c e r t a i n l y t h e r e w i l l be programs i n the f u t u r e
t h a t h y p o t h e s i z e s t r u c t u r e s ; t h e s e programs w i l l need r a p i d model
generation i n order to evaluate 3-D c o n s t r a i n t s . For these
a p p l i c a t i o n s , t h e models must be c r e a t e d a u t o m a t i c a l l y , without
interactive intervention. We a l s o e n v i s i o n t h e v a s t l i b r a r i e s o f
m o l e c u l a r s t r u c t u r e s s t o r e d i n c h e m i c a l d a t a b a s e s w i l l need t o be
c o n v e r t e d t o 3-D geometry l i b r a r i e s i n order t o use t h e s e d a t a b a s e s
i n d e s i g n i n g new 3-D s t r u c t u r e s .

Goals of AIMB

The g o a l s of. AIMB a r e l i s t e d i n F i g u r e 1. We wish t o generate models


r a p i d l y and s y m b o l i c a l l y . Chemists have over the y e a r s a c q u i r e d a
g r e a t d e a l o f knowledge about the s t r u c t u r e o f m o l e c u l e s ; we p l a n t o
g i v e AIMB the advantage o f access t o t h a t knowledge. Similarly,
chemists have used various methods to reason by analogy:
iso-electronic structures, v a l e n c e e l e c t r o n m o d e l , the periodic
table, etc. We wish t o i n c o r p o r a t e such knowledge i n AIMB. The
avoidance o f m o l e c u l a r mechanics i s a n e g a t i v e g o a l ; perhaps i t i s

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
138 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

more c o r r e c t say our g o a l i s t o demonstrate t h a t i t i s p o s s i b l e t o


b u i l d good models w i t h o u t m o l e c u l a r m e c h a n i c s . After a l l , chemists
b u i l d v e r y good models m a n u a l l y and m e n t a l l y w i t h o u t m i n i m i z a t i o n .

1. B u i l d 3-D Model RAPIDLY, SYMBOLICALLY


2. Use Knowledge and Analogy l i k e Chemist
3. A v o i d m o l e c u l a r mechanics
4. P r o v i d e support f o r r e s u l t s :
a) L i t e r a t u r e precedent
b) Causes o f u n c e r t a i n t y
c) Q u a l i t y assessment
d) Next b e s t models
5 . E x t e n d i b l e t o c o n f o r m a t i o n a l search

F i g u r e 1. G o a l s o f AIMB.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch012

A s i g n i f i c a n t g o a l i s t o have our model b u i l d e r e x p l a i n and


j u s t i f y i t s answer. Computing methodology now e n a b l e s us t o show how
an answer i s d e r i v e d , t h e most n o t a b l e example o f e x p l a n a t i o n
c a p a b i l i t i e s i s MYCIN, a m e d i c a l d i a g n o s i s program.(11) Chemists
have the same need f o r e x p l a n a t i o n o f computing r e s u l t s t h a t d o c t o r s
have. U s e r s o f model b u i l d i n g programs are f r e q u e n t l y n o n - e x p e r t s .
They need t o know whether the program t h a t someone e l s e wrote i s
a p p l i c a b l e t o t h e i r p r o b l e m , and t o what degree the program can
h a n d l e t h e i r p r o b l e m . They should r e c e i v e , f o r example, i n d i c a t i o n s
o f the l i t e r a t u r e precedent showing t h a t the method can h a n d l e t h a t
case w e l l , or an example o f an e v a l u a t e d answer f o r a " s i m i l a r "
p r o b l e m . Every answer i n s c i e n c e c a r r i e s an u n c e r t a i n t y , b u t c u r r e n t
n u m e r i c a l programs do not r e v e a l t h i s u n c e r t a i n t y t o the u s e r . The
model b u i l d e r s h o u l d e x p l a i n what the causes f o r the u n c e r t a i n t y a r e
as w e l l as the p r o b a b l e magnitude. I t i s d e s i r a b l e t o o b t a i n an
o v e r a l l q u a l i t y assessment o f the model and some i n d i c a t i o n which
p a r t s o f the model are most s t r o n g l y supported and which p a r t s are
most t e n u o u s . Thus the q u a l i t y assessment must a l s o a p p l y t o the
i n d i v i d u a l components o f t h e model when p o s s i b l e . Another e x c e l l e n t
way t o j u s t i f y t h e "answer" i s by p r e s e n t i n g f o r comparison the " n e x t
best" models.
I f we succeed w i t h t h e s e o b j e c t i v e s , t h e f i n a l g o a l i s t o a p p l y
the same methods w i t h minor m o d i f i c a t i o n t o g e n e r a t e a l l " r e a s o n a b l e "
conformers o f a compound, i . e . , t o d e v e l o p a s y m b o l i c c o n f o r m a t i o n a l
search c a p a b i l i t y .

Components of AIMB

We e n v i s i o n e d i n our d e s i g n o f AIMB t h a t we c o u l d use a l i b r a r y o f


X - r a y c r y s t a l s t r u c t u r e s as our " e x p e r i e n c e " or knowledge o f t h r e e -
d i m e n s i o n a l models o f m o l e c u l a r s t r u c t u r e s . I t c o u l d , o f c o u r s e , be
a l i b r a r y o f computed s t r u c t u r e s , b u t we favor h a v i n g an e x p e r i m e n t a l
b a s i s f o r our i n f e r e n c e s . I n our r e a s o n i n g we need t o d u p l i c a t e the
learned chemical r u l e s o f analogy; i n t h i s case, those analogies that
p r e s e r v e the t h r e e - d i m e n s i o n a l s t r u c t u r e o f the m o l e c u l e . A module
t o a n a l y z e the problem t o be s o l v e d , and a module t o c o n s t r u c t the
f i n a l t h r e e - d i m e n s i o n a l model seemed o b v i o u s . G r a p h i c a l i n p u t o f the
problem and output o f t h e r e s u l t seemed o b l i g a t o r y . I n f a c t , we even
r e l y h e a v i l y on g r a p h i c s f o r o b s e r v i n g the o p e r a t i o n o f the program

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
12. WIPKE A N D H A H N Analogy and Intelligence in Model Building 139

for debugging p u r p o s e s . A model or t h r e e - d i m e n s i o n a l inference


evaluator i s also important. L a s t l y , t o c o n s t r u c t an e x p l a n a t i o n ,
the program needs t o e x t r a c t from i t s i n f e r e n c e s and knowledge base
the t r a i l o f l o g i c and s u p p o r t i n g d a t a f o r t h e r e s u l t i n g t h r e e -
d i m e n s i o n a l m o d e l . I t a l s o needs t o r e t a i n r u n n e r s - u p models and the
s u p p o r t i n g d a t a for them.

1 C r e a t e l i b r a r y o f models
2 Enter s t r u c t u r a l diagram o f t a r g e t
3 Perceive target
4 T a r g e t or analogs i n l i b r a r y ?
5 No, D i v i d e i n t o subproblems, s o l v e each
6 Assemble s o l v e d subproblem p a r t s
7 Compute degree o f f i t o f s u b p a r t s
8 Prepare s u p p o r t i n g d a t a
9 D i s p l a y completed model
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch012

Figure 2. AIMB a l g o r i t h m .

AIMB Procedure

The sequence o f e v e n t s i n AIMB i s summarized i n F i g u r e 2 . F i r s t the


l i b r a r y o f e x p e r i e n c e (known s t r u c t u r e s ) i s processed i n t o a form f o r
r a p i d r e t r i e v a l . I n t h i s paper we used a 2000 compound l i b r a r y from
the Cambridge C r y s t a l F i l e . Each o f t h e s e s t r u c t u r e s r e p r e s e n t s an
e x p e r i m e n t a l r e s u l t w i t h the p r e c i s i o n o f the c r y s t a l structure
r e f i n e m e n t and l i t e r a t u r e r e f e r e n c e t o the o r i g i n a l p a p e r . These are
not "averaged" t e m p l a t e s , a l t h o u g h n o t h i n g i n our d e s i g n p r e c l u d e s
u s i n g a l i b r a r y o f averaged t e m p l a t e s or t h e o r e t i c a l l y c a l c u l a t e d
structures. The l i b r a r y i s processed o n c e . New e x p e r i m e n t a l r e s u l t s
can be entered i n c r e m e n t a l l y b y p r o c e s s i n g j u s t t h e new s t r u c t u r e s .
Model building begins with the chemist drawing the
t w o - d i m e n s i o n a l s t r u c t u r a l diagram o f t h e d e s i r e d s t r u c t u r e with
s t e r e o c h e m i s t r y , now a w e l l - e s t a b l i s h e d p r a c t i c e . ( 1 2 ) , ( 1 3 ) The t a r g e t
is perceived to identify rings,(14) chains, aromaticity, and
stereochemistry,(15) but c u r r e n t l y not functional groups. The
c a n o n i c a l SEMA name i s a l s o g e n e r a t e d . ( 1 6 )
AIMB f i r s t d e t e r m i n e s i f the t a r g e t or c l o s e analogs are
c o n t a i n e d i n the knowledge b a s e . T h i s a c c e s s t o the knowledge base
i s i n s t a n t a n e o u s because we use hash c o d i n g methods.(17) I f something
i s f o u n d , i t s e l e c t s the most r e l a t i v e e x p e r i e n c e (known model from
the knowledge base) and uses t h a t geometry f o r the p r o b l e m . We w i l l
d i s c u s s how AIMB e v a l u a t e s " c l o s e n e s s " o f a n a l o g i e s i n a moment. If
t h e r e i s no c l o s e a n a l o g t o the whole t a r g e t compound, AIMB uses a
"divide and conquer" strategy. The problem i s divided into
subproblems each o f which i s t r e a t e d as a new p r o b l e m . As i n g e n e r a l
systems a n a l y s i s , t h e b e s t s u b d i v i s i o n s o f a system are those t h a t
m i n i m i z e c o n n e c t i o n s ( i n t e r a c t i o n s ) between subsystems. I n our case
we s e l e c t r i n g a s s e m b l i e s ( 1 4 ) and c h a i n s as s u b d i v i s i o n s . When the
component i s c a r v e d o u t , we a l s o r e t a i n i n f o r m a t i o n about the c o n t e x t
i n which the component r e s i d e s .
These components or subproblems are p r i o r i t i z e d t o s o l v e the
l a r g e s t , most r i g i d r i n g a s s e m b l i e s f i r s t . T h i s w i l l f o r c e an e a r l y
f a i l , i f f a i l we must. AIMB seeks the c l o s e s t a n a l o g i e s t o t h i s
subproblem p r e s e n t i n the knowledge b a s e . First i t l o o k s for

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
140 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

components i n an i d e n t i c a l environment w i t h i d e n t i c a l s t r u c t u r e , next


w i t h atom t y p e a n a l o g i e s , then w i t h d i f f e r e n t environment and
d i f f e r e n t atom t y p e s , r e l a x i n g the matching c o n s t r a i n t s u n t i l one or
more analogs are f o u n d .
The s o l v e d s u b p a r t s a r e then assembled by the c o n s t r u c t o r . The
degree o f f i t i s computed and r e t a i n e d f o r l a t e r e v a l u a t i o n . The
process continues u n t i l a l l subproblems are solved. Then the
e x p l a n a t i o n module p r e p a r e s the s u p p o r t i n g e x p l a n a t i o n f o r t h e model
from the r e a s o n i n g t r a i l and e v a l u a t i o n r e s u l t s . The f i n a l model i s
t h e n d i s p l a y e d and the e x p l a n a t i o n p r e s e n t e d .
Our system a r c h i t e c t u r e i s i l l u s t r a t e d i n F i g u r e 3 w i t h the
i n p u t s c r e e n shown i n F i g u r e 4 . G r a p h i c a l i n p u t i s handled p r i m a r i l y
by the PS300 system w i t h o c c a s i o n a l messages p a s s i n g between the
PS300 and the VAX. The s m a l l window i s a r o t a t a b l e t h r e e - d i m e n s i o n a l
v i e w o f the d e v e l o p i n g m o d e l .
We w i l l take 7 - b e n z y l - 2 - n o r b o r n a n o n e as an example f o r the model
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch012

b u i l d i n g process. F i g u r e 4 shows the compound as i t has been drawn


i n 2 - d i m e n s i o n s by the u s e r . The program does not f i n d an e x a c t
match or an a n a l o g match f o r the compound. T h e r e f o r e , t h e compound
i s d i v i d e d i n t o subcomponents w i t h the n o r b o r n y l group becoming the
f i r s t subproblem. We can i n F i g u r e 5 observe AIMB e v a l u a t i n g the
r e l e v a n c e o f a bromocamphor compound as an a n a l o g for the n o r b o r n y l
group. Debug s w i t c h e s have been s e t t o show i n t e r n a l s c o r i n g and
c o n c l u s i o n s . T h i s s t r u c t u r e ends up s c o r i n g the b e s t .
Now we t u r n t o the scoring function. The f u n c t i o n for
e v a l u a t i n g the c l o s e n e s s o f analogy i n v o l v e s a s e n s i t i v i t y parameter
for the atom c l a s s , js, t i m e s t h e d i s s i m i l a r i t y p a r a m e t e r , d:
ΝΑ M

I Κ
WHERE A e TARGET, A ' € ANALOG, J = MAP(l)

Summation o c c u r s over a l l atoms and a l l a t t r i b u t e s . There are t h r e e


atom c l a s s e s : normal atoms, o r i g i n atoms (where components j o i n ) and
dummy atoms ( p a r t i a l c o n t e x t o f n e i g h b o r i n g component e n v i r o n m e n t ) .
A t t r i b u t e s i n c l u d e atom t y p e , bond t y p e , s t e r e o c h e m i s t r y , e t c . For
atom t y p e d i s s i m i l a r i t y , t h e r e i s a s e r i e s o f atom analogy c l a s s e s o f
v a r y i n g degree o f c l o s e n e s s . For example, i f atom types a r e e q u a l , d
= 0 , i f a . and a . b e l o n g t o { C l , B r , I ] ά = 3 , i f a. and a . b e l o n g t o
{F, B r , e l , 1} dP = 5 , and i f a . and a . are not members o f any analogy
c l a s s e s t h e n à = 10. A r o m a t i c b o n d i a r e c o n s i d e r e d analogous t o
double b o n d s . S i n c e we are s t i l l e x p l o r i n g the analogy h e u r i s t i c s ,
we s h o u l d t a k e t h e s e o n l y as examples o f the a p p r o a c h . R e t u r n i n g to
our example, F i g u r e 6 shows the s c o r e s computed f o r two analogous
bicyclo[2.2.1]heptanes. The bromo compound at 1500 i s b e t t e r than
the dimer a t 1540 (lower number i n d i c a t e s s m a l l e r d i s s i m i l a r i t y ) .
Note t h a t AIMB r e c o g n i z e s enantiomers and i s a b l e t o use a r e f l e c t i o n
o f the m o d e l . I n t h i s case the c a r b o n y l i s on the wrong s i d e o f the
molecule.
The a c y c l i c component i s found i n two compounds ( F i g u r e 6) one
h a v i n g an oxygen i n p l a c e o f the c e n t r a l c a r b o n (540) , t h e other
having a carbon (220). I n b o t h c a s e s the compounds have an aromatic

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch012

Figure 4. Input on E&S PS330 of molecule to be modeled by AIMB.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch012

Figure 6. D i s s i m i l a r i t y scores for analogies relevant to


problem. The lower score i s the better analogy.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
12. W I P K E A N D H A H N Analogy and intelligence in Model Building 143

r i n g j o i n e d by one atom c h a i n t o an a l i p h a t i c five-membered r i n g


system. F i n a l l y f o r our example, AIMB assembles t h e components i n
three-dimensions. The e x p l a n a t i o n module i s under development, b u t
the r e a s o n i n g t r a c e and q u a l i t y o f model e v a l u a t i o n i s implemented.
In t h i s example the f i t o f the phenyl group t o the c h a i n i s w i t h i n
0^01 A , t h e f i t o f the norbornane s k e l e t o n t o the c h a i n i s 0.11
A (this reflects the f a c t t h a t the analogy f o r the c h a i n was
connected t o a l e s s c o n s t r a i n e d five-membered r i n g ) . The a c t u a l
a c c u r a c y o f the r e s u l t i n g model i s much b e t t e r than these v a l u e s
i n d i c a t e because t h e d i s c r e p a n c y i n f i t i s o n l y an i n d i c a t i o n o f the
s i m i l a r i t y o f the analogy and i s not i n c o r p o r a t e d i n t o the m o d e l ) .
The c h a i n analogy comes from C r y s t . S t r u c t . Commun. 8, 553 ( 1 9 7 9 ) ;
norbornane s k e l e t o n from A c t a . C r y s t a l l o g r . Sect Β 31, 903 ( 1 9 7 5 ) .
The r e p o r t e d p r e c i s i o n o f the experiment f o r the c r y s t a l s t r u c t u r e s
is also available.
An ORTEP p l o t o f the A I M B - b u i l t model i s shown i n F i g u r e 7 .
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch012

A l l i n g e r ' s m o l e c u l a r mechanics program MM2(18) was then used t o


r e f i n e t h e model c o n s t r u c t e d by AIMB. S u p e r p o s i t i o n o f the AIMB
model w i t h t h a t r e f i n e d by MM2 i s shown i n F i g u r e 8. MM2 d i d not
change t h e d i h e d r a l a n g l e s o f the b e n z y l group from those proposed by
AIMB. L e t ' s r e c a l l t h a t AIMB has no i n t e r n a l knowledge o f s t r u c t u r a l
c h e m i s t r y , b u t o n l y knows how t o use a n a l o g i e s and a knowledge base
o f known m o d e l s . AIMB does not c u r r e n t l y know about any k i n d o f
non-bonded i n t e r a t o m i c i n t e r a c t i o n s , yet AIMB b u i l t a c o r r e c t model
o f the example t a r g e t compound because the knowledge o f i n t e r a c t i o n s
and how t o m i n i m i z e them i s embedded i n the knowledge base o f known
models. Thus AIMB b u i l t a minimum energy model ( v e r i f i e d s e p a r a t e l y
by MM2) yet AIMB d i d t h i s s y m b o l i c a l l y b y r e a s o n i n g r a t h e r than
minimization.

Table I . Speed o f b u i l d i n g model o f 7-benzyl-2-norbornanone

Method Time (seconds)

AIMB 40
Human b e i n g 118
PRXBLD 644
MM2 4436

S e v e r a l s i g n i f i c a n t p o i n t s can now be made. F i r s t as T a b l e I shows,


on the VAX 11/750, AIMB t o o k o n l y 40 seconds t o c o n s t r u c t the m o d e l .
A c h e m i s t took 145 seconds t o assemble a F i e s e r model o f the compound
and when c o m p l e t e d , t h e chemist d i d not know the d i h e d r a l a n g l e s o f
the b e n z y l g r o u p . PRXBLD took 644 seconds t o b u i l d t h e m o d e l . T h i s
m o l e c u l e i s d i f f i c u l t f o r PRXBLD because adjustment o f the c h a i n
a n g l e s r e q u i r e s movement o f the two l a r g e groups o f atoms, but PRXBLD
does not r e c o g n i z e t h a t the groups can be moved as a unit.
A l l i n g e r ' s MM2 took 4436 seconds t o converge to t h e d e f a u l t CHEMLAB
c r i t e r i a and t h a t was when g i v e n a v e r y good i n p u t s t r u c t u r e (PRXBLD
model).

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch012

Figure 7. ORTEP plot of f i n a l AIMB model of target molecule.

Figure 8. Superposition of AIMB model and r e s u l t of MM2


refinement.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
12. WIPKEANDHAHN Analogy and Intelligence in Model Building 145

Conclusion

We have shown t h a t analogy and i n t e l l i g e n c e a p p l i e d t o model b u i l d i n g


leads to a f a s t , accurate algorithm. U s i n g p r i o r knowledge i s
efficient. T h i s method i s a p p l i c a b l e t o complex f u n c t i o n a l i t y where
the f o r c e s or i n t e r a c t i o n s may n o t be w e l l understood, e . g . ,
i n o r g a n i c s and o r g a n o m e t a l l i c s , b u t where t h e r e a r e many known
crystal structures. While we based our knowledge on c r y s t a l d a t a ,
one c o u l d a l s o use computed s t r u c t u r e s s e p a r a t e l y or i n c o n j u n c t i o n
with c r y s t a l data. The p r o c e s s we d e s c r i b e d i s easy f o r any c h e m i s t
to understand. AIMB does n o t i n v o l v e f o r c e f i e l d s o r c o m p l i c a t e d
m a t h e m a t i c s . The models AIMB g e n e r a t e s are supported b y e x p e r i m e n t a l
data and h i g h l y j u s t i f i e d . Finally, w h i l e energy m i n i m i z a t i o n
methods i n c r e a s e i n t i m e e x p o n e n t i a l l y as t h e number o f atoms i n t h e
problem i n c r e a s e , t h e AIMB a l g o r i t h m i n c r e a s e i n t i m e i s l i n e a r w i t h
i n c r e a s i n g numbers o f atoms.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch012

A l t h o u g h AIMB i s a working p r o t o t y p e , we have many q u e s t i o n s


r e m a i n i n g t o be answered. We would l i k e t o e x p l o r e t h e d e t a i l e d
h e u r i s t i c s and study t h e e f f e c t o f c h a n g i n g these on t h e f i n a l models
constructed. We a r e i n t e r e s t e d i n s e e i n g how t h e s i z e o f t h e
knowledge base i s r e l a t e d t o t h e q u a l i t y o f r e s u l t s and speed o f
operation. F i n a l l y , we would l i k e t o e x p l o r e i t s a p p l i c a t i o n i n
areas where c o n v e n t i o n a l methods s i m p l y c a n not be u s e d .

Acknowledgments

T h i s work was supported by PRXBLD u s e r s and i n p a r t by a Faculty


Research Grant from the U n i v e r s i t y o f C a l i f o r n i a .

Literature Cited

1. Mueller, K. "COORD: Interconversion of Cartesian and Internal


Coordinates (QCPE 419)". QCPE Bull. 1981, 1, 37.
2. Program MDCORD does this. Personal Communication, Douglas
Hounshell.
3. Potenzone, R., J r . ; Cauicchi, E.; Hopfinger, A. J.; Weintraub,
H. J . R. "Molecular Mechanics and the CAMSEQ Processor".
Computers and Chemistry 1977, 1, 187.
4. Liljefors, T. "MOLBUILD: An Interactive Computer Graphics
Interface to Molecular Mechanics". J. Mol. Graphics 1983, 1,
(4), 111.
5. Humbolt, C. "MMS-X Modeling System User's Guide"; Technical Memo
7, Washington Univ., St. Louis MO, Jan. 1980.
6. Wipke, W. T . ; Verbalis, J . ; Dyott, T . , "Three-Dimensional
Interactive Model Building", Presented at the 162nd National
Meeting of the American Chemical Society, Los Angeles, August
1972.
7. Wipke, W. T. "Computer-Assisted Three-Dimensional Synthetic
Analysis". In Computer Representation and Manipulation of
Chemical Information; Wipke, W. T . ; Heller, S. R.; Feldmann, R.
J.; Hyde, E., Eds.; John Wiley and Sons, Inc.: 1974, pp 147-174.
8. Cohen, N. C.; Colin, P.; Lemoine, G. "SCRIPT: Interactive
Molecular Geometrical Treatments on the Basis of Computer-Drawn
Chemical Formula". Tetrahedron 1981, 37, 1711-1721.
9. Dolata, D. P. QED: Automated Inference in Planning Organic

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
146 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Synthesis, PhD dissertation, University of California, Santa


Cruz 1984.
10. Dolata, D. P., "WIZARD—Artificial Intelligence in
Conformational Analysis", Presented at the Drug Information
Workshop, Feb. 4 - 6 , 1985.
11. Edward Hance Shortliffe "Computer Based Medical Consultations:
MYCIN"; American Elsevier, New York: 1976.
12. Corey, E. J.; Wipke, W. T. "Computer-Assisted Design of Complex
Molecular Syntheses". Science 1969, 166, 178.
13. Corey, E. J.; Wipke, W. T . ; Cramer, R. D.; Hower, W. J. J . Am.
Chem. Soc. 1972, 94, 421.
14. Wipke, W. T . ; Dyott, T. M. "Use of Ring Assemblies in a Ring
Perception Algorithm". J. Chem. Inf. and Comput. Sci. 1975, 15,
140.
15. Wipke, W. T . ; Dyott, T. M. "Simulation and Evaluation of
Chemical Synthesis. Computer Representation of
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch012

Stereochemistry". J . Am. Chem. Soc. 1974, 96, 4825, 4834.


16. Wipke, W. T . ; Dyott, T. M. "Stereochemically Unique Naming
Algorithm". J. Am. Chem. Soc. 1974, 96, 4834.
17. Wipke, W. T . ; Krishnan, S.; Ouchi, G. "Hash Functions for Rapid
Storage and Retrieval of Chemical Structures". J. Chem. Inf.
and Comput. Sci. 1978, 18, 32.
18. Burkert, U . ; Allinger, N. L. "Molecular Mechanics"; American
Chemical Society: ACS Monograph, Vol. 177, 1982.

RECEIVED January 24, 1986

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
13
Computer-Assisted Drug Receptor Mapping Analysis
1 1 1 1
Teri E . Klein , Conrad Huang , Thomas E . Ferrin , Robert Langridge , and
2
Corwin Hansen
1
Computer Graphics Laboratory, University of California, San Francisco, C A 94143
2
Department of Chemistry, Pomona College, Claremont, CA 91711

KARMA is an interactive computer assisted drug design tool that incorporates


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch013

quantitative structure-activity relationships (QSAR), conformational analysis,


and three-dimensional graphics. It represents a novel approach to receptor map-
ping analysis when the x-ray structure of the receptor site is not known, KARMA
utilizes real time interactive three-dimensional color computer graphics com-
bined with numerical computations and symbolic manipulation techniques from
thefieldof artificial intelligence.

Many problems in chemistry may benefit from developments in the field of Artificial Intelligence
(AI), particularly the area now known as knowledge engineering. Knowledge can be described as
that which includes both empirical material and that "which is derived by inference or interpreta-
tion". (1) It may consist of descriptions, relationships, and procedures in some domain of interest
(2) We are now incorporating methods from knowledge engineering research in computer assisted
drug design.
Molecular modeling with interactive color computer graphics in real time is a powerful
method for studying molecular structures and their interactions. Display and manipulation of
computer generated skeletal and surface models provide efficient methods for the chemist to
examine steric interactions of many ligands with the binding sites in their receptors. We have
combined x-ray crystallographic results, quantitative structure-activity relationships (QSAR), and
interactive three-dimensional graphics in earlier attempts to design better ligands for enzyme bind-
ing. (3,4) We are applying knowledge engineering techniques provided by the software KEE
(Knowledge Engineering Environment (5) ) to the development of rational drug design methods
without having x-ray crystallographic results in hand.
Our integrated system, KARMA, KEE Assisted Receptor Mapping Analysis, uses knowledge
sources, including QSAR and conformational analysis, in a rule-based system to create an anno-
tated visualization of the receptor site. This is then used in an iterative manner to guide the inves-
tigator in generating rules, hypotheses, and new candidate structures for drug design. This
approach to receptor mapping and drug design differs from the traditional approach used by
chemists in two significant ways. Classically, in computerized drug design, one superimposes a
set of structurally related molecules (congeners) so that their bioactive functional groups coincide,
yielding a pharmacophore. A surface is then derived based on the composite molecule supposedly
yielding a complementary shape of the receptor. (6) This approach has met with limited success
because compounds that act as substrates or inhibitors of certain receptors do not necessarily bind
similarly. It is our belief that the commonality of the binding mode must be established. The
other shortcoming of the traditional approach is that it provides little information on the qualitative
character of the enzyme surface. The classical lock and key concept of ligand-receptor

0097-6156/86/0306-O147$06.00/0
© 1986 American Chemical Society

American Chemical Society


Library
1155 16th St. N.W.
In Artificial Intelligence Applications inf Chemistry; Pierce, T., el al.;
ACS Symposium Series; Washington, D.C.Society:
American Chemical 20036Washington, DC, 1986.
148 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

emphasizes structural geometry and may neglect the importance of interactions such as hydropho-
bicity. Processing of the binding data using QSAR prior to receptor mapping analysis yields
information not only about the hydrophobic and polar nature of the surface model, but also about
the steric and electronic properties of the data. (7)

System Design
KARMA is a set of programs residing on several machines connected by a high bandwidth network
(see Figures 1 and 2). The main program resides on the Lisp machine and controls all processing.
The controlling program on the Lisp machine is implemented on top of KEE which embodies many
knowledge engineering techniques. KEE provides a set of software tools that allows for very rapid
software prototyping, evaluation, debugging, and modification. Specifically, KARMA takes advan­
tage of KEE*s capabilities that includeframebased knowledge representation with inheritance, a
rule-based inference system, a graphical interface for debugging and displaying knowledge bases,
and aflexibleinterface that allows for the integration of outside methods. (5)
Input to the controlling program consists of congener sets and their related QSAR equa­
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch013

tions. A satellite program, based on the Pomona MedChem Software SMILES (Simplified Molec­
ular Input Line Editor System) is used for input of the structures. (8) SMILES creats a unique
identifying code for each chemical structure which is useful for searching for structures and phy-
siochemical parameters, and minimizing duplication of structural information. These structures
are passed to satellite programs, including distance geometry (9) and energy minimization (10), to
generate multiple conformations that are then displayed so that users may select those of interest
These structures, which constitute the basis set, are used to define the receptor model.
The receptor model is represented graphically by a set of surfaces. These surfaces are
defined by a set of control points which are calculated on the compute server. Control points,
which are based on minimized structures, are then manipulated by KARMA's rules system. These
rules provide detail to the receptor surface model. During this process, KEE provides a graphical
interface showing which rules and derivations are being accepted as true. The user can also
interact with KARMA's rule system during this time. The surface model is displayed using the con­
trol points to form bicubic patches on the graphics workstation. The user can then manipulate the
surface as well as modify the structure. These modifications are sent back to the controlling pro­
gram for refinement by the rules. This iterative process continues until the user is satisfied with
KARMA's results.
As seen in Figure 1, our hardware is connected by an Ethernet (11) The control server is a
Symbolics 3600 Lisp Machine and the compute server is a DEC VAX 8600. The three dimen­
sional graphics workstations include the Silicon Graphics IRIS 2400 and the Evans and Sutherland
PS350. Electronic communication with collaborating scientists at other institutions is available
through the VAX 750 via several networks including the ARPAnet and CSnet

System Implementation
Input to the controlling program is achieved through a series of "pop-up" menus in the Karma
Window (see Figure 3a). For example, if the user is interested in entering a set of congeners, the
user would select the molecule editor, KARMA will then display the molecule editor layout in the
current window. Users can then enter the chemical structures selecting structurefromthe
molecule editor menu (see Figure 3b). Structures are currently entered using the tree structure of
SMILES (see Figure 3c). (The molecule editor will be expanded to allow for graphical input in
the future.) KARMA then displays the two-dimensional structure for user verification (see Figure
3d). Coordinates for the three-dimensional structures are saved in a knowledge base in KEE. The
three-dimensional structures are based on x-ray crystallographic data, standard bond angles, and
bond lengths. All congener data, including physiochemical parameters such as log Ρ or MR (cal­
culated or experimental), can easily be entered and revised in the molecule editor (see Figure 4).
Three-dimensional coordinates for the congener set are passed to the distance geometry and
minimization programs. These satellite programs provide efficient methods for searching confor­
mational space. Distance geometry programs includes subroutines for controllingringplanarity of
aromatic rings and orientation of the molecules based on a common group of atoms. (12)

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
13. KLEIN ET AL. Computer-Assisted Drug Receptor Mapping Analysis 149

Graphics Workbench Graphics Workbench


Compute Server
Silicon Graphics Evans & Sutherland
DEC VAX 8600 PS 350
IRIS 2400

Ethernet
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch013

Outside
Control Server
Communication Server World
Symbolics 3600
[ (e.g. ARPAnet,
DEC VAX 750
Lisp Machine . CSnet,etc.)y

Figure 1. Hardware Configuration.


Copyright © 1985, Regents of the University of California/Computer Graphics Lab.

Control Server Compute Server Graphics Workbench

Distance Geometry Select


Structure
and Feasible
Generation
Minimization Models

Graphics Workbench Control Server Compute Server

s
Display Model Κ Rules for Surface Generation
Characterization
1. Control Points
of the
User Modification Ν Surface 2. Bicubic Patches

Figure 2. System Architecture.


Copyright © 1985, Regents of the University of California/Computer Graphics Lab.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
150 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Karma Window Molecule Editor

Editors Edit
Molecule Structure
Equation Parameters
Rule Name
Graphics Parent

(a) (b)

Molecule Editor Molecule Editor


ΝΗ2
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch013

Structure
c 1 ccccc 1 Cc2c(N )nc(N)nc2
NH2

SMILES: clccccclCc2c(N)nc(N)nc2
Name:
(c) (d)

Figure 3. Editing Sample.


Copyright © 1985, Regents of the University of California/Computer Graphics Lab.

Molecule Editor

Calculated Parameters
Parent:
clog Ρ = 2.025
NH2
cMR = 5.979
Substituent:
Revise
Save
N
NH2 Abort
Experimental Parameters
Parent:
log Ρ = 1.58

Substituent:
π = 0.000

SMILES: clccccclCc2c(N)nc(N)nc2
Name:

Figure 4. Editing Sample (continued).


Copyright © 1985, Regents of the University of California/Computer Graphics Lab.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
13. KLEIN ET AL. Computer-Assisted Drug Receptor Mapping Analysis

The output of the distance geometry and minimization programs is passed to the graphics
program EDGE (Easy Distance Geometry Editor). The structures are displayed three-
dimensionally so users may select structures to represent conformational space. Models are easily
selected pointing at the desired structure (see Figure 5). X, Y, and Ζ rotations and translations,
depth cueing, color, and labeling have been incorporated in EDGE, EDGE also provides a RMS
matching routine for Ν arbitrary atoms designated by the user. The selected models are then used
for surface generation.
Surface generation is based on a set of points derived from the outcome of distance
geometry programs applied to the basis set of structures. The basis set of points, P, is defined as:

where Pi is the uniformly distributed set of points over a sphere corresponding to atom i, and,
2
g (PiPj) is the overlap of the two sets of points. The density of points/angstrom can be arbi­
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch013

trarily set by the user. If the density is relatively high, a large number of bicubic patches with
small area are generated; to address each bicubic patch at a high density would be time-consuming
and difficult at best If the density of points is low, the patches become too large and don't yield
enough detailed information about the surface model.
The control points are defined by the basis set of points P. These control points define the
parametric bicubic patches which form the surface model. Advantages of the parametric bicubic
surface include continuity of position, slope, and curvature at the points where two patches meet
All the points on a bicubic surface are defined by cubic equations of two parameters s and /, where
s and t varyfrom0 to 1. The equation for x(s,t) is:
3 3 2 2 3
x(s ,f ) = a u s * + a \2S t 4- a \-$sh + α χ ^
2 7> 2 2 2 2
+ ct2\s t + a22S t + <i2zs t +α24β
3 2
+ a ^\st + a yist + a y$st + a 345

+ <Ζ4ΐί + α42Γ +<243ί + Û44


3 2

Equations for y and ζ are similar. (13) Either cardinal spline or B-spline bicubic patches can be
used as they differ only by the starting coefficients. (14) Overlapping sets of control points allow
for the joining of patches. Sixteen points define a bicubic patch. To determine which points
define which patches, an initial triangle is formedfromthree nearest neighbors. The next triangle
shares one side of the initial triangle and is connected to its next nearest neighbor. This process is
continued iteratively until all points are accounted for. The internal edge of two triangles is then
dropped to form a quadrilateral. Each internal edge is used only once. Nine quadrilaterals define a
single patch. These patches are combined to form the surface model and are manipulated by both
KARMA's rule system and the investigator at the graphics station.

System Core
The information contained in KARMA's knowledge bases is based upon quantitative structure-
activity relationships (QSAR), kinetic data, and structural chemistry. The combination of QSAR
and kinetic data allows for the study of enzyme-ligand interactions. The Hansen approach to
QSAR, based on a set of congeners, states:
Biological Activity = f(physiochemical parameters)
Physiochemical parameters are used to model the effects of structural changes on the electronic,
hydrophobic, and steric effects for organic molecules. (15) Examples of physiochemical parame­
ters include, among others:

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
152 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

σ, an electronic constant based on the Hammett equation for the ionization of substituted
benzoic acids;
π, the hydrophobic parameter for a chemical substituent based on the octanol-water par­
tition coefficient log P;
MR, the molar refractivity, which parameterizes polarizability and steric effects; and
Verloop's parameters, which are steric substituent values calculated from bond angles
and distances.
Using multivariable linear regression, a set of equations can be derived from the parameterized
data. Statistical analysis yields the "best" equations to fit the empirical data. This mathematical
model forms a basis to correlate the biological activity to the chemical structures.
K A R M A describes the interactions for enzyme-ligand binding using QSAR equations and
parameters, and the structural information of the congener data. These interactions, with illustra­
tive examples, are shown below:
Interaction Example
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch013

enzyme -> specific enzyme (DHFR* -> chicken DHFR)


congener -> specific enzyme (benzylpyrimidines -> chicken DHFR)
congener -> specific congener (inhibitors** -> benzylpyrimidines)
substituents -» specific congener (3,4,5 OMe -> benzylpyrimidines)
equations -> congeners (equation -> benzylpyrimidines)
variable -> substituents (4-Cl->4-Br)
specific enzyme —» specific enzyme (chicken DHFR -» L. casei DHFR)
*DHFR - Dihydrofolate Reductase
"inhibitors - triazines, benzylpyrimidines, etc.
The data used for the above interactions is contained in KARMA's knowledge bases, Chem-
Data and KarmaData. These knowledge bases contain information about classes of objects or
about the objects themselves. Objects and their attributes are represented as individual
"knowledge frames" which are linked together to form a hierarchal structure. Consistency
among the objects in both knowledge bases is obtained through inheritance rules.
ChemData is one of several data bases available in KARMA. This data base contains chemi­
cal information pertaining to chemical elements and molecular substituents. Elemental data
includes atom type, atomic radii, hybridization, molecular weight, etc. Substituent data consists of
unique identifying codes, physiochemical parameter data, and x-ray crystallographic data. For
each substituent, where known, there are values for the hydrophobic parameter, i.e., π, an elec­
tronic parameter, i.e., σ, and a steric parameter, i.e., MR. The associated x-ray crystallographic
data is used for building the small molecules in the congener set This data is also used for speci­
fying constraints used in the distance geometry calculations.
KarmaData contains information which the user enters, e.g., QSAR equations, congener set,
as well as information about previously studied enzyme-ligand binding complexes. KarmaData
contains several classes and subclasses. For example, in KarmaData, there is a class called pro­
teins, a subclass in proteins called dehydrogenase, a particular member of dehydrogenase called
DHFR, and a specific instance of DHFR called chicken (vide infra). Chicken DHFR contains
those attributes which are specific to itself, and inherits properties from units DHFR, dehydro­
genase, and proteins.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
13. KLEIN E T AL. Computer-Assisted Drug Receptor Mapping Analysis 153

CAC

• E.coli
Proteins

DHFR «- L.casei
Ν

ADH Χ
>> Chicken

KEE provides many different mechanisms for inheritance. KEE has the ability to constrain the type
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch013

and number of values assigned to attributes for consistency and description in the knowledge base.
©
Currently, KARMA's rules are formulated in an if-then format. A rule may have multiple
conditions, conclusions, and actions, KARMA takes advantage of both the forward and backward
chainers for derivation of the three-dimensional receptor model. For example, two types of rules,
generic and specific, can be defined empirically from the results of QSAR as well as from molecu­
lar structure.
Generic rules are based on the QSAR equations and their coefficients. Forward chaining
using these rules yields basic characteristics for the receptor site model. For instance, an
abstracted generic rule may take the form:
If the coefficient of the hydrophobic parameter is approximately equal to one, then ex­
pect complete desolvation about substituent X of the ligand.
This rule was derived empirically from some recent work on several species of alcohol dehydro­
genase (ADH). (16) The following equations were found:
Compounds Enzyme Equations

Horse ADH log 1/K. = 0.89 log Ρ + 3.56


n= 11, r = 0.960, s = 0.197

H o r s e
^NH 2 log 1/K. = 0.98 log Ρ - 0.83 σ + 3.69
2
Τ η =14, r = 0.937, s = 0.280

Χ Human ADH log 1/K. = 0.87 log Ρ - 2 . 0 6 0 ^ -4.60


η =13,^ = 0.977, s = 0.303

Rat ADH log 1/K. = 1.22 log Ρ - 1.80 o m e t a + 4.87


W n= 14,^=0.985, s = 0.316
Ν NH
Horse ADH log 1/K. = 0.96 log?+ 5.70
n = 5, r = 0.990, s = 0.207

where X is the substituent and log Ρ is based on the octanol-water partition.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
154 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

The average of the coefficients of the hydrophobic term is approximately equal to one (average =
0.97) suggesting complete desolvation about substituent X. Figure 6 shows complete desolvation
by the enzyme ADH (hydrophobic space - red; polar space - blue) around substituent X of the
pyrazole (green).
Another example of a rule dealing with hydrophobicity may take the form:
If the coefficient of the hydrophobic parameter is greater than 0.5 and less than 1.0, then
expect a concave surface about substituent X of the ligand.
This type of rule is empirically based on the enzyme-ligand binding such as that of carbonic anhy-
drase c (CAC) and sulfonamides. (4) The following equation was found:
Compound Equation

log Κ = 1.55 σ + 0.64 log Ρ - 2.07^ - 3.28I + 6.94


2

η = 29, r = 0.993, s = 0.190


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch013

Figure 7 shows how the solvent accessible surface of the enzyme CAC (hydrophobic space - red;
polar space - blue) is slighdy concave about the substituent X of the sulfonamide (green). Similar
rules exist for the coefficients which describe other aspects of hydrophobicity, as well as polar
space, which help define the basic shape, i.e., cleft or hole, of the surface receptor model.
Specific rules are based on the attributes of congeners, including the physiochemical param­
eters used to determine the QSAR equation, the biological activity, and the molecular structure.
Backward chaining, using these rules with specific instances of substituents, yields detailed shape
and character for the receptor model. For instance, an abstracted specific rule may take the form:
If the biological activity of compound y is less in enzyme A than that of related enzyme
B, expect possible steric hindrance about substituent X.
One possible interpretation of this type of rule is the enzyme ligand binding of trimethoprim with
bacterial DHFR and chicken liver DHFR. (17,18)
DHFR Species Binding Affinity (log 1/K.)

L. casei 8.87
E. coli 6.88
chicken 3.98

This data shows a noticeable drop in binding affinity for trimethoprim and chicken liver DHFR.
Figure 8 illustrates steric interaction between the 5-OMe of trimethoprim (green) with the
sidechain of Tyr 31 of native chicken liver DHFR (red). There is no steric interaction seen
between the 5-OMe of trimethoprim (green) and the sidechain of Phe 30 of L. casei DHFR (red).
(Right view: chicken liver DHFR; Left View: L. casei DHFR) It is knownfromx-ray crystallo­
graphic results that the sidechain of Tyr 31 of chicken liver DHFR rotates to accommodate
trimethoprim. (18)
A specific rule can also be based upon comparisons of bond lengths and van der Waals
radii, and biological activities. For instance,
If the biological activity of substituent X is less than the biological activity of substi­
2

tuent X and, X is atomically larger than X then expect possible steric hindrance with
r 2 p

the receptor wall about X , provided that other factors are equal.
2

This rule can be exemplified by two compounds that differ by the type of the substituent, i.e., a
chlorine and a bromine atom. If the binding affinity for the bromine compound was lower (and

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
3. KLEIN ET AL. Computer-Assisted Drug Receptor Mapping Analysis 155
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch013

Figure 5. Output of EDGE.


Copyright © 1985, Regents of the University of California/Computer Graphics Lab.

Figure 6. Enzyme-ligand Complex for Alcohol Dehydrogenase and a substituted pyrazole.


Copyright © 1985, Regents of the University of California/Computer Graphics Lab.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
156 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch013

Figure 7. Enzyme-ligand Complex for Carbonic Anhydrase C and a substituted sulfonamide.


Copyright © 1985, Regents of the University of California/Computer Graphics Lab.

Figure 8. Enzyme-ligand Complex for Dihydrofolate Reductase and trimethoprim.


(L. casei: left, chicken liver: right).
Copyright © 1985, Regents of the University of California/Computer Graphics Lab.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
13. KLEIN ET AL. Computer-Assisted Drug Receptor Mapping Analysis 157

possibly even lower for the iodine compound), it would suggest that the wall of the receptor model
is contacted by the ligand at the bond distance of the chlorine atom and its related van der Waals
radius. Therefore, one could assume that the larger bromine atom represents an intrusion into the
receptor wall.
The above examples used to illustrate the specific rules for backward chaining are similar to
other attempts at receptor mapping. (6) However, these other methods do not account for interac-
tions that may be based on a combination of effects such as hydrophobicity and ligand potency.
For instance, a rule that might apply to a compound with a substituted phenyl ring may take the
form (19)
If a meta disubstituted compound is symmetrical, and the biological activities differ
between hydrophobic and polar substituents, then expect possibleringrotation to max-
imize hydrophobic and polar interactions between the ring substituents and the hydro-
phobic and polar surface.
Many rules can be derivedfromthe molecular structures and biological activities as seen from the
above examples, which add both shape and character to the surface model.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch013

Graphics Interface
KARMA presents the resultsfromthe rule system on a three-dimensional graphics workstation.
The bicubic patches of the surface model are displayed graphically and may be manipulated by
the user. The user may also modify the model and return to the control server for another iteration
in the rule system if the results are not satisfactory.
The bicubic patches are characterized with different colors, intensities and line textures to
show attributes such as hydrophobicity and steric properties. Only one attribute may be displayed
at a time, with color and intensity representing the value of the attribute, and line texture
representing KARMA*s confidence level in the information. For example, when displaying hydro-
phobicity, red patches are hydrophobic space while blue patches are polar space. Patches drawn
with solid lines represent areas which are well explored while patches with short dashes contain
little information. Displaying information using multiple cues allows the user to examine various
aspects of the surface model without having to deal with large amounts of numerical data.
The graphics interface is also the appropriate place to alter the model since it lets the user
look at an overall picture of the model as it is modified. The graphics interface provides user-
friendly tools for this purpose, including a pointing device for selecting the modification site and a
hierarchical menu system to guide the user through the actual process of making changes. Thus,
the user may select a control point on one of the bicubic patches with the pointing device; pop up
a menu of permitted modifications; select an operation, e.g., move the control point outwards
along the surface normal. After the control point data has been modified, the graphics interface
will recalculate and redraw the bicubic patches of the surface model based on the new data. After
modifying the model to the desired state, the user may simply return to the control server and ini-
tiate the rule system for further refinement

Conclusion
Currently, KARMA is in the prototyping phase. Although the hardware is connected via the high
bandwidth network, it is necessary to implement the servers for data communications. Addition-
ally, a completely new graphics package is in development for KARMA. The next two steps in
terms of development are the turnkey and production versions of KARMA.
Current methods in computer-assisted drug design are most successful if the structure of the
receptor is known. Our goal is to aid the investigator in those situations where the structure of the
receptor may or may not be known, KARMA emphasizes two critical factors. First, three dimen-
sional graphics presents the resultsfromthe rule-based system in a manageable format. Second,
KARMA provides a means for the user to inject knowledge about the model, KARMA is designed as
a tool to aid the chemist and the ability to incorporate ideasfromthe user is a very important
aspect It is our goal to successfully look at computer assisted drug design from a new perspective
using KARMA.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
158 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Acknowledgments
This work was supported in part by N M RR-1081, DAAG29-83-G-0080, Evans and Sutherland,
Silicon Graphics and IntelliCorp. We also wish to thank Dennis Miller, ID. Kuntz, Don Kneller,
Greg Couch, Ken Arnold, and Willa Crowell for help and discussion.

Literature Cited
(1) Morris, W., Ed. In "The American Heritage Dictionary of the English Language"; Ameri-
can Heritage and Houghton Mifflin: New York, 1969; p. 725.
(2) Hayes-Roth, F.; Waterman, D.A.; Lenat, D.B., Eds. "Building Expert Systems'';
Addison-Wesley: USA, 1983;
(3) Blaney, J.M.; Jorgensen, E.C.; Connolly, M.L.; Ferrin, T.E.; Langridge, R.; Oatley, S.J.;
Burridge, J.M.; Blake, C.C.F. J. Med. Chem. 1982, 25, 785-790.
(4) Hansen, C.; McClarin, J.; Klein, T.; Langridge, R. Molec. Pharm. 1985, 27, 493-498.
(5) K E E User's Manual. 707 Laurel Street, Menlo Park, California, 94025 K E E is a registered
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch013

trademark of IntelliCorp.
(6) Marshall, G.R. "Computer Aided Drug Design". First European Seminar and Exhibition on
Computer-Aided Molecular Design. October 18-19, 1984.
(7) Blaney, J.M.; Hansen, C.; Silipo, C.; Vittoria, A . Chem. Rev. 1984, 84, 333.
(8) We wish to thank Dr. David Weininger and Dr. Albert Leo at the MedChem Project,
Department of Chemistry, Pomona College for providing us with the SMILES software.
(9) Crippen, G . M . "Distance Geometry and Conformational Calculations"; Research Studies:
New York, 1981.
(10) Weiner, P.K.; Kollman, P.A. J. Comp. Chem. 1981, 2, 287-303.
(11) Metcalfe, R.M.; Boggs, D.R. Comm of the ACM 1976, 9. Ethernet is a registered trade-
mark of Xerox Corporation.
(12) We wish to thank Dr. Gordon Crippen from Texas A & M University and Dr. Jeffrey M . Bla-
ney from E.I. DuPont de Nemours & Company for providing us with the Distance
Geometry software.
(13) Foley, J.D.; Van Dam, A . "Fundamentals of Interactive Computer Graphics"; Addison-
Wesley: 1982; Chap. 13.
(14) Clark, J.; "Parametric Curves, Surfaces, and Volumes in Computer Graphics and Computer
Aided Geometric Design," Technical Report No. 221, Computer Systems Laboratory, Stan-
ford University, 1981.
(15) Hansen, C.; Leo. A . "Substituent Constants for Correlation Analysis in Chemistry and Biol-
ogy"; Wiley-Interscience: 1979.
(16) Hansen, C.; Klein, T.; McClarin, J.; Langridge, R.; Cornell, N . J. Med. Chem. (in press).
(17) Hansen, C.; L i , R.; Blaney, J.; Langridge, R. J. Med. Chem, 1982, 25, 777-784.
(18) Selassie, C.; Fang, Z.; Li, R.; Klein, T.; Langridge, R.; Kaufman, B. J. Med. Chem. (in
press).
(19) Smith, R.N.; Hansen, C.; Kim, K.I.; Omiya, B.; Fukumura, G; Selassie, C.D.; Jow, P.Y.C.;
Blaney, J.M; Langridge, R. Arch. Biochem. Biophys. 1982, 215, 319-328.

R E C E I V E D December 17, 1985

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
14
An Intelligent Sketch Pad as Input
to Molecular Structure Programs

Carl Trindle

Chemistry Department, University of Virginia, Charlottesville, VA 22901

The programming and manipulation of chemical graphs i s


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch014

awkward i n most f a m i l i a r programming languages. LISP,


the Esperanto of artificial i n t e l l i g e n c e research,
makes possible a representation of chemical s t r u c t u r a l
formulas which i s much more nearly analogous to the
chemist's view of such graphs. This i s a considerable
computational advantage as well as a convenience for
the user.
We have designed a "functional fragment" representation
of s t r u c t u r a l formulas, applicable to any molecule,
which will resolve a crude sketch of a chemical struc-
ture into a list of fundamental fragments. Exploiting
the PROPERTY feature of LISP and the distance geometry
algorithms of Crippen we can recover Cartesian coordi-
nates f o r each atom, suitable for input to molecular
mechanics programs, or to ab i n i t i o electronic struc-
ture packages.
Besides l o c a l geometries, the i n t e l l i g e n t sketchpad
can contain any l o c a l properties, including bond types
and strengths, chromophore o p t i c a l spectra, and nuclear
magnetic resonance and infrared spectra c h a r a c t e r i s t i c
of a l o c a l chemical environment.

Computational chemists have developed several remarkably powerful


and r e l i a b l e computer codes, capable of describing the r e l a t i v e
s t a b i l i t y of various conformations of macromolecules, and d e t a i l s
of the e l e c t r o n i c structure of molecules of more modest s i z e (1).
The properties of molecules which can be obtained by use of these
programs c o r r e l a t e with important features of chemical r e a c t i v i t y
and the properties of m a t e r i a l s . Molecular design, i n pharma-
c e u t i c a l s , photochemistry, and general materials science can be
made much more e f f i c i e n t by the routine use of these computational
systems. However, t h e i r use i s at present not widespread; i t i s
l i m i t e d to a few large chemical companies.

0097-6156/86/0306-0159$06.00/0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
160 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

One of the obstacles to wider use of the well-tested and power-


f u l programs such as A l l i n g e r ' s molecular mechanics (2) and Pople's
GAUSS80 (3) i s that the programs require such elaborate and awkward
input. Users must o r d i n a r i l y prepare a l i s t of Cartesian coordi-
nates of each atom. This i s cumbersome f o r molecules of even
1
moderate s i z e . But more s i g n i f i c a n t l y , chemists powerful sense of
three-dimensional molecular structure i s never expressed i n Cartesian
coordinates. Instead chemists think more n a t u r a l l y of " i n t e r n a l
coordinates," that i s bond lengths, primary valence angles, and l o c a l
dihedral angles. Of course a f u l l set of i n t e r n a l coordinates de-
fines i n p r i n c i p l e the set of Cartesian coordinates (4). Unfortu-
nately, the usual algorithms f o r generating Cartesian coordinates
from i n t e r n a l coordinates are s e n s i t i v e to small e r r o r s . These
errors accumulate and can perpetrate enormities such as leaving
rings unclosed, or forcing u n r e a l i s t i c a l l y short separations be-
tween nonbonded atoms. In the chemist's conceptual p i c t u r e ,
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch014

r e a l i s t i c bond distances f o r rings are maintained, even i f d i s t o r -


tions i n normal valence angles are required.
The problem i s to transform the p i c t o r i a l view of molecules
which i s the d a i l y companion of the chemist, to the numerical form
required by programs, WITHOUT FORCING THE USER TO EFFECT THE TRANS-
LATION. We must not ask the chemist to do much more than i d e n t i f y
the atoms, t h e i r c o n n e c t i v i t y , and some gross features of the
stereochemistry. The s t r u c t u r a l formula i s the medium by which such
simple yet r i c h l y evocative information i s conveyed. The s t r u c t u r a l
formula does a f t e r a l l s u f f i c e f o r the chemist's work day to day.
It should be adequate to convey the e s s e n t i a l information to useful
computer programs.
There w i l l be two major stages to the t r a n s l a t i o n of information
from the chemist's p i c t o r i a l image to the r i g i d l y formatted input
f i l e required by molecular mechanics or molecular o r b i t a l programs.
F i r s t the sketch i s impressed on a d i g i t i z i n g tablet (perhaps as
simple as a Koala Pad (R), or a more accurate d i g i t i z i n g t a b l e t ) .
Then the graph must be interpreted and a t r i a l geometry generated.

Accepting the Sketch. The (computationally) most convenient way to


enter a s t r u c t u r a l diagram i s to use a d i g i t i z i n g tablet with a mouse
or s t y l u s . Our experience has been with the Houston Instruments
HIPAD (R). The software accompanying t h i s (and most ordinary)
d i g i t i z i n g tablet accepts and stores l o c a l coordinates of p a r t i c u l a r
points, and a set of pointers designating which v e r t i c e s are to be
connected (5). In t h i s way the molecular topology can be s p e c i f i e d
with no novel analysis or programming.
I t would be more i n t e r e s t i n g from the point of view of A r t i -
f i c i a l I n t e l l i g e n c e research to i n t e r p r e t a sketch already on paper,
by the analysis of dark and l i g h t elements (6). We have made only
small progress i n t h i s task, but some preliminary remarks can make
the d i f f i c u l t i e s c l e a r . The f i e l d of view i s resolved into p i c t u r e
elements, and an o p t i c a l scanner would assign a numerical value
corresponding to the darkness of the sketch at that l o c a t i o n .
Heavy l i n e s would be easy to recognize, by the sequence of adjacent
dark spots detected by the scanner. Intersections might be harder
to recognize i f the g r i d i s coarse, but knowledge of the existence
of l i n e s could guide the search, by estimates of the i n t e r s e c t i o n s
by extrapolation. A planar graph (with no crossing l i n e s ) would

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
14. TRINDLE An Intelligent Sketch Pad for Molecular Structure Programs 161

seem t o p r e s e n t few d i f f i c u l t i e s . Vertices representing generalized


atoms ( t h a t i s , "Me" i n p l a c e o f a f u l l y d e t a i l e d m e t h y l group)
would have t o be more c a r e f u l l y s p e c i f i e d . The chemist would use
B e r z e l i u s - n o t a t i o n c a p i t a l l e t t e r s f o r l a b e l s , w h i c h would have t o
be i n t e r p r e t e d . T h i s i s a h a r d t a s k , as the p o s t o f f i c e has l e a r n e d .
I t would be e s s e n t i a l f o r the system t o r e a l i z e when i d e n t i f i c a t i o n
o f a v e r t e x i s i m p o s s i b l e o r ambiguous, and r e q u e s t guidance from
the u s e r . F i g u r e 1 shows how v e r t i c e s a r e s p e c i f i e d .
I t w i l l be n e c e s s a r y t o d i s t i n g u i s h the s t r o k e s which i d e n t i f y
s i n g l e o r m u l t i p l e bonds from the s t r o k e s d e n o t i n g l o n e p a i r s , and
i t w i l l be r e q u i r e d t o s u p p l y m i s s i n g hydrogens and l o n e p a i r s
w h i c h a r e o f t e n o m i t t e d from c a s u a l s k e t c h e s . T h i s l a t t e r problem
w i l l a l s o be e n c o u n t e r e d i f the s k e t c h i s i n p u t d i r e c t l y by the
digitizing tablet. We r e t u r n t o t h a t l i n e of a p p r o a c h .
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch014

P r e l i m i n a r y P r o c e s s i n g of the Sketch. Even a t t h i s e a r l y s t a g e ,


b e f o r e d i f f e r e n t atoms a r e d i s t i n g u i s h e d and hydrogens a r e f u l l y
e x p r e s s e d , we have much o f the i n f o r m a t i o n needed f o r some k i n d s o f
analysis. A l l o f the g r a p h - t h e o r e t i c a n a l y s i s o f p i systems (7),
w h i c h may be c o n s i d e r e d t o be based on the H u c k e l model, uses no
more than the c o n n e c t i v i t y between e q u i v a l e n t c e n t e r s . However
p o w e r f u l the graph t h e o r y has been, i t cannot be d e n i e d t h a t i t
s u p p r e s s e s much o f the d e t a i l e x p r e s s e d i n the s t r u c t u r a l d i a g r a m .
T h e r e f o r e we w i l l n o t be c o n t e n t to s t o p a t t h i s s t a g e .
I t w i l l be n e c e s s a r y a t minimum t o d e f i n e the t y p e o f atom
p r e s e n t a t each v e r t e x . We r e d u c e the l a b o r n e c e s s a r y f o r t h i s
s p e c i f i c a t i o n by (a) s u p p r e s s i n g hydrogens i n the p r e l i m i n a r y
s k e t c h ; and (b) assuming as a d e f a u l t t h a t each v e r t e x r e p r e s e n t s
a carbon atom, r e q u i r i n g an amendment o n l y f o r heavy atoms. Our
software i s r e s p o n s i b l e f o r f i l l i n g i n hydrogens. This process
i s f r e q u e n t l y ambiguous, g i v e n o n l y the s k e l e t o n o f heavy atoms.
T h e r e f o r e the computer system w i l l sometimes i n t e r r o g a t e the u s e r
f o r the number o f hydrogen atoms a t each v e r t e x . With t h i s i n f o r -
m a t i o n the t a s k o f c o m p l e t i n g a Lewis s t r u c t u r e i s l e f t t o the
s o f t w a r e , w h i c h i s a t l e a s t as c a p a b l e o f t h i s t a s k as the a v e r a g e
f i r s t - y e a r student. T h i s i s the f i r s t t a s k t h a t r e q u i r e s a n y t h i n g
r e s e m b l i n g A r t i f i c i a l I n t e l l i g e n c e , so a few remarks on the d e s i g n
may n o t be out o f p l a c e .

A R o u t i n e t o A s s i g n Lewis S t r u c t u r e s . The p r o c e d u r e f o r a s s i g n i n g
Lewis s t r u c t u r e s i s f a m i l i a r ( 8 ) . G i v e n t h e s e t o f atoms, one must
sum the v a l e n c e e l e c t r o n s . In our LISP system, each ATOM can be
a s s i g n e d PROPERTIES w h i c h may i n c l u d e the number o f v a l e n c e e l e c -
t r o n s i t c o n t r i b u t e s t o the m o l e c u l e , and e q u a l l y i m p o r t a n t , i t s
s e t o f NEIGHBORS by w h i c h the s k e l e t o n o f the m o l e c u l e i s s p e c i f i e d .
Each such l i n k i s a s s i g n e d a p a i r o f v a l e n c e e l e c t r o n s , and a census
i s k e p t o f e l e c t r o n p a i r s i n the v i c i n i t y o f each atom. Among the
PROPERTIES of each atom i s an e s t i m a t e o f i t s e l e c t r o n e g a t i v i t y ,
and the program a s s i g n s e l e c t r o n p a i r s t o f i l l o c t e t s u s i n g the
e l e c t r o n e g a t i v i t y to s e t p r i o r i t y . The l a s t s t e p i s most " d i f f i -
cult." F o r each o f t h o s e atoms w h i c h l a c k a f u l l o c t e t , the system
must l o o k among the NEIGHBORS f o r atom(s) p o s s e s s i n g a l o n e p a i r
w h i c h i t might s h a r e . Of a l l those p o t e n t i a l d o n o r s , one chooses
the atom w i t h the most n e g a t i v e f o r m a l charge. The m u l t i p l e bond

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

A D i a l o g u e Accompanying t h e E n t r y
of a Molecule o f Moderate Complexity

SPECIFY NON-CARBON VERTICES:

NUMBER : 1 TYPE: nitrogen


NUMBER: 8 TYPE: oxygen
NUMBER: 10 TYPE: oxygen
NUMBER: 12 TYPE: oxygen
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch014

NUMBER: 0

SPECIFY NET CHARGE: *1

HYDROGENS AT VERTEX
HYDROGENS AT VERTEX
HYDROGENS AT VERTEX
HYDROGENS AT VERTEX
HYDROGENS AT VERTEX
HYDROGENS AT VERTEX
HYDROGENS AT VERTEX 8
HYDROGENS AT VERTEX 9
HYDROGENS AT VERTEX 11 :
HYDROGENS AT VERTEX 12
HYDROGENS AT VERTEX 13

F i g u r e 1. A l l v e r t i c e s a r e f i r s t assumed t o be CARBON. The


system r e q u e s t s t h a t t h e u s e r s p e c i f y non-CARBON v e r t i c e s ; i t
w i l l b u i l d a set of u s e r ' s abbreviations.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
14. TRINDLE An Intelligent Sketch Pad for Molecular Structure Programs

i s represented by the appearance of the donor several times i n the


(revised) NEIGHBOR l i s t . When the Lewis structure routine finds an
ambiguity which we would represent by a set of resonance structures,
i t reports that f a c t and chooses the f i r s t l e g a l structure f o r
further processing. Figure 2 shows the procedure i n p r a c t i c e .

Representation of the Molecule i n LISP. We have used the chemist's


sketch, or i t s Lewis structure equivalent, as the model of a data
structure i n LISP (9). This language has the f l e x i b i l i t y needed
to express an e s s e n t i a l l y non-numerical object, i n terms of l i s t s .
LISP w i l l permit us to organize molecular structure information i n
a way that mimics the human expert's knowledge. To accomplish t h i s
representation, we must develop a clear idea how the chemist assimi-
lates the information provided d i r e c t l y and e x p l i c i t l y by the sketch,
and how the properties of the molecule are r e c a l l e d to the chemist's
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch014

awareness.
The s t r u c t u r a l formula at minimum i d e n t i f i e s the atoms and
t h e i r connectivity. This hardly seems to be adequate i n complexity
to express much molecular information. This apparent paradox i s
resolved when we recognize that the chemist brings much of h i s ex-
perience to the task of i n t e r p r e t i n g the sketch, and much of the
information i s evoked rather than transmitted by means of the s t r u c -
t u r a l formula. The atoms' names—carbon, n i t r o g e n — c a l l up a f l o o d
of associations which (although they are almost never w r i t t e n ex-
p l i c i t l y i n the chemist's sketch) are nonetheless part of the
information i t can summon. Among t h i s data are the atomic mass,
t y p i c a l valencies, l o c a l geometry, perhaps a van der Waals radius,
and a guide to chemical behavior, i t s " e l e c t r o n e g a t i v i t y . "
The connectivity can define some aspects of the geometry i n a
useful semiquantitative way. The chemist has a very r e l i a b l e idea
of the range of bond lengths; CC(single), 1.54 A; CC(double), 1.33 A,
etc. By counting connections and recognizing the atoms being con-
nected, one can assign good estimates of the distances between
d i r e c t l y bonded atoms.
The chemist's knowledge of molecular geometry extends beyond
t y p i c a l values of bond distances. He w i l l also be able to predict
many bond angles f a i r l y accurately. This i s equivalent to speci-
fying a 1-3 nonbonded interatomic distance. The chemist's sketch
portrays c i s and trans isomerization, syn and a n t i , and gauche
conformations which specify either t o r s i o n angles, or i n d i r e c t l y ,
a 1-4 nonbonded distance.
Besides primary bond distances and angles, and some s p e c i a l
cases of t o r s i o n a l and dihedral angles, the chemist knows more global
features of molecular geometry. However, such knowledge becomes more
and more fragmentary; the longest distances i n a molecule are most
poorly defined.

A LISP S t r u c t u r a l Recognizer. A molecule i s represented i n our LISP


program f i r s t as a l i s t of atoms. A numbering scheme assigns an
unique l a b e l to each atom. Each atom has a c o l l e c t i o n of PROPERTIES;
foremost among them i s i t s generic NAME. The name CARBON c a r r i e s
with i t a van der Waals RADIUS and a VALENCE. Other properties can
be added as desired.
The major feature of a molecular sketch i s the topology or

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
164 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

connectivity of the molecule. This i s expressed as the property


NEIGHBOR f o r each atom. This i s j u s t the set of labels of other
atoms connected to a p a r t i c u l a r atom. The NEIGHBOR property i s a
compact way to store the adjacency matrix used i n graph theory.
The chemist's sketch, processed into the l i s t representation
j u s t described, i s not yet very valuable; the system at the moment
i s very ignorant of the structure of the molecule i n question. But
the chemist knows much of the molecule from l i t t l e more than the
diagram. How does the chemist "see" a complex molecular diagram?
In our judgement a chemist knows so much about a molecule because he
recognizes recurrent fragments of moderate s i z e . Rings of varying
atomic composition, structure, and s i z e ranging from carbonyl groups
to s t e r o i d systems, are recognized at a glance. Many stereochemi-
c a l l y well-defined fragments, such as spiro and norbornyl systems,
are part of the chemist's conceptual t o o l k i t . Our programming task
i s to assure that our system recognizes such fragments, with a l l the
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch014

associated information on t h e i r structure and properties, with ease.


Somehow we must discern the presence of meaningful, f a m i l i a r
fragments i n the molecular l i s t . We mimic t h i s stock of informative
portions of molecules i n our LISP system by l i s t s c a l l e d FRAGMENTS.
The FRAGMENTS, permanent members of a growing data base, each con-
t a i n a set of ATOMS and a NEIGHBOR l i s t f o r each atom i d e n t i f y i n g the
connectivity. Besides t h i s topological information, the fragments
contain as PROPERTIES a stock of a t t r i b u t e s of the fragments. The
f i r s t c o l l e c t i o n of PROPERTIES we gathered were interatomic d i s -
tances gleaned from c r y s t a l structures. A l l interatomic distances
are defined w i t h i n a fragment. The system can now assign many
(though not a l l ) interatomic distances i n an a r b i t r a r y molecule i f
fragments could be discerned w i t h i n the sketch.
We have developed a search technique which w i l l scan the
MOLECULE and locate a l l fragments. Design of t h i s recognition algo-
rithm i s d i f f i c u l t . The search routine shares some of the features
of the "knapsack problem," a c l a s s i c d i f f i c u l t y i n computer science.
We expect that we w i l l be able to speed t h i s step considerably. At
present we scan a l l stored fragments, though that i s not the way an
expert would proceed. We screen out many fragments by a s u p e r f i c i a l
test that the atoms i n the fragment must be a subset of the atoms i n
the molecule. The fragments are subjected to more and more thorough
t e s t s , u n t i l recognition i s complete. These tests are e s s e n t i a l l y
recursive applications of the requirement that i f a fragment i s to
be i d e n t i f i e d i n a molecule, the environment of each atom i n the
fragment must be found i n the molecule f o r the corresponding atom.
More d e t a i l on the search condition may be found i n a previous
a r t i c l e (10). Figure 3 shows a t y p i c a l fragment representation.
In t h i s f i r s t formulation we have already established that i t i s
most e f f e c t i v e to scan the largest candidate fragments f i r s t . I t i s
desirable to recognize overlapping fragments; more distances are
determined. However, i t i s i n e v i t a b l y the case that a substantial
number of distances w i l l be l e f t undefined, p a r t i c u l a r l y the longest
distances which would not be incorporated into a fragment.

Distance Geometry Changes Distances to Cartesian Coordinates. Most


esperimental measures of molecular geometry provide quantities which
may be most d i r e c t l y interpreted as defining interatomic distances.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
TRINDLE An Intelligent Sketch Padfor Molecular Structure Programs

Assignment, o f .3 L e w i s S t r u c t u r e

FORMULA: HI Π Ν 03 <+)

COMPUTED VALENCE ELECTRONS: Si

23 PAIRS ASSIGNED TO L I N K S

VERTEX 3 ASSIGNED 3 PAIR(S)


VERTEX 10 A S S I G N E D 2 PA IPCS?
VERTEX 12 A S S I G N E D 3 PAIR(S)
VERTEX 1 ASSIGNED 1 PAIR;S )
VERTEX 2 ASSIGNED 1 PA1R(S>
VERTEX 3 ASSIGNED 1 PAIR(S)

VERTEX 4 , 5 , 6 , 7 , 11 UNSATISFIED

SHARING BETWEEN VERTICES 4 AND 3


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch014

VERTEX 5 UNSATISFIED
SHARING BETWEEN VERTICES 5 AND 4
VERTEX 3 UNSATISFIED
SHARING BETWEEN VERTICES 3 AND 2
SHARING BETWEEN VERTICES 6 AND 1
SHARING BETWEEN VERTICES 8 AND 7
SHARING BETWEEN VERTICES 12 AND 11
NONZERO FORMAL CHARGES:
VERTEX 1: «• 1

Figure 2 . The Lewis structure routine w i l l draw on the


PROPERTY VALENCY, which i s the number of electrons each vertex
contributes to the Lewis structure. I t assigns a p a i r of
electrons to each LINK, and s a t i s f i e s the octet requirement.
In case of resonance, i t w i l l choose one of the set of
equivalent structures a r b i t r a r i l y .

Recognition o f A S e t o f Known Fragments in a Molecule

SIX MEMBERED RING RECOGNIZED ( B E N Z E N O I D )


BENZENE D I S T A N C E S ASSUMED
REVISED DISTANCE 1-2
REVISED D I S T A N C E 1-6
REVISED D I S T A N C E 1-10

ACETYL GROUP RECOGNIZED

ACETYL GROUP RECOGNIZED

N-O-C SATURATED LINK RECOGNIZED

C-C SATURATED LINK RECOGNIZED

Figure 3. A t y p i c a l fragment decomposition f o r a molecule


of moderate complexity. Roughly h a l f of the interatomic
distances can be s p e c i f i e d i n t h i s case by the fragment data.
The remaining distances are estimated by the distance geometry
algorithm of Crippen (11).

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
166 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

(In f a c t , usually the i n t e r p r e t a t i o n requires the presumption of a


r i g i d framework so that interatomic distances are persistent.)
There are N(N-l)/2 d i s t i n c t distances i n a c l u s t e r of Ν atoms, d i s ­
regarding symmetry-dictated equivalencies. This set of distances i s
of course redundant; 3N-6 Cartesian coordinates are s u f f i c i e n t to
determine molecular geometry, apart from the p o s i t i o n of the center
of mass and the o r i e n t a t i o n of the p r i n c i p l e moments of i n e r t i a .
The larger the system, the more redundant i s the f u l l set of
distances.
Of course i t i s almost never the case that we have anything
resembling a f u l l set of interatomic distances from experimental
data. Crippen has shown how one may pass not only from Cartesian
coordinates to interatomic distances, but from distances to Car­
tesian coordinates (11). More s i g n i f i c a n t , he has shown that an
incomplete set of interatomic distances, together with even very
crude estimates of unmeasured distances, can produce h e l p f u l e s t i ­
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch014

mates of Cartesian coordinates. The estimates of missing distances


can be provided by " t r i a n g l e conditions" which express that a 1-3
distance must be i n the range from the (absolute value of the)
difference of the 1-2 and 2-3 distances to the sum of the 1-2 and
2-3 distances. By a factor analysis of the matrix of vector dot
products one obtains the best three-dimensional "imbedding" of the
geometry.
By the methods of Crippen we can use our well-known distances
w i t h i n i d e n t i f i e d fragments, with crude estimates of distances
between atoms i n d i s j o i n t fragments, to estimate the geometry of the
entire molecule. The factor analysis produces " s t a t i s t i c a l l y best"
estimates of every distance. Of course we cannot evaluate the
q u a l i t y of the estimates of the missing longer distances. But the
s t a t i s t i c a l l y best estimates of the shortest distances (influenced
i n d i r e c t l y as they are by the poorly known longer distances) depart
s u b s t a n t i a l l y from the known fragment distances. One can improve
the o v e r a l l estimates by replacing the estimates of the well-known
distances by accurate values and i t e r a t i n g the factor analysis.
The structure produced by distance geometry i s not necessarily
the optimum energy form. I t i s merely a l e g a l three-dimensional
structure reproducing the short-range structure. Of course i f some
of the longer distances are known, further constraints are p o s s i b l e .
In our experience, however, short-range fragment properties deter­
mine much of the global form even of rather large systems. This i s
p a r t i c u l a r l y s t r i k i n g i n (say, carborane) clusters and (even very
large) r i n g s , both of which are inconvenient to describe by other
methods.

Extensions of the Functional Fragment Data Structure. In p r i n c i p l e ,


any molecular property which may be represented as a sum of c o n t r i ­
butions from fragments i s n a t u r a l to incorporate into the functional
fragment representation. Maksic has recently reviewed such group
a d d i t i v i t y r e l a t i o n s , concentrating on atoms as the fundamental
fragment (12). Magnetic and e l e c t r i c properties are remarkably
w e l l represented by such methods, i f a suitable hybridized elec­
tronic state i s chosen f o r the atom i n the molecular environment.
Such atomic a d d i t i v i t y r e l a t i o n s are the simplest form of a fragment
a d d i t i v i t y scheme f o r representation of molecular properties. I f

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
14. TRINDLE An Intelligent Sketch Pad for Molecular Structure Programs 167

we choose s l i g h t l y larger fragments, molecular properties can be


better represented. Ascending the scale, we can adapt the func-
t i o n a l fragment data structure to help us perform bond energy
calculations (13). Benson shows us how to estimate thermodynamic
properties given values for fragments (14). Often i t i s possible
to estimate spectra by summing chromophore properties, so long as
the absorbing centers are only weakly coupled (15). The same state-
ment applies to chemical r e a c t i v i t y , so long as the functional
groups i n t e r a c t weakly (16).

Interacting-Fragments Modeling Schemes may be Incorporated. I t i s


not required that fragments be nearly independent parts of a mole-
cule, and the molecular property be considered a simple sum of
fragment properties. Consider for example the p o s s i b i l i t y of incor-
porating the quantitative perturbation - molecular - o r b i t a l method
of describing the e l e c t r o n i c d i s t r i b u t i o n i n molecules, which begins
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch014

with MOs f o r fragments. The perturbation theory provides a syste-


matic way to account for fragment i n t e r a c t i o n s , and reproduces a
wide v a r i e t y of i n t e r e s t i n g e l e c t r o n i c behavior at l i t t l e computa-
t i o n a l cost (17). This e f f o r t s t i l l l i e s before us.

Summary. The everyday reasoning of the chemist i s p r i m a r i l y p i c -


t o r i a l and q u a l i t a t i v e ; i t i s analogic. The chemist can make
astounding predictions of the chemical, thermodynamic, and spectro-
scopic properties of a substance given only an image, the s t r u c t u r a l
formula. This process rests heavily on knowledge of the behavior
of s i m i l a r systems. We have devised a strategem whereby important
molecular structure programs can be supplied the Cartesian coordi-
nates they require, without forcing the chemist to provide much
more than the s t r u c t u r a l diagram, which i s a more n a t u r a l language.
The system interprets a sketch impressed on a d i g i t i z i n g t a b l e t ,
and scans the structure f o r f a m i l i a r fragments. Stored properties
of each known fragment include intra-fragment interatomic distances.
From these known distances, a l e g a l three-dimensional structure can
be constructed by the methods of Crippen, and supplied i n the form
of Cartesian coordinates to molecular structure programs.

Literature Cited

1. Quantum Chemistry Program Exchange Catalog, Chemistry


Department, Indiana University, Bloomington, IN 47001.
2. Berkert, U.; Allinger, N. L. "Molecular Mechanics", ACS
Monographs: Washington, D. C., 1982.
3. Clark, T. "A Handbook of Computational Chemistry"; Wiley:
New York, 1985.
4. Fitts. D. D. "Vector Analysis in Chemistry"; McGraw-Hill
Book Co.: New York, 1974.
5. HIPAD software copyright by Houston Instruments, Inc.
6. Cohen, P. R.; Feigenbaum, E. A. "The Handbook of A r t i f i c i a l
Intelligence"; W. Kaufmann, Inc.: Los Altos, CA, 1982;
Vol. III, p. 125. Raphael, B. "The Thinking Computer: Mind
Inside Matter"; W. H. Freeman: San Francisco, 1976.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
168 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

7. Graovac, Α.; Gutman, I.; T r i n a j s t i c , N. "Topological


Approach to the Chemistry of Conjugated Molecules"; Springer:
New York, 1977.
8. Lewis, G. N. J . Am. Chem. Soc. 1916, 38, 762.
9. Winston, P.; Horn, B. "LISP"; Addison-Wesley Publ. Co.:
Reading, MA, 1981. Johnson, C. S. J . Chem. Inf. Comp. S c i .
1983, 23, 151.
10. Trindle, C.; Givan, R. In "Chemical Applications of Graph
Theory and Topology"; King, R. B., Ed.; E l s e v i e r : New York,
1983. Trindle, C. Croatica Chem. Acta 1984, 57, 1231.
11. Crippin, G. M. "Distance Geometry and Conformational
Calculations"; Chemometric Research Studies Press of Wiley
Publ. Co.: New York, 1981.
12. Maksic, Z. B.; Eckert-Maksic, M.; Rupnik, K. Croatica
Chem. Acta 1984, 57, 1295.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch014

13. Benson, S. W., et al. Chem. Rev. 1969, 69, 279; Int. J . Chem
Kinet. 1974, 6, 813.
14. Benson, S. W. "The Foundations of Chemical Kinetics";
McGraw-Hill: New York, 1960; p. 665.
15. NMR and v i b r a t i o n a l spectra of organic molecules are well
described by group-additivity ideas; o p t i c a l spectra require
corrections to the spectra of chromophores. Cf. discussion
of spectra by Gordon, A. J . and Ford, R. Α., "The Chemist's
Companion: A Handbook of P r a c t i c a l Data, Techniques and
References"; Wiley-Interscience: New York, 1972.
16. Almost every elementary textbook of organic chemistry provides
a systematic description of properties of functional groups
and their c h a r a c t e r i s t i c r e a c t i v i t y ; f o r example,
Fessendon, R. J . and Fessendon, J . S. "Organic Chemistry";
Willard Grant Press: Boston, 1979.
17. Dewar, M. J . S. "The Molecular O r b i t a l Theory of Organic
Chemistry"; McGraw-Hill: New York, 1969. Albright, Τ. Α.;
Burdett, J . K.; Whangbo, M. H. " O r b i t a l Interactions i n
Chemistry"; Wiley-Interscience: New York, 1985.

R E C E I V E D December 17, 1985

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
15

T h e S i m i l a r i t y of G r a p h s and M o l e c u l e s

1 2
Steven H.Bertz and William C. Herndon
1
AT&T Bell Laboratories, Murray Hill,ΝJ07974
2
University of Texas at El Paso, El Paso, TX 79968-0509

A new definition of molecular similarity is presented, based upon the


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch015

similarity of the corresponding molecular graphs. First, all of the


subgraphs of the molecular graph are listed, and then various similarity
indices are derived from the numbers of subgraphs. One of these
compares favorably with the standard distance measures of sequence
comparison. Measurement of similarity provides a new way to measure
molecular complexity, as long as the most (or least) complex member of
a set of molecules can be identified.

The concept of the similarity of molecules has important ramifications for physical,
chemical, and biological systems. Grunwald (7) has recently pointed out the
constraints of molecular similarity on linear free energy relations and observed that
"Their accuracy depends upon the quality of the molecular similarity." The use of
quantitative structure-activity relationships (2-6) is based on the assumption that
similar molecules have similar properties. Herein we present a general and rigorous
definition of molecular structural similarity. Previous research in this field has usually
been concerned with sequence comparisons of macromolecules, primarily proteins and
nucleic acids (7-9). In addition, there have appeared a number of ad hoc definitions of
molecular similarity (10-15), many of which are subsumed in the present work.
Difficulties associated with attempting to obtain precise numerical indices for
qualitative molecular structural concepts have already been extensively discussed in the
literature and will not be reviewed here.

Results and Discussion

We begin with the way chemists perceive similarity between two molecules. This
process involves, consciously or unconsciously, comparing several types of structural
features present in the molecules. For example, considering the five aliphatic alcohols
(represented by their Η-suppressed molecular graphs) in Figure 1, we note both
similarities and differences: they are all four-carbon alcohols; a, b, c and d are acyclic,
whereas e has a ring; a and b are primary alcohols, c and e are secondary alcohols and
d is a tertiary alcohol; b and c have the same skeleton, but for the labeling of points
(atoms), while the other skeletons are distinct; etc.

0097-6156/86/0306-0169$06.00/0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
170 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

The first step in quantifying the concept of similarity is to list all subgraphs of
the given molecular graphs, e.g. a-e, which has been done in the first column of
Table I. The subgraphs include the vertices (atoms), all connected subgraphs, and the
full molecular graphs themselves, since it can be seen that the molecular graphs for a
and c are both subgraphs of e. Next, the number of each subgraph contained in the
molecular graphs must be counted. Row 1 lists the number of C atoms, row 2 the
number of Ο atoms, row 3 the number of C-C bonds, row 4 the number of C-O bonds,
etc. Gordon and Kennedy (16) defined N.. as the number of subgraphs of graph j
isomorphic with graph /, and more colloquially as "the number of distinct ways in
which skeleton ι can be cut out of skeleton j" The entries in Table 1 are the number
of ways the subgraphs can be cut out of the molecular graphs (the number of
subgraphs of the molecular graphs isomorphic with the subgraphs in the first column).
In terms of the numbers of C or Ο atoms, a-e are equally complex. In terms of
C-C bonds (ethane subgraphs) a-d are 3/4 as complex as e; however, in terms of
propane subgraphs (row 5) a and c are 1/2 as complex as e. A simple algorithm that
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch015

takes account of all the subgraphs involves comparison of two columns at a time,
examining them row by row and dividing the smaller of the numbers by the larger. A
similarity index (57) can then be calculated by taking the average of the quotients. Of
course, for two identical molecular graphs, 57-1. Inclusion of the molecular graphs in
the list of subgraphs ensures that two different molecules which have the same number
of each proper subgraph will not have 5/— 1. The values of S1(1) for a-e are
summarized in the form of a similarity matrix SM(l) in Figure 2.
A simpler similarity index can be calculated by dividing the sum of the lesser of
the two numbers in each row by the sum of the greater. (Only two columns of Table I
are considered at a time, of course.) The values of SI(2) for a-e are summarized in
SM(2), also in Figure 2. According to both SI(l) and 5/(2), 1-butanol (a) and 2-
butanol (c) are the most similar, whereas f-butanol (d) and cyclobutanol (e) are the
least similar pair. In between these extremes there are a significant number of
disagreements between these indices. For example based on SI(l), c and e are more
similar than c and d; however, c and d are more similar than c and e based on 57(2).
There are seven such pairs (out of 45 possible pairs), and each index has one
"degeneracy". By considering standard measures of "distance," 57(2) would appear to
be the superior index (vide infra).
The calculations of similarity indices can also be done with labeled subgraphs of
a labeled molecular graph. The points can be labeled according to the valency of the
corresponding atoms (i.e. whether they are primary, secondary, tertiary, etc.), labeled
with stereochemical descriptors, or labeled to reflect isotopic composition to cite but a
few examples. Furthermore, the number of similarity indices can be doubled by
relaxing the stricture that only connected subgraphs be considered. We have
concentrated on connected subgraphs, as they are more intuitively meaningful to the
average chemist; nevertheless, for some applications the inclusion of disconnected
subgraphs may be desirable or even necessary.

Similarity and Distance. Two sequences of subgraphs m and η such as those in


Table 1 have the property that there is a built-in one-to-one correspondence between
the elements of one sequence (m,) and those of the other (/!,). Accordingly, it is
straightforward to calculate various well-known (17) measures of the distance d
between the sequences, e.g. Euclidean distance [2/0^r-/i,) ] , "city block" distance
2 1/2

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
15. BERTZ A N D H E R N D O N Similarity of Graphs and Molecules 171
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch015

Figure 1. Selected four-carbon alcohols, abstracted as their Η-suppressed molecular


graphs: a 1-butanol, b isobutanol, c 2-butanol, d /-butanol, e cyclobutanol.

b c d

0.561 0.682 0.417

1.000 0.472 0.576

SM(l) - 1.000 0.472

1.000

1.000

b c d

0.684 0.778 0.522

1.000 0.619 0.609

SM(2) - 1.000 0.609

1.000

1.000

Figure 2. Similarity matrices SM(l) and SM(2) for the graphs in Figure 1.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

I. Subgraph Enumeration for Some Four-carbon Alcohols.

SUBGRAPH NUMBER IN GRAPH


α b C d e

• 4 4 4 4 4

ο 1 1 1 1 1
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch015

3 3 3 3 4
— ·
•—0 1 1 1 1 1

2 3 2 3 4

1 1 2 3 2

1 0 1 0 4

1 2 1 0 2

X 0 1 0 1 0

X 0 0 1 3 1

π 0 0 0 0 1

α 1 0 0 0 2

b 0 1 0 0 0

d
X- 0 0 1 0 2

0 0 0 1 0

e 0 0 0 0 1

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
15. BERTZ A N D HERNDON Similarity of Graphs and Molecules 173

2,-ΙΉ/—Hjl, or Hamming distance, which counts the number of positions in which the
corresponding elements are unequal. It may be noted that these are measures of
dissimilarity; of course, it is easy to draw conclusions about similarity from them (e.g.
by taking their inverse). Table II contains the distances calculated according to each
of the definitions discussed above as applied to molecular graphs a-e. The three
distance functions parallel each other quite closely: there are only two disagreements
between Hamming distance and Euclidean distance, and there are no disagreements
between city-block distance and Euclidean distance. There is a two-fold degeneracy
within city-block distance and Euclidean distance (the same as S1(1) and S1(2)) and a
four-fold one within Hamming distance, which is the crudest measure. Both city-block
and Euclidean distance have only a single disagreement with 5/(2), but many with
5/(7); therefore, it is recommended that 5/(2) or one of the distance measures that
parallel it be used to index similarity.

Table II. Distance Measures


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch015

Hamming City-bl. Euclid. 1/City-bl. 1/Euclid. 5/(7) 5/(2)

</(a,b) =
6 6 2.449 0.167 0.408 0.561 0.684
</(a,c) SES 4 4 2.000 0.250 0.500 0.682 0.778
rf(M) 8 11 4.359 0.091 0.229 0.417 0.522
</(a,e) 10 14 4.899 0.071 0.204 0.462 0.517
d(.b,c) 8 8 2.828 0.125 0.354 0.472 0.619
* 5 9 0.576 0.609
d{b,d) 4.359 0.111 0.229
rf(b,e)=
d(c,a)
dice)
- 11
8
8
16
9
12
5.657
3.317
4.690
0.062
0.111
0.083
0.177
0.301
0.213
0.400
0.472
0.577
0.484
0.609
0.586
d(d,e) 12 19 6.245 0.053 0.160 0.367 0.441

Similarity and Complexity. On account of the variety of features that contribute to


the complexity of a molecule (e.g. rings, double bonds, branching, heteroatoms, etc.),
two molecules can have the same complexity and yet be quite dissimilar, depending on
the weights given to the features (18). In contrast two molecules which are very
similar must have nearly equal complexities. Therefore, once the most complex
member of a family of molecules has been identified somehow, the others can be
ranked in order of complexity by calculating their similarity to it. For example, taking
tetrahedrane as the most complex member of the family butane (P ), cyclobutane 4

(C ), bicyclobutane (K — x), tetrahedrane (K ), 5/(2) confirms that this is the order


4 4 4

of increasing complexity. The same order is obtained by considering the total number
of subgraphs or by counting only the number of propane subgraphs (19), η (Table III).
Subgraph Enumeration. The total number of subgraphs increases rapidly with the
number of atoms, making hand calculations of SI impractical for large molecules.
Therefore a computer program was written. Our program is based on the fact that the
entries in the nth power of the adjacency matrix of a graph count paths of length n,
which includes retraced pathways and, therefore, branched chains and cycles. A
molecular graph is represented by the string adjacency matrix A $(/,/), where the
e
/,/-entry is a string of characters describing a bond (I^J) or an atom ( / J ) .
Matrix multiplication is defined as string concatenation. The concatenated strings are
alphabetized, processed to eliminate duplicates, sorted by number of bonds, and stored
for future use. (A copy of this program can be obtained by writing to WCH.)

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
174 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

PA Q K -x
4 K A
Table III. Complexity Measures

G SI(G,K ) Subgraphs 1.000 0.588 0.303 0.156 PA


4 V

1.000 0.515 0.266 4 C

PA 0.156 10 2
SM (2)-
1.000 0.516 *4
C
4 0.266 17 4
1.000 *4
K —x
4 0.516 33 8

*4 1.000 64 12

Potential Applications. Quantitative structure-activity relations have been formulated


on the basis of common substructures (2,14) and similarity indexing (5JO). For
example, Carbo et al. (10) related phermone activity to "an electron density measure
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch015

of similarity between two molecular structures." Randic et al. (14) have related
pharmacological activity to the numbers of paths in the molecular graph. The
extension from this one kind of subgraph to all possible subgraphs should improve the
statistical correlation of properties with substructures; but, even more importantly, it
will make the results easier to visualize in a way that is meaningful to a chemist.
Gordon and Kennedy (16) observe that a physical measurable can be expressed as a
linear combination of graph-theoretical invariants (N , see above). By using all
tj

possible subgraphs in such an analysis and optimizing the coefficients the most
important ones might be found.
Another important subject for similarity considerations is the planning of organic
syntheses. Wipke and Rogers (20) point out that "chemists do not always work
systematically backward but sometimes make an 'intuitive leap' to a specific starting
material from a target without consideration of reactions needed for interconversion.
This intuitive leap probably involves a Gestalt pattern recognition based on the
chemist's knowledge of available starting materials and similarity between the starting
material structure and the target structure." Our method should allow not only the
overall similarity of target and potential starting material to be assessed, but also the
similarity of portions (substructures) of the target and all or part of a starting material.

Acknowledgment. W C H is grateful to the Robert A . Welch Foundation of Houston,


Texas forfinancialsupport.

Literature Cited

1. Grunwald, E. Chemtech 1984, 14, 698.


2. Crandell, C. W.; Smith, D. H. J. Chem. Inf. Comput. Sci. 1983, 23, 186.
3. Bawden, D. Ibid. 1983, 23, 14.
4. Hansch, C.; Leo, A. J. "Substituent Constants for Correlation Analysis in
Chemistry and Biology"; Wiley: New York, 1979.
5. Kier, L. B.; Hall, L. C. "Molecular Connectivity in Chemistry and Drug
Research"; Academic Press: New York, 1976.
6. Kubinyi, H.; Kehrhahn, O. J. Med. Chem. 1976, 19, 1040.
7. Waterman, M. S. in "Mathematical and Computational Problems in the Analysis
of Molecular Sequences" (Bull. Math. Biol. 1984, 46); Pergamon: Oxford, 1984;
p. 473. Cf. other articles in this volume.
8. Lipman, D. J.; Pearson, W. R. Science 1985, 227, 1435.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
15. BERTZ A N D HERNDON Similarity of Graphs and Molecules 175

9. Sellers, P. H. SIAM J. Appl. Math. 1974, 26, 787.


10. Carbó, R.; Leyda, L.; Arnau, M. Int. J. Quantum Chem. 1980, 17, 1185.
11. Cone, M. M.; Venkataraghavan, R.; McLafferty, F. W. J. Am. Chem. Soc.
1977, 99, 7668.
12. Bersohn, M. J. C. S. PerkinI1982, 631.
13. Armitage, J. E.; Lynch, M. F. J. Chem. Soc. (C) 1967, 521.
14. Randić, M.; Kraus, G. Α.;Džonova-Jerman-Blažić,B. in "Chemical Applications
of Topology and Graph Theory"; King, R. B., Ed.; Elsevier: Amsterdam, 1983;
p. 192.
15. Seybold, P. G. Int. J. Quantum Chem., Quantum Biol. Symp. 1983, 10, 95, 103.
16. Gordon, M.; Kennedy, J. W. J. C. S. Faraday Trans.II1973,69, 484.
17. Kruskal, J. in "Time Warps, String Edits, and Macromolecules"; Sankoff, D.;
Kruskal, J., Eds.; Addison-Wesley: Reading, MA, 1983, p. 1. Cf. other articles
in this volume.
18. Bertz, S. H. in "Chemical Applications of Topology and Graph Theory"; King,
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch015

R. B., Ed; Elsevier: Amsterdam, 1983; p. 206.


19. Bertz, S. H. J. C. S. Chem. Commun. 1981, 818.
20. Wipke, W. T.; Rogers, D. J. Chem. Inf. Comput. Sci. 1984, 24, 71.

RECEIVED December 17, 1985

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
16

Symbolic Computer Programs Applied to Group Theory

Gordon D. Renkes
Chemistry Department, Ohio Northern University, Ada, OH 45810

Applications of symbolic computer programming to group


theory will be discussed. These programs, which are
written in Common Lisp, perform the symbolic
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch016

manipulations involved in the generation of


multiplication tables, finding the classes, taking
products of groups, establishing the correlations
between subgroups and supergroups, etc. This software
should prove very useful for applications of group
theory to the spectroscopy of non-rigid molecules, for
which the molecular symmetry groups are often large,
not standard point groups, and very tedious to
manipulate by hand.

The symposium one year ago on symbolic computing i n chemistry, and


t h i s symposium on uses of a r t i f i c i a l i n t e l l i g e n c e i n chemistry
demonstrate that symbolic computation i s now becoming recognized as
a useful t o o l f o r chemists. Just as computer "number crunching" i s
now f u l l y accepted and implemented t o a s s i s t the solving of many
chemical questions, i t appears that eventually computer "symbol
crunching" w i l l f u l f i l l an equally important r o l e t o a s s i s t the
chemist with h i s thinking.

Why Symbolic Computing f o r Group Theory?

This paper addresses the a p p l i c a t i o n of symbolic programming t o the


symbolic manipulations of group theory. Chemists are already
f a m i l i a r with the standard a p p l i c a t i o n s of group theory as explained
in the standard texts. For many applications, the useful
information such as character tables and c o r r e l a t i o n tables are i n
t h e i r appendices. However, i n c e r t a i n areas of current research,
such as the i n t e r p r e t a t i o n of the spectra o f non-rigid molecules,
unfamiliar and sometimes large groups which are not included i n the
standard tables are employed (1^4). A variety of formulations have
been developed t o approach the analysis of the symmetries of such
species, (e.g. molecular symmetry group and the isometric group,
etc.). They a l l share the common hazzard of many elements and

0097-6156/86/0306-0176$06.00/0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
16. RENKES Symbolic Computer Programs Applied to Group Theory 177

tedious manipulations f o r many molecules of interest. When


confronted with t h i s s i t u a t i o n , the i n v e s t i g a t o r must generate the
required tables himself, i f he i s not lucky enough to f i n d them
published somewhere. A common clause i n many papers reads "The
character table f o r t h i s group has already been published....". The
reader can hear the author's sigh of r e l i e f that he didn't have to
work i t out himself. Many labor saving techniques have been devised
to speed up t h i s process, e.g. (4), but these have to be learned and
executed with care. And, the amount of paper work involved can s t i l l
be considerable, especially when complicated s i t u a t i o n s are
considered. For example, to evaluate the classes and character table
for the molecule boron trimethyl (of order 324) required 18.5 pages
and 15 intermediate tables even when e f f i c i e n t algorithms were used
(4). Upon surveying t h i s s i t u a t i o n , one can appreciate the
convenience of computer programs which would handle the tedious
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch016

details. (This would be analogous to the application of computers


to numerical computations. Before computers, tedious arithmetic was
minimized by use of l o g tables, perturbation theory, algebraic
approximations, etc. With computers, computations can be executed
with f a r fewer approximations and applied to more extensive and
r e a l i s t i c s i t u a t i o n s . ) Such programs would be useful t o o l s , because
they would free one to spend more time thinking about the problem at
hand, and to quickly t e s t out ideas without having to decide, " I s i t
worth the e f f o r t ? "

L i s p as a Language f o r Implementation

Given that such programs would be useful, we must next decide which
language would be most appropriate f o r implementation. At l e a s t
three reasons j u s t i f y the symbolic language L i s p .

F i r s t , L i s p i s designed to be used i n t e r a c t i v e l y at a
computer terminal. This would be very convenient for the
investigator i n the midst of thinking about a p a r t i c u l a r problem.
Suppose a question a r i s e s which requires the use of group theory
tables. Rather than digging through appendices or searching i n the
l i b r a r y , the computer programs would be employed to supply r e s u l t s
on the spot, even i f no one has ever done i t before.

Second, the fundamental data structure of the L i s p language


i s a l i s t of symbols. Two examples of l e g i t i m a t e l i s t s are,

(1 2 3) and ( (1 2 3) (4 5 6) (7 8 9 ) )

The f i r s t i s a simple l i s t of three integers, and the second i s a


l i s t of l i s t s , each of which i s a simple l i s t of three integers.
The parentheses serve to enclose the l i s t s . Such l i s t s ideally
match the permutation notation f o r group operations which are
employed i n these programs. For example, to represent the
permutation of the integers 1 and 2 i n a l i s t of integers ( 1 2 3)
one w r i t e s ,

operator * operand = result

(1 2) * (12 3)= (2 1 3)

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
178 A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

And, the product of two operators, which i s equal to another single


permutation operator i s w r i t t e n , f o r example,

operator * operator = operator

(1 2 3) * (1 2 3) =(3 2 1)

(This matches the notation used by many spectroscopists who study


non-rigid molecules (1^3). At present, a user of t h i s software i s
confined to t h i s notation, although i t could be possible to expand
the c a p a b i l i t y of reading and displaying standard point group
notation at the terminal.)

Finally, the language provides b u i l t i n devices f o r


conveniently manipulating, categorizing, storing and r e c a l l i n g a l l
the information which pertains to a group, such as the
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch016

m u l t i p l i c a t i o n table, classes, characters of the irreducible


representation, c o r r e l a t i o n s between subgroups and product groups,
etc. Much of the r e s t of t h i s paper w i l l summarize the d e t a i l s of
how t h i s i s done.

Implementation Discussion

Basic Functions. The fundamental symbolic operation which i s used


performs the permutations on l i s t s of numbers. The Common L i s p
supplied function ROTATEF i s designed to do j u s t t h i s . An a r b i t r a r y
number of arguments can be supplied to i t , and i t returns a l i s t i n
which the f i r s t argument i s at the end, and the others have been
s h i f t e d one space to the l e f t . An example of the a p p l i c a t i o n of
t h i s function, and the r e s u l t displayed on the screen i s ,
1
(ROTATEF 1 '2 '3)

(2 3 1)

A user function CYCLOPERATE was w r i t t e n to employ t h i s L i s p


function, using the operator l i s t as a recipe f o r how ROTATEF should
rearrange the numbers i n the operand l i s t . In t h i s example, the
l i s t (1 2 3) i s the operator, and (1 2 3 4 5) the operand. The
permuted l i s t i s returned as the r e s u l t of the function.

(CYCLOPERATE ' ( 1 2 3) '(12345))

(23145)

Another user function, PERMUTE, applies CYCLOPERATE repeatedly when


a l i s t of permutation operators i s applied successively to an
operand l i s t , to return a l i s t of permuted numbers,

(PERMUTE '( (1 2 3) (2 3 4) (3 4 5) ) '(12345))

(21354)

These two functions, and one more which can reconstruct a

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
16. RENKES Symbolic Computer Programs Applied to Group Theory 179

permutation operator from a permuted l i s t , serve as the workhorses


for the procedure of setting up a group m u l t i p l i c a t i o n table.

Some of the information pertaining t o a group i s stored i n


property l i s t s . Table I exemplifies how t h i s looks f o r the simple
case of the c y c l i c group of order three. (This would be isomorphic
to the r o t a t i o n a l subgroup of a molecule such as methyl f l u o r i d e .
The operators (1 2 3) and (1 3 2) would correspond t o the
permutations of the three hydrogen n u c l e i i numbered 1, 2 and 3.
N3L, the language's symbol f o r the empty l i s t , serves as the
identity.)

Table I . Property l i s t s f o r c y c l i c group, order 3·


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch016

( #:GRP-1 #:GRP-2 //:GRP-3 )

PERMOP (NIL) PERMOP ((1 2 3)) PERMOP ((1 3 2))

RESULTLIST ( 1 2 3) RESULTLIST (2 3 1) RESULTLIST (3 1 2)

INVPERMOP (NIL) INVPERMOP ((1 3 2)) INVPERMOP ((1 2 3))

INVERSE #:GRP-1 INVERSE #:GRP-3 INVERSE #:GRP-2

CLASS #:CLS-1 (LASS //:CLS-2 CLASS #:CLS-3

#:GRP-1 #:GRP-1 #:GRP-1 #:GRP-2 #:GRP-1 #:GRP-3

#:GRP-2 #:GRP-2 #:GRP-2 //:GRP-3 #:GRP-2 //:GRP-1

#:GRP-3 #:GRP-3 #:GRP-3 #:GRP-1 //:GRP-3 #:GRP-2

The three operators are represented by the three Gensyrn symbols i n


the l i s t (#:GRP-1 #:GRP-2 #:GRP-3) which i s stretched out across the
top of the table to make room f o r the property l i s t s underneath.
These symbols by themselves mean nothing. The useful information i s
contained i n the property l i s t s , which are displayed underneath i n
v e r t i c a l tabular format f o r r e a d a b i l i t y . The property l i s t i s a
l i s t of p a i r s of symbols. The f i r s t symbol of each p a i r i s the
property indicator, which allows access t o the second symbol, the
property value, by execution of an access function. For example,
the PERMOP property i s the permutation operator f o r each group
element. I f we want the permutation operator f o r a p a r t i c u l a r group
element, we use the access function GET, t o get from the appropriate
GENSYM symbol the PERMOP property.

(GET '#:GRP-2 'PERMOP)

((1 2 3))

The RESULTLIST property i s the r e s u l t of operating with that


operator on an i n i t i a l l y ordered operand l i s t . INVPERMOP and

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
180 A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

INVERSE are the inverse operator l i s t , and the Gensym symbol f o r i t ,


respectively. The CLASS property value i s another Gensym atom which
has as i t s value a l i s t of a l l of the operators i n that c l a s s . (In
t h i s simple case, the value of #:CLS-1 i s the l i s t (#:GRP-1), etc.)
The remaining p a i r s i n each property l i s t represent the group
m u l t i p l i c a t i o n table. For any p a r t i c u l a r group m u l t i p l i c a t i o n , an
element of the group l i s t at the top of Table I pertains to the
right operator, the property i n d i c a t o r pertains to the l e f t
operator, and the property value pertains to the product. For
example, f o r the product of the permutation operator ( 1 2 3) with
itself,

(GET »#:GRP-2 »#:GRP-2)

#:GRP-3
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch016

And to obtain the operator i t s e l f ,


f
(GET (GET «#:GRP-2 #:GRP-2) »PERMOP)

((1 3 2))

Much of the computational labor expended i s used to set up these


property l i s t s , but once that i s accomplished, other manipulations
which need the information stored i n them only have to GET the
r e s u l t s which are stored i n these property l i s t s .

The strategy used to set up the m u l t i p l i c a t i o n table i s


handled by a function which accepts a l i s t of operators which are a
set of generators f o r that group. A l l possible products between the
generators f i l l i n a portion of the table, and usually produce new
operators. Further m u l t i p l i c a t i o n using these new operators f i l l s
i n more of the table, and may produce more new operators. This
process i s repeated exhaustively u n t i l no new operators are
produced, at which point the closure property of groups is
s a t i s f i e d , and the table i s complete. Following t h i s , another
function uses t h i s table to f i n d the conjugacy classes by
a p p l i c a t i o n of the d e f i n i t i o n .

Terminal Display and P r a c t i c a l Usage. Once calculated, other user


functions can extract the desired information from the i n t e r n a l l y
stored representation and display i t on the terminal or p r i n t i t on
a l i n e p r i n t e r . Table I I shows the l i s t of operators by classes f o r
S3, the permutation group of degree three, as displayed on the
terminal. (This i s isomorphic to the point group C-3V.) The
m u l t i p l i c a t i o n table and character table can a l s o be displayed i n
appropriate formats, although the m u l t i p l i c a t i o n table i s readable
for only the smallest groups, and probably would not normally be
displayed anyway. (At the present stage of development, the
character table must be entered from the terminal. A function which
sets i t up from scratch w i l l be w r i t t e n i n the near future. Also at
present, the classes are simply numbered.)

Other t y p i c a l group manipulations can be performed once a l l


the aforementioned information has been found. For example, d i r e c t

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
16. R E N K E S Symbolic Computer Programs Applied to Group Theory 181

Table I I . Terminal Display of Classes

For the group S3

Operators by Classes

1 i s (NIL).

2 i s ((1 3 2 ) ) , ((1 2 3)).

3 i s ((2 3)), ((1 3)), ((1 2 ) ) .

products can be taken between two groups, and the c o r r e l a t i o n s


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch016

established between the representations of the subgroups and the


product group. Consider the d i r e c t product of the permutation
groups of degree three and degree two, represented by the names S3
and S2, t o produce the product named S3-DP-S2. The character table
of the product group i s exhibited i n Table I I I . The default l a b e l s
f o r the representations i n the product group are formed by
concatenating the l a b e l s of the representations i n the subgroups (A
and Β f o r S2, and A1, A2 and Ε f o r S3).

Table I I I . Terminal display of character table of


d i r e c t product of S3 with S2

For the group S3-DP-S2

Group character table.

(LASS 1 2 3 4 5 6

A1A 1 1 1 1 1 1

A1B 1 1 1 -1 -1 -1

A2A 1 1 -1 1 1 -1

A2B 1 1 -1 -1 -1 1

EA 2 -1 0 2 -1 0

EB 2 -1 0 -2 1 0

A record of the c o r r e l a t i o n s between the representations i s


constructed with association l i s t s while taking the product, and
these can be used t o display character c o r r e l a t i o n tables i n both

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
182 ARTIFICIAL I N T E L L I G E N C E APPLICATIONS IN C H E M I S T R Y

the forward sense, from a subgroup to the product group, and i n the
reverse sense, from product group t o a subgroup. Table IV shows the
terminal display of both forward c o r r e l a t i o n s .

Table IV. Terminal Display of Character Correlation Tables

Character c o r r e l a t i o n table.

SUBGROUP PRODUCT-GROUP
S3 S3-DP-S2

A1 A1B A1A

A2 A2B A2A
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch016

Ε EB EA

Character c o r r e l a t i o n table.

SUBGROUP PRODUCT-GROUP
S2 S3-DP-S2

A EA A2A A1A

Β EB A2B A1B

Products between the i r r e d u c i b l e representation characters


within a group w i l l produce representations which are often
reducible. A simple c a l c u l a t i o n can decompose t h i s product t o a sum
of the i r r e d u c i b l e representation characters, as i s demonstrated i n
Table V f o r two representations from the S3-DP-S2 group.

Table V. Terminal display of the decomposition of the product


of two representations of S3-DP-S2

Within the group S3-DP-S2

EA 2 - 1 0 2 - 1 0

EB 2 - 1 0 - 2 1 0

EAxEB 4 1 0 - 4 - 1 0

The decomposition i s

EAxEB 1 EB 1 A2B 1 A1B

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
16. RENKES Symbolic Computer Programs Applied to Group Theory 183

Other Implementation D e t a i l s . A l l of the information r e s u l t i n g from


the computations described above i s stored i n a named record
structure which i s defined using the Common L i s p DEFSTRUCT f a c i l i t y .
An example of what t h i s looks l i k e i s shown i n Table VI f o r the
group S2, which we used e a r l i e r .

Table VI. Record structure which stores a l l the


information pertaining to the group S2

ORDER 2

OPERAND-LIST (4 5)

(#:GRP--10
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch016

GROUP-LIST #:GRP-11)

CLASS-LIST (//:CLS--7 #:CLS-8)

CHARACTER-LIST (#:CHR--8 #:CHR-9)

SUBGROUPS NIL

SUPERGROUPS (S3-DP--S2)

The l e f t element on each l i n e i s a name f o r the f i e l d (or record)


which i s shown i n the r i g h t element. The ORDER of t h i s group i s 2.
The OPERAND l i s t used f o r t h i s example was the numbers 4 and 5. The
Gensym symbols f o r the two group elements are stored i n the
GROUP-LIST f i e l d . As was explained e a r l i e r , property l i s t s were
attached to each of these which contained the m u l t i p l i c a t i o n table
and other information. The elements of the CLASS-LIST and
CHARACTER-LIST f i e l d s contain the information indicated by t h e i r
names. In the above examples, we did not work with the subgroup of
S2, so NIL i s stored there; but the d i r e c t product name S3-DP-S2 i s
stored i n the f i e l d SUPERGROUPS. Attached to t h i s name (not shown)
is the association list for the c o r r e l a t i o n between the
representations, which was used f o r the construction of Table IV.

The examples used above to i l l u s t r a t e the features of the


software were kept d e l i b e r a t e l y simple. The u t i l i t y of the symbolic
software becomes appreciated when l a r g e r problems are attacked. For
example, the d i r e c t product of S3 (order 6) and S4 (isomorphic to
the tetrahedral point group) i s of order 144, and has 15 classes and
representations. The l i s t of classes and the character table each
require nearly a f u l l page of l i n e p r i n t e r printout. When asked f o r ,
the c o r r e l a t i o n tables and decomposition of products of
representations are evaluated and displayed on the screen w i t h i n one
or two seconds. Table V I I shows the r e s u l t s of decomposing the
products of two p a i r s of representations i n t h i s product group.

These programs have been coded i n Common L i s p (5) which is

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch016

Table V I I . Display of the decomposition of products of representations w i t h i n the


d i r e c t product group S3-DP-S4.

Within the group S3-DP-S4

EA1 2 - 1 0 2 2 -1 -1 0 0 0 0

EA1 2 - 1 0 2 2 -1 -1 0 0 0 0

EA1xEA1 4 1 0 4 4 1 1 0 0 0 0

The decomposition i s

EA1xEA1 1 EA1 1 A2A1 1 A1A1

Within the group S3-DP-S4

ET2 6 -3 0 -2 0 2 -2 1 0 -1 1 0 0 0 0

ET2 6 -3 0 -2 0 2 -2 1 0 -1 1 0 0 0 0

ET2xET2 36 9 0 4 0 4 4 1 0 1 1 0 0 0 0

The decomposition i s

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ET2xET2 1 ET2 1 ΞΤ1 1 EE 1 EA1 1 A2T2 1 A2T1 1 A2E 1 A2A1 1 A1T2

ACS Symposium Series; American Chemical Society: Washington, DC, 1986.


1 A1T1 1 A1E 1 A1A1
16. R E N Κ ES Symbolic Computer Programs Applied to Group Theory 185

being promoted as a standardized d i a l e c t which should be e a s i l y


transportable between d i f f e r e n t computers. ( I t has not been
determined i f i t could operate on any microcomputer implementations
of Common Lisp.)

Other Group Theory Software

Other published reports of computer programs applied to group theory


include the following. J . J . Cannon (University of Sydney) i s a
mathematician who has l e d the w r i t i n g of a large set of Fortran
1
programs t o generate and study groups from a mathematician s point
of view 05), C. Trindle (Univ. of V i r g i n i a ) has w r i t t e n programs
i n Basic7 which execute on an Apple microcomputer (7). These
programs are also intended to be used f o r academic i n s t r u c t i o n i n
group theory as w e l l as f o r research work. K. Balasubramanian
(Arizona State) has w r i t t e n programs which use the wreath product
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch016

formalism to generate the permutation operators f o r non-rigid


molecules (8), and compute nuclear spin s t a t i s t i c a l weights (9).

Future Plans

Some features which w i l l be included i n future developments of these


programs include; the evaluation o f semi-direct products between
groups, the d i r e c t evaluation of the character tables from scratch,
and storing i n f i l e s the record structures which contain the
information about the larger groups. The f i r s t two are necessary
for useful work to be accomplished f o r non-rigid molecule groups,
since t h e i r construction usually includes the semi-direct product
combination of subgroups. The t h i r d feature i s intended to avoid
repeating long computations which occur f o r large groups, even on a
computer.

Literature Cited

1. Bunker, P.R., "Molecular Symmetry and Spectroscopy"; Academic


Press: New York, 1979.
2. "Symmetries and Properties of Non-rigid Molecules"; Maruani, J.,
Serre, J., Ed.; Elsevier: Amsterdam, 1983.
3. Ezra, G.S. "Symmetry Properties of Molecules"; Springer-Verlag:
Berlin, 1982.
4. Altmann, S.L. "Induced Representations in Crystals and
Molecules"; Academic Press: N.Y., 1977.
5. Steele, G.L., Jr. "Common Lisp"; Digital Press: Burlington, MA,
1984.
6. Cannon, J.J. in "Computational Group Theory"; Academic Press:
London, 1984: pp 145-83.
7. Trindle, C. J.Computat.Chem.1984, 5, 162-9.
8. Balasubramanian, K. J. Computat. Chem. 1983, 3, 302-7.
9. Balasubramanian, K. J. Computat. Chem. 1982, 1, 69-74, 75-88.

RECEIVED December 17, 1985

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
17

A M u l t i v a l u e d Logic P r e d i c a t e C a l c u l u s Approach
to Synthesis Planning

W. Todd Wipke and Daniel P. Dolata


Department of Chemistry, University of California, Santa Cruz, CA 95064

Stereochemical principles of synthesis planning have


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

been axiomatized using first-order predicate calculus


with a multi-valued logic as implemented in the QED
system. Given the definition of a synthetic target
molecule as a set of axioms, QED is able to infer a
synthesis plan in high-level terms without reference to
reactions. Key benefits of this approach are clarity of
expression and transparency of the system: a l l chemical
knowledge used is explicit in the axioms.

The purpose of t h i s research was to explore the representation,


manipulation, and u t i l i z a t i o n of strategic knowledge i n organic
synthesis planning. The method we decided to explore was to create
an axiomatic theory to replace our i n t u i t i v e theory about chemical
synthesis. This formal method o f reasoning i s very powerful i n that
i t completely eliminates any questions about the method used to reach
a conclusion. Since any conclusions reached would be theorems of the
axiomatic theory, the a c c e p t a b i l i t y o f the conclusions rests
completely on the a c c e p t a b i l i t y of the postulates and not upon the
method o f reasoning. We are then free to focus on the chemical
p r i n c i p l e s which are provided as postulates.
In t h i s paper we describe the need for planning, and then
develop the predicate calculus we used and the choice of multi-valued
l o g i c . F i n a l l y we b r i e f l y describe the QED program, a few r u l e s , and
an example analysis. Other papers i n the QED series w i l l cover the
program and chemical r e s u l t s i n d e t a i l .

Meed For Planning

Synthesis planning programs such as SECS, (Y)(2) (3») LHASA, ( 4 ) ( 5 ) or


SYNCHEMC6M7) work backward from the target molecule to be
synthesized toward s t a r t i n g materials. By applying applicable
chemical transforms (inverse chemical reactions) to the target, the
f i r s t set of chemical precursors i s generated. Each of these i n turn
can be considered a new target and processed i n l i k e manner

0097-6156/ 86/ 0306-0188S06.00/ 0


© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
17. WIPKE A N D DOLATA A Predicate Calculus Approach to Synthesis Planning 189

r e c u r s i v e l y . This process develops a "synthesis t r e e " , where nodes


i n the tree correspond to chemical structures and edges to chemical
transforms. The fundamental problem i s that there are many possible
chemical transforms that can be applied and t y p i c a l syntheses require
several steps. I f each molecule i n the tree has ten precursors, by
the time we reach the s i x t h l e v e l , one m i l l i o n precursors must be
evaluated! I f our program can process 10 precursors per second, t h i s
w i l l require a day. This example actually underestimates the size of
the problem, because the t y p i c a l branching factor i s between 100 to
200 rather than ten, and many syntheses require more than s i x or
seven steps.

Approaches to Large Search Spaces

- Heuristics
- Macro Operators
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

- Abstraction
- Planning

A h e u r i s t i c i s a rule of thumb may lead to a shortcut i n the solution


of a problem. I f such a h e u r i s t i c removes 90% of the routes at each
l e v e l , then i t w i l l eliminate 99% of the possible routes by the
second l e v e l , 99.9% by t h i r d , etc.. Even a simple h e u r i s t i c can make
!
the problem far more t r a c t a b l e . G e l e r n t e r s SYNCHEM 11(7,) i s an
example of a program that has focused on h e u r i s t i c evaluation
functions for reducing the search space that must be explored. Wang
used macro operators to e s t a b l i s h "planning islands" that can then
serve as near-term objectives.(8^ The macro operators make bigger
jumps i n the search space, thus eliminating much branching and
combinatorics. Abstraction can be used to s i m p l i f y chemistry so
there are fewer kinds of functional groups, and fewer chemical
transforms, thus a reduced search space.(9) Planning provides
d i r e c t i o n i n the search space, thus permits pruning of pathways which
are headed i n the wrong d i r e c t i o n and permits focusing resources i n a
particular direction.

Prov iding a_ Sense of Purpose. Planning i s more than j u s t a


h e u r i s t i c evaluation function that measures complexity. The ultimate
goal of synthesis i s to prepare the complex from things simple. Thus
in Figure 1 our goal i s to find syntheses that lead downward, but
some excellent syntheses may require increasing the complexity of a
precursor i n order to ultimately lead to a very simple one as the
path Τ —> Ρ " —> Ρ ~ i n Figure 1 i l l u s t r a t e s . Going u p h i l l i n
complexity i s acceptable if_ there i s some purpose and a plan provides
that sense of purpose.

Plan Representation i n SECS. The SECS Simulation and Evaluation of


Chemical Synthesis program e x p l i c i t l y represents i t s plans(2) as a
l i s t structure of goal i n s t r u c t i o n s with l o g i c a l connectives. A goal
i n s t r u c t i o n can specify one of the following:

- Introduce a functional group at a position


- Change a functional group at a position
- Make or Break a bond
- Use an atom, bond, or group

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
190 A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

The l o g i c i n s t r u c t i o n can include AND, OR, NOT, or XOR, and also


includes an action to take i f the goals beneath i t are not achieved.
The actions generally modify the evaluation score for a synthesis
pathway or completely terminate consideration of a synthesis
pathway.
SECS uses the goal l i s t to select transforms that appear to have
the p o t e n t i a l for s a t i s f y i n g the goals, based on the character of the
transform. The character of a transform s p e c i f i e s the types of
a r c h i t e c t u r a l changes the transform may e f f e c t . The goals specify
desired a r c h i t e c t u r a l changes i n the molecule. I t i s the
r e s p o n s i b i l i t y of the SECS program to find transforms that can
achieve the goals. The strategy module of SECS creates a plan and
w r i t e s i t on the goal l i s t . The chemist may modify those goals or
add new ones.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

The Strategic Basis. The strategic basis for designing a synthesis


plan r e s t s on general p r i n c i p l e s of molecular architecture
construction, and i s independent of reaction knowledge. Examples
include symmetry of the target molecule, p o t e n t i a l symmetry of the
target molecule, the r e l a t i v e r e a c t i v i t y of functional groups i n the
target, consideration of potential starting materials, the
connectivity of the s t r u c t u r e , and the control of stereochemistry. A
symmetry-based strategy for 3-carotene i s shown i n Figure 2. The
reaction-independent p r i n c i p l e i s to construct the molecule from
i d e n t i c a l pieces to take advantage of the symmetry of the structure.
The r e s u l t i n g goal structure i s a set of three a l t e r n a t i v e goals,
each of which s p e c i f i e s two bonds that should be broken i n the
analytical direction.
Since we were interested i n studying these s t r a t e g i e s , we wanted
a means for e x p l i c i t l y representing the p r i n c i p l e s that enabled us to
e a s i l y modify them and to be able to e a s i l y understand exactly what
p r i n c i p l e s the program was using. For t h i s reason the QED project
was initiated.(JJ)) QED was to use statements of the p r i n c i p l e s
together with a d e f i n i t i o n of the molecule and then i n f e r a
reasonable set of s t r a t e g i e s for the synthesis of the molecule and
write these to the SECS goal l i s t .

F i r s t Order Predicate Calculus

We chose the f i r s t order predicate calculus (PC) as our language for


representing synthetic p r i n c i p l e s . The first order predicate
calculus (PC) i s a "formal" system of logic.(JJ_)(Jj2)(J_3) In t h i s
context, formal means that i t i s the form of the arguments that i s
important, not the actual content. The term "calculus" comes from
1
the meaning "a method of c a l c u l a t i o n " , and does not r e f e r to Newton s
differential calculus.
To give an example, the following form of argument i s known as
modus ponens: " I f statement A implies conclusion B, and we know A to
be true, then we may conclude B" and may be represented as follows:

A => Β
Β
:. Β

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
WIPKE A N D DOLATA A Predicate Calculus Approach to Synthesis Planning
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

Figure 2. Symmetry-based s t r a t e g i e s f o r 3-carotene.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
192 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

It i s possible t o substitute various meanings t o the statements A and


B, and as long as t h i s form i s followed, the conclusion i s said t o
follow from the premises. A could be the statement "A ketone i s
present" and Β could be "A carbonyl i s present". I f A i s t r u e , then
Β follows from the premises.
If the conclusion seems t o be i n e r r o r , and the chain of
reasoning i s v a l i d , we have then demonstrated that one (or more) of
our i n i t i a l assumptions must be erroneous. For example, i f one of
the premises was the implication " I f a ketone i s present then the
compound i s an alkane", then the conclusion that would follow from
the observation "a ketone i s present" would be obviously f a l s e .
Since the PC i s l o g i c a l l y correct ( i n fact defines " l o g i c a l l y
correct") any erroneous conclusions must a r i s e from an erroneous
axiom, and cannot be the f a u l t o f the c a l c u l a t i n g procedure used.
This allows us t o focus our attention on the assumptions and frees us
from having to worry about the procedure.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

A Working D e f i n i t i o n of the Predicate Calculus. In a formal


theory(13) the statements are w r i t t e n i n a s p e c i a l l y constructed
symbolic language, and are manipulated i n accordance with specified
rules which make no appeal t o any possible meaning of the symbols. A
s t r i n g i s a f i n i t e sequence of these formal symbols. There e x i s t s a
grammar for deciding i f a s t r i n g i s a statement. There i s a method
for determining i f a statement i s an axiom. This method involves
pattern matching, and perhaps rearrangement and r e w r i t i n g . Given a
f i n i t e sequence S , ... ,S, o f statements, there i s a procedure for
deciding i f S follows from one or more S ^ ... ,S^ by the r u l e of
inference. The formal symbols defined are shown i n Table I .

Table I . The Formal Symbols of the PC used by QED.

Name Normal QED


Representation Representation
For A l l V $A11
There Exists 3 $Exists
And Λ •and.
Or V .or.
Not ~ .not.
If/Then => if then
I f and Only I f <-> if only-if
Therefore :. (not used)
Parenthesis ( ) ( )
Brackets c ] C ]
Curly bracket t } { }
Comma
Predicate P,Q,R... I n i t i a l capital letter
Identifier i ,x,z... A l l small l e t t e r s

To decide i f a formula i s a w e l l formed formula (wff) of the


i t must conform to the following d e f i n i t i o n :

1. There e x i s t s a set o f variable symbols, a ... z, which can


hold places. We can also define a set o f objects, o^ ... ο ,
also known as constants. This set o f objects i s known as the
domain.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
17. WIPKE A N D DOLATA A Predicate Calculus Approach to Synthesis Planning 193

2. For each of η = 0, 1, ... there i s a possibly empty set of


χ
η-place predicates. These are denoted by P(x^ ...
f » ^· η

This predicate may represent some property of an object, or a


r e l a t i o n s h i p between objects.
3. An "atomic formula" of the PC i s formed by taking a n-place
x a n c l
predicate symbol Ρ(χ^, ... » ^» s u b s t i t u t i n g variables or
n

constants for any of the χ..


4. A formula of the PC i s formed by combining atomic formulae
using the connective symbols ν, Λ , ~, => or <->.
5. A formula of the PC may be preceded by a q u a n t i f i e r , V or 3.
6. Parentheses are used to avoid confusion when the order of
evaluation or the scope of the arguments i s important.
Parentheses, square brackets and c u r l y brackets are a l l
equivalent, and may be used interchangeably, as long the same
type i s used to open and close the same term.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

If A, Β and C represent atomic formulae, then the following are


examples of formulae:

A VA
Α ν Β 3 χ Β
Α Λ Β A A ( B V C )
Α => Β ( ( Α => Β ) => C )
Α ν ( Β => A )

An object may be any "thing". This can include tangible items


such as atoms, or bonds, or can include non-tangible things such as
goals, or plans. In the f i r s t order PC, an object may not be a
predicate.
Predicates are properties of objects, or r e l a t i o n s h i p s between
objects. A predicate i s derived from the predicate clause of a
sentence. In the sentence "atom 5 i s a carbon", the predicate clause
i s " i s a carbon". This would be represented as Is-carbon(atom5). To
avoid confusion, names of predicates and objects are w r i t t e n without
embedded blanks or spaces.

Translating Chemical Statements into Predicate Logic. In the


following examples, we use the QED representation for connectives and
quantifiers.

an atom which i s a ring atom

Ring-atom (x) (1)

atom5 i s a ring atom

Ring-atom (atom5) (2)

atomU and atom5 are r i n g atoms

Ring-atom (atomU) .and. Ring-atom (atom5) (3)

atom4 and atom5 are alpha


Alpha (atom4, atom5) (4)
Note the difference between equations 3 and 4, where an e l l i p s i s
makes the two phrases look s i m i l a r . However, i f 3 i s rewritten as
shown i n 5 the difference i s obvious. In equation 5, the English

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
194 A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

word "and" i s used i n the l o g i c a l sense, whereas i n equation 4, "and"


indicates items which w i l l be related by a predicate phrase.

atom4 i s a ring atom and atom5 i s a ring atom (5)

atom4 and atom5 or atorn6 are beta

Beta (atom4, atom5) .or. Beta (atom4, atom6) (6)

bond12 i s not an appendage bond


.NOT. Appendage-bond (bond12) (7)
$A11 (x) $A11 (y) $A11 (z)
[ I f Atom (x) .and. Atom (y) .and. Atom (z) .and.
Alpha (x,y) .and. Alpha (y,z) .and.
.not. Identity (x,z) then Beta (x,z) ] (8)
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

Postulate 8 defines the beta r e l a t i o n s h i p from the alpha


predicate (alphaCx, y) i s true i f bond(x, y) i s t r u e ) . The term
".not. Identity (x,z)" must be included i n axiom 8 to prevent an atom
from being beta to i t s e l f (x=z).

D e f i n i t i o n of Axiomatic Theories. An axiomatic theory i s an


attempt to formalize an i n t u i t i v e theory. Geometry was i n t u i t i v e
before Euclid wrote "The Elements". An i n t u i t i v e theory i s defined
as a body of knowledge which attempts to express r e l a t i o n s h i p s and
c a u s a l i t y between objects, but i s not formal. Most modern science i s
1
s t i l l i n t u i t i v e , even though i t may represent many of i t s findings
i n exact mathematical formulae. As long as the entire corpus of
knowledge i s not expressed i n a single formal system, i t w i l l remain
intuitive.
The following l i s t of the steps i s necessary to create an
Axiomatic Theory:

1. Provide a set of symbols, and associated d e f i n i t i o n s . This set


of symbols w i l l include both objects, and predicate r e l a t i o n s .
Together with the set of symbols defined by the associated
l o g i c , these w i l l be the only allowed symbols.
2. Provide a set of axioms (postulates). I f one i s attempting to
create an axiomatic theory which mirrors experimental r e a l i t y ,
then these axioms should express some fundamental properties of
the system you are trying to model.
3. Choose a type of l o g i c to associate with the symbols and
definitions. This w i l l be used to deduce a l l further
statements of the theory. This l o g i c w i l l provide at l e a s t one
rule of inference, and rules of combination of t r u t h values.
4. Any new statement created by the rule of inference upon the
postulates i s known as a theorem of the theory and may be
referred to as " v a l i d " or "proved" within the axiomatic
theory.

An axiom i s defined as a p r i n c i p l e which holds across a l l


domains of knowledge such as, " i f two objects are equal to a t h i r d
object, then they are equal to each other". Postulates are
statements which are given without proof, based only on the
d e f i n i t i o n s provided before. Euclid c a l l e d postulates "self-evident

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
17. WIPKE A N D DOLATA A Predicate Calculus Approach to Synthesis Planning 195

t r u t h s " , and he f e l t that they mirrored some fundamental p r i n c i p l e of


the universe. The idea of postulating an obviously false statement
such as "two p a r a l l e l l i n e s may i n t e r s e c t at some point" seemed
useless. However, Bolyai-Lobachevsky geometry does j u s t t h i s and has
met with success i n space time physics, where large bodies can create
bends i n space such that p a r a l l e l l i n e s can i n t e r s e c t .
In modern axiomatic theory, postulates and axioms are defined
simply as given statements. By the d e f i n i t i o n of an axiomatic theory
the concept of t r u t h i s not considered relevant to i t s construction.
If we can derive a theory which seems to mirror r e a l i t y as reported
by our current experimental knowledge, then we consider the
postulates to be "successful" i n some sense of the word. I f the
theory derived from the postulates clash d r a s t i c a l l y w i t h our
observations, the postulates can be thrown away as "non-relevant".
If the differences are s l i g h t , or i f the theory predicts new
experiments which should show differences from what the i n t u i t i v e
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

theory would p r e d i c t , we can even c a l l the axiomatic theory


"interesting".
1
Why More Systems Haven 1 Been Axiomitized. Geometry i s unique i n
that i t can be expressed i n a simple l o g i c , the r e s u l t s are either
true or f a l s e , and that the actual "experiments" were capable of
being done with thought alone. In chemistry there was not s u f f i c i e n t
knowledge to enumerate the basic d e f i n i t i o n s and postulates. The
recent explosion of knowledge i n chemistry has made i t f e a s i b l e to
begin the process of axiomatization of chemical theories.
Geometry i s also s p e c i a l i n that most examples we wish to reason
about consist of a few objects, with a l i m i t e d number of
relationships. Thus i t i s f e a s i b l e to use a simpler s e n t e n t i a l
calculus form of l o g i c which cannot reason with v a r i a b l e s .
Sentential calculus can only be used when the t o t a l space of a l l
statements i s e a s i l y numerable, since a l l properties about objects
and r e l a t i o n s h i p s must be stated i n separate e x p l i c i t sentences.
But i n chemistry, where a t y p i c a l molecule w i l l have 20 - 30
atoms, as many bonds, several r i n g s , stereocenters, hetereoatoms,
etc., a theory expressed i n the s e n t e n t i a l calculus would require
thousands of statements. Thus chemistry had to await development of
the predicate calculus,(14) to axiomatize the theory.
F i n a l l y , while geometry was axiomatized with bimodal l o g i c ,
chemistry required creation of a new (within l a s t few decades) branch
of l o g i c r i c h enough i n expressive power to manipulate uncertain
statements without over or understating t h e i r value. The next
section introduces these new l o g i c s .

Bimodal Logic

There i s a p r i n c i p l e of l o g i c c a l l e d the p r i n c i p l e of the "Excluded


Middle", dating back to Plato. The p r i n c i p l e states that "a
statement i s e i t h e r True or False". The Heisenberg uncertainty
p r i n c i p l e however forces us to recognize uncertainty as a r e a l i t y .
Logics capable of expressing statements containing uncertainty are
now available.(^5) These new l o g i c s can be viewed as extensions of
bimodal l o g i c .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
196 A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

Redefining True, False, and the Value of Connectives, We begin by


replacing Τ with 1, and F with 0, mapping atomic formula onto the
range {0,1}. Then the t r u t h value of a formula A i s either 0 or 1 :

Definition 1: v(A) = 1 "true"

Definition 2 : v(A) = 0 " f a l s e "

For any wff F, there e x i s t s a function v ( F ) , which w i l l map F


onto the range {0, 1}. We determine v(F) by the following technique:

1. I f F i s an atomic formula, P(x1, ... xn), then we can say that


Ρ maps the tuple (x1, ... xn) onto the range {0, 1}. This
mapping i s generally done by e x p l i c i t statement, i . e .
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

Socrates i s Mortal.
Mortal(socrates) maps to 1
v(Mortal(socrates)) = 1

2. I f F i s a composite formula, we define functions which extend


the values of the terms to the formula.

Definition 3 : v(A Λ Β) = min ( v(A), v(B) )

D e f i n i t i o n 4: v(A ν B) = max ( v ( A ) , v(B) )

Definition 5 : v(A => B) = min ( 1, (1-v(A))+v(B) )

Definition 6: v(A <-> B) = min ( v ( A ) , v(B) )

Definition 7: v ( ~ A) = 1 - v(A)

We w i l l examine implication ( D e f i n i t i o n 5) i n d e t a i l because i t


i s less obvious and quite important to QED. We s t a r t by enumerating
the four possible cases f o r implication (Table I I ) :

Table I I . Truth table from D e f i n i t i o n 5 for A => B.

A v(A) Β v(B) min (1, (1-v(a))+v(B))


True 1 True 1 1 easel
True 1 False 0 0 case2
False 0 True 1 1 case3
False 0 False 0 1 case4

Only case 2 has a zero value for the r e s u l t o f the combination


formula. I t i s only when the antecedent A i s true, but the
conclusion i s false that the r u l e i s considered to be bad. For
example, given the rule shown i n equation 10, then only i f the
molecule i s a protein but i t contains no amide i s the r u l e shown t o
be f a u l t y . I f the molecule i s not a protein, the respective t r u t h or
f a l s i t y of the assertion that i t contains an amide would not affect
the t r u t h value of the implication. An implication rule sets no
l i m i t a t i o n upon the value of Β i f A i s not true, so case3 and case4
are v a l i d .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
17. WIPKE A N D DOLATA A Predicate Calculus Approach to Synthesis Planning 197

IF Is-protein(mol) then Contains-amide(mol) (9)

In summary, the value of an implication rule i s an inverse


measure of how often the antecedent has a higher value than the
consequent. This i s very important i n multi-valued l o g i c s where the
truth value ranges over many numbers rather than j u s t 0 and 1. The
more "valuable" the r u l e , the more often i t implies the correct
consequence.

Lukasiewicz-Tarski Multi-Valued Logic

In chemistry, uncertainty may arise because we are ignorant of some


of the underlying p r i n c i p l e s , or just have not done enough
experiments yet. The normal two valued l o g i c i s not s a t i s f a c t o r y for
many complex problems, e.g., p r i n c i p l e s of chemical synthesis. The
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

solution i s to add additional values to our l o g i c so that we can


represent " I don't know." One of the more popular multi-valued
!
l o g i c s (MVL s) was created by the Polish l o g i c i a n s , Lukasiewicz and
Tarski (LT) about 50 years ago.(j1j>) This l o g i c , c a l l e d LT, i s
further i d e n t i f i e d as LT , where η represents the number of discrete
values that are covered. LT i s the same as the bimodal PC.
2

Only small changes are necessary to convert our d e f i n i t i o n of


the bimodal l o g i c into the LT l o g i c . For a given atomic formula,
P(x^, ... x ) , we say that Ρ maps the tuple (x^, ... χ ) onto a
n

range {0, 1? n-1}, where η i s the order of the LT. Every


appearance of the number "1" i n our previous d e f i n i t i o n s (1-7) i s
replaced by "n-1" where η i s the value of the LT. For example,
D e f i n i t i o n 5 i s changed from min (1, (1-v(A))+v(B)) to min ((n-1),
((n-1)-v(A)) + v(B)). When n=2, the formula i s the same.
We w i l l i l l u s t r a t e how the LT l o g i c works using the simplest
l o g i c , LT . The range of LT i s 0, 1, 2. These numbers can be thought
of as expressing the English terms False/0, Maybe/1, and True/2. The
rule for e x p l i c i t evaluation would be used to assign values to the
following examples:

There i s an atom alpha to atom5.


v( $Exists χ Alpha(x, atom5) ) = 2 (10)

There are no atoms which are alpha to themselves.


v( .NOT. $Exist χ Alpha(x,x) ) = 2 (11)

Having assigned values to the atomic formulae, the modified


formulae of d e f i n i t i o n s 3 through 7 are used to assign values to
wffs.

Atom5 i s on a_ ring
Atom6 might be on a ring
v(0n-ring(atom5) ) = 2
v(0n-ring(atom6) ) = 1 (12)

I t might be true that both atom5 and atom6 are on a r i n g .


v( 0n-ring(atom5) .AND. 0n-ring(atom6) ) =
min ( v(A) , v(B) ) = min ( 1, 2 ) =1 (13)

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
198 A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

It i s c e r t a i n l y true that at l e a s t one of atom5


or atom6 are on a_ r i n g .
v( 0n-ring(atom5) .OR. 0n-ring(atom6)) =
max ( v ( A ) f v(B) ) = max ( 1, 2 ) =2 (14)

The value of the rule " I f atom5 i s on a r i n g then t

atom6 i s also on a r i n g " cannot be established from


the data at hand.
v( $A11 x $A11 y ( I f On-ring(x) then On-ring(y) ) ) =
max ( n-1 - v(A), v(B) ) = max ( 2-2, 1 ) = 1 (15)

Lets focus on the implication rule i n equation 15. I f the rule


i s considered to be "good" i n LT , i t w i l l have a value of 2. Thus i t
has to be the case that the second term i n the valuation formula
( D e f i n i t i o n 5 ) , (n-1 )-v(A)+v(B) must be greater than or equal to 2.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

Table I I I shows the values of Β allowed for given values of A as a


function of the "goodness" of the r u l e . I f the rule has value 0,
then any value of Β i s acceptable no matter what the value of A i s .

Table I I I . Allowed values of Β for A => Β i n LT

v(A) v(A => B) =2 v(A => B) = 1 v(A => B) = 0


2 2 1, 2 0, 1, 2
1 1, 2 0, 1, 2 0, 1, 2
0 0, 1, 2 0, 1, 2 0, 1, 2

In assigning a value to Β from the possible range of values for


B, we must select the lowest value, since only t h i s value i s
supported by the i m p l i c a t i o n , the other values are only p o s s i b i l i t i e s
that cannot be ruled out. Other implications may i n f e r the
consequent Β with a higher value.

Cumulative Evidence. LT does not mirror human i n t u i t i o n about


accumulation of evidence. Generally i f one has more pieces of
orthogonal evidence which support a deduction, then there i s more
reason to believe that the deduction i s true. However, t h i s i s not
the case with LT l o g i c . For example, consider the following.

(A1 => B) v=1 (impD


(A2 => B) v=2 (imp2)

(A9 => B) v=9 (imp9)

Assume that we are using LT , and that A1 ... A9 a l l are True


with v=10. In t h i s case, Β w i l l be inferred from each of these rules
with v=1 to v=9 respectively. Using an LT combining function for two
values, v1 and v2, we w i l l choose v(B) as max(v1, v2). Thus, only 1
piece of evidence (imp9) w i l l a c t u a l l y matter, and A1 through A8
could assume any value from 0 to 10 and not a f f e c t the value of B.
r t n e
The reader i s recommended to the book by Ackermann(_15_) f°
complete expostulation why LT l o g i c uses t h i s combining function.
B a s i c a l l y LT l o g i c was designed to avoid some of the many
d i f f i c u l t i e s that arise when applying MVL to mathematical domains.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
17. WIPKE A N D DOLATA A Predicate Calculus Approach to Synthesis Planning 199

Unlike chemistry, mathematics often deals with i n f i n i t e domains, and


i n f i n i t e axiom sets. I f we allow the fact that two axioms i n f e r the
same conclusion to increase the t r u t h value of that conclusion, we
must choose some increment that r e f l e c t s the importance of each
i n d i v i d u a l axiom. I f there are an i n f i n i t e number of such axioms,
then each axiom becomes i n f i n i t e s i m a l l y important. Thus LT l o g i c
chooses to err on the side of conservatism, assuring that the
conclusions w i l l be v a l i d , though perhaps less strong than they could
a c t u a l l y be.
Obviously we should not allow m u l t i p l e i t e r a t i o n s of the same
r u l e to increase the value of the consequent. I f t h i s were to be
allowed then one could obtain any f i n a l value by simply r e - i t e r a t i n g
the same rule s u f f i c i e n t times. But redundancies i n rules a r i s e i n
subtle ways, e.g, Β => A and C => A where Β <-> C, i . e , Β i s another
name for C. F i n a l l y , i t can be shown that even i f the chain of
r e l a t i o n between Β and C contains l o g i c a l connectives other than <->,
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

then allowing two successive inferences to increase the value of the


consequence above that inferred by the strongest alone can lead to
problems.
In chemistry, where the axioms are generated by formalizing
f i n i t e human experience, i t i s reasonable to allow evidence to
accumulate and we did i n QED. It i s probable that a l l our axioms
w i l l not be orthogonal. Simple perception concepts are a part of the
antecedents i n many d i f f e r e n t r u l e s , hence there i s some commonalty.
Assuring that our rules are orthogonal i n a l l of the many possible
combinations i s a d i f f i c u l t task.

Incremental Multi-Valued Logic (IMVL)

We u t i l i z e d i n QED a type of MVL which i s s i m i l a r to a very popular


form of MVL known as Bayesian Logic. There are several unfortunate
problems with Bayesian l o g i c , including the fact that Russell showed
that t h i s l o g i c incorporates several f a t a l paradoxes. Fortunately,
these paradoxes only manifest themselves i n i n f i n i t e systems. There
are s t i l l problems with f i n i t e systems, such as the a b i l i t y to assign
unwarranted values to conclusions i f the data base i s aapoorly
constructed. But there are s i g n i f i c a n t advantages to t h i s l o g i c .
Once again the l o g i c maps a set of statements onto a range. In
t h i s case the range w i l l be the r a t i o n a l numbers from -m to +m, where
-m i s equivalent to False, and +m i s equivalent to True, with
complete ignorance at 0.
However, the value of an atomic formula i s comprised of three
parts, the confirmation value, disconfirmation value, and the
combined t r u t h value:

confirmation value CV: 0 =< CV =< +m


disconfirmation value DV: 0 =< DV = < +m
t r u t h value, TV = CV - DV: -m =< TV =< +m

An advantage of carrying CV and DV i s that one can recognize


from the magnitude of CV and DV the amount of concurrence or c o n f l i c t
i n support of a given inference.
For any e x p l i c i t assignment of an atomic formula, only the CV or
DV i s assigned. The TV i s then calculated from t h a t . For the value
of wffs comprised of assigned terms, the following formulas are used:

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
200 A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

D e f i n i t i o n 8: CV ι(Α Λ B) = min ( CV (A), CV (B) )

D e f i n i t i o n 9: DV <(Α Λ B) = max ( DV (A), DV (B) )

D e f i n i t i o n 10: TV (Α Λ B) =: CV (Α Λ Β) - DV (Α Λ Β)

D e f i n i t i o n 11: CV (A V B) =: max ( CV ( A ) CV(B) ) f

D e f i n i t i o n 12: DV (Α ν B) =: min ( DV (A), DV(B) )

D e f i n i t i o n 13: TV (Α ν B) =: CV (Α ν B) - DV (Α ν B)

D e f i n i t i o n 14: CV ( ~ A) = DV (A)

D e f i n i t i o n 15: DV ( ~ A) = CV (A)
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

D e f i n i t i o n 16: TV ( ~ A) = CV ( - A) - DV Γ A)

D e f i n i t i o n 17: CV ( A => Β ) = max ( DV (A), CV (B) )

D e f i n i t i o n 18: DV ( A => Β ) = min ( DV (A), CV (B) )

D e f i n i t i o n 19: TV ( A => Β ) = CV ( A => Β ) - DV ( A

While these formulae are more complex than those of the LT, they
give the same r e s u l t s as do those of LT l o g i c . Two features of the
IMVL are however quite d i f f e r e n t from the LT: 1) the rules for
incrementally acquiring evidence; and 2) the rules for computing the
value of a consequent of an implication when the antecedent i s not
f u l l y True.

Incrementally Acquiring Evidence. Unlike LT l o g i c , the IMVL allows


successive inferences about a fact to increase the t r u t h value of
that f a c t . One way of viewing the way that the IMVL deals with
inferences i s to say that an inference i n support of a theorem
decreases our ignorance about that theorem. Thus, when the theorem
i s f i r s t proposed, the ignorance i s maximal, the values for CV, DV,
and TV are a l l 0. The amount of ignorance about the CV (or DV) could
be said to be m.
The f i r s t inference of a theorem with value ν, (v =< m) , then
reduces our ignorance by v. I f the value ν was i n confirmation of
the theorem, then the values become DV = 0, CV = v, and TV = v. We
have reduced our ignorance about the CV to m-v. Further confirmatory
evidence for the theorem i s applied to the remaining measure of
ignorance. The t r u t h value i s calculated from CV and DV as normal.
The formulae for t h i s are:

Definition 20: CV(A, given A1, A2) = CV(A1) + CV(A2)*[m -


CV(A1)]/m

Definition 21: DV(A, given A1, A2) = DV(A1) + DV(A2)*[m -


DV(A1)]/m

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
17. WIPKE A N D DOLATA A Predicate Calculus Approach to Synthesis Planning 201

Definition 22: TV(A, given A1, A2) = CV(A, given A1 f A2) -


DV(A, given A1 A2) f

It can be shown that CV and DV approach the maximal value m


asymptotically and that the order of acquiring t r u t h values does not
matter.

Implication i n IMVL. The second major difference i s i n how the value


of the consequent of an i m p l i c a t i o n i s c a l c u l a t e d . The method i s
m u l t i p l i c a t i v e as compared to the minimum function of LT MVL. The
formulae involved are shown below where TV(A) = TV(A) i f f TV(A) > 0;
otherwise 0.

Definition 23: CV (B, when A => B) = T V (A) * TV(A => B)/m

Definition 24: DV (B, when A => B) = unchanged


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

Definition 25: CV (B when A => ~ B) = unchanged


f

1
Definition 26: DV (B, when A => ~ B) = TV (A) * TV(A => ~B)/m

With t h i s background on the t h e o r e t i c a l representation issues,


we w i l l now b r i e f l y describe the QED program that implemented t h i s
IMVL PC l o g i c .

The QED Program

QED was implemented on the SUMEX-AIM DEC 2060 TOPS-20 system. The
source code consists of about 18000 l i n e s for FORTRAN code and 1500
l i n e s of macro code. A block diagram of the program modules i s shown
in Figure 3. QED i t s e l f contains no chemical information. The
chemical knowledge i s stored as postulates i n a formal f i r s t order
predicate calculus language. The grammar for t h i s language i s also
e x p l i c i t l y described i n the BNF notation. The PARSER i n t e r p r e t s the
postulates and i n t e r a c t i o n with the user, both for entering questions
and also for entering new r u l e s i n t e r a c t i v e l y . The QED EXEC handles
opening of f i l e s , entry of a molecule, and debugging aides. The
AGENDA EXEC creates, p r i o r i t i z e s , s e l e c t s , performs, and deletes
tasks. The INFER EXEC selects r u l e s , examines the data base,
i n s t a n t i a t e s predicates and i n t e r p r e t s the l o g i c . A l l information,
including postulates, r u l e s , d i c t i o n a r y , i n s t a n t i a t i o n s , tasks, etc.,
i s stored i n an associative r e l a t i o n a l data base. The ANSWER
EXTRACTER and FORMATTER communicates the answer to a question i n a
form the chemist can understand and that SECS can understand. The
design of the system i s very much l i k e the Japanese 5th Generation
Computer System design which i s also based on l o g i c .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
202 A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

USER

Token Scanner Output Formatter

QED
PARSER EXECUTIVE

AGENDA
EXECUTIVE i ANSWER
TRANSLATOR
INFER I EXTRACTION
EXECUTIVE
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

ASSOCIATIVE DATA BASE MANAGER

ASSOCIATIVE MEMORY

ν ν ν ν
DICTIONARY POSTULATES AGENDA ITEMS INFERENCES
Figure 3. Block diagram of the QED system.

lASCII —> PARSER —> Parse —> SIMPLIFIER -> Simple


rule I tree tree
II
grammar

r u l e i n <— OBJECT <— Proper <— SEMANTIC


r e l a t i o n a l form SYNTHESIS tree ANALYZER

dictionary

Figure 4. Compilation process for r u l e s .

QED Rule Parsing

Since FORTRAN (unlike LISP) cannot easily accept ASCII


representations of rules and use them d i r e c t l y , they must be read,
parsed, analyzed and translated to the form QED can i n t e r p r e t . The
general flow of the compiler i s shown i n Figure 4. As an example,
l e t s follow the processing of the r u l e "ALPHA-TO-SC" that defines
s i t e s where stereochemical induction may occur:

Rule ALPHA-TO-SC
$A11 Atom(x) $A11 Atom(y)
[IF Stereocenter(x) .AND. Alpha (x,y)
THEN Alpha-anisotropic (x)] CF 0.7

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
17. W I P K E A N D D O L A T A A Predicate Calculus Approach to Synthesis Planning 203

The rule i s parsed i n a top-down fashion(26) using a BNF driven


parser according to the e x p l i c i t grammar, a portion of which i s shown
i n Figure 5. This i s a f a i r l y simple context free grammar, w r i t t e n i n
the BNF (Backus Normal Form) s t y l e . (J_7) A condensed version of the
parse tree for the sample rule i s shown i n Figure 6. Semantic
analysis checks f o r :

- Recursive rules with i d e n t i c a l bindings


- Unbound variables
- Variables that are improperly scoped
- Predicates and functions having the incorrect number of
arguments
- Predicates and functions having improper types of arguments
- Quantifiers i n c o r r e c t l y scoped
- Predicates and functions i n c o r r e c t l y defined
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

Once t h i s has been done, the t r a n s l a t i o n to i n t e r n a l form can be


performed. F i n a l l y the rule i s added to the axiom data base.

<z> ::= <rule> ';' ;


<rule> ::= <ruleid> <quants> <implication> <certfact> ;
<ruleid> ::= ( 'Rule' ! 'rule' ! 'RULE' ) <identifier> ;
<quants> ::= <quanpair> [ <quants> ] ;
<implication> ::= ' [ ' <antecedent> <impsymbol> <consequent> ' ] ' ;
<impsymbol> ::= 'then' ! 'Then' ! 'THEN' ;
<antecedent> ::= [ ' I f ! 'IF' ] <formula> ;
<formula> ::= <and-node> ! <or-node> ! <atomicform> !
<quanpair> • [ ' <formula> ' ] • ;
<and-node> ::= ( <atomicform> ! ' ( ' <or-node> ' ) ' ) <and-op>
( <and-node> ! <atomicform> ! ' ( ' <or-node> ' ) ' ) ;

Figure 5. A Portion of the BNF Grammar for QED/s language.

<rule>
/
/
<quants><quants><implicationXCF value>
/ / \ \
$A11 / \ CF 0.7
Atom χ / \
y <antecedent> <impsymbol> <consequent>
$A11 Atom / ι !
/ !
<conjunction> THEN Alpha-anisotropic χ
/ ! \
/ ! \
Stereocenter y •AND. Alpha x y

Figure 6. Simplified parse tree of the ALPHA-TO-SC r u l e .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
204 A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

Associative Relational Data Base

Everything i n QED i s stored i n " t r i p l e s " l i k e i n the LEAP


language.08) Each t r i p l e consists of an index, an a t t r i b u t e , and a
value. QED maintains pointer l i s t s to entries that have the same
index, a t t r i b u t e , or value so that i t can quickly r e t r i e v e r e l a t i o n s
given any combination of I , A, or V. The t r i p l e s are stored i n QED*s
software implemented v i r t u a l memory that i s mapped to disk. The
i n t e r n a l form of the ALPHA-TO-SC rule i s shown i n Table IV.

Table IV. The i n t e r n a l form of the ALPHA-TO-SC r u l e .

Index Attribute Value

1 isa rule
1 Rule-id "Alpha-to-SC"
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

1 son index # 2
2 isa quant-description
2 quantifier $A11
2 variable "X"
2 parent index # 1
2 son index # 3
3 isa quant-description
3 quantifier $A11
3 variable "y"
3 parent index # 2
3 son index // 4
4 isa inference
4 antecedent-son index # 5
4 consequent-son index # 6
4 CF value 0.7
4 parent index # 3
5 isa conjunction
5 formula-son index # 7
5 formula-son index // 8
5 parent index # 4
7 isa atomic-formula
7 Predicate "Stereocenter"
7 variable-1 "x"
7 parent index // 5
8 isa atomic-formula
8 Predicate "Alpha"
8 variable-1 "X"
8 variable-2 Il y II
8 parent index # 5
6 isa atomic-formula
6 Predicate "alpha-anisotropic"
6 variable-1 "X"
6 parent index # 4

Agenda L i s t Control

QED puts problems to be solved on the Agenda L i s t . I n i t i a l l y the top


goal i s the task to find a plan f o r a molecule. V i t a l information

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
17. WIPKE A N D DOLATA A Predicate Calculus Approach to Synthesis Planning 205

such as task type, task pointer, parent, p r i o r i t y , clock t i c k , item


depth, and the current u n i f i c a t i o n set with t r u t h values i s stored as
part of the task, again as associative r e l a t i o n s . QED uses
h e u r i s t i c s to p r i o r i t i z e the tasks by examining the number of terms
in a rule and the types of connectives and q u a n t i f i e r s and estimating
the amount of work required to complete the task. Easy tasks and
tasks which may f a i l early are chosen f i r s t i n order to truncate
extensive search. We use the agenda l i s t to provide "best f i r s t "
control.

Example Rules

Before we can consider an example a p p l i c a t i o n of QED, we need to


present some of the rules we have developed for stereochemical
control i n chemical synthesis and then see how they are used i n
developing a plan.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

Rule Suggest-control-sc
$A11 Atom (x)
[IF Stereocenter (x) THEN Control-sc (x) ] CF 0.8 ;

Rule Connect-to-control
$A11 Atom (x) $A11 Atom (y)
[IF Control-sc (x) .and. Anisotropic (y)
THEN Connect (x,y) ] CF 0.8 ;

Rule Connect-apps-for-control
$A11 Atom(z) $A11 Appendage (y) $A11 Ring ( r )
[IF Root-of-appendage (z,y) .and.
Control-sc (z) .and. Atom-of-ring (z,r)
THEN Reconnect-app (y,r) ] CF 0.8 ;

Rule Suggest-control-sc says simply i f there i s a stereocenter at an


atom, then i t i s important t o control stereochemistry there.
Connect-to-control t e l l s one way for c o n t r o l l i n g stereochemistry at a
center, namely t o connect that center t o another center that i s
sterically differentiated. Connect-app-for-control states that i f
the center to be controlled i s on a ring and i s the root of an
appendage, then i t might be a good idea to reconnect the appendage t o
the r i n g to form a new r i n g . Currently, QED has rules for
reconnection of appendages, removal of stereocenters, making
transannular bonds, breaking appendage bonds, increasing s t e r i c
hindrance, and using functional groups.

Example of Analysis

A very simple dialog with QED w i l l be presented f o r the target


molecule shown below: l r ...

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
206 A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

Appendage 1 consists of atoms 1-4, Appendage2 consists of atoms


9-12, and Appendage3 = atoms 5, 14-16. The molecule i s i n i t i a l l y
represented as a standard SECS MOLFILE, i . e . , a connection t a b l e . In
the QED dialog on t h i s problem i n Figure 7, the user typing i s shown
underlined. The e o l i n command reads the connection table 3app.mol
then converts the molecule to QED l o g i c a l predicates, e.g.,
Atom(atom3) TV 100; Atom(atom4) TV 100; Bond(atom3, atom4) TV 100;
Stereocenter(atom4) TV 100; Root-of-appendage(atom4, appendage1) TV
100; appendageC appendage 1) TV 100; e t c . Thus the molecule i s
represented as a set of premises within QED which are known to be
completely true. The user then asks QED t o i n f e r a plan for a l l
atoms x. Connecting two atoms Connect(x,y) i s one possible part of a
plan.

@QED
- QED -
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

For commands type HELP


QED: molin 3app.mol
QED: i n f e r 3app-example $A11 Atom (x) Plan (x)
5 User Infer Request 3app-example Plan
To Be Instantiated by r u l e " P l a n l "

599 Rule " P l a n l " - Plan


END OF AGENDA LIST
QED: Lookup Reconnect-app-app
appendagel, appendage2, 75
appendage2, appendagel, 75
QED: Lookup Reconnect-app
appendagel, atom10, 67
appendagel, atom5, 56
QED: Lookup Connect
atom2, atom11, 60
atoml, atom5. 50

QED: What Rule Infers Connect


Rule "where-to-reconn"
QED: Show r u l e where-to-reconn
(rule i s printed out)
QED: Lookup Stereo-center
I don't know the word "Stereo-center"
Please choose one of the following
0) None of the following
1) Stereocenter
:: 1
atom9, 100
atom4, 100
QED: Quit
Thus i t i s shown.

Figure 7. Sample QED d i a l o g .

To summarize the example, i n a l l cases, the non-stereo


appendage, appendage3, was not u t i l i z e d for reconnection. The t r u t h

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
17. WIPKE A N D DOLATA A Predicate Calculus Approach to Synthesis Planning 207

values f o r reconnecting appendagel to appendage2 was 75, and because


of symmetry of reconnection, appendage2 to appendagel i s also 75.
Reconnection of the appendage to ring atoms: f o r connection to
stereocenter, TV = 72; for connection alpha to stereocenter, TV = 56;
for connection beta to stereocenter, TV = 50. We w i l l present the
chemical significance of some QED analyses separately elsewhere.

Conclusion

The multi-valued predicate calculus l o g i c as implemented i n QED has


been demonstrated to be suitable for cleanly representing s t r a t e g i c
axioms of chemical synthesis. QED i s a powerful t o o l for exploring
inference i n the planning of synthesis strategies. QED helped us
elucidate key s t r a t e g i c concepts and t h e i r interdependence and
enabled us to create a consistent r u l e base. The c l a r i t y of the QED
PC language allows anyone to e a s i l y read and understand the s t r a t e g i c
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

p r i n c i p l e s and may encourage further axiomatization work by others.

Acknowledgment

This research was supported i n part by NIH Grants RR01059 and


ES02845, with computational support from SUMEX-AIM, NIH Grant
RR-00785 and support from the users of the SECS synthesis program.

Literature Cited

1. Wipke, W. T. "Computer-Assisted Three-Dimensional Synthetic


Analysis". In Computer Representation and Manipulation of
Chemical Information; Wipke, W. T.; Heller, S. R.; Feldmann, R.
J.; Hyde, E., Eds.; John Wiley and Sons, Inc.: 1974, pp 147-174.
2. Wipke, W. T.; Braun, H.; Smith, G.; Choplin, F.; Sieber, W.
"SECS — Simulation and Evaluation of Chemical Synthesis:
Strategy and Planning"; American Chemical Society: Vol. 61,
1977.
3. Wipke, W. T.; Ouchi, G. I.; Krishnan, S. "SECS: An Application
of A r t i f i c i a l Intelligence Techniques". A r t i f i c i a l Intelligence
1978, 9, 173-193.
4. Corey, E. J.; Wipke, W. T. "Computer-Assisted Design of Complex
Molecular Syntheses". Science 1969, 166, 178.
5. Corey, E. J.; Wipke, W. T.; Cramer, R. D.; Howe, W. J. J. Am.
Chem. Soc. 1972, 94, 421, and adjacent papers
6. Gelernter, H.; Sridharan, N. S.; Hart, H. J.; Yen, S. C.;
Fowler, F. W.; Shue, H. J. "The Discovery of Organic Synthetic
Routes by Computer". Topics Curr. Chem. 1973, 41, 113.
7. Gelernter, H. L.; Sanders, A. F.; Larsen, D. L.; Agarwal, Κ. K.;
Bovie, R. H.; Spritzer, G. Α.; Searlemen, J. E. Science 1977,
197, 1041.
8. Wang, T.; Burnstein T.; Ehrlich S.; Evens M.; Gough, Α.;
Johnson, P. Y. "Using a Theorem Prover in the Design of Organic
Synthesis". In Applications of A r t i f i c i a l Intelligence in
Chemistry; Hohne, B.; Pierce, T., Eds.; American Chemical
Society: Washington D. C., 1986.
9. Wipke, W. T.; Rogers, D. "Artificial Intelligence in Organic
Synthesis. SST: Starting Material Selection Strategies. An
Application of Superstructure Search". J. Chem. Inf. Comput.
Sci. 1984, 24, 71-81.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
208 A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

10. Dolata, D. P. QED: Automated Inference in Planning Organic


Synthesis, PhD dissertation, University of California, Santa
Cruz 1984.
11. Robert R. Stoll "Set Theory and Logic"; Dover Publications:
1963.
12. A. Margaris "First Order Mathematical Logic"; Blaisdell
Publishing Company: 1967.
13. Haskell B. Curry "Foundations of Mathematical Logic"; Dover
Publications: 1977.
14. Frege, G. "Die Grundlagen der Arithmatik, Eine
Logischmathematisce Untersuchung uber der Begriff der Zahl";
Marcus, Breslau: 1884.
15. Ackermann, R. "Introduction to Many Valued Logics"; Dover, New
York: 1967.
16. Gries, D. "Compiler Construction for Digital Computers"; John
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch017

Wiley and Sons, New York: 1971.


17. Cleaveland, J. C.; Uzgalis, R. C. "Grammars for Programming
Languages"; Elsevier North Holland, New York: 1977.
18. Rovner, P. D.; Feldman, J. A. Massachusetts Institute of
Technology, Lincoln Laboratory, The LEAP Language and Data
Structure, 1968.

RECEIVED December 17, 1985

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
18

A Self-Organized Knowledge Base


for R e c a l l , D e s i g n , and Discovery in O r g a n i c C h e m i s t r y

1 2,3
Craig S.Wilcox and Robert A. Levinson
1
Department of Chemistry, University of Texas at Austin, Austin, TX 78712
2
Department of Computer Science, University of Texas at Austin, Austin, TX 78712
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

The design and operation of a system which forms


generalizations about organic chemical reactions and
structures and uses these generalizations to organize the
reactions and structures for efficient retrieval and to
generate precursors to a given target molecule is
presented. Approaches to computer based classificatory
concept formation and organization are discussed. A new
linear notation for organic reactions is described.

The complex professional tasks accomplished by organic chemists are


an i n t r i g u i n g example of i n t e l l i g e n t human a c t i v i t y . Organic
chemists organize and r e c a l l a vast amount of information. In
ascending order of complexity, the knowledge created and used by the
organic chemist consists of i n d i v i d u a l observations, conceptual
schemes and generalizations which organize t h i s factual knowledge
base, and, most importantly, procedures which describe how to use
these facts and conceptual schemes to solve a given problem. We are
interested i n the ways i n which information i s organized and used for
problem solving.
Our objective i s to design machines which w i l l encode reactions
and structures, w i l l automatically create generalizations based on
these data, and w i l l use these generalizations to organize the data
and to solve the problem of precursor generation. Organic chemists
often use s t r u c t u r a l features to c l a s s i f y reactions. The capacity to
conceptualize i s an indispensable aspect of i n t e l l i g e n c e . We wished
to determine whether a computer, given a large number of structures
or reactions and a small set of r u l e s , can create useful
generalizations. In designing such a program, we have faced a number
of interesting issues concerning conceptualization i n organic
chemistry.
Organic chemistry i s a unique theater for AI research because
over the past 150 years organic chemists have created a powerful
graphical knowledge representation scheme. This representation
3
Current address: Board of Studies in Computer Science, University of California, Santa
Cruz, CA 95064
0097-6156/ 86/ 0306-0209S06.25/ 0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
210 A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

method i s the second language of a l l organic chemists and supports


tasks ranging from mundane r e c a l l of a s p e c i f i c datum to the
generation of highly complex, creative proposals for multi-step
syntheses of previously unknown molecules. Computer science and
organic chemistry have been i n comfortable c o l l a b o r a t i o n for the past
25 years.(1-7) A number of important programs have been developed i n
that time. The DENDRAL project influenced AI research i n
far-reaching ways.(8) Organic chemistry i s an enticing arena for AI
research because to a limited but important extent, i n the microworld
of the organic chemist, the problem of how to represent knowledge has
been solved. The graphical language shared by a l l organic chemists
for over a century i s a remarkably sophisticated knowledge
representation scheme which i s e a s i l y adapted to contemporary
techniques i n computer science. The organic chemist does use many
concepts ( e l e c t r o n e g a t i v i t y , insights from quantum theory, and
s p a t i a l r e l a t i o n s h i p s between molecular components) which are absent
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

or are i n d i r e c t l y encoded i n h i s graphical notation. Nevertheless, a


substantial amount of knowledge at the factual l e v e l , and a useful
number of higher l e v e l concepts, can be expressed as connected
graphs.
Consider the following l i s t : functional groups, the a l d o l
reaction, the Paterno-Buchi r e a c t i o n , carbon-carbon bond formation,
ene r e a c t i o n , esters, alkenes, e l i m i n a t i o n , enamines, Claisen
rearrangement, a l l y l i c alcohols, halogenation. These words describe
just a few general categories used by chemists to c l a s s i f y reactions
or structures. These categories, some i n use for over 100 years, can
be described using organic s t u c t u r a l formulae and find d a i l y use i n
classifying chemical facte. Computer systems have used such
generalizations (provided by chemists) to guide data organization,
r e c a l l , and planning.
The benefits of o r i g i n a l machine calculated generalizations w i l l
be r e a l i z e d when capable conceptualizing programs are a v a i l a b l e . I t
w i l l be shown here that, given structures and reactions and a simple
set of i n s t r u c t i o n s , a computer can indeed discover generalizations,
some of which are equal to the categorizations used by chemists.
While the fact that some discoveries are very s i m i l a r to known
categories i s i n t e r e s t i n g , i t i s more important that the computer can
also discover patterns previously unknown to chemists.
In this program the generalizations about reactions and
structures which are discovered by the system are used very much as
man-made generalizations have been used. They organize the data,
they are used during the r e c a l l procedure, and they are used to
generate precursors to target structures. We hypothesize that
because only a few chemistry s p e c i f i c h e u r i s t i c s are used i n the
generalization algorithm, t h i s system w i l l have more creative
p o t e n t i a l than systems which are more r i g i d l y constructed from many
s p e c i a l rules based on detailed chemical knowledge. In current
system the answers provided to the precursor generation problem are
naive because we have not yet incorporated a h e u r i s t i c based module
to guide precursor s e l e c t i o n . Here, as i n a c h i l d , however, t h i s
naivety i s accompanied by the p o t e n t i a l to suggest fresh approaches
to solving a problem. The answers are not directed to conform to a
concensus view of correctness. We seek a system of answering
questions, but not a system which provides only expected answers.
The f i r s t part of this paper provides an overview of what the

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
18. WILCOX A N D LEVINSON A SeIf-Organized Knowledge Base 211

program does and how i t works. We present an approach to


representing both structures and reactions as single connected
graphs. We w i l l refer to a l l such l a b e l l e d graphs as s t r u c t u r a l
concepts or simply as concepts. S t r u c t u r a l concepts range from the
very general (carbon-carbon single bonds, carbon-oxygen double bonds)
through intermediate s i z e and generality (the a l d o l reaction, the
pyran ring) up to the most complex real-world instances of molecules
or reactions. By v i r t u e of the graph representation scheme,
reactions and structures, both r e a l and abstract, may be stored i n a
single data base.
This system d i f f e r s i n several ways from other approaches to
organic chemistry data base organization. The data organization of
t h i s system i s based on machine generated s t r u c t u r a l concepts rather
than pre-determined screens. The rules which guide the
generalization process w i l l be d e t a i l e d . The data i s h e i r a r c h i c a l l y
self-organized, i n a p a r t i a l ordering proceeding from the smallest,
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

most general s t r u c t u r a l concepts ( p r i m i t i v e s ) to the largest and most


s p e c i f i c structures or reactions. The generalizations that are
created aid r e t r i e v a l and are used i n precursor generation.
The idea of a h e i r a r c h i c a l organization of knowledge has h i s t o r y
far predating computer science.(9) ( Consider for example the arbor
porphyriana, a "tree of concepts" proposed by Porphyry i n the 3rd
century A.D. ) We recognize that the h e i r a c h i c a l organization and
manipulation of graphs i s a general approach to knowledge processing
and should find a p p l i c a t i o n outside of organic chemistry.
In the second part of the paper examples of the system i n action
w i l l be given. We f e e l that because our system uses c l e a r l y defined
rules f o r creating generalizations, i t may o f f e r fresh insights and
solutions to problems. Rules f o r generalization can be
systematically modified. The question of how such modifications
a f f e c t the problem solving c a p a b i l i t i e s of the system i s unanswered.
An appendix i s provided and d e t a i l s the new techniques used i n
t h i s program. An e f f i c i e n t algorithm based on a p a r t i a l ordering
allows the r e c a l l of subgraphs, supergraphs and close-matches for any
query graph. Some comparisons w i l l be made of t h i s algorithm with
previously used screen approaches for graph r e t r i e v a l .

Overview of the System

Reaction Representation. From the outset, t h i s project was shaped by


the graphical form of t r a d i t i o n a l organic reaction representations:

8
-
Li
0
ι Ρ
^
ο
yK • ^
Reactions are i n v a r i a b l y w r i t t e n t h i s way, and obviously have a l e f t
hand side and a r i g h t hand side. To the beginning organic student,
t h i s format n a t u r a l l y suggests a "before and a f t e r " or "cause and
e f f e c t " perception of reactions. " I f the s t a r t i n g material i s
treated i n this way, then the product w i l l r e s u l t . " This perception
has influenced the design of some computer programs. Reactions have
been represented either as two related structures or as one structure
and a set of changes required to produce the other structure.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
212 A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

To simplify comparisons between reactions, we sought to describe


entire reactions as a single labeled graph. Just as cause and e f f e c t
can be considered either as two separate events or as a u n i f i e d
process, changing with time, so a reaction can be perceived as two
structures, as shown above, or as a single assembly of n u c l e i
connected by bonds which change with time. The aldol-type reaction
just i l l u s t r a t e d can be rewritten as follows:

:12

Note that bonds which are invariant with time are represented i n
the usual way. The dotted l i n e s represent bonds which change over
the time course of the reaction event. Each changed-bond i s labeled
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

to indicate i t s bond order before and after the reaction. Obviously,


the unchanging bonds can also be labeled i n an i d e n t i c a l fashion.
(For example, "1:1" would represent an unchanged single bond.) A
second example of this representation i s i l l u s t r a t e d i n Figure 1.
These formulae are unorthodox only because they contain unusual
types of bonds, bonds which change with time. I t i s t h i s same
feature which makes the formulae very useful. The single formulae
represent entire reactions and can be stored or manipulated using any
of the methods already devised for the storage of s t a t i c structures.
We have chosen to use a bond-centered approach to encoding these
graphs. The smallest s t r u c t u r a l unit i s the atom-bond-atom fragment,
and w i l l be referred to as a p r i m i t i v e . Connected networks of these
atom-bond-atom fragments define a molecule or a reaction. These
networks of primitives are node labeled connected graphs and can be
represented as adjacency tables wherein the nodes are labeled with
numbers corresponding to p r i m i t i v e s . F i n a l l y , these adjacency tables
are stored i n f i l e s as LISP l i s t s and reside i n core as arrays.
Steps followed i n thç t r a n s l a t i o n of a reaction into a LISP l i s t
structure are i l l u s t r a t e d i n Figure 1.

Reaction Generalizations Based on S p e c i f i c Observations. Organic


chemists have long sought to organize t h e i r observations. Reactions
represented as connected graphs can be formed into groups on the
basis of common substructures (subgraphs) shared by a l l the members
of the group. These substructures (subgraphs) are s t r u c t u r a l
concepts which are more general than the s p e c i f i c reactions from
which they are derived. These s t r u c t u r a l concepts help to organize
the large numbers of known reactions.
Structural concepts derived from examples of real-world
reactions may have the form of a normal reaction but are not
necessarily good reactions as formulated. For example, most organic
chemists would recognize the following as the generic form of the
Diels-Alder reaction but few chemists expect this exact reaction to
afford a high y i e l d .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
WILCOX A N D LEVINSON A Self-Organized Knowledge Base 213

(c=:-c1-:»c:-c2=:-c3:-),c2-c4-o-c5-c3,c4»o,c5«o,c1-i. (b)

9
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

C-C01

NODE * CONCEPT* ADJACENCIES

1 1KC-F11) 2.12
2 6(C-C12) 1.3,12
3 5<C-C21) 2.4
4 2ÎC-C01) 5,13
5 KC-C11) 4.6,7.13
6 9ÎC-022) 5,7
7 16(C-011) 5.6.8
8 16 7,9,10
9 9 8.10
10 1 8,9,11.13
11 2 10,12,13
12 5 1.2,11
13 S 4.5.10,11

Lia Hat structure;

(13(11 2 12X6 13 12) (524) (25 13)(1 4 6 7 13) (957)


(16 5 6 8) (16 7 9 10) (9 8 10) (1 8 9 11 13) (2 10 12 13) <e)
(5 12 1 0 ( 5 4 5 10 11))

Figure 1. Five representations of the same chemical information.


The canonical chemical reaction graph (a) can be represented i n
l i n e a r notation (b, see Appendix) or as a bond-centered labeled
graph (c) by using time-variant bonds. The labeled graph affords
an adjacency table (d) and a LISP l i s t representation (e).

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
214 A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

To imitate this t r a d i t i o n a l and human process of generalization,


we use substructure discovery to create general concepts which
organize our data base. This mechanical generalization occurs
whenever a new reaction i s entered into the data base and i s
accomplished i n two stages.
F i r s t , for each new reaction (R) added, two generalizations, one
very general and one very s p e c i f i c , are calculated. These
generalizations (subgraphs) of R w i l l be referred to as the minimum
reaction concept (MXC(R)) and the complete reaction concept (CXC(R)),
respectively, and are defined as follows:

MXC(R): A graph which is equal to the smallest


connected-subgraph of reaction R which contains
a l l the changed bonds i n that reaction.

CXC(R): A graph made by i n i t i a l i z i n g a set C equal to a l l


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

bonds i n the MXC(R), adding to C a l l bonds


adjacent to C, and then continuing to add to C
a l l bonds which are not carbon-carbon single
bonds and which are adjacent to C u n t i l t h i s i s
no longer possible.

The r e l a t i o n s h i p between a reaction and i t s MXC and CXC i s more


c l e a r l y i l l u s t r a t e d i n Figure 2. The MXC i s a very general statement
about a specific reaction. The CXC i s a very s p e c i f i c
"generalization" of that reaction. The value of the MXC i s that i t
helps to organize the data base and i t w i l l be used l a t e r during
r e t r i e v a l and comparisons of reactions, and i t i s used i n the
precursor generation algorithm. The MXC w i l l not contain everything
that i s necessary for the reaction to proceed. The value of the CXC
i s that i t w i l l very l i k e l y contain everything required for a
successful reaction. The expected y i e l d of the reaction represented
by the CXC i s l i k e l y to approach or even exceed the y i e l d of the
o r i g i n a l reaction. Obviously the CXC contains much more than i s
necessary for the reaction. An organic chemist, i f asked to define
what was e s s e n t i a l to the success of the o r i g i n a l reaction, would
probably define a subgraph larger than the MXC and smaller than the
CXC.
The f i r s t stage of generalization begins, then, with c a l c u l a t i n g
the MXC and CXC of a reaction and adding those graphs to the data
base. A very simple h e u r i s t i c used here i s that generalizations
about reactions w i l l be subgraphs of reactions and w i l l contain a l l
the changed bonds of the reactions. The reaction i t s e l f i s next
added, and during that process previously known reactions which are
s i m i l a r to the new reaction are i d e n t i f i e d .
The second stage of generalization begins with t h i s l i s t of
s i m i l a r reactions. I f a reaction (RR) on this l i s t contains MXC(R),
(that i s , i t has the same MXC as the o r i g i n a l reaction, R), we
calculate the largest common subgraph of R and RR which contains
MXC(R). This new graph i s a s p e c i f i c plausible generalization formed
by comparing R and RR. This process r e s u l t s i n i d e n t i f y i n g
i n t e r e s t i n g reaction subgraphs of a size larger than an MXC and
smaller than a CXC. An example of the effects of these algorithms
for creating generalizations i s i l l u s t r a t e d i n Figure 3.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

con.
•C022
con, CM

kccii ocm, ο 3?
CC21
t
CC11
ccir
(
C011
ccâTccoTccîï ccrfl
con r
SilCH,l 3

osm csm con


£011 (R)
csm csm

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Figure 2 . MXC(R) and CXC(R) are general statements about the
reaction R and are r e a d i l y derived by graph manipulations of the
system graph representation.
ARTIFICIAL I N T E L L I G E N C E APPLICATIONS IN C H E M I S T R Y

ID

121
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

131

14)

151

Figure 3. When a reaction (1) i s added to the data base i t s MXC


(2) and CXC (3) are also added. The reaction i s then compared
with other reactions (4) and a maximum common subgraph (5) i s
added.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
18. WILCOX A N D LEVINSON A Se If-Organized Knowledge Base 217

Calculating Generalization V a l i d i t i e s . The knowledge base consists


of a large number of structures and reactions from the r e a l world and
an even larger number of structures or transforms which are
calculated based on these o r i g i n a l data. These calculated graphs are
generalizations based on the known reactions.
I f these generalizations are to be used for problem solving,
then t h e i r v a l i d i t y i s an important issue. By v a l i d i t y we mean the
p r o b a b i l i t y that the exact reaction represented by the generalization
would work. I f the generalization was considered as a reaction, what
would be the y i e l d of that reaction?
Measuring the v a l i d i t y of these generalizations i s important
because they are machine generated. In systems which use human
generalizations about reactions to generate precursors estimated
y i e l d s or r e l i a b i l i t y factors are provided for each generalization.
Our system seeks to automate this approach to machine i n t e l l i g e n c e
and a c a l c u l a t i o n a l approach to the r e l i a b i l i t y of generalizations i s
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

required.
The estimation of one type of v a l i d i t y i s a task faced by
organic chemists every day. In the process of reviewing research
grants, experts must predict whether proposed reactions, hitherto
unknown, w i l l succeed. To make this judgement, the expert r e l i e s i n
part on precedent. Previously observed reactions s i m i l a r to the
proposed reaction lend credence to the proposal. I f many reactions
(very s i m i l a r to be proposed reaction) are known to proceed i n high
y i e l d , the v a l i d i t y or l i k e l y y i e l d of the new reaction i s high. I f
s i m i l a r reactions are known to give low y i e l d s , then the proposed
transform i s of low v a l i d i t y .
Before precedent can be used to estimate v a l i d i t y , the meaning
of " s i m i l a r " (as i t i s applied to reactions) must be defined. I t i s
not s u r p r i s i n g that problems of conceptualization and s i m i l a r i t y
a r i s e i n the same project. Philosophers have long recognized the
complexity and interdependence of comparison and concept formation.
What makes one reaction a better precedent than another? Can
s i m i l a r i t y be quantified and i f so can the s i m i l a r i t y of a reaction
and a proposed transform be quantified? The ways i n which reactions
are s i m i l a r or d i s s i m i l a r and the p r e d i c t i o n of y i e l d s based on
precedent are important questions which deserve further study.
At present, we calculate transform v a l i d i t i e s (estimated y i e l d s )
for a generalization or an unknown reaction as follows.

Let TV(i) = transform v a l i d i t y of i .


Let A ( i ) = chemical r e a c t i v i t y of i .

Currently chemical r e a c t i v i t y i s equal to the number of bonds


which are not carbon-carbon single bonds. This i s a crude approach
to estimating the p o t e n t i a l r e a c t i v i t y of i . We wish to calculate
TV(r) for a newly discovered transform r based on reactions of
precedent. Let IS(r) = the set of known transforms upon which the
v a l i d i t y of r i s to be based, then:

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
218 A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

i?iS(r) C W ( 1 ) 2

TV(r) = (2)

iÎs(r) C W ( i
>
CW(i) i s the closeness ( s i m i l a r i t y ) weighted v a l i d i t y of
transform i with respect to the new transform r . I f the denominator
i n Equation 2 i s 0, TV(r) i s 0. The constant a determines the
magnitude of the e f f e c t of closeness A ( r ) / A ( i ) on the calculated
transform v a l i d i t y .
Equations 1 and 2 were intended to produce the following
results. I f there are a large number of r e a c t i v e bonds i n the
precedent not i n the proposed transform, the closeness weighted
v a l i d i t y of the precedent i s small. I f there be the same number of
reactive bonds i n both reactions, the closeness weighted v a l i d i t y of
the precedent i s equal to i t s y i e l d or known v a l i d i t y . This i s an
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

attempt to encompass the idea that i f the precedent has r e a c t i v i t y


s i m i l a r to the proposed transform, the proposed transform i s l i k e l y
to work as w e l l as the precedent. I f the precedent and the proposed
transform are very d i f f e r e n t , the precedent i s not h e l p f u l . A
weighted average of the closeness weighted v a l i d i t i e s of the
precedents provides the estimated y i e l d f o r the new transform. The
weighting procedure favors close precedents of high y i e l d . This
follows from the chemist's usual optimism: i f there are several
equivalent good precedents, some of high y i e l d and some of low y i e l d ,
the proposed transform i s judged to have a good chance.
The comparison of "numbers of reactive bonds" crudely measures
similarity. A more appropriate but complex approach would evaluate
s i m i l a r i t y i n terms of known functional groups or discovered r e a c t i v e
substructures shared or not shared by two reactions. Both these
approaches to v a l i d i t y estimation are limited because they are
e n t i r e l y based on structure. The expert w i l l use other factors,
including t h e o r e t i c a l considerations, to r e f i n e v a l i d i t y .
V a l i d i t y aids the precursor generation task i n a unique way.
V a l i d i t y can be used to i d e n t i f y s i t u a t i o n s i n which a p a r t i c u l a r
reaction i s not applicable. (Most structures have v a l i d i t y =* 100,
but some, l i k e Bredt's rule v i o l a t o r s , would have a lower v a l i d i t y . )
Reactions of very low (predicted or known) y i e l d or impractical
structures are c a l l e d "negative instances". M i t c h e l l uses negative
instances i n the learning process to rule out otherwise plausible
generalizations.(12) We use v a l i d i t y to define a continuum from the
most p o s i t i v e to the most negative instances. The mechanism f o r
precursor generation then automatically uses these negative instances
(structures or reactions of low v a l i d i t y ) to block the use of good
generalizations i n s p e c i f i c i n v a l i d a t i n g situations.(13)
A l l the generalizations calculated from a set of known reactions
are assigned a v a l i d i t y ( r e l i a b i l i t y factor) based on ( i ) how much
these subgraphs deviate from the known reactions from which they are
derived and ( i i ) the known y i e l d s of these known reactions. This
p r i m i t i v e method of predicting y i e l d s based on precedent serves to
i l l u s t r a t e challenges to be met i f machines are to acquire r e l i a b l e
chemical judgement independent of, but consistent with, an expert's
evaluations. The v a l i d i t i e s calculated here are used to guide the
precursor generation task and provide a means of evaluating proposed
precursors.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
18. WILCOX A N D LEVINSON A Self-Organized Knowledge Base 219

The System i n Action

Interactive Sessions. The system has been implemented i n LISP (Franz


Lisp) and Is running on a D i g i t a l Corporation VAX 11/780 at the
U n i v e r s i t y of Texas. Interactive sessions with the system are
i l l u s t r a t e d i n Figures 4-7. (During the development stages of t h i s
project a l i n e a r notation was created for reaction input and output.
A b r i e f description of this notation i s provided i n the Appendix.)
The figures are annotated and l i t t l e a d d i t i o n a l comment i s
required. Figure 4 i l l u s t r a t e s r e t r i e v a l of a structure and i t s
supergraphs and subgraphs. Figure 5 i l l u s t r a t e s reaction r e t r i e v a l .
The system i s able to use i t s knowledge to generate precursors
to a target molecule. Two examples are shown (Figures 6 and 7). At
present, the program compares known reactions and generalizations
based on known reactions to the target and chooses to apply reactions
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

which have the most reactive bonds i n common with the target. The
r e s u l t i s that precursors are suggested with l i t t l e s o p h i s t i c a t i o n .
In f a i r n e s s , i t should be emphasized that the data base was generated
from only about 230 reactions, and no generalizing concepts were
provided by the operators. We look forward to testing the system
when i t has acquired more knowledge.

Conclusions

The system described i n this paper stores and retrieves reactions and
structures, creates generalizations which further organize the
knowledge base, estimates the v a l i d i t y of these generalizations, and
uses both s p e c i f i c reactions and machine derived generalizations to
generate precursors. We have shown that the representation of
reactions as single labeled graphs i s possible based on the idea of a
bond which changes during a reaction and this graph representation
s i m p l i f i e s the machine driven act of induction. Concepts are
generated automatically and these concepts organize the data base,
aid in the retrieval, and support the precursor-generation
c a p a b i l i t i e s of the system. A method for c a l c u l a t i n g the v a l i d i t i e s
of a given generalization has been devised and methods of r e f i n i n g
these calculations have been i d e n t i f i e d .
This study examined some unexplored aspects of conceptualization
i n organic chemistry. How are c l a s s i f i c a t o r y concepts created? Can
the value of a generalization be quantified? Although here these
questions are presented i n r e l a t i o n to organic chemistry, they are i n
fact basic questions of epistemology and go beyond organic
chemistry.(9)
This program makes generalizations about real-world reactions
and uses these generalizations to generate precursors. M i t c h e l l ' s
approach to conceptualization requires an "instance language" to
represent observations, a "generalization language" to create
concepts, and a "matching predicate" to associate observations with
generalizations.(12,23) Our approach to generalization i n organic
chemistry r e l i e s on a bond-centered labeled graph representations of
reactions and structures (observations). In this language
"more-general-than" i s defined as equivalent to "subgraph-of". We
take advantage of the fact that i n organic chemistry the instance
language and the generalization language are i d e n t i c a l , and matching
predicates are based on graph comparisons.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

The system is now ready to process your requests.


There are 1230 concepts. ( comments]

What would you like to do? Iteveral options axe m l table.


1 - Change the database.
2 = Ask a question.
3 = Go to lisp level input. I the user if interested in asking
4 » Save changes. la question about the database.
5- Quit
2
wttich of the following do you need help with?
1 « Structure or reaction retrieval.
2 » Finding a precursor.
3 = Finding a postcursor.
4 « A multistep synthesis.
5 » Return to previous menu.
/ (structure retrieval or reaction
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

Initiating a query... [retrieval are possible.


Please enter the list of classes: I the ays test asks for a list of
(data types which will restrict
(s) Ithe search. We choose to
(search only structures.
Please enter the legal substitutions:
nil I no substitutions are allowed
I The user enters the following
Type in the structure please: Structure using a mnemonic system: ο
(c I -c -c -c -c -c -), cl*o.

Searching data-base of graphs...


(graphic input is not yet available. Ô
Ithe query structure is known.
Exact matches: (282)
Subgraphs: (4 7 16 49 63 69 86 88 102 137 282 306 539)

Supergraphs: (196 282 296 432 436 484 509 510 515 526 668 669 670 677 678 682 683 684 766
815 816 817 819 828 829 830 831 987 989 991 1164 1183 1192 1193 1194 1225 1226 1227)

Close matches: nil (by convention, since supergraphs were


I found, close matches are not sought.
Number of concepts searched: 16 116 concepts were examined to
jfind the 51 sutches shown above.
Number of complete node-by-node searches required: 15
|a complete subgraph isomorphism
(test was required on 13 concepts.

Going to lisp level input.


To return to this menu type XhiT

-> (show 539) |The user asks to view two sub-


C1-C2-C3.. (graphs. Eventually, graphical output
-> (show 306) Iwill be possible.
C1-C2-<:3-C4-C5-C6.C3=07..
-> (show 484) |a supergraph is viewed.
(C1 -C2-C3-C4-C5-C6-),C7-C3-C2-C 1 -012.C8-C2-C9-C 10-C 11..

F i g u r e 4. S t r u c t u r a l r e t r i e v a l . Responses p r o v i d e d by the user


are i n i t a l i c s . A n n o t a t i o n s a r e i n s e r t e d on the r i g h t .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
WILCOX AND LEVINSON A Se If-Organized Knowledge Base

What would you like to do?


1 • Change the database.
2 * Ask a question.
3 = Go to lisp level input.
A = Save changes
5 = Quit

2
Initiating a query...
Please enter the list of classes. libit time we are interested in
(only reactions.
(r)

Please enter the legal substitutions:

nil I no substitutions are allowed


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

Type in the structure please: (the user wants to see reactions


(which form c-c bonds at the alpha
(c I -c2-c-c-c-c-),c1-o,c2:-c. Icarbon in cyclohexanone:

Searching data-base of graphs...

Exact matches: nil Ithe exact reaction is not in the data


(bue.
Subgraphs: nil |no known subgraphs.

Supergraphs: (667 681 826 1224) Ifour known reactions are supergraphs
Ιοί the query.
Close matches: ((508 7) (814 7) (676 7) (136 4) (1063 3) (105 3) (1057 2) (359 2))
{concept 50S. for example, has a 7 bond
Number of concepts searched: 21 Isubgraph in common with the query.

Number of complete node-by-node searches required: 19

Would you like to add the structure as a new concept? (y-yes)


no (this is one way in which the system
(can learn new concepts
J
Going to lisp level input.
To return to this menu type (hi Τ
-> (show 826) Ithe user now examinee some
laupergrapha of the query.
(C1-C2-C3-C4-C5-C6-).(C12+C13+C14+C15+C16+C17* ),07-C 1 -C6-C8.C6 :-C9= :-C 10-C11 -C12.C 1
1=018..
U Η Ο Ο

-> (com 826)(Ύ (Qjll [Tj S > S


[ Ô ] lc«w»em« include bibliographie
^ ^ ^ ^ (information. Yields are stored
(House, H. 0. "Modem Synthetic Methods", pp 595-6 Un a separate file.

-> (show 1224)


(C 1 -C2-C3-C4-C5-C6-),(C2-C3-C 10-C 11-N12-C13 :-).07=C 1-C2 :-C 13 :-N 12-C15.C 1 -C6-C8.C6-C9
,C13=:014,. o 0

•> (com ,224)


(Corey . E J.. étal J. Amer.Chem.Soc. 1974. 96.6516)

Figure 5. Reaction r e t r i e v a l .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
222 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

What is the target molecule?


Type in the structure please: Ithe target: ^

(c 1 -c2-c3-c-c-c-c7-c-), c 1*o, c2-c-c-c, c3-c7.


CO-
Adding concept... Ithe system temporarily adds the target
Searching data-base of graphs... (to the data base. In this way known
The concept is 1231. (subgraphs of the target are found together
Iwith the reactions that will produce theei.
The following precursors are suggested: I these reactions are then used to generate
• reaction validity size (precursors.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

1 1216 83 13 Ithe table gives five precursors, the


2 11 61 13 (concept used to generate the precurter is
3 236 56 14 jehewn with the transform validity of this
4 193 56 13 (application (see text), the last column
5 27 A} 13 (gives the number of bonds in a precursor.

The precursors are on list pre'. Ithe veer new views the first three

-> (view pro 1)


(C1<2-C3-C7-œ-)XC3<4<5-C6-C7-),C3-C2-012-C11-C1CX9.. =rv

-> (view pre 2)


a y
(C1 ^2-<^7<cVWC3^4-C5-C6-C7-).C3-C2-C\< 12.C2-C9-C 1(K 11

-> (view pr$ 3)


(C1^^3^7<8-).(C3<4<5^6^7-)/3^
-> (show 1216) (this is thee
(C 1-M3-:-C3:-C6-:-C5MM-:).C5-O7. (to suggest the first precursor:
-> (up 12f6)
(1219)
-> (up 1219)
(1210)
- c
-> (com 1210)
(Dsot)enW6J Org Chem 1972 37 1212) (the reference and a reaction from which
-> (show 1210) (thai
<C1-<2<3-:<4^<W1<HXC5<6<7-Ce>^
5-C16.. lean easily be found.

M0CH
^OCM 3 ^ • 3

Figure 6. Precursor generation. Note that o v e r a l l transforms


may be encoded and applied without r e s t r i c t i o n s as to the actual
mechanism.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
WILCOX A N D LEVINSON A Self-Organized Knowledge Base

What is the target molecule?


Type in the structure please:

(c 1 -c2-c3-c-c-c-). cï»o, c3~o, c2-c-c-c *n.

Adding concept... (as in Figure 7. the target is first added


Searching data-base of graphs... (to the data base, subgraphs ef the target
The new concept is added: 1232 (are identified and reactions known te
Igenerate such subgraphs are applied in a
The following precursors are suggested: Iretresynthetic sense to the target.
reaction validity size
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018


1 441 77 14
2 11 76 11
3 425 36 13
4 300 20 13
The precursors are on list 'pre'. Ithe user now views three of the

6olng to lisp level input.

-> (vf§w pr$ 1)


07<3-C4-CT-C6-C2<1-<»-C9-C10»N11.C2-N13-<:i2.C3-ai5>l13-C14..

-> (view pro 2)


(C1-C2-C3-).(C1-C2-C3-C4-C5-C6-).C2-C7-C8-C9*N10.. la very

-> (view pre 3)


(C1-C2^3-MC1^^3-C4-(^6-^

F i g u r e 7. The c a p a c i t y t o g e n e r a l i z e from s p e c i f i c f a c t s i s
r e v e a l e d by the systems a b i l i t y t o p r o v i d e these p r e c u r s o r s .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
224 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Following the seminal work of Corey and Wipke, elegant and


powerful programs have been developed to aid the synthetic organic
chemist. These programs use man-made generalizations and special
h e u r i s t i c s to guide the computer to the solution of complex problems.
This project complements these e a r l i e r and ongoing e f f o r t s . The
l i m i t s and u t i l i t y of machine-made generalizations are our central
interest.

Acknowledgments

Enlightening conversations with Dr. Elaine Rich (Department of


Computer Science, U n i v e r s i t y of Texas) are g r a t e f u l l y acknowledged.
Mr. James Wells wrote the Pascal programs which allow input and
output v i a mnemonic strings of characters. This research was
sponsored i n part by the Robert A. Welch Foundation, Research
Corporation, and NSF (MCS-8122039). A d d i t i o n a l support was provided
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

by a National Science Foundation Graduate Fellowship to RAL.

APPENDIX

Data Organization and R e t r i e v a l

A new data base organization for storing and r e t r i e v i n g organic


structures was created for t h i s project. Although this r e t r i e v a l
system i s applied here to chemistry, i t i s w r i t t e n i n a general
manner and i s applicable to other graph-based domains. The
organization i s based on a p a r t i a l ordering of graphs by the ordering
f, M
r e l a t i o n subgraph-of . A simple yet powerful r e t r i e v a l algorithm
has been developed to accompany the p a r t i a l ordering. These methods
o f f e r an a l t e r n a t i v e to the scheme used by most r e t r i e v a l systems -
the screen approach.

The P a r t i a l Ordering. Labeled graphs stored i n t h i s data base w i l l


be referred to as concepts because they represent s t r u c t u r a l features
that are useful to consider when reasoning about molecules and
reactions. Both molecules and reactions are represented as labeled
graphs. Those graphs that represent known molecules and reactions
s i t near the top of the p a r t i a l ordering. P r i m i t i v e s (the single
node graphs that represent bonds) form the lowest l e v e l of the
p a r t i a l ordering. As the system evolves, intermediate concepts are
created. These concepts u s u a l l y represent p a r t i a l structures (such
as functional groups) or reaction generalizations. The intermediate
concepts are discovered (constructed) by the system to improve
r e t r i e v a l e f f i c i e n c y and precursor generation. Figure 8 shows a
simple p a r t i a l ordering. Notice that the concepts i n the p a r t i a l
ordering can be viewed as forming a continuum from general concepts
to more s p e c i f i c concepts.

The R e t r i e v a l Algorithm. The r e t r i e v a l algorithm e f f i c i e n t l y t e l l s


the system user how a new concept relates to a l l other known
concepts. The algorithm solves the following basic problem: Given
an element G and a p a r t i a l ordering return the following four sets:

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
18. WILCOX AND LEVINSON A Self-Organized Knowledge Base 225
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

F i g u r e 8. A simplified v i e w of the p a r t i a l ordering. A typical


upward c h a i n i s shown.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
226 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

1. The set of elements that are the same as G. (exact


matches)
2. The set of elements that are predecessors of G.
(subgraphs)
3. The set of elements that are successors of G.
(supergraphs)
4. The set of elements that have predecessors i n common
with G. (close matches)

The algorithm does something more powerful: I t finds the immediate


predecessors to G (the largest known subgraphs of G), and the
immediate successors to G (the smallest known supergraphs of G).
This i s the key to the algorithm. By finding where G f i t s i n the
p a r t i a l ordering we find the four desired sets. The algorithm must
minimize the number of comparison operations required to find the
four desired sets. This minimization of comparison operations i s
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

very important i n a system that uses complex objects l i k e graphs


since the complexity of these comparisons varies exponentially with
size.(14) The algorithm i s easy to implement and searches nodes i n a
l o g i c a l bottom-up order. This may be useful i n domains where, for
example, one may wish to apply general concepts or rules to a
s i t u a t i o n before more s p e c i f i c ones are found to be applicable.

D e t a i l s of the Algorithm. The algorithm has two phases. In Phase 1


the immediate predecessors (largest known subgraphs) are found and i n
Phase 2 the immediate successors (smallest known supergraphs) are
found. These two phases are enough to answer a l l four parts of the
query. To understand the algorithm, note that t r a n s i t i v e edges
between concepts i n the p a r t i a l ordering are not stored: i f a 5 b (a
i s a subgraph of b) and b < c, an edge from a to c i s not stored.
IP(y) i s the set of immediate predecessors of the data element y and
IS(y) i s the set of i t s immediate successors. These sets are stored
i n f i l e s as LISP l i s t s , one l i n e per concept. Phase 1 determines
IP(G) where G i s the query object.

S:-U
-While there i s an unmarked element y i n the
database such that each member of IP(y) i s marked
Τ and y has fewer nodes than G:
I f y < G (graph comparison needed)
Then mark y as Τ
{ β } : - {S - IP(y)} U {y}
Else mark y as F.

It can be shown that when this process terminates S • IP(G), the


set of largest known subgraphs of the query graph. When Phase 1
begins, a l l objects at the bottom of the p a r t i a l ordering (the
primitives) are compared to G since they have no immediate
predecessors. This process i s fast because the bottom of the p a r t i a l
ordering contains single node graphs for which the comparison
operation i s t r i v i a l .
Phase 2 may be informally described as follows: The goal of
Phase 2 i s to calculate IS(G) - the immediate successors (smallest
known supergraphs) of G:

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
18. WILCOX AND LEVINSON A Self-Organized Knowledge Base 227

-Sequence through the elements of IP(G) i n any


order, chaining up the p a r t i a l ordering for each
element. Beginning with the l a s t element of
IP(G) a b r e a d t h - f i r s t search i s required and i f
an unmarked element y i s encountered which has
been reached from a l l other elements of IP(G),
execute :
If G < y (comparison needed)
Thenjs}:={s|u{y}
-(mark a i l concepts chaining up from y as
successors without further comparison)
Else continue b r e a d t h - f i r s t search from y.
β
When Phase 2 terminates S IS(G). A l l supergraphs of G have
been i d e n t i f i e d by chaining up from each element of IS(G) as these
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

are found.
Phase 1 and Phase 2 answer parts 1-4 of the query as follows:

1. Exact match: I f G already e x i s t s i n the database,


s
then IP(G) IS(G). G i s the single element contained
i n these sets.
2. Subgraphs: The subgraphs are simply a l l nodes that
were marked Τ i n Phase 1.
3. Supergraphs: The supergraphs are marked i n Phase 2.
They are the union of the upward chains from each
member i s IS(G).
4. Close matches: The close matches are the union of the
upward chains from each member of IP(G) (not including
supergraphs). In the most obvious implementation of
Phase 2, a hash table i s used to manage the breadth
f i r s t search. I t contains information about which
nodes have been v i s i t e d and which upward chains they
are on. The desired union can be found simply by
c o l l e c t i n g elements of the hash table.

Other Chemical Structure Search Systems. Many e f f i c i e n t systems have


been designed to i d e n t i f y graphs i n a f i l e that contain a given
substructure. One system i s the Cambridge Crystallographic Data
Base.(15) In the Cambridge system the query structure i s compared to
every molecule of the database. This means that r e t r i e v a l time for a
query goes up l i n e a r l y with the size of the database. Other search
systems a l l e v i a t e t h i s problem. These systems use a screen
approach.(16-22) The screen aproach i s an indexing scheme that
includes, associated with each smaller concept of the database, a
l i s t of data items that contain the smaller concept (a l i s t of upward
pointers).

Comparisons with the Screen Approach. The algorithm used by screen


systems i s a special case of our algorithm, the difference between
the screen approach and t h i s approach i s i n the number of l e v e l s
allowed i n the database organization and not i n the r e t r i e v a l
algorithm.
Which organization supports more e f f i c i e n t r e t r i e v a l i n terms of
number of concept comparisons? No absolute conclusion can be

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
228 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

reached, but there are reasons to believe that i n general a


m u l t i l e v e l aproach may be cheaper overall.(13) F i r s t , our approach
tends to search smaller concepts than does the screen system. The
cost of these searches w i l l be much cheaper. Second, i n Phase 2 of
our system we ar^ able to i n f e r that some graphs are supergraphs of
the query without daing further searching. Finding a l l subgraphs and
all supergraphs of a query, with p r e c i s i o n , i s beyond the
c a p a b i l i t i e s of most screen systems. F i n a l l y , experimental evidence
supports our system.
To compare the performance of the m u l t i l e v e l organization
against a two-leveled one we ran our r e t r i e v a l algorithm on two data
bases. The f i r s t contained molecular structures, discovered molecule
concepts, and p r i m i t i v e s , and had 630 concepts altogether. The
second was a version of the f i r s t i n which a l l intermediate l e v e l s
between primitives and top-level structures have been removed,
leaving just two l e v e l s . This database had 521 concepts i n a l l . The
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

algorithm ran more than twice as fast on the multi-leveled database,


even though the two-level database contained fewer concepts. The
algorithm produced 33% more answers (subgraphs and supergraphs) when
running on database 1 than on database 2.

Linear Notation for Reactions and Structures. To a s s i s t i n the


development of t h i s program a new l i n e a r grammar was developed to
describe reactions and structures (Figure 1). A program w r i t t e n by
Mr. James Wells at the University of Texas accepts alphanumeric
s t r i n g s created by the chemist. From these strings which represent
structures or reactions the Pascal program generates a connectivity
table of the sort used i n the Cambridge Crystallographic database.
The connectivity f i l e s are transferred to the main LISP program which
creates the LISP structure l i s t s shown i n Figure 1.
The grammar for reactions and structures i s e a s i l y mastered by
the organic chemist. The following symbols are used:

- ; single bond
= ; double bond
* ; t r i p l e bond
+ ; delocalized double bond

Other than these symbols, the chemist needs to remember only two
r u l e s : ( i ) rings are encoded i n parentheses wherein the l a s t atom i s
followed by a bond which connects i t to the f i r s t atom i n the
parenthetical expression, and ( i i ) atoms at branching points must be
numbered. Linear or c y c l i c strings are separated by commas.
Hydrogens are o r d i n a r i l y ignored. Thus cyclopentane i s encoded as
(c-c-c-c-c-) and sec-butanol as c - c l - c - c , c l - o . A menu i s available
which contains commonly used structures which can be used i n an
abbreviated form to define molecules. The t - b u t y l d i m e t h y l s i l y l ether
derived from n-propanol can be represented as *tbs*-o-c-c-c. Further
examples of representations based on t h i s system are shown i n Figures
4-7.
The chemist can encode a structure i n many ways and, provided
the representation follows the above r u l e s , each alphanumeric s t r i n g
will generate a proper connectivity file. For example,
lf 11 lf
(c-c-c-cl-c-c-c-c2-) ,cl-c2" or (cl-c2-c-c-c-) ,cl-c-c-c-c2 are both
proper representations of 3.3.0-bicyclooctane. IUPAC numbering can
be followed or the numbering can be a r b i t r a r y .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
18. WILCOX AND LEVINSON A Self-Organized Knowledge Base 229

Reaction graphs are encoded i n the same way as s t a t i c


structures. Bonds which change during the reaction are coded as
"x:y" where χ i s the bond type before the reaction and y i s the bond
type after the reaction. Thus "c-c=:-c" represents the reduction of
11
propene to propane and (c-o:-cl-c-c-) , c l - : i " represents the
formation of tetrahydrofuran and an iodine atom from
4-iodobutan-l-ol.
A second program accomplishes the reverse process and w i l l
generate from a connectivity f i l e an alphanumeric representation of
molecules or reactions based on t h i s l i n e a r notation. While we
recognize the need for a graphical interface for the main AI program
we are enthusiastic about the e f f i c i e n c y of t h i s l i n e a r grammar.
This l i n e a r notation should be adaptable to use i n any application
dealing with connected graphs.

Literature Cited
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

1. Lederberg, J.; Sutherland, G. L.; Buchanan, B. G.; Feigenbaum,


Ε. Α.; Robertson, Α. V.; Duffield, A. M.; Djerassi, C. J. Amer.
Chem. Soc. 1969, 91, 2973.
2. Corey, E. J.; Wipke, W. T. Science 1969, 166, 178.
3. Brandt, J.; Friedrich, J.; Gasteiger, J.; Jochum, C.; Schubert,
W.; Ugi, I. In "Computer Assisted Organic Synthesis"; Wipke, W.
T.; Howe, W. J., Eds.; ACS Symposium Series No. 61, American
Chemical Society: Washington, D.C., 1977; pp. 33-59.
4. Wipke, W. T.; Braun, H.; Smith, G.; Choplin, F.; Sieber, W. In
"Computer Assisted Organic Synthesis"; Wipke, W. T.; Howe, W.
J., Eds.; ACS Symposium Series No. 61, American Chemical
Society: Washington, D.C., 1977; pp. 97-125.
5. Hendrickson, J. B. J. Amer. Chem. Soc. 1971, 6844-6862.
6. "Computer Assisted Organic Synthesis"; Wipke, W. T.; Howe, W.
J., Eds.; ACS Symposium Series No. 61, American Chemical
Society: Washington, D.C., 1977.
7. Wipke, W. T.; Rogers, D. J. Chem. Info. Comp. Sci. 1984, 24,
71-80.
8. Lindsay, R. K.; Buchanan, B. G.; Feigenbaum, Ε. Α.; Lederberg,
J. "Applications of A r t i f i c i a l Intelligence for Organic
Chemistry"; McGraw-Hill: New York, 1980.
9. Reidl, R. In "Biology of Knowledge"; John Wiley and Sons: New
York, 1984.
10. "Computer Representation and Manipulation of Chemical
Information"; Wipke, W. T., Ed.; John Wiley and Sons: New York,
1974.
11. Michalski, R. S.; Stepp, R. E. "Learning from Observation:
Conceptual Clustering" In "Machine Learning: An A r t i f i c i a l
Intelligence Approach"; Michalski, R. S.; Carbonell, J. G.;
Mitchell, Τ. Μ., Eds.; Tioga Press, 1983.
12. Mitchell, T. M.; Utgoff, P. E.; Banerji, R. "Learning by
Experimentation: Acquiring and Refining problem Solving
Heuristics" In "Machine Learning: An A r t i f i c i a l Intelligence
Approach"; Michalski, R. S.; Carbonell, J. G.; Mitchell, Τ. Μ.,
Eds.; Tioga Press, 1983.
13. Levinson, R. A. Ph.D. Thesis, University of Texas at Austin,
Austin, 1985.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
230 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

14. Tarjan, R. E. In "Algorithms for Chemical Computations";


Christoffersen, R. E., Ed.; American Chemical Society:
Washington, D.C., 1977; pp. 1-20.
15. Allen, F. H.; Bellard, S.; Brice, M. D.; Cartwright, Β. Α.;
Doubleday, Α.; Higgs, H.; Hummelink, T.; Kennard, O.;
Motherwell, W. D. S.; Rodgers, J. R.; Watson, D. G. Appl. Cryst.
1979, 35, 2331-2339.
16. Adamson, G. W.; Cowell, J.; Lynch, M. F.; McLure, H. W.; Town,
W. G.; Yapp, M. A. J. Chem. Doc. 1973, 13, 153-157.
17. Bawden, D. J. Chem. Inf. Comp. Sci. 1983, 23, 14-22.
18. Dittmar, P. G.; Farmer, Ν. Α.; Fisanick, W.; Haines, R. C.;
Mockus, J. ibid. 1983, 23, 93-102.
19. Feldman, A.; Hodes, L. ibid. 1975, 15, 147-151.
20. Fugmann, R.; Kusemann, G.; Winter, J. H. Info. Process. Mgmt.
1979, 15, 303-323.
21. O'Korn, L. J. In "Algorithms for Chemical Computations";
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch018

Christoffersen, R. E., Ed.; American Chemical Society:


Washington, D.C., 1977; pp. 122-148.
22. Willett, P. J. Chem. Inf. Comp. Sci. 1980, 20, 93-96.
23. Mitchell, T. M. A r t i f i c i a l Intelligence 1982, 18, 203-226.

RECEIVED December 17, 1985

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
19

Expert-System Rules for Diels-Alder Reactions

C. Warren Moseley, William D. LaRoe, and Charles T. Hemphill


Texas Instruments Inc., Dallas, TX 75265

Expert systems of today are powerful when used in the proper domains.
Unfortunately, the most difficult part of applying these systems is the struc-
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

turing of knowledge into rule format. This paper describes methods devel-
oped which allow the capture of Diels-Alder reaction knowledge into simple
and elegant expert system rule format. Essential components of the system
include: a grammar for matching the input molecular structure expressed
in Wiswesser Line Notation (WLN), the unification of many reactions into a
single generalized mechanism using synthon template patterns, use of WLN
rules to produce valid synthons, and use of frontier molecular orbital theory
(FMO) to verify the disconnection. This system is implemented in Prolog,
whose natural backtracking and generation capabilities easily express and
produce the many structural combinations possible.

There have been attempts to apply formal methods to the representation of organic
compounds [l],[2], some attempts to apply artificial intelligence to organic synthesis
[3],[4], and numerous attempts to apply the use of molecular orbital calculations to
the verification of the validity of compounds in the synthesis route. This effort was a
moderate attempt to examine the representation issues involved in writing production
rules for Diels-Alder disconnections.
The disconnection approach [5] is adopted in this work because it is amenable
to backward chaining systems. The starting point is the target compound, which is, in
this case, a Diels-Alder product. The target compound is broken or disconnected into
two distinct parts called synthons. The synthons are the ideal representations of the
actual reactants used to produce the target compound. Synthons embody the physical
properties of the actual compounds they represent.
As an initial implementation approach, rules could consist of specific targets and a
list of their synthons. No one uses this method because the naive approach of expressing
every possible chemical disconnection is impracticable: the number of rules involved to
express even trivial synthetic routes grows exponentially. Any expert system solution to

0097-6156/ 86/0306-0231 $06.00/0


© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
232 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

the synthesis problem must attack two fundamental problems: the variety of functional
groups which may participate in a given reaction and the symmetry involved between
function groups in a reactant (intra-synthon and inter-synthon functional group interac-
tion, respectively). The thrust of this research has been to capture the reaction routes
for a chemical disconnection in a clear, symbolic notation which accommodates quali-
tative reasoning with functional groups and which comprehends the symmetry of this
problem.
Ideally, an implementation language would support symbolic and linguistic ap-
proaches to representation and manipulation, a qualitative approach to verification, and
a deductive approach to disconnection. Prolog [6] is a symbolic language which directly
supports backward chaining deduction. Viewed as a declarative language it naturally
supports elegant grammar formalisms and its procedural aspects support qualitative
reasoning. For these reasons, Prolog was chosen as the implementation language for
this project.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

In summary, the following research goals are addressed in this effort:

1. A linguistic approach to the representation of chemical information.

2. Use of molecular orbital theory to qualitatively validate derived synthons.

3. Unification of synthetic disconnections into a general form.

4. Use of symbolic structure rearrangement in WLN.

2 Grammar Rules for Structure Recognition


The Definite Clause Grammar (DCG) formalism [7] is utilized throughout this project.
Grammar rules are used in the expert system rules to recognize the general class of
the parent molecule in the disconnection (e.g., cyclohexene). The class determines the
patterns used to construct the resultant synthons (discussed in Section 4).

2.1 B a c k g r o u n d for W L N a n d D C G

Many researchers have recognized the importance of having an unambiguous grammar


for chemical notation, but they have mainly applied WLN [8] to on-line compound
search [9] and structural summary (identification of common structural features) [10].
Johns and Clare point out that it is a linguistic rather than merely a symbolic notation.
This means that the symbols are represented and manipulated in well defined structures.
This section relies on the unambiguousness of WLN to recognize parent molecules while
Section 5 relies on the WLN rules to actually manipulate symbol structures.
The DCG formalism is based on first order predicate logic and provides a clear
and powerful method for describing languages. The formalism generalizes the Context
Free Grammar (CFG) formalism and DCG grammars may be efficiently executed. DCG
is most often implemented through a translation process from the DCG notation to a
top-down, left-to-right, backtracking Prolog program. This program becomes a parser
for the language specified by the DCG.
The required amount of work at each step in a backtracking parser is exponential
in the number of constituents already found, just for recognition. This occurs because
intermediate effort, which could become useful later, is not saved. Of course, classes of
grammars exist for which this behavior does not occur. Most programming language

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
19. M O S E L E Y ET A L . Expert-System Rules for Diels-Alder Reactions 233

grammars are carefully written to avoid exponential behavior. However, parsing algo­
rithms exist (e.g., the active chart parser [11]) where the worst case parsing time is
3 2
0(n ) for any C F G grammar and 0(n ) when the grammar is unambiguous (n is the
sentence length). Nevertheless, Prolog provides an adequate DCG grammar parsing
mechanism for the purposes of this work.

2.2 G r a m m a r for D i e l s - A l d e r Reactions

This section examines grammars used to recognize parent molecules (carbocyclic rings
for example).
The following regular expression [12] recognizes cyclohexene:

L6UTJ [Ασ ] [Βσ ] \Co } {Όσ } [Εσ ] [Fa ]


Α Β c Ό Ε F

where if r is any regular expression, [r] is an abbreviation for (e + r) (in other words,
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

r is optional), e is a regular expression that denotes the empty set and is the
union operator for the languages represented by the regular expression arguments. The
symbol V represents an arbitrary substituent, with the subscripts indicating to which
ring locant the substituent belongs.
Using DCG, the more general class of carbocyclic rings can be recognized. The
grammar rule
a tt tt tt
carbocyclic(Substituents, Number) — • L " , number(Number), U", T", J",
substituents(Substituents, Number).

achieves the desired result. Within this rule the logical variables are denoted by a
leading capital letter. This declaratively states that carbocyclic rewrites into the letter
tt
L", followed by a number (which in turn is recognized by DCG grammar rules), followed
by the letters "UTJ", followed by the substituents. The substituents rule recognizes
the Substituents at each ring locant and uses the instantiated value for Number to
verify that the ring locant values are within the proper range. Subsequent steps in the
disconnection process utilize the variables mentioned in the head of the rule.
Finally, using the grammar rule described above (and related rules not presented),
the goal

carbocyclic(S, N, "L6UTJ A l BNW F3", [])

rewrites the string "L6UTJ A l BNW F3" into the empty set [] (meaning that the entire
string is recognized) and produces the result

S = [[A,1],[B,N,W],[F,3]],N = 6.

S is a list of ring locants and the corresponding substituents used in subsequent discon­
nection stages. Ν represents the number of ring locants.

2.3 A p p l i c a t i o n to O t h e r Reactions

The general grammars and the mixture of declarative and procedural Prolog code allows
easy grammar rule writing for other reactions. As an additional example, consider
heterocyclic rings. The grammar rule

heterocyclic(Substituents, Number, Heteroatom) — • "T", number(Number),


tt
heteroatom(Heteroatom), J", substituents(Substituents, Number),
recognizes this class of molecules.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
234 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

The following grammar rule recognizes the heteroatom:

heteroatom(Heteroatom) — • [Heteroatom], {member(Heteroatom, "NOS")}.

Curly braces allow direct inclusion of Prolog terms within DCG grammars (the terms are
not translated). In this case, the member predicate tests the value of the Heteroatom
variable for membership in a list of heteroatoms.

3 The Reaction Check


This system covers concerted reactions of the π electron systems on tworeactants to form
new σ bonds yielding carbocyclic rings with a single unsaturation. If the reaction follows
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

the rule of maximum orbital overlap, then it is a suprafacial, suprafacial process and is
termed a [,-4, + 2 ] reaction. By the Woodward-Hoffmann rules this is a symmetry-
r 9

allowed thermal reaction [13].


The theoretical underpinnings used in this program are derived from those used
by Jorgensen et. al. in the C A M E O system [14], [15] with the exception that our system
works backwards, going from a product to either the reactants which form it, or issuing
a statement informing the user that a disconnection is not possible.

3.1 Basic Frontier Molecular Orbital Theory

It is known from molecular orbital theory that molecules possess sets of individual
molecular orbitals (as long as the molecules are sufficiently far apart from each other).
These are the basic unperturbed molecular orbitals used in the evaluation of the reaction.
As the molecules move more closely together, their orbitals begin to overlap. This
interaction between the orbitals on the different molecules results in the mixing of the
orbitals on each molecule [13].
According to frontier molecular orbital theory, the strongest interactions are be­
tween those orbitals that have coefficients with similar magnitudes relative to the unper­
turbed molecules, i.e. the interaction is between the small coefficient on the dienophile
and the small coefficient on the diene [16], [17].
If both of the molecular orbitals involved in the bonding are filled, the resulting
orbital is not significantly reduced in energy [18]. The greatest reduction in energy
arises in the interaction between a filled molecular orbital and an empty one. Since
the interaction is strongest between the orbitals of like energy, the ideal combination
of orbitals is between the highest occupied molecular orbital (HOMO) on one molecule
and the lowest unoccupied molecular orbital (LUMO).
Although Diels-Alder reactions can occur in the unsubstituted case, the reaction
is most successful when the diene and the dienophile contain substituents which exert
a favorable electronic influence [19]. In the normal electron demand case, the most
favorable interactions are between dienes with electron-donating groups and dienophiles
with electron-withdrawing groups. Cases have been reported in which inverse electron
demand occurs and the electronic nature of the diene and dienophile are reversed [20],
[21], [22]. This case of inverse electron demand is accounted for in the system.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
19. MOSELEY ET AL. Expert-System Rules for Diels-Alder Reactions 235

3.2 S t r u c t u r a l C o n s t r a i n t s o n Reactants

It became necessary early on in the project to develop a method for quickly checking the
reactants for structural features which would make them unsuitable for the Diels-Alder
reaction. The constraints are integrated into the notation package, since they are most
easily recognized in terms of the notation patterns resulting from the disconnection. The
synthons produced by a Diels-Alder disconnection are checked for proper configuration.
All synthons are checked before the FMO algorithm begins, resulting in the failure of
program execution and the return of a "no" to indicate no reaction. This assures that
synthons produced by the rules are actually reactive.
The following structural features of diene-synthons are considered unreactive in
+ 2 ] cycloadditions:
T e

1. Any diene-synthon unable to have an s-cts conformation.


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

2. Diene-synthons in which an exocyclic double bond is conjugated to a double bond


in the ring (e.g., a double bonded substituent on the diene).

3. Diene-synthons in large (greater than 7-membered) rings.

4. Acyclic compounds that have bulky substituents at the central positions on the
diene-synthon. The substituents at these positions are relatively close to each
other, and bulk leads to steric hindrance.

5. Substitution at both terminal diene-synthon positions is allowed only if the sub­


stituent is a primary atom or a triply bonded functional group (such as a cyano
group).

All double bonds are perceived as possible dienophile synthons by the notation
package. The screening involves only the elimination of all double bonds in aromatic
tt
compounds (WLN symbol R").

3.3 Basic H O M O - L U M O Calculations

From work performed in 1983 by Burnier and Jorgensen [15], the following ab initio
calculations for the HOMO and LUMO energies of the synthons were developed. The
function n(x, parent) returns the number of atoms of type χ in the parent. This
function is abbreviated below as simply n(x) where the parent is understood. The
symbols UU, Ο, N, S represent triple bonds, oxygen, nitrogen, and sulfer, respectively.
The subscripts 'c' and 't' denote central and terminal locations respectively in the
parent for the elements which they modify. For brevity, the terms diene-synthon and
dienophile-synthon will be replaced with diene and dienophile respectively.
For Dienes:

£HOMO = -2n(0) - n(UU) - 0.2n(N ) - 0.5n(S ) - n(S ) - 9.0


c t c (1)

#LUMO = -n(O) - 0.5n(N) - 2n(S ) + 1.5n(S ) + 0.6


t c (2)
For Dienophiles:

£HOMO = -n(UU) - 4n(0) - 2n(N) - n(S) - 10.5 (3)

£LUMO = n(UU) - n(O) - 0.5n(N) - 4n(S) + 1.8 (4)

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
236 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

In the carbocyclic ring case, the HOMO-LUMO values default to the constants at
the end of the equations. The formulas above are used to compute the orbital energies
(both HOMO and LUMO) of the unsubstituted parent compounds. In the case of
substituted compounds, additional formulas account for the electronic effects of the
substituents.
The explanation of the regiospecificity of Diels-Alder reactions requires knowledge
of the effect of substituents on the coefficients of the HOMO and LUMO orbitals. In
the case of normal electron demand, the important orbitals are the HOMO on the
diene and the LUMO on the dienophile. It has been shown that the reaction occurs
in a way which bonds together the terminal atoms with the coefficients of greatest
magnitude and those with the coefficients of smaller magnitude [18]. The additions
are almost exclusively cis and with only a few exceptions, the relative configurations of
substituents in the components is kept in the products [19].
It is known that the effects of substituent groups on a diene or dienophile vary
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

between different types of parents [23]. A function, τ(Υ), has been determined for several
functional groups, with Y corresponding to their electron donating or withdrawing
capability such that a reasonable estimate of the HOMO energy could be obtained by
use of the equation [15]:

£ H O M O = l(P) + T(Y) + EHOMO(P) (5)


This equation yields a value for the substituted molecule where η(Ρ) is the sensi­
tivity of the parent P. Some initial values, called r values, which describe the electronic
effects of functional groups have been found and developed by Jorgensen et. al. Hydro­
gen was assigned a τ of 0.0 eV so that electron withdrawing substituents have negative
τ values and electron donating groups have positive τ values. The values for τ were
chosen so that a 0.5 eV change in the substituent gives a change of 10 in the τ value.
This algorithm, when combined with the notation rules, yields useful results for many
functional groups and gives reasonable estimates of the values for those not known. The
factor 7(P) for an ethylene analog is given by:

7(P) = O.Oln(UU) + 0.06n(O) + 0.03n(N) + 0.03n(S) + 0.05 (6)


For any given diene the value for 7(P) can be adequately represented by the value
0.03 eV. This provides the proper value for correction in the calculation due to the
sensitivity of the parent compound towards different types of functionality.

3.4 D e t e r m i n a t i o n of Substituent Effects

To determine substituent effects, substituent groups are built from primary recognized
atoms and functional groups. A functional group is scanned one Wiswesser symbol
a
at a time. A Wiswesser symbol can represent either an individual atom (e.g., G "
a
for chlorine) or a functional group (e.g., Z " for the amino group). This allows us to
adapt the "layer" method of Jorgensen to the scanning of the functional groups on
the rings. These groups are provided as Prolog sublists as outlined in the previous
section. Once the comparison between the functional group elements and the known
values are compared, τ is calculated by the following method. The formula for the
numeric calculation is:

+ 2 w / ( l + NFG) (?)

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
19. MOSELEY ET AL. Expert-System Rules for Diels-Alder Reactions 237

Table 1. Example τ Table Entries

name WLN τ
tau_entry (p-methoxyaryl, "R DOl", 51).
tau_entry(trimethylamino, "Ν1&Γ, 44).
tau_entry(aryl, "R", 42).
a
tau_entry('methyl sulfate', sr, 38).
tau _entry (amino, "Z", 36).
tau jent ry (olefinic, "1U2", 36).
tau_entry (sulfate, U
SH", 32).

The legend for this equation is:


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

• T - the largest calculated reference value of τ in either the positive or negative


max

direction.

• Tsum - the sum of the remaining r values in the functional group.

• NFG - the number of functional groups attached to the parent system.

The above is based on the calculation of a collective τ for the whole molecule. This
value changes the HOMO of either the diene or dienophile, as is necessary. This equation
is accurate to about 0.5 eV on either side of the "known" values [15]. The value of r tai to

is inserted into the HOMO-LUMO calculation as the parameter τ(Υ). Note that in its
pure form, this equation only yields values for the HOMO orbitals. Corrections are used
for the calculation of the LUMO values. Table 1 contains examples of the Wiswesser
Line Notation and the raw r values used in the computation of orbital energies.

3.5 D e t e r m i n a t i o n of P e r m u t a t e d L U M O Coefficients

The following rules were used for the determination of the LUMO orbital coefficients
from the values determined for the HOMO coefficients [15].

1. An electron donating functional group raises the energy of the HOMO orbital of
a system about twice as much as it raises the LUMO.

2. In contrast, an electron withdrawing functional group lowers the HOMO energy


about one third as much as it lowers the LUMO.

3. Groups which add conjugation such as olefinic, acetylenic and aromatic groups
lower the LUMO orbital energy one third to one half as much as the HOMO
energy.

The same equations are used to determine both the HOMO and LUMO values.
This is consistent with the fact that the HOMO and LUMO orbitals are calculated from
the same parent system, and that the difference between the orbital energies can be
adequately covered by the two parameters 7(P) which represents the sensitivity of the
parent to substitution and τ(Υ) which represents the electronic effect exerted by the
functional group acting as a substituent.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
238 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

To implement the rules mentioned above, only the r(Y) values for the functional
groups are changed. Thus, the r(Y ) values for the calculation of the LUMO orbitals on
both the diene and dienophile are changed following these rules:

1. Positive τ values except those for conjugated hydrocarbons are divided by a factor
of 2.

2. Negative τ values are multiplied by 3.

3. r values for conjugated hydrocarbons are divided by a factor of 3 and their signs
are reversed.

This method covers many combinations of functional groups that influence the
orbital energies. A feature of this method is that it uses the same functional group
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

r values as in the HOMO energy calculation. The algorithm described above is used
for the calculation of both the HOMO and LUMO atomic coefficients. The r values of
the substituents are permutated to give the proper values for the LUMO orbitals. The
following steps are required:

1. r values on terminal positions are taken from the list previously described.

2. Resultant τ values on the central diene positions are divided by a factor of two
to accommodate the fact that the orbital coefficients at these positions are very
small.

3.6 A l g o r i t h m for R e g i o c h e m i c a l Selection

Any functional group attached to a terminal carbon on either a diene or dienophile


increases the magnitude of the coefficient on the opposite terminal. Any functional
group attached to a central position on the diene (there is no analogous case for the
dienophile) increases the magnitude of the coefficient on the terminal farthest from the
substituted position. For cyclohexene, the central locants are the A and Β positions
on the Diels-Alder adduct. Thus, if a functional group is on position A the magnitude
of the coefficient at terminal C increases. One of the remarkable aspects of the Diels-
Alder reaction is the specificity of the bonding between the carbon atoms [13]. The
orientation of the addition can be accurately predicted by an extended form of the
frontier molecular orbital theory as developed by Fukui and Fujimoto et. al. [16]. For
dienes the coefficients are determined as follows: if the sum of the absolute values of
r on positions F and Β is greater than the sum of τ on positions A and C , then the
coefficient on position C has the greater magnitude, otherwise the coefficient of position
F has the greater magnitude. On dienophiles, if the sum of the absolute values of τ is
greater on position D than on position £ , then £ has the greater magnitude.

4 Reaction Unification Using a General Form


This section examines the notion of a general form for representing the possible synthons
in a reaction. Derivation of this form is illustrated and examples of the general form
are presented. Symmetry and the encoding of optional notation is discussed and some
examples of the naive approach are presented.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
19. MOSELEY ET AL. Expert-System Rules for Diels-Alder Reactions 239

Table 2. The Naive Approach

Parent Synthons
tf
discon( L6UTJ Al Β Γ , ["lUYl&Yl&Ul", "lUl"]).
a tt a w
discon( L6UTJ D1Q", [ l U 2 U r , Q2Ul ]).
tt
discon( L6UTJ Al Bl DOVI", [ « I U Y I & Y I & U I " , "îvoiur]).
a
discon( L6UTJ Al Bl Dl E N W , ["lUYlfcYlfcUl", "WN1U2"]).

4.1 M o t i v a t i o n : the N a i v e A p p r o a c h

In the naive approach, disconnections are simply listed as facts with the molecule to
disconnect as the first argument and a list of the synthons as the second. Table 2
contains some examples. This approach suffers in many ways; primarily, the number
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

of rules would become unmanageable (quite huge even for cyclohexene), slowing the
inferencing speed of the expert system.
A sample inference mechanism using these facts (given the natural backward chain-
ing of Prolog) might be

disconnect(Parent, Given.Synthons) : -
discon(Parent, Synthons),
disconnect(Synthons, Given.Synthons).
disconnect(Parent, [Parent]) : -
given(Parent).
disconnect [First I Rest] . [First.Disc|Rest J)isc] ) :-
disconnect(First, F i r s t J ) i s c ) ,
disconnect(Rest, Rest.Disc),
disconnect ([] , []).

This procedure recursively disconnects synthons until the final synthons for the orig-
inal parent are all available (or given) compounds. Upon successful completion, the
variable Given-Synthons contains a tree (in list notation) which denotes the synthon
combination order to reproduce the parent compound.

4.2 D e r i v a t i o n of the G e n e r a l F o r m

Consider the domain of a six-membered ring with single unsaturation. Table 3 expresses
the synthetic route with one substituent. Again, the symbol V represents an arbitrary
substituent. Square brackets surrounding a set of symbols indicates optionality of those
symbols (as in regular expression notation). For example, the string may reduce
α
to the string V or σ&* depending on whether the substituent represented by σ ends
in a terminal symbol or not (following the rules of WLN).
Symmetry in the patterns, however, hides many details in the diene and dienophile
patterns. Table 4, with combinations of symmetric substituents, reveals more of the
details. The order of the symmetric substituents may be chosen arbitrarily. Alphabetical
ordering was chosen here for consistency.
Finally, for a full cyclohexene molecule, the patterns become

<r lUY<r [&]Y<r [&]Ul* +


c A B F σ ΐνΐσ
Β Ε (8)

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
240 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Table 3. Patterns for a Six-Membered Ring with One Substituent

Substituent Position Diene Dienophile


A or Β lUY<r[fc]Ul 1U1
C or F <rlU2Ul 1U1
D or Ε 1U2U1 σΐϋΐ

Table 4: Patterns for a Six-Membered Ring with Two Substituents

Substituent Position Diene Dienophile


A and Β ΐυΥσ [&]Υ<τ [&]υΐ
Λ Β
1U1
C and F <r lU2Ul* c F 1U1
D and Ε 1U2U1 σρΐυΐσΕ
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

It should be clear that this notation applies to many different classes of reactions.
Use and manipulation of this general form will be discussed in the next section. The
following discussion outlines its use in expert system rules.

4.3 U s e of the M e c h a n i s m i n R u l e F o r m a t i o n

Given the general form, it is possible to capture many disconnections of a given class
with a single rule. The following example illustrates the approach advocated in this
paper for cyclohexene.

discon(WLN, [Diene, Dienophile]) : -


carbocyclic(Substituents, 6, WLN, [ ] ) ,
collect.substs(Substituents, " C À B F " , Dn.substs) ,
collect.substs(Substituents, " D E " , Dl.substs),
fmo(Dn.substs, Dl.substs).
H W
make_synthon(Dn_substs, *1UY**Y*«J1* . Diene),
M
make_synthon(Dl_substs, *1U1*", Dienophile).

This rule declaratively states that the compound represented by W L N disconnects to


the Diene and Dienophile pair if the W L N matches the carbocyclic grammar rule
with 6 substituents, the collected substituents for the Diene and Dienophile pass the
fmo test, and the respective constituents may be successfully incorporated into the
general form for the cyclohexene Diene and Dienophile.
The goal make _synthon instantiates the general form and rewrites the instan-
tiated general form into a pseudo-WLN form. The pseudo-WLN form has adjacent
number values combined and redundant ampersands eliminated, but the branch or-
dering does not necessarily follow all the WLN rules. The symbol in the second
argument represents a general substituent, 'σ', where the subscript is determined by
the order mentioned in the collect_substs predicate [e.g., "CABF" and "DE").
The following grammar rewrites the instantiated general form to the pseudo-WLN
notation. The unit symbol '[]' in the following grammar represents the NIL symbol (or
empty symbol) and arises when a substituent is not present in a particular position. This
grammar captures the following conditions: the '[]' symbol next to a number disappears,

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
19. MOSELEY ET AL. Expert-System Rules for Diels-Aider Reactions 241

adjacent numbers are summed (for a longer carbon chain), a three way branch reduces
to a carbon when one of the branches is empty, optional ampersands are eliminated, and
required ampersands are retained. The rules must be applied to the string repeatedly
until no changes to the string occur.

N[] — N.
[]N — N.
<TN —• {number(<r), N N is σ + Ν } , N N .
Νσ —-> {number(a), NN is Ν NN.
NiN 2 — {NN is Ni+ N }, NN. 2

Y[]& — 1.
Υσ& —• {not(number(a)), endsJn_terminal(a)}, Υσ.
Y<7& —• {not (number (σ)), not(endsJn_terminal(<r))}, Yak,.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

For example, performing these transformations with an empty cyclohexene (σ = Α

K
[] ... up = []) yields the diene 1U2U1" and the dienophile " Ι ϋ Γ . Once the synthons
are in pseudo-WLN form, they are rearranged to conform to the standard WLN form
(described in Section 5).

4.4 A p p l i c a t i o n to O t h e r Reactions

General forms are easily developed for other reactions. The machinery introduced in
this section can then be utilized to write disconnection rules for other reactions. For
example, consider the Diels-Alder adduct bicyclo[2.2.1]hept-2-ene. Using the regular
expression notation described previously, the line notation for these types of compounds
can be represented as

L55 CU ATJ [Ασ ] [Βσ ] [Ca ] [Όσ ] [Εσ ] [Fa ] [Ga } [-A&(F+G)] [-B&(F+G)]
Α Β c Β Ε F G

The information following the hyphens describes the orientation of the substituents at
locants where stereoisomerism can occur. F and G are the locants where the stereo­
chemistry may occur.
This compound can be disconnected into a cyclopentadiene synthon and a dieno­
phile synthon similar to the the one previously described. The general form for the
disconnection is then given in the notation by

L5 AHJ Ασ Βσ Ca Α Β c Όσ Ό Εσ + a l\Jla
Ε F G (9)

Additional pseudo-WLN rewrite rules would eliminate ring locant symbols which are
followed by an empty substituent.

5 Notation Rearrangement
The previous section illustrated the formation of diene and dienophiles and noted that
the intermediate notation did not necessarily obey the WLN rules. This section de­
scribes the transformation from pseudo-WLN form to legal WLN notation.
A predicate called wln_order occurs within the make_synthon predicate. This
predicate builds a graph from the pseudo-WLN (using WLN Rule 8(a)) and possibly
reorders the graph as described below. The following Prolog code describes this manip­
ulation:

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
242 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

wln_order(Pseudo_WLN. WLN) : -
notation_graph(Pseudo_WLN),
rule6(Chain), % uses graph in database
mle7and8(Chain. WLN).

Construction of the notation-graph requires general knowledge about terminal


symbols and their interaction with branch symbols. The pseudo-WLN is parsed using
this knowledge. Vertices are created when branch symbols are encountered and the edges
are labeled with the notation which occurs between the branch vertices. An undirected
graph results from this process and all vertices with outdegree one are considered roots.
Rule 6 orients the molecule, collecting the vertices and edges in the proper order.
To accomplish this, all root nodes are collected. Starting from each root, the primary
chain of the notation is chosen using the longest path of notation symbols, breaking any
tie by choosing the chain which ends in the latest notation symbol (Rule 2).
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

Next, Rule 7 orients branch choices along the primary chain chosen above. This
rule orders branches using the branches with the lowest branching factor and with the
fewest notation symbols. Ties are again broken by Rule 2. Rule 8 guides the reassembly
of the molecule in proper WLN form. It reintroduces ampersands and inserts hyphens
where necessary. All of this was easily implemented in Prolog, using DCG to parse the
pseudo-WLN form and the Prolog database to represent the graph.
Many additional rules are required for other reactions. Probably the entire comple-
ment of WLN rules must be implemented for even moderately sophisticated chemistry.
It may be desirable at this point, however, to design a notation which encompasses
WLN'S strong points, but is more computationally oriented.

6 Conclusions
Other systems have developed FMO reaction checks and used WLN for cataloging, but
this system has relied heavily on a symbolic approach to chemistry, including application
of grammar techniques to WLN strings. We feel that our system is very successful in
the domain that it has been applied, eliminating hundreds of naive expert system rules.
We also feel that our techniques are applicable to many other reactions as well.
This paper has primarily stressed concepts rather than implementation details. A
prototype system based on these concepts has been implemented, with concentration in
the cyclohexene domain. The entire system, including grammars, the FMO verification,
and WLN manipulation required only 12 pages of Prolog code. Although execution
speed was never considered a factor at this stage, the system performs the disconnection

L6UTJ A l B l D l ENW => 1UY1&Y1&U1 + WN1U2 (10)

in four seconds with a IK Logical Inferences Per Second (LIPS) interpreter.


There are several future research directions for this project. First, results from
the F M O reaction check are not infallible due to the qualitative nature of this check. A
more precise, yet computationally feasible model may be possible. Second, more work
remains in the WLN rearranger; a full system based on our concepts would require
knowledge of the entire complement of WLN rules. It may also be desirable to adopt
or develop another, more computationally tractable line notation for the purpose of
synthetic analysis. Finally, we would like to extend our work to more reaction classes
to examine its potential in more detail.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
19. MOSELEY ET AL. Expert-System Rules for Diels-Alder Reactions 243

Acknowledgments

We wish to express our appreciation for the Texas Instruments IDEA program which
sponsored the majority of this research. This is a unique program within a large com-
pany which provides excellent research opportunities. Texas Instruments' unsurpassed
computing facilities also deserve acknowledgment.

Literature Cited

[1] Blower, P. E., Jr., An Application of Artificial Intelligence to Organic Synthesis,


PhD Thesis, University of Wisconsin, 1975.
[2] Gordon, John E., "Chemical Inference. 2 Formalization of Organic Chemistry:
Generic Systematic Nomenclature," J. Chem. Inf. Comput. Sci., 24, (1984), pp.
81-92.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch019

[3] Rodgers, David and W. T. Wipke. "Artificial Intelligence in Organic Synthesis.


SST: Starting Material Selection Strategies. An Application of Superstructure
Search," J. Chem. Inf. Comput. Sci., 24, (1984), pp. 71-81.
[4] Sridharan, N. S., PhD Thesis, State Univerisity of New York at Stonybrook, 1971.
[5] Warren, Stuart, Organic Synthesis: the Disconnection Approach, John Wiley &
Sons, New York, 1982.
[6] Clocksin, W. F. and C. S. Mellish, Programming in Prolog, Springer- Verlag, Berlin,
1981.
[7] Pereira, F.C.N., D.H.D. Warren, "Definite Clause Grammars for Language Anal-
ysis - a Survey of the Formalism and a Comparison with Augmented Transition
Networks", Artificial Intelligence, 13, (1980), pp. 231-278.
[8] Smith, Elbert G., ed., The Wiswesser Line - Formula Chemical Notation, McGraw-
Hill, New York, 1968.
[9] Fritts, Lois E., Margaret Mary Schwind, "Using the Wiswesser Line Notation
(WLN) for Online, Interactive Searching of Chemical Structures," J. Chem. Inf.
Comput. Sci., 22, (1982), pp. 106-109.
[10] Johns, Trisha M., Michael Clare, "Wiswesser Line Notation as a Structural Sum-
mary Medium," J. Chem. Inf. Comput. Sci., 22, (1982), pp. 109-113.
[11] Winograd, Terry, Language as a Cognitive Process, Volume 1, Addison-Wesley,
Reading, 1983.
[12] Hopcroft, John E.,JeffryD. Ullman, Introduction to Automata Theory, Languages,
and Computation, Addison-Wesley, 1979.
[13] Woodward, R. B. and R. Hoffmann, The Conservation of Orbital Symmetry, Aca-
demic Press, New York, 1970.
[14] Jorgensen, W. L. and Timothy D. Salatin, J. Org. Chem., 45, 2043, (1980).
[15] Jorgensen, W. L. and Julia Schmidt Burnier, J. Org. Chem., 48, 3923, (1983).
[16] Fukui, K., Top. Cur. Chem., 15, 1, (1970).
[17] Herndon, W. C., Chem. Rev., 72, 157, (1972).
[18] Lowry, T. H., K. S. Richardson, Mechanism and Theory in Organic Chemistry,
2nd ed., Harper & Row, New York, 1981.
[19] Onishchenko, A. S., Diene Synthesis, Daniel Davey and Co., New York, 1964.
[20] Sustmann, R., Tetrahedron Lett., 2717, (1971).
[21] Sustmann, R., Tetrahedron Lett., 2721, (1971).
[22] Sustmann, R. and H. Trill, Agnew. Chem. Int. Ed., 11, 838, (1972).
[23] Fleming I., Frontier Molecular Orbitals and Organic Chemical Reactions, Wiley,
London, 1976, Chapter 4.

RECEIVED December 17, 1985

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
2 0

Using a Theorem Prover in the D e s i g n


of O r g a n i c S y n t h e s e s

Tunghwa Wang, Ilene Burnstein, Michael Corbett, Steven Ehrlich, Martha Evens,
Alice Gough, and Peter Johnson
Illinois Institute of Technology, Chicago, IL 60616
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch020

This paper describes an expert system for


organic s y n t h e s i s which uses a r e s o l u t i o n
based theorem prover as i t s reasoning
component. This reasoning component i s
b u i l t upon LMA (Logic Machine
A r c h i t e c t u r e ) , a c o l l e c t i o n of Pascal
subroutines w r i t t e n by the theorem proving
group at Argonne N a t i o n a l Laboratory. The
SYNLMA system (SYNthesis with LMA)
represents the target compound as a theorem
to be proved, while the s t a r t i n g m a t e r i a l s
and r e a c t i o n r u l e s become axioms. The main
advantages of SYNLMA stem from the
independence of the database from the
i n f e r e n c i n g mechanism. This s e p a r a t i o n
makes it p o s s i b l e to experiment with
d i f f e r e n t representations of knowledge and
d i f f e r e n t data bases, l i k e the l a r g e
chemical databases made a v a i l a b l e by ISI
and Chemical A b s t r a c t s , without
reprogramming.

U s i n g LMA ( L o g i c M a c h i n e A r c h i t e c t u r e ) , a c o l l e c t i o n o f
P a s c a l p r o g r a m s w r i t t e n by t h e t h e o r e m p r o v i n g g r o u p a t
A r g o n n e N a t i o n a l L a b o r a t o r y ( 1 - 2 ) , we have d e v e l o p e d
SYNLMA ( S Y N t h e s i s w i t h LMA), a n e x p e r t s y s t e m f o r o r g a n i c
s y n t h e s i s t h a t u s e s a r e s o l u t i o n based theorem p r o v e r a s
t h e r e a s o n i n g component. The m a j o r a d v a n t a g e s o f SYNLMA
stem f r o m t h e i n d e p e n d e n c e o f t h e d a t a b a s e a n d t h e
inferencing. F i r s t , t h e d a t a b a s e c a n be m o d i f i e d o r a n
e n t i r e l y d i f f e r e n t one u s e d w i t h o u t r e p r o g r a m m i n g t h e
d e c i s i o n making u n i t o f t h e s y s t e m . This conversion
i n v o l v e s m o d i f y i n g a s h o r t program t h a t t r a n s l a t e s a
database r e p r e s e n t a t i o n f o rmolecules i n t o a molecular
r e p r e s e n t a t i o n t h e t h e o r e m p r o v e r r e c o g n i z e s ; SYNLMA i s
not changed a t a l l . S e c o n d , t h e scheme f o r r e p r e s e n t i n g

0097-6156/86/0306-0244$06.00/0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
20. WANG ET AL. Using a Theorem Prover in Organic Syntheses 245

a m o l e c u l e c a n be c h a n g e d w i t h o u t c h a n g i n g SYNLMA. Once
a g a i n SYNLMA r e m a i n s t h e same, o n l y t h e i n t e r f a c e between
t h e d a t a b a s e a n d SYNLMA w i l l h a v e t o be a l t e r e d . This
f l e x i b i l i t y makes SYNLMA a n a t t r a c t i v e a l t e r n a t i v e t o
o t h e r o r g a n i c s y n t h e s i s programs.
SYNLMA p e r f o r m s a r e t r o s y n t h e t i c a n a l y s i s u s i n g a
s p e c i a l p u r p o s e t h e o r e m p r o v e r b u i l t f r o m LMA components.
The compound t o be s y n t h e s i z e d becomes a t h e o r e m t o be
proved. The r e a c t i o n r u l e s a n d s t a r t i n g m a t e r i a l s become
axioms. The c h o i c e o f a knowledge r e p r e s e n t a t i o n h a s
b e e n one o f o u r g r e a t e s t p r o b l e m s .
D a t a f o r t h e t h e o r e m p r o v e r h a s t o be t r a n s l a t e d
i n t o c l a u s e s , t h e o n l y form t h e theorem p r o v e r
recognizes. A c l a u s e i s t h e "OR" o f one o r more l i t e r a l s
where a l i t e r a l i s a p r e d i c a t e a n d i t s a r g u m e n t s . A
predicate i s a property or r e l a t i o n s h i p that i s true or
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch020

f a l s e . I t s a r g u m e n t s c a n encompass a n y number o f
f u n c t i o n s . A f u n c t i o n r e t u r n s t r u e , f a l s e o r some o t h e r
value. The s t a t e m e n t "x + y > y + ζ" c a n be w r i t t e n a s a
c l a u s e u s i n g t h e f u n c t i o n "Sum" a n d t h e P r e d i c a t e
" G r e a t e r T h a n . " The r e s u l t i n g o n e - l i t e r a l c l a u s e l o o k s
like this:

GreaterThan(Sum(x,y),Sum(y,z))

(See 3 f o r a formal definition of a clause.)

Molecular Representations

The r e p r e s e n t a t i o n o f m o l e c u l a r s t r u c t u r e i n c l a u s e f o r m
i s c r u c i a l t o t h i s r e s e a r c h as i t i s a major determinant
1
o f t h e t h e o r e m p r o v e r s e f f i c i e n c y . The c l a u s e
r e p r e s e n t a t i o n a f f e c t s t h e time i t takes t o r e t r i e v e
r e a c t i o n r u l e s and s t a r t i n g m a t e r i a l s and t h e time
n e c e s s a r y t o make c o m p a r i s o n s between s t r u c t u r e s . The
i m p o r t a n c e o f t h e r e l a t i o n s h i p between e f f i c i e n c y a n d t h e
c l a u s e r e p r e s e n t a t i o n i s i l l u s t r a t e d by t h e d i f f e r e n c e i n
t h e r u n t i m e s between p r o v i n g o u r f i r s t c l a u s e s a n d
c u r r e n t ones. Our f i r s t r e p r e s e n t a t i o n scheme was a
s i m p l e one w i t h one p r e d i c a t e f o r e a c h atom e x c e p t
h y d r o g e n and one f o r e a c h bond ( F i g u r e l a ) . U s i n g t h i s
c l a u s e form, a m o l e c u l e w i t h t e n atoms t o o k s e v e r a l h o u r s
t o p r o v e on a n IBM m a i n f r a m e . F o r SYNLMA t o be a v i a b l e
s y s t e m f o r o r g a n i c s y n t h e s i s t h e " p r o v i n g t i m e " h a s t o be
r e a s o n a b l e a n d one k e y t o t h i s i s t h e c l a u s e
representation. By u s i n g a s i n g l e p r e d i c a t e t o d e s c r i b e
e a c h atom a n d i t s "bond e n v i r o n m e n t , " t h e p r o o f o f a
m o l e c u l e h a s b e e n r e d u c e d t o a few s e c o n d s . We w i l l
continue t o experiment with the r e p r e s e n t a t i o n f o r
m o l e c u l e s , t r y i n g t o f i n d t h e r i g h t b a l a n c e between t h e
number o f c l a u s e s a n d t h e i r l e n g t h . We c u r r e n t l y
r e p r e s e n t s t a r t i n g m a t e r i a l s a n d compounds t h a t we want
to s y n t h e s i z e ( t a r g e t s ) by a c l a u s e l i s t ( F i g u r e l b ) . I n
t h i s scheme:

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
246 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

1. A m o l e c u l e i s r e p r e s e n t e d by a l i s t o f c l a u s e s , where
e a c h c l a u s e c o r r e s p o n d s t o one atom a n d d e s c r i b e s i t s
e n v i r o n m e n t ( i . e . , i t s bonds, c h a r g e , e t c . ) .
2. The number o f atoms i n a m o l e c u l e d o e s n o t c o r r e s p o n d
t o t h e number o f c l a u s e s i n t h e c l a u s e l i s t . An atom
generates a clause o n l y i f i t i s bonded t o two o r
more atoms; o t h e r w i s e t h e atom w i l l be i g n o r e d a s a l l
i t s i n f o r m a t i o n w i l l be c o n t a i n e d i n a c l a u s e
g e n e r a t e d b y a n o t h e r atom.
3. Each c l a u s e c o n s i s t s o f the p r e d i c a t e c a l l e d
F r a g m e n t , a Bond f u n c t i o n ( B r r l , B211, B l l l l , e t c . )
l i s t i n g t h e t y p e s o f bonds, s u c h a s a r o m a t i c ,
r e s o n a n t , t r i p l e , d o u b l e , s i n g l e , f o r t h e atom b e i n g
d e s c r i b e d a n d a n Atom f u n c t i o n f o r t h i s c e n t r a l atom
of r e f e r e n c e a n d f o r e a c h atom bonded t o i t . A c l a u s e
is terminated with a semicolon.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch020

4. The a r g u m e n t s f o r t h e Atom f u n c t i o n a r e : t h e c h e m i c a l
symbol f o r t h e e l e m e n t , a number a s s i g n e d by o u r
n u m b e r i n g scheme, t h e c h a r g e on t h e atom (-1, 0, +1,
+2 e t c . ) , a s t e r e o c h e m i s t r y f l a g a n d a r i n g f l a g
i n d i c a t i n g w h e t h e r o r n o t t h e atom i s a member o f a
ring. D e f a u l t v a l u e s f o r t h e l a s t t h r e e arguments
are zero.

H(7) 0(3) C(l) ;


\ // C(2);
H ( 6 ) _ C(2) _ C(l) 0(3);
/ \ 0(5);
0(5) H(4) Double(1,3);
/ Bond(2,1);
H(8) Bond(2,5);

F i g u r e l a . Our F i r s t C l a u s e R e p r e s e n t a t i o n f o r a S i m p l e
Molecule. The numbers f o l l o w i n g t h e e l e m e n t s y m b o l s i n
t h e d i a g r a m a r e u s e d t o i d e n t i f y atoms i n t h e c l a u s e s .

Fragment(B211(Atom(C,1,0,0,0),Atom(0,3,0,0,0),
Atom(C,2,0,0,0),Atom(H,4,0,0,0)));
Fragment(Bllll(Atom(C,2,0,0,0),Atom(C,1,0,0,0),
Atom(0,5,0,0,0),Atom(H,6,0,0,0),
Atom(H,7,0,0,0))) ;
Fragment(Β11(Atom(0,5,0,0,0),Atom(C,2,0,0,0),
Atom(H,8,0,0,0)));

Figure l b . Our C u r r e n t C l a u s e Representation


f o r t h e Same Molecule

F i g u r e l b i s a s i m p l e example o f a c l a u s e l i s t a n d
t h e r u l e s f o r c o n s t r u c t i n g i t . I n a c t u a l i t y , t h e r e a r e no
s p a c e s between c h a r a c t e r s i n a c l a u s e . T h e y a r e i n c l u d e d
t o make i t e a s i e r t o g r a s p t h e c l a u s e n o t a t i o n . N o t e ,
t h a t a l t h o u g h t h e r e a r e e i g h t atoms i n t h e m o l e c u l e o n l y
three generated clauses. F o r example, 0 ( 3 ) d o e s n o t
g e n e r a t e a c l a u s e s i n c e i t w o u l d be r e d u n d a n t . The c l a u s e
f o r 0 ( 3 ) w o u l d be " F r a g m e n t ( B 2 ( A t o m ( 0 , 3 , 0 , 0 , 0 ) ,

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
20. WANG ET AL. Using a Theorem Prover in Organic Syntheses 247

Atom(C,1,Ο,Ο,Ο)))" and a l l t h i s i n f o r m a t i o n i s c o n t a i n e d
i n t h e c l a u s e generated by C ( l ) . The f i r s t Fragment
predicate i n figure l b i s :

Fragment(B211(Atom(C,1,0,0,0),Atom(0,3,0,0,0),
Atom(C,2,0,0,0),Atom(H,4,0,0,0)));

The Bond f u n c t i o n d e s c r i b e s a c e n t r a l atom o f


r e f e r e n c e a n d t h e atoms bonded t o i t . B211 s t a t e s t h a t
t h e r e i s a c e n t r a l atom o f r e f e r e n c e bonded t o one atom
by a d o u b l e bond (2) a n d t o two o t h e r atoms by s i n g l e
bonds ( 1 ) . The o r d e r o f t h e Bond f u n c t i o n a r g u m e n t s
c o r r e s p o n d s t o t h i s Bond f u n c t i o n n o t a t i o n . These
a r g u m e n t s a r e n o t s i m p l e a t o m i c s y m b o l s , b u t Atom
f u n c t i o n s that c a n r e l a t e c o n s i d e r a b l e i n f o r m a t i o n about
the atom. I n t h i s Bond f u n c t i o n , Atom(C,1,0,0,0) i s t h e
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch020

c e n t r a l atom. The n e x t t h r e e a r g u m e n t s a r e atoms t h a t a r e


bonded t o t h i s c e n t r a l atom: t h e f i r s t , Atom(0,3,0,0,0)
by a d o u b l e bond; t h e n e x t two, Atom(C,2,0,0,0) a n d
Atom(H,4,0,0,0), b y s i n g l e b o n d s .
The f i r s t two a r g u m e n t s i n t h e Atom f u n c t i o n f o r a
p a r t i c u l a r Atom n e v e r c h a n g e , a s t h e y i d e n t i f y t h e atom.
Atom(0,3,0,0,0) d e s c r i b e s t h e o x y g e n atom numbered 3, a s
o p p o s e d t o t h e o x y g e n atom numbered 5, i n t h e d r a w i n g i n
F i g u r e l a . The number d o e s n o t i n d i c a t e p o s i t i o n . I f
some r e a c t i o n r e s u l t e d i n t h e "03" bond t o " C I " b e i n g
b r o k e n a n d "03" was r e p l a c e d by some o t h e r atom, "03"
r e m a i n s "03"; t h e new atom w i l l h a v e a new number.
Suppose "03" were t o become c h a r g e d , t h e n t h e f u n c t i o n
d e s c r i b i n g i t w o u l d become A t o m ( 0 , 3 , - 1 , 0 , 0 ) , r e f l e c t i n g
the c h a n g e .

R e a c t i o n Rule Database

Our p r e s e n t r e a c t i o n r u l e d a t a b a s e i s made up o f
a p p r o x i m a t e l y one h u n d r e d r u l e s a d a p t e d f r o m a m i c r o f i c h e
g e n e r o u s l y sent t o us by G e l e r n t e r ( 4 ) . F o r a g i v e n
r e a c t i o n , a r u l e s p e c i f i e s t h e r e a c t a n t s (subgoal) and
the p r o d u c t ( s ) ( g o a l ) , i n c o n n e c t i o n t a b l e f o r m a t a n d a n y
c o n s t r a i n t s on t h e i r c o m p o s i t i o n ( F i g u r e 2 a ) . The r u l e s
a r e i d e n t i f i e d by c h a p t e r a n d schema numbers. The
connection tables a r e organized as follows:

1. A r e a c t i o n rule connection table includes a l l the


atoms i n b o t h t h e g o a l a n d s u b g o a l m o l e c u l e s . The
atoms a r e numbered u n i q u e l y a n d t h e n u m b e r i n g o f t h e
atoms ( s e e t h e d r a w i n g o f t h e m o l e c u l e s ) c o r r e s p o n d s
t o t h e row numbers i n t h e t a b l e s . The same atom
a p p e a r i n g i n b o t h a g o a l a n d s u b g o a l k e e p s t h e same
number. I f a n atom i n t h e g o a l d o e s n o t a p p e a r i n t h e
subgoal, the subgoal connection table w i l l s t i l l
i n c l u d e t h e atom a s a row atom b u t a l l v a l u e s t o t h e
r i g h t w i l l be z e r o .
2. The s y m b o l s i n t h e f i r s t column ( t o t h e r i g h t o f t h e
row number) i d e n t i f y t h e atom o r v a r i a b l e d e s c r i b e d
American Chemical Society
Library
1155 16th St. N.W.
f
In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;
Washington,
ACS Symposium Series; American D.C.
Chemical 20036
Society: Washington, DC, 1986.
248 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

by t h e row. I t w i l l be r e f e r r e d t o a s t h e row atom.


T h e r e a r e t h r e e t y p e s o f row atom s y m b o l s : p e r i o d i c
t a b l e n o t a t i o n f o r e l e m e n t s ; t h e v a r i a b l e #J w h i c h
r e p r e s e n t s a h a i i d e ; a n d a v a r i a b l e composed o f a
d o l l a r s i g n f o l l o w e d by a n u m b e r ( $ 1 , $ 2 . . . ) . The
"$/even numbered" v a r i a b l e s c a n r e p r e s e n t a n y
s u b s t r u c t u r e o r any atom. The "$/odd numbered"
v a r i a b l e s c a n r e p r e s e n t any s u b s t r u c t u r e o r a n y atom
except hydrogen. The f o l l o w i n g f o u r s t r u c t u r e s c o u l d
be e q u i v a l e n t . The s t r u c t u r e s r a n g e f r o m t h e v e r y
s p e c i f i c on t h e l e f t where t h e atom p o i n t e d t o i s
d e f i n e d a s a c h l o r i n e atom t o t h e v e r y g e n e r a l where
t h e atom o r s u b s t r u c t u r e c a n be a n y t h i n g .

0 0
I I 11
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch020

C c c c
/ \ / \ / \ / \
$1 CI $1 #J $1 $1 $1 $2

specific > general

3. The n e x t t w e l v e c o l u m n s ( s i x p a i r s : up, down, l e f t ,


r i g h t , i n , o u t ) d e s c r i b e t h e bonds o f t h e atom i n
column one. The f i r s t number i n e a c h p a i r i s t h e row
i n d e x , i d e n t i f y i n g t h e atom bonded t o t h e row atom.
The s e c o n d number i s one o f f i v e bond t y p e s ( 1 :
s i n g l e , 2: d o u b l e , 3: t r i p l e , 16: r e s o n a n t bond,
8 : s i n g l e bond between a n atom and a r e s o n a n t
s t r u c t u r e ) . I f t h e row atom d o e s n o t a p p e a r i n t h e
goal or subgoal s t r u c t u r e s the d e f a u l t values are
zero.
4. The l a s t s i x t e e n c o l u m n s c o n t a i n symmetry
information.

F i g u r e 2a i s t h e G e r l e r t n e r r e a c t i o n r u l e f o r t h e
" r e a c t i o n o f magnesium w i t h a l k y l b r o m i d e s " . The number
o f ( s i x ) and t y p e o f row atoms (Mg, B r , C, $2, $4, $6)
a r e i d e n t i c a l f o r b o t h t h e t h e g o a l and s u b g o a l
c o n n e c t i o n t a b l e s and i s a c o m p o s i t e o f a l l atoms i n b o t h
the p r o d u c t and r e a c t a n t s . D i f f e r e n c e s between g o a l and
s u b g o a l s t r u c t u r e s a r e i n d i c a t e d by t h e numbers t o t h e
r i g h t o f row atoms a n d n o t t h e i r p r e s e n c e o r a b s e n c e i n
t h e t a b l e s . F o r example, i n t h e g o a l t a b l e Row Atom 1,
magnesium, i s bonded t o Row Atom 2 by a s i n g l e bond
( i n d e x : b o n d = 2:1) and t o Row Atom 3 by a s i n g l e bond
( i n d e x : b o n d = 3 : 1 ) . W h i l e magnesium d o e s n o t a p p e a r i n
t h e s u b g o a l s t r u c t u r e , i t i s s t i l l t h e f i r s t row atom i n
1
t h e s u b g o a l s t a b l e . B u t t h e v a l u e s f o r bond i n d e x e s a n d
bond t y p e s a r e now z e r o ; t h a t i s , M g ( l ) i s n o t bonded t o
o t h e r atoms i n t h e t a b l e . An example o f a n atom t h a t
a p p e a r s i n b o t h t h e g o a l and s u b g o a l s t r u c t u r e s i s Row
Atom 3. One o f t h e atoms t h a t C ( 3 ) i s bonded t o c h a n g e s
(Br t o Mg) b u t C ( 3 ) i s c o n s i d e r e d t h e same t h r o u g h o u t t h e
r e a c t i o n and k e e p s t h e same i n d e x .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
20. WANG ET AL. Using a Theorem Prover in Organic Syntheses 249

$2(4) $4(5) $2(4) $4(5)


\ / \ /
C(3) C(3)
/ \ / \
Br(2) $6(6) Mg(l) $6(6)
/
Br(2)

Diagram illustrating the f o l l o w i n g reaction rule

Schema 2

Schema name i s r e a c t i o n o f magnesium w i t h a l k y l


bromides. The s t a r t i n g v a l u e s f o r e a s e , y i e l d and
c o n f i d e n c e a r e : 90, 95, 100. The r e a g e n t c l a s s f o r
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch020

t h i s schema i s : 0. This i s a single a p p l i c a t i o n


schema. The maximum no. o f n o n i d e n t i c a l s u b g o a l
m o l e c u l e s a l l o w e d f o r t h i s schema i s 9.

The Transformation Patterns :

Goal TSD

no e l e m up down left right in out

1 Mg 2 1 3:1 0:0 0:0 0:0 0:0 0000000000000000


2 Br 1 1 0:0 0:0 0:0 0:0 0:0 0000000000000000
3 C 1 1 4:1 5:1 6:1 0:0 0:0 0000000000000000
4 $2 3 1 0:0 0:0 0:0 0:0 0:0 0000000000000000
5 $4 3 1 0:0 0:0 0:0 0:0 0:0 0010000000000000
6 $6 3 1 0:0 0:0 0:0 0:0 0:0 0010000000000000

S u b g o a l TSD

no e l e m up down left right in out

1 Mg 0 0 0:0 0:0 0:0 0:0 0:0 0000000000000000


2 Br 3 1 0:0 0:0 0:0 0:0 0:0 0000000000000000
3 C 2 1 4:1 5:1 6:1 0:0 0:0 0000000000000000
4 $2 3 1 0:0 0:0 0:0 0:0 0:0 0000000000000000
5 $4 3 1 0:0 0:0 0:0 0:0 0:0 0000000000000000
6 $6 3 1 0:0 0:0 0:0 0:0 0;0 0000000000000000

Schema T e s t s : Can't have any of the f o l l o w i n g a t t r i b u t e s :

136 T h i o l
126 Oxime
122 D i a z o k e t o n e
(and o t h e r s )

Figure 2a. G e r l e r n t e r reaction rule.

The c o n s t r a i n t s l i s t e d u n d e r t h e schema t e s t s g i v e
l i m i t a t i o n s on t h e p o s s i b l e values of the v a r i a b l e s i n

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
250 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

column one, e t c . The r e a c t i o n r u l e s c a n be c h a r a c t e r i z e d


as s i n g l e o r m u l t i s t e p , where a m u l t i s t e p r e a c t i o n i s
d e f i n e d a s a r u l e t h a t c a n be r e w r i t t e n a s a s e r i e s o f
single step reactions. An example o f a s i n g l e and
multistep reaction r u l e f o r a malonic ester synthesis
follows.

Multistep:

0
M 1) NaOET
ET-O-C 2) RX 0 CH3
\ 3) OH-,H20 II /
CH2 > H-0- C -CH2-C -H
/ 4) H+ \
ET-O-C CH3
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch020

1 I
0

Single Step Equivalent:

0 0
II II
ET-O-C ET-O-C
\ \
CH2 + NaOET > CH-
/ /
ET-O-C ET-O-C
II II
0 0

0 0
II II
ET-O-C H CH3 ET-O-C CH3
\ - \ / \ /
CH + C > CH-C-H
/ / \ / \
ET-O-C Br CH3 ET-O-C CH3
II II
0 0

0 0
II II
ET-O-C CH3 H-O-C CH3
\ / OH-, H20 \ /
CH-C-H > CH-C-H
/ \ / \
ET-O-C CH3 H-O-C CH3

0I' J0J

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
20. WANG ET AL. Using a Theorem Prover in Organic Syntheses 251

H-O-C CH3 1) H+ Ο CH3


\ / 2) -C02 II /
CH-C-H > H-O- C -CH2-C-H
/ \ \
H-O-C CH3 CH3
I I
Ο

I n t h i s example a f o u r s t e p s y n t h e s i s i s a l s o
e x p r e s s e d as a v e r y g e n e r a l one s t e p r e a c t i o n .
We have w r i t t e n a p r o g r a m t h a t t r a n s l a t e s t h e
c o n n e c t i o n t a b l e s i n t o c l a u s e s , a form t h a t the theorem
p r o v e r c a n p r o c e s s , and s t o r e s them i n f i l e s organized
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch020

f i r s t by g o a l o r s u b g o a l and t h e n by t h e f u n c t i o n a l
groups i n the molecule. The c o n s t r a i n t s a r e i n a n o t h e r
s e t o f f i l e s . SYNLMA u s e s t h e s e f i l e s ; i t d o e s n o t u s e
the f i l e s of G e r l e r n t e r formatted r u l e s . In a d d i t i o n to
t h e r e a c t i o n r u l e d a t a b a s e , we have f u n c t i o n a l g r o u p and
s t a r t i n g m a t e r i a l databases (also i n clause form).

The T r a n s l a t i o n of Reaction Rules into Clauses

E a c h atom i n a t a r g e t o r s t a r t i n g m a t e r i a l m o l e c u l e i s
defined. T h i s i s not t r u e f o r a r e a c t i o n r u l e or
f u n c t i o n a l g r o u p m o l e c u l e where p a r t s o f t h e m o l e c u l e a r e
r e p r e s e n t e d by v a r i a b l e s ($1, $ J , e t c . ) . SYNLMA t r e a t s a
r e a c t i o n r u l e or f u n c t i o n a l group s t r u c t u r e as a
m o l e c u l e , e v e n t h o u g h some o f i t s atoms a r e unknown, and
r e p r e s e n t s i t i n e s s e n t i a l l y t h e same f o r m as known
m o l e c u l e s ( F i g u r e 2b). A m o l e c u l e w i t h a v a r i a b l e
s u b s t r u c t u r e d i f f e r s f r o m a known m o l e c u l e i n t h e
following:

1. The p r e d i c a t e s a r e ORed f o r a m o l e c u l e w i t h v a r i a b l e s
(one c l a u s e p e r m o l e c u l e ) i n s t e a d o f ANDed (one list
of c l a u s e s f o r each molecule).
2. The s i g n o f t h e p r e d i c a t e i s n e g a t i v e i n s t e a d o f
positive.
3. V a r i a b l e atoms o r s u b s t r u c t u r e s a r e r e p r e s e n t e d by
t h e l e t t e r "y" f o l l o w e d by a number ( y l , y2) o r t h e
l e t t e r " j " ( y j ) . " Y j " r e p r e s e n t s a h a l i d e ; the
" y / e v e n numbered" v a r i a b l e s c a n r e p r e s e n t any
s u b s t r u c t u r e o r atom; and t h e "y/odd numbered" c a n
r e p r e s e n t any s u b s t r u c t u r e o r atom e x c e p t h y d r o g e n .
4. The Atom f u n c t i o n s h a v e v a r i a b l e s f o r a r g u m e n t s , n o t
constants.
5. Each g o a l or subgoal c l a u s e i s terminated w i t h the
p r e d i c a t e R x n r u l e whose f i r s t argument i s a r e a c t i o n
r u l e i d e n t i f i c a t i o n number. A f t e r t h i s number, t h e
p r e d i c a t e u s e s t h e f u n c t i o n LL ( f o r l i n k e d l i s t ) t o
l i s t a l l t h e atoms i n t h e c o n n e c t i o n t a b l e .
F u n c t i o n a l group c l a u s e s are terminated w i t h the

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
252 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

s i m i l a r f u n c t i o n F u n c g r . The main d i f f e r e n c e between


t h e s e two f u n c t i o n s i s t h a t F u n c g r o n l y l i s t s atoms
or s u b s t r u c t u r e s that a r e i n the molecule being
d e s c r i b e d , w h i l e t h e r e a c t i o n r u l e l i s t s a l l atoms i n
the c o n n e c t i o n t a b l e r e g a r d l e s s o f whether they
appear i n the s t r u c t u r e ( s ) being d e s c r i b e d .

Reaction Rule Chapter 20, Schema 2: GOAL

-Fragment(Bl1(Atom(Mg,xl,s1,tl,ul),Atom(Br,x2,s2,t2,u2),
Atom(C,x3,s3,t3,u3)))|
-Fragment(Bllll(Atom(C,x3,s3,t3,u3),
Atom(Mg,xl,sl,tl,ul),y2,y4,y6))|
Rxnrule(202,LL(Atom(Mg,xl,sl,tl,ul,),
LL(Atom(Br,x2,s2,t2,u2),
LL(Atom(C,x3,s3,t3,u3),
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch020

LL(y2,LL(y4,LL(y6,NIL)))))));

Reaction Rule Chapter 20, Schema 2: SUBGOAL

-Fragment(Bllll(Atom(C,x3,s3,t3,u3),
Atom(Br,x2,s2,t2,u2),y2,y4,y6))|
Rxnrule(202,LL(Atom(Mg,xl,sl,tl,ul,),
LL(Atom(Br,x2,s2,t2,u2),
LL(Atom(C,x3,s3,t3,u3),
LL(y2,LL(y4,LL(y6,NIL)))))));

Figure 2b. C l a u s e r e p r e s e n t a t i o n o f t h e g o a l and s u b g o a l


i n r e a c t i o n r u l e C h a p t e r 20, Schema 2.

A c o m p a r i s o n between t h e c o n n e c t i o n t a b l e s i n f i g u r e
2a and t h e i r c l a u s e r e p r e s e n t a t i o n s i n f i g u r e 2b
i l l u s t r a t e s t h e c o n v e r s i o n r u l e s and some o f t h e
d i f f e r e n c e s between a known m o l e c u l e ' s c l a u s e and a
r e a c t i o n r u l e c l a u s e . Two row atoms, M g ( l ) and C ( 3 ) , i n
t h e g o a l and o n l y C ( 3 ) i n t h e s u b g o a l a r e bonded t o two
o r more atoms a n d t h e r e f o r e g e n e r a t e p r e d i c a t e s . Unlike
the c l a u s e l i s t ( f i g u r e l b ) these p r e d i c a t e s a r e not
s e p a r a t e d by s e m i c o l o n s ( i m p l i c i t l y ANDed one p r e d i c a t e
c l a u s e s ) b u t a r e j o i n e d by a v e r t i c a l b a r , t h e symbol f o r
OR. The p r e d i c a t e , F r a g m e n t , i s c o n s t r u c t e d i n t h e same
way a s f o r a known m o l e c u l e w i t h t h e e x c e p t i o n t h a t some
o f t h e Atom f u n c t i o n s a r g u m e n t s a r e v a r i a b l e s ( e . g . x l ,
s i , t l , e t c . ) . V a r i a b l e s a r e n o t w r i t t e n u s i n g Atom
f u n c t i o n s ( t h e y a r e unknowns) b u t a r e s i m p l y l i s t e d i n
t h e p r o p e r o r d e r i n t h e bond f u n c t i o n . The c l a u s e i s
t e r m i n a t e d w i t h an i d e n t i f y i n g R x n r u l e p r e d i c a t e t h a t
l i s t s t h e r e a c t i o n r u l e c h a p t e r and schema ( c h a p t e r
number * 1000 + schema number) and e v e r y row atom i n t h e
c o n n e c t i o n t a b l e . Note t h a t t h e R x n r u l e p r e d i c a t e i s
i d e n t i c a l f o r t h e g o a l and s u b g o a l , l i n k i n g t h e two
clauses together.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
20. WANG ET AL. Using a Theorem Prover in Organic Syntheses 253

The Synthetic Design Process

SYNLMA i s c u r r e n t l y c a p a b l e o f h a n d l i n g t h e s y n t h e t i c
d e s i g n f o r compounds t h e s i z e o f t h e a n a l a g e s i c D a r v o n
u s i n g an i n - c o r e database o f a p p r o x i m a t e l y a hundred
reactions. The s y n t h e s i s p r o c e s s s t a r t s w i t h t h e i n p u t
o f t h e s t r u c t u r e o f t h e compound ( i n c l a u s e form) t h a t we
are t r y i n g t o synthesize. Next, an i n t e r n a l
r e p r e s e n t a t i o n o f t h e compound i s g e n e r a t e d . This
becomes t h e t a r g e t ( t h e t h e o r e m t o be p r o v e d ) . The
t h e o r e m p r o v e r b e g i n s by i d e n t i f y i n g t h e t a r g e t ' s m a j o r
f u n c t i o n a l g r o u p s a n d u s e s them a s k e y s i n t o t h e
database. As t h e s e a r c h b e g i n s f o r r e a c t i o n s a n d
compounds f r o m w h i c h t h e t a r g e t c a n be s y n t h e s i z e d , t h e
theorem p r o v e r o n l y s e a r c h e s t h e g o a l f u n c t i o n a l group
f i l e s c o r r e s p o n d i n g t o t h e f u n c t i o n a l groups i t has
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch020

a l r e a d y found i n the t a r g e t . F o r example, i f t h e t a r g e t


c o n t a i n e d an a c i d and a benzene r i n g , t h e theorem p r o v e r
would s e a r c h t h e g o a l f i l e s c o n t a i n i n g a c i d s and benzene
r i n g s f o r a matching molecular s t r u c t u r e . When a
matching s t r u c t u r e i s found, i t s c o r r e s p o n d i n g subgoal
becomes t h e new g o a l a n d i s t r a n s l a t e d i n t o t h e i n t e r n a l
representation f o r a molecule. Then t h e f u n c t i o n a l g r o u p
i d e n t i f i c a t i o n a n d t h e s e a r c h a n d match p r o c e s s i s
repeated. This process of examining a l t e r n a t i v e r e a c t i o n
p a t h s a n d s e t t i n g up i n t e r m e d i a t e compounds a s new g o a l s
i s r e p e a t e d u n t i l a l l t h e p o s s i b l e r e a c t i o n s c a n be
p e r f o r m e d u s i n g t h e a v a i l a b l e compounds.
T h i s p r o c e s s o f working backward from t h e a v a i l a b l e
s t a r t i n g m a t e r i a l s , c a l l e d " r e t r o s y n t h e t i c a n a l y s i s " by
the o r g a n i c chemist, i s immediately r e c o g n i z e d as an
example o f b a c k w a r d c h a i n i n g by w o r k e r s i n a r t i f i c i a l
intelligence. T h i s backward c h a i n i n g p r o c e s s c r e a t e s a
l a r g e problem s o l v i n g t r e e i n which g o a l s o r nodes
c o r r e s p o n d t o compounds w h i l e t h e b r a n c h e s c o r r e s p o n d t o
p o s s i b l e r e a c t i o n p a t h w a y s . (A more d e t a i l e d d e s c r i p t i o n
o f how SYNLMA h a n d l e s t h i s p r o c e s s c a n be f o u n d i n 5-6.)
An example o f a p r o b l e m s o l v i n g t r e e f o r t h e
s y n t h e s i s o f D a r v o n a p p e a r s i n F i g u r e 3. The t r e e
c o n t a i n s b o t h AND nodes a n d OR n o d e s ( 7 ) . The AND
b r a n c h e s , c o n n e c t e d by d o u b l e a r c s , i n d i c a t e t h a t b o t h
compounds a r e r e q u i r e d t o make t h e compound above them.
The OR b r a n c h e s ( t h e r e a r e t h r e e OR p a t h s t o make
compound I I ) i n d i c a t e d i f f e r e n t r o u t e s f o r making t h e
compound. The t e r m i n a l nodes c o r r e s p o n d i n g t o s t a r t i n g
m a t e r i a l s a r e e n c l o s e d i n boxes. At present, a branch i s
t e r m i n a t e d when t h e number o f c l a u s e s i n t h e c l a u s e l i s t ,
the i n t e r n a l r e p r e s e n t a t i o n o f the g o a l , i s l e s s than o r
e q u a l t o s i x o r t h e c l a u s e l i s t matches t h e c l a u s e l i s t
of a s t a r t i n g m a t e r i a l molecule.
C u r r e n t l y , SYNLMA g e n e r a t e s one p r o b l e m s o l v i n g t r e e
f o r e a c h m o l e c u l e t h a t i t s y n t h e s i z e s . Some o f t h e t r e e ' s
paths a r e v i a b l e s y n t h e t i c r o u t e s , o t h e r s a r e deadends.
U n f o r t u n a t e l y , good a n d b a d p a t h s a r e p u r s u e d w i t h t h e

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch020

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
20. WANG ET AL. Using a Theorem Prover in Organic Syntheses 255

ο
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch020

10
•H
10
<u
SX
•M

CO
<D

u
ο
m
ω
ω
u

D>
fi
•H
>

Ο
CO

ο
u

CO

ω α
u ο
3 >
en u
•Η
Q

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
256 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

same i n t e n s i t y and a l o t o f t i m e i s s p e n t p u r s u i n g
deadend p a t h s . The p r o g r a m has t h e p o t e n t i a l t o be more
c l e v e r i n i t s approach. I t c a n g e n e r a t e a number o f
t r e e s of v a r y i n g s p e c i f i c i t y . F i r s t , SYNLMA c o u l d
generate a t r e e of m u l t i s t e p r e a c t i o n r u l e s . A tree
b u i l t f r o m m u l t i s t e p r e a c t i o n r u l e s w o u l d be q u i c k e r t o
b u i l d t h a n one where e a c h s t e p i s s p e c i f i e d . Then a
s e c o n d , more s p e c i f i c t r e e c o u l d be g e n e r a t e d u s i n g t h e
knowledge g a i n e d from the f i r s t . F o r example, some
s y n t h e t i c p a t h w a y s c o u l d be r u l e d o u t on t h e b a s i s o f t h e
multistep rules. The more pathways t h a t c a n be
e l i m i n a t e d on t h e b a s i s o f one m u l t i s t e p r u l e as o p p o s e d
to a s e r i e s of s i n g l e s t e p r u l e s , the f a s t e r the system
c a n work. For paths t h a t appear p r o m i s i n g , the p r o d u c t s
and r e a c t a n t s i n t h e f i r s t t r e e f o r m p a i r s o f t a r g e t s and
s t a r t i n g m a t e r i a l s t h a t w i l l d i r e c t the growth of the
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch020

second t r e e . A s y n t h e t i c p a t h t h a t works i n t h e more


g e n e r a l t r e e d o e s n o t n e c e s s a r i l y work when SYNLMA t r i e s
t o f i l l i n t h e g a p s between nodes w i t h s i n g l e s t e p r u l e s .
Some o t h e r c o n d i t i o n , l i k e a s u b s t r u c t u r e c o n s t r a i n t , may
b l o c k t h e pathway. So a c o m b i n a t i o n o f t h e two
a p p r o a c h e s , g e n e r a l and s p e c i f i c , i s n e c e s s a r y . Two
d a t a b a s e s , one o f s i n g l e s t e p r u l e s , t h e o t h e r m u l t i s t e p ,
a r e n e c e s s a r y t o implement t h e two t r e e s y s t e m . Since
o u r d a t a b a s e i s a m i x t u r e o f t h e s e two t y p e s o f r u l e s , a
two t r e e s y s t e m i s n o t y e t p o s s i b l e . I t w i l l have t o
w a i t u n t i l we c a n s e p a r a t e o u r d a t a b a s e i n t o two p a r t s .

Future Directions

A f t e r t h e two t r e e s y s t e m i s f u n c t i o n i n g , we w o u l d l i k e
t o add a t h i r d t r e e d e f i n i t i o n l a y e r t h a t p r e c e d e s t h e
o t h e r s and d e t e r m i n e s an o v e r a l l s y n t h e t i c s t r a t e g y . The
f o c u s d u r i n g t h i s s t a g e i s on t h e r e c o g n i t i o n o f c o g e n t
s u b s t r u c t u r e s , thus i t r e q u i r e s a database of about 200
compounds i n s t e a d o f r e a c t i o n r u l e s . The t a r g e t w i l l be
compared t o t h e s e compounds r a t h e r t h a n r e a c t i o n r u l e s
and "matches" one o f t h e s e compounds when a l a r g e
s u b s t r u c t u r e i n t h e t a r g e t i s i d e n t i f i e d i n a compound.
T h i s m a t c h i n g compound now becomes t h e new t a r g e t and the
p r o c e s s i s r e p e a t e d , r e s u l t i n g i n a much more a b s t r a c t
problem s o l v i n g t r e e . Then the t w o - t r e e system i s
a p p l i e d t o t h i s t r e e t o d e f i n e t a r g e t s and s t a r t i n g
materials. The s y s t e m moves f r o m t h e g e n e r a l t o t h e
s p e c i f i c , u s i n g the i n f o r m a t i o n from the f i r s t t r e e t o
b u i l d t h e s e c o n d t r e e and i n f o r m a t i o n f r o m t h e s e c o n d
t r e e to b u i l d the t h i r d . The t h i r d and f i n a l t r e e
d e s c r i b e s t h e s p e c i f i c s t e p s i n t h e s y n t h e t i c pathway.
I f an o r g a n i c s y n t h e s i s s y s t e m i s t o be o f p r a c t i c a l
u s e t o c h e m i s t s , i t must be s e t up t o i n t e r f a c e w i t h
l a r g e c h e m i c a l d a t a b a s e s s u c h as t h e d a t a b a s e s made
a v a i l a b l e by I S I ( t h e I n s t i t u t e f o r S c i e n t i f i c
I n f o r m a t i o n ) and by C h e m i c a l A b s t r a c t s . We have s t a r t e d
t o c o n v e r t o u r d a t a b a s e t o t h e CAS c o n n e c t i o n table
format to s i m p l i f y database i n t e r f a c e s . Fortunately,
t h i s d o e s n o t r e q u i r e c h a n g i n g SYNLMA. We o n l y n e e d t o
w r i t e a new p r o g r a m t o t r a n s l a t e c o n n e c t i o n t a b l e s i n t o

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
20. WANG ET AL. Using a Theorem Prover in Organic Syntheses 257

c l a u s e s , but t h i s i s a s h o r t program independent o f


SYNLMA.
From t h e u s e r ' s p o i n t o f v i e w t h e most i m p o r t a n t
s t e p i n i m p r o v i n g SYNLMA i s t o r e w r i t e and expand t h e
user i n t e r f a c e . C u r r e n t l y , t h e compound we want t o
s y n t h e s i z e i s e n t e r e d i n c l a u s e f o r m and t h e p r o g r a m i s
r u n i n b a t c h mode. T h i s means t h a t t h e u s e r c a n n o t
affect SYNLMA's b e h a v i o r once t h e s y s t e m s t a r t s w o r k i n g
on a s y n t h e s i s . We p l a n t o d e v e l o p an i n t e r a c t i v e s y s t e m
where t h e u s e r e n t e r s t h e i n i t i a l t a r g e t m o l e c u l e by
d r a w i n g i t on t h e s c r e e n u s i n g a g r a p h i c s p a c k a g e and i s
a b l e t o m o n i t o r the p r o g r e s s o f the theorem p r o v e r . The
u s e r w i l l be a b l e c o n t r o l i t s a c t i o n s by r e m o v i n g
i n t e r m e d i a t e t a r g e t s and s u g g e s t i n g s t a r t i n g m a t e r i a l s .

Summary
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch020

The s u c c e s s o f SYNLMA shows t h a t i t i s p o s s i b l e t o b a s e


a n e x p e r t s y s t e m on a theorem p r o v e r . The a d v a n t a g e o f
u s i n g a theorem p r o v e r a s d e d u c t i v e component i s t h a t i t
a l l o w s us t o e x p e r i m e n t w i t h a number o f d i f f e r e n t
representations f o r chemical information. The same
f l e x i b i l i t y makes i t e a s y t o add new s t a r t i n g m a t e r i a l s
and r e a c t i o n r u l e s f r o m l a r g e c o m m e r c i a l o n l i n e
databases.

Acknowledgments

T h i s r e s e a r c h was p a r t i a l l y s u p p o r t e d by t h e N a t i o n a l
S c i e n c e F o u n d a t i o n u n d e r G r a n t MCS 82-16432.
Literature Cited

1. Lusk, E.; McCune, W.; Overbeek, R. Proc. S i x t h


I n t e r n a t i o n a l Conference on Automated Reasoning,
Loveland, D., Ed.; Computer Science Lecture Notes
#138, S p r i n g e r - V e r l a g : New York, 1982, pp. 85-108.
2. Lusk, E.; McCune, W.; Overbeek, R. Proc. S i x t h
I n t e r n a t i o n a l Conference on Automated Reasoning,
Loveland, D., Ed.; Computer Science Lecture Notes
#138, S p r i n g e r - V e r l a g : New York, 1982, pp. 70-84.
3. Wos, L.; Overbeek, R.; Lusk, E.; Boyle, J . "Automated
Reasoning"; P r e n t i c e - H a l l : Englewood Cliffs, New
Jersey, 1984.
4. Gelernter, H.; Sanders, Α.; Larsen, D.; Agarwal, K.;
B o i v i e , R.; S p r i t z e r , G.; Searleman, J . Science,
1977, 197, 1041.
5. Wang, T.; E h r l i c h , S.; Evens, M.; Gough, Α.; Johnson,
P. Proc. Conference on I n t e l l i g e n t Systems and
Machines, 1984, pp. 176-181.
6. Wang, T.; Burnstein, I.; E h r l i c h , S.; Evens, M.;
Gough, Α.; Johnson, P. Proc. 1985 Conference on
I n t e l l i g e n t Systems and Machines, 1985.
7. N i l s s o n , N. " P r i n c i p l e s of Artificial Intelligence";
Tioga: Palo A l t o , C a l i f o r n i a , 1980.
RECEIVED December 17, 1985

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
21

Acquisition a n d R e p r e s e n t a t i o n of K n o w l e d g e
f o r E x p e r t S y s t e m s in O r g a n i c C h e m i s t r y

1
J. Gasteiger, M. G.Hutchings ,P.Löw,and H. Saller
Institute of Organic Chemistry, Technical University Munich, D-8046 Garching,
West Germany

Many of the models used by the organic chemist to ex­


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

p l a i n his observations provide a good basis for repre­


senting chemical knowledge in an expert system. Such
knowledge can be acquired by developing algorithms for
these models and parameterizing them with the aid of
physical or chemical data. This i s demonstrated for
concepts such as e l e c t r o n e g a t i v i t y , polarizability, or
the inductive and resonance effects. Combination of
these models permits construction of systems which
make predictions worthy of an experienced chemist.
This i s exemplified by EROS, a system that can predict
the course of chemical reactions or can design organic
syntheses.

Chemistry - as a s c i e n t i f i c and technological d i s c i p l i n e - has some


unique c h a r a c t e r i s t i c s . I n contrast to physics, where most of the
underlying laws can be given i n e x p l i c i t and sometimes simple mathe-
matical form, many of the laws governing chemical phenomena are
e i t h e r not e x p l i c i t l y known, or else have a mathematical form that
s t i l l eludes an exact s o l u t i o n . S t i l l , chemistry does provide - and
rests on- quantitative data of physical or chemical properties of
high numerical p r e c i s i o n . A search for quantitative r e l a t i o n s h i p s i s
thus suggested, despite the lack of a tractable t h e o r e t i c a l basis.
Chemists have accumulated over the l a s t two centuries an enormous
amount of information on compounds and reactions. However, t h i s i n -
formation appears l a r g e l y as a c o l l e c t i o n of i n d i v i d u a l facts devoid
of any comprehensive structure or organization. This i s most pain-
f u l l y f e l t by the novice studying chemistry. However, the more he
progresses i n h i s s c i e n t i f i c d i s c i p l i n e , the more concepts and rules
emerge that allow him to bring order into h i s knowledge. These con-
cepts include p a r t i a l atomic charges, e l e c t r o n e g a t i v i t y , inductive,
resonance, or s t e r i c e f f e c t s , which have a l l been coined by the
1
Current address: Organics Division, Imperial Chemical Industries pic, Blackley,
Manchester M9 3DA, England

0097-6156/86/0306-0258$06.00/0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
21. GASTEIGER ET AL. Acquisition and Representation of Knowledge 259

chemist do derive models for the p r i n c i p l e s governing chemical obser-


vations. The design of these models has involved the reduction of
c o l l e c t i o n s of i n d i v i d u a l observations to general p r i n c i p l e s .
Throughout t h i s paper we use the term model. I t w i l l r e f e r to
concepts of varying degrees of s o p h i s t i c a t i o n and s p e c i f i c a t i o n . A
model can be a notion developed by the chemist to c l a s s i f y an ob-
servation, i t can be an e x p l i c i t procedure for the c a l c u l a t i o n of a
value f o r a physico-chemical concept, or, i t can r e f e r to a mathe-
matical equation for the p r e d i c t i o n of an observation. We i n t e n t i o n -
a l l y do not d i s t i n g u i s h between these d i f f e r e n t uses i n order to
stress the point that the development of a model to further under-
standing i s quite a common approach i n science.
The huge amount of information a v a i l a b l e i n chemistry early on
i n v i t e d the use of the computer for s t o r i n g and r e t r i e v i n g inform-
ation. Documentation systems have been developed, and are being
maintained, that contain a sizeable amount of the known chemical
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

information. Thus, they have gained importance as a knowledge base


f o r a s s i s t i n g the chemist i n solving h i s problems. C l e a r l y , the con-
s t r u c t i o n of a large chemical information r e t r i e v a l system i s an
enormous endeavor. Furthermore, the work w i l l never be complete as
new information i s constantly being gathered and should be incorpo-
rated i n t o the system. Beyond that, pure r e t r i e v a l can only give
access to known information. Without appropriate s t r u c t u r i n g of
information no predictions can be made of new information.
Thus, some of the most important and i n t e r e s t i n g problems of a
chemist could not be tackled.
These are:
1. What w i l l be the properties of an unknown compound?
2. What i s the structure of a new compound?
3. How can a compound with a new structure be synthesized?
These questions f a l l into the domains of s t r u c t u r e - a c t i v i t y r e -
l a t i o n s h i p s , structure e l u c i d a t i o n , and synthesis design, respective-
l y . They a l l ask for new information not yet known e x p l i c i t l y . That
i s , they require p r e d i c t i o n s .
I t would be highly desirable to reduce the i n d i v i d u a l facts i n
an information r e t r i e v a l system to general p r i n c i p l e s j u s t as the
chemist has done i n devising h i s empirical concepts mentioned pre^
v i o u s l y . Such a reduction of information to i t s e s s e n t i a l contents
asks for i n s i g h t s , to transform information to knowledge.
We have not attempted to make the computer do the job of auto-
m a t i c a l l y f i n d i n g the fundamental laws of chemistry from a compilation
of i n d i v i d u a l f a c t s . Rather, we have e x p l i c i t l y b u i l t i n t o the
computer s p e c i f i c models that we believe can represent the structure
of chemical information. We were guided i n t h i s endeavor by concepts
derived by the chemist and have t r i e d to develop models and proced-
ures that quantify these concepts. In doing so we have put more
emphasis on the a c q u i s i t i o n and representation of knowledge than on
problem-solving techniques. In any expert system the q u a l i t y of the
knowledge base i s of primary and desicive importance.
We are mainly concerned with the development of EROS (Elabora-
t i o n of Reactions for Organic Synthesis), a program system for the
p r e d i c t i o n of chemical reactions and the design of organic syntheses
(J_-_3) . This system does not r e l y on a database of known reactions.
Instead, reactions are generated i n a formal manner by breaking and

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
260 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

making bonds and s h i f t i n g electrons. In Figure 1 one of those reac-


t i o n schemes contained i n the program i s shown. This scheme, breaking
two bonds and making two new ones i s quite important; many rather
diverse reactions follow that scheme. Such a scheme can be applied i n
both a forward search (reaction p r e d i c t i o n ; l a and lb) as w e l l as i n
a retrosynthetic search (synthesis design; l c ) .
C l e a r l y , not a l l reactions obtained by such a formal scheme can
be r e a l i s t i c ones. In f a c t , many have no chemical r e a l i t y ( c f . I d ) .
A major task i n program development i s therefore, to f i n d ways of
automatically extracting the chemically f e a s i b l e reactions from
amongst the formally conceivable ones. To t h i s end a modelling of
chemical r e a c t i v i t y seems indispensable.

Finding the Pieces

The high q u a l i t y numerical data on physical and chemical properties


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

of atoms, molecules, and compounds present a good s t a r t i n g point f o r


the development of a knowledgebase. The task i s to condense the i n -
formation contained i n a series of i n d i v i d u a l data into a quantita-
t i v e parametric model which w i l l reproduce the primary data with a
c e r t a i n accuracy. I f t h i s i s successful i t can be used to predict
new, as yet unknown data f o r which the same kind of accuracy can be
expected. Furthermore, the parameters could also be of use i n other
models which i n turn give new types of data.
In developing models f o r t r e a t i n g chemical r e a c t i v i t y we have
been guided by the concepts used by the organic chemist i n discussing
the causes of organic reactions and t h e i r mechanisms. Examples of
the more prominent effects are shown i n Figure 2.
Our i n t e n t i o n has been to derive models that can quantify these
various effects and thereby build a basis f o r a quantitative t r e a t -
ment of chemical r e a c t i v i t y . The following simple models that enable
calculations to be performed rapidly on large molecules and b i g data
sets have been developed.

Heats of Reaction and Bond D i s s o c i a t i o n Energies. The simplest form


of a model i s an a d d i t i v i t y scheme that derives a molecular property
through summation over increments assigned to atoms, bonds or groups
(4). We have explored such an approach by assuming that heats of
formation can be estimated from values assigned to d i r e c t (1,2) and
next nearest (1,3) atom-atom interactions (5). Values f o r these para-
meters have been derived from experimental heats of formation through
m u l t i - l i n e a r regression analyses (6). As an example, the heats of
formation of 49 alkanes have been condensed into four fundamental
parameters that reproduce the data with a standard error of 0.8 7 k c a l
/mol (6).
This amounts to a sizeable reduction of the information that has
to be stored, while conserving a rather good accuracy i n the data.
With these four parameters unknown heats of formation of alkanes can
be estimated by the a d d i t i v i t y scheme with a s i m i l a r l y high accuracy.
This approach has been extended to other series of compounds.
Using these parameters for the estimation of the heats of f o r -
mation of s t a r t i n g materials and products of a reaction and then
taking the difference i n these two numbers provides values f o r reac-
t i o n enthalpies. Only parameters of those substructures that are
changed i n a reaction need be considered.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
GASTEIGER ET AL. Acquisition and Representation of Knowledge

I—J ι J
+ —> I +
I
K—L Κ L

CH —Br CH, Br
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

a) I +I
HO—Η HO Η

b)

CH =CH-C=N
2 CH =CH—C=N
9
CH^CH-Ç-H
+ 2
I I : H
0 H-OH O-N-OH
H OH

d)
CH — Br
+
H—OH
3

+ CH.

H
I +I
Br

OH

Figure 1. Formal r e a c t i o n scheme with examples

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
262 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Furthermore, the e f f e c t s of strained rings and of aromatic com­


pounds must be considered (7), and algorithms that perform these tasks
have been developed (8,9). Values on bond d i s s o c i a t i o n energies can
be calculated by extending the parametrization to r a d i c a l s (10).
Table I gives r e s u l t s obtained f o r methyl propionate; experimental
values are from compounds containing s i m i l a r s t r u c t u r a l s i t u a t i o n s
around the bond being considered ( Π ) .

Table I . Comparison between calculated and experimental bond d i s ­


s o c i a t i o n energies i n methyl propionate ( i n kcal/mol)
2
ι 2

CH-CH -C< _
Λ2 0
3
5 6
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

bond BDE (calc) BDE (exp.)(ref.


±
C'-H 98.7 98.2 1
2
C -H 93.4 92.3 ± 1.4
1
6
C -H 93.8 94 2
±
c'-c 2
85.0 86.4 1
2
c -c 3
83.6 81.2 - 1
3
c -o 4
123.4 -
3
c -o 5
96.9 95.5 i 1.5
6
c -o 5
86.4 83.6 * 1.5

An a d d i t i v i t y scheme i s a rather simple model, but despite t h i s ,


such schemes can be applied to a v a r i e t y of physical data of mole­
cules. Benson and Buss have c l a s s i f i e d a d d i t i v i t y rules into suc­
cessive approximations and have given examples of t h e i r a p p l i c a b i l i ­
ty (40. According to t h e i r terminology the zero-order approximation
of a molecular property i s given by a d d i t i v i t y of atomic properties,
f i r s t - o r d e r approximation by a d d i t i v i t y of bond properties, and
second-order approximation by a d d i t i v i t y of group properties. More
recent widespread use of a d d i t i v i t y schemes i s found i n methods |or
estimating spectroscopic data, i n p a r t i c u l a r those f o r deriving H-
or C-NMR chemical s h i f t s o f organic molecules.

P o l a r i z a b i l i t y E f f e c t s . The next model demonstrates that an addi­


t i v i t y scheme can be combined with other forms of mathematical r e ­
l a t i o n s to extract the fundamental parameters of a model from primary
information. And furthermore, i t shows than an a d d i t i v i t y scheme
useful f o r the estimation of a global molecular proparty can be modi­
f i e d to obtain a l o c a l , s i t e s p e c i f i c property.
M i l l e r and Savchik (12) have given Equation 1 f o r estimating the
mean p o l a r i z a b i l i t y , a, of a molecule, where Ν i s the t o t a l number
of electrons i n the molecule, and τ.is a p o l a r i z a b i l i t y contribution
f o r each atom i , c h a r a c t e r i s t i c of èhe atom type and i t s h y b r i d i -
zation state.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
21. GASTEIGER ET AL. Acquisition and Representation of Knowledge 263

2
ά = |(Στ.) (!)
i
Mean molecular p o l a r i z a b i l i t y can be calculated through the
Lorenz-Lorentz- Equation from r e f r a c t i v e index, η , molecular weight,
MW, and density, d, of a compound, demonstrating that the parameters
T£ can be derived from these elementary molecular properties (Figure
3).
P o l a r i z a b i l i t y i s a measure of the r e l a t i v e ease of d i s t o r t i o n
of a dipolar system when exposed to an external f i e l d . The s t a b i l i ­
zation energy due to the i n t e r a c t i o n between an external charge and
the induced dipole i s highly distance-dependent and can be c a l c u l a ­
ted through c l a s s i c a l e l e c t r o s t a t i c s . The s i t u a t i o n i s , however,
less c l e a r l y defined when the charge resides w i t h i n the molecule
that i s being polarized. To model the s t a b i l i z a t i o n r e s u l t i n g from
p o l a r i z a b i l i t y i n these s i t u a t i o n s , we have modified Equation 1 by
n
introducing a damping factor d i ~ ^ , where 0 < d < l , and n£ gives the
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

smallest number of bonds between an atom i and the charge center


(Equation 2)(13) ·

n i 2 2
»d-i<?* "S> <>
i
α i s c a l l e d e f f e c t i v e p o l a r i z a b i l i t y , as the damping factor models
the distance dependent attenuation of the s t a b i l i z a t i o n e f f e c t .
Furthermore, t h i s factor gives d i f f e r e n t values f o r f o r the same
molecule depending on where the charge center i s located. An alterna­
t i v e a d d i t i v i t y scheme (14) f o r estimating mean molecular p o l a r i z a ­
b i l i t y can be s i m i l a r l y modified to obtain values of e f f e c t i v e
p o l a r i z a b i l i t y (15). The significance of these values has been demon­
strated by c o r r e l a t i o n with physical data (13).
Charge D i s t r i b u t i o n , Inductive and Resonance E f f e c t s . U n t i l now,
the discussion has been concerned with models based on a d d i t i v i t y
schemes and t h e i r modifications. However, we have also explored
other types of models that can be put into algorithms that are f a s t ,
a l b e i t less convenient f o r p e n c i l and paper a p p l i c a t i o n .
This i s true f o r our procedure f o r c a l c u l a t i n g p a r t i a l atomic
charges i n σ-bonded molecules (16). The method s t a r t s from Mulliken's
d e f i n i t i o n of e l e c t r o n e g a t i v i t y , χ, derived from atomic i o n i z a t i o n
p o t e n t i a l s , IP, and e l e c t r o n e g a t i v i t i e s , EA (Equation 3)(17).

χ = 0.5 (IP + EA) (3)

E l e c t r o n e g a t i v i t y was considered to be dependent both on o r b i t a l


type, and on the occupation number of an o r b i t a l (or, equivalently,
the charge on an atom). On bond formation, negative charge i s trans­
ferred from the less to the more electronegative atom. Because of
the charge dependence, the e l e c t r o n e g a t i v i t i e s change i n the sense
that they tend to equalize. The problem of the mutual dependence of
e l e c t r o n e g a t i v i t y on charge and of charge transfer on electronega­
t i v i t y was solved by an i n t e r a t i v e procedure that takes e x p l i c i t ac­
count of the molecular topology (16). This gives access to a s e l f -
consistent set of values of p a r t i a l charges and associated residual

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
264 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

— heat of reaction and bond dissociation energy

H C—H + Χ—X
3 ^ H C—X + H—X
3

— charge distribution
0 Θ
δ+.Ο I
H 3 C — + :Nu
hUC — C—Nu
Η
— inductive effect
Cl-CH -COOH 2 Cl-CH -C00 9
G
+ H®

— polarizability
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

Br-CH CH -Cl + OH
2 2
G
HO-CH CH -Cl + Br
2 2

resonance effect

H2C=CH—+H-CN ^ NC-CH -CH —C, 2 2

A H

H C—CH=C
2

Figure 2 . Concepts used i n discussing the causes of organic


reactions

Lorenz- Additivity
MW Lorentz- »cc- Scheme
Equation
d '

Attenuation
Model

7
OU

Figure 3. Deriving values f o r e f f e c t i v e p o l a r i z a b i l i t y , ot^, from


r e f r a c t i v e index, n , molecular weight, MW, and density, d
D

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
21. GASTEIGER ET AL. Acquisition and Representation of Knowledge 265

e l e c t r o n e g a t i v i t y values f o r each atom of a molecule, r e f l e c t i n g


atomic type as w e l l as the influence of the molecular environment.
The charge values have been used to correlate or calculate a
v a r i e t y of physical data including dipole moments (18), ESCA chemical
1
s h i f t s (16), ^H-NMR chemical s h i f t s (19), and J _ c o u p l i n g constants
c H

(20), thereby r e l a t i n g these physical data to the fundamental values


of IP and EA, i n concert with a proper consideration of the network
of bonds i n molecules.
An extension of the method has been developed f o r conjugated
π-systems which arrives at charge d i s t r i b u t i o n i n these systems by
generating the various resonance structures and assigning weights
to them (21, 22). Again, the s i g n i f i c a n c e of the charge values was
established by reproducing physical data of molecules.
I t was found that the residual e l e c t r o n e g a t i v i t y values calcu­
lated f o r σ-bonded molecules can be taken as a quantitative measure
of the inductive e f f e c t (23). In a s i m i l a r manner, the values of π-
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

e l e c t r o n e g a t i v i t i e s can be used f o r quantifying the resonance e f f e c t .

Hyperconjugation. Empty or p a r t i a l l y f i l l e d p - o r b i t a l s can be s t a b i ­


l i z e d through overlap with adjacent C-H and C-C bonds of appropriate
symmetry. Following a previous suggestion (24), we have taken the
number of such bonds as a measure of t h i s s t a b i l i z a t i o n through
hyperconj ugat ion.

Putting the Pieces Together

The previous chapter has b r i e f l y presented methods that quantify the


various effects used by the organic chemist to r a t i o n a l i z e h i s obser­
vations on r e a c t i v i t y , reaction mechanisms, and the course of organic
reactions. Physical data were chosen to demonstrate the significance
of the calculated values.
But are the values calculated by the above methods also useful
for understanding and p r e d i c t i o n of chemical r e a c t i v i t y data? Here,
the s i t u a t i o n i s less well-defined than with physical properties. In
many cases our knowledge of chemical r e a c t i v i t y i s more of a semi­
quantitative nature. Furthermore, i n many reactions the various ef­
fects operate simultaneously, and they do so to varying degrees.
Several s t a t i s t i c a l and pattern recognition techniques were used
to unravel the relationships between chemical r e a c t i v i t y data and the
previously described effects which influence them.

M u l t i l i n e a r Regression Analysis. As an entry to the problem we have


selected simple gas phase reactions involving proton or hydride ion
transfer which are influenced by only a few effects and f o r which
r e a c t i v i t y data of high accuracy are a v a i l a b l e . In these s i t u a t i o n s
where a larger set of numerial data are available m u l t i l i n e a r r e ­
gression analysis (MLRA) was applied. Thus, the simplest mathematical
form, a l i n e a r equation i s chosen to describe the r e l a t i o n s h i p bet­
ween r e a c t i v i t y data and physicochemical factor. The number of para­
meters (factors) simultaneously applied was always kept to a minimum,
and a p a r t i c u l a r parameter was only included i n a MLRA study i f a
d e f i n i t e i n d i c a t i o n of i t s relevance existed.
The proton a f f i n i t y (PA) of alkylamines can be described by

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
266 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

only a single parameter, the e f f e c t i v e p o l a r i z a b i l i t y α ,(13). For 49


tmsubstituted alkylamines a method was found to calculate proton af-
f i n t i t y from the r e f r a c t i v e index, molecular weight, and density ( c f .
Figure 3). For alkylamines carrying heteroatom s u b s t i t u t i o n a
measure of the inductive e f f e c t had to be included. This could be a-
chieved by using residual e l e c t r o n e g a t i v i t y values, χ"^ i n the two
parameter equation 4 (23).

P A = C
V l°d " 2 12
C Y ( 4 )

The signs of the c o e f f i c i e n t s i n t h i s equation are e n t i r e l y con­


s i s t e n t with i n t u i t i o n , p o l a r i z a b i l i t y s t a b i l i z i n g , and electronega­
t i v i t y d e s t a b i l i z i n g the protonated form of the amine. Similar equat­
ions could be developed f o r proton a f f i n i t y data of alcohols and
ethers, as w e l l as of t h i o l s and thioethers (Figure 4b and 4c) (25).
Furthermore, and χ ^ parameters were also s u f f i c i e n t to describe
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

q u a n t i t a t i v e l y gas phase a c i d i t y data of alcohols (Figure 4d) (25). In


t h i s case, the c o e f f i c i e n t s for the two parameters had the same sign
as both effects provide sources of s t a b i l i z a t i o n f o r the alkoxide
ion. Figure 5 shows the r e s u l t s obtained by MLRA.
Simple l i n e a r equations could also be developed f o r the other
three systems of Figure 4, PA of aldehydes and ketones(4e), and t h e i r
hydride ion a f f i n i t i e s , both of the neutral (4f) and protonated forms
(4g). However, i n addition to e f f e c t i v e p o l a r i z a b i l i t y and e l e c t r o ­
n e g a t i v i t y , hyperconjugation had also to be used as a parameter, as
ρ-orbitals carrying a p a r t i a l p o s i t i v e charge are involved i n the
reactions 4e to 4g (26).
Multiparameter equations, such as Equation 4, obtained through
MLRA are the simplest form of p a r a l l e l connection of several models.
Each model has been parameterized from i t s own source of primary
data. Combined a p p l i c a t i o n can reproduce new types of data and lead
to new information and knowledge.
The correlations with data on gas phase reactions have served
to e s t a b l i s h that the parameters calculated by our methods are indeed
useful for the p r e d i c t i o n of chemical r e a c t i v i t y data. Their a p p l i c a ­
t i o n i s , however, not r e s t r i c t e d to data obtained i n the gas phase.
This has been shown through a c o r r e l a t i o n of pK values ( i n H^O) of
alcohols with r e s i d u a l e l e c t r o n e g a t i v i t y and poîarizability para-
meters, by including a parameter that i s interpreted to r e f l e c t
s t e r i c hindrance of solvation (27).

The R e a c t i v i t y Space. In many reaction types the s i t u a t i o n i s not as


w e l l defined as i n the chemical reactions so f a r investigated. I f
e i t h e r fewer and less accurate r e a c t i v i t y data are a v a i l a b l e , or the
chemical system i s under the influence of many e f f e c t s , then MLRA i s
no longer the appropriate a n a l y t i c a l method.
For such s i t u a t i o n s we have developed a d i f f e r e n t approach. The
parameters calculated by our methods are taken as coordinates i n a
space, the r e a c t i v i t y space. A bond of a molecule i s represented i n
such a space as a s p e c i f i c point, having c h a r a c t e r i s t i c values f o r
the parameters taken as coordinates. Figure 6 shows a three-dimens-
ional r e a c t i v i t y space spanned by bond p o l a r i t y , bond d i s s o c i a t i o n
energy, and the value for the resonance effect as coordinates.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
GASTEIGER ET AL. Acquisition and Representation of Knowledge

R 1
X R e 1
x

R -^N
2
+ Η φ
> R -N-H 2

R3/ R / 3

R 1 R
\ e
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

p2/ R

d) R-O-H * R
- 0 y
+ H*

e) C=0 + > C=0-H

f) \=0 + Η Θ
» Vo®

g) C=0-H + H > C-O-H

Figure 4. Gas phase reactions f o r which l i n e a r equations have


been developed using p o l a r i z a b i l i t y , e l e c t r o n e g a t i v i t y , and
hyperconiugation parameters. Reaction a) réf. JJ3, 2_3; b)-d) r e f .
25; e)-g) r e f . 26.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

MeOH

- R-0-H > R-0 Θ


+ Η Φ

-
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

5
ο
ε "
0 H
Ίοα -r . .

370
1 -

ΖΣΖ

365 χ F v
r" 0 H
Δ Η r= c
0 " 1X" 2 d
C c a

l F ν
ι ι ι I I
365 370 375

ΔΗ (calc.)
ρ kcal/mol

Figure 5. Experimental gas phase a c i d i t y data of alcohols


plotted against values calculated from e l e c t r o n e g a t i v i t y and
p o l a r i z a b i l i t y parameters. (Reprinted from: Gasteiger, J . ;
Hutchings, M.G. J . Am. Chem. Soc. 1984, 106, 6489).

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
21. GASTEIGER ETAL. Acquisition and Representation of Knowledge 269
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

F i g u r e 6. R e a c t i v i t y space h a v i n g bond p o l a r i t y , Q^, bond d i s -


s o c i a t i o n energy, BDE, and resonance e f f e c t parameter, R, as
coordinates.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
270 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

In t h i s space various bonds of 2-chlorobutyric acid are indicated. In


f a c t , i n t h i s case we have investigated h e t e r o l y t i c bond cleavages.
Thus, each bond w i l l give r i s e to two points i n t h i s space, depending
on which d i r e c t i o n the charges are s h i f t e d on h e t e r o l y s i s , i n the
d i r e c t i o n of bond p o l a r i t y , or against i t . For example, points 2 and
3 of Figure 6 both r e f e r to the carbonyl double bond. In the case of
point 2, i t s h e t e r o l y s i s against the preformed p o l a r i z a t i o n of that
bond (Figure 7), and therefore the bond p o l a r i t y parameter Q has a a

negative sign.
Points 2 and 3 are characterized by the same value for the (homo-
l y t i c ) bond d i s s o c i a t i o n energy. However, resonance s t a b i l i z a t i o n of
charges can occur only f o r the h e t e r o l y s i s represented by point 3.
Therefore i n this case, the resonance parameter R has a high value,
whereas i t i s zero f o r the h e t e r o l y s i s represented by point 2.
Figure 6 shows an a d d i t i o n a l feature, The points are d i s t i n ­
guished according to whether the associated bond i s considered react­
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

ive (breakable; small cubes) or not (non-breakable; small pyramides).


Any chemist w i l l agree that the most reactive bonds of 2-chlorobutyric
acid are the 0-H, the C=0, and the C-Cl bonds, where the negative
charge goes to the more electronegative atom (0 or CI) on h e t e r o l y s i s .
This h e t e r o l y t i c cleavage of the three bonds i s represented by points
7, 3, and 4, respectively. The other bonds are considered as much
less r e a c t i v e , or non-breakable. I t can be seen that reactive and non-
reactive bonds c l e a r l y separate. Thus, t h i s three-dimensional space
already represents the ease of breaking a bond, a chemical r e a c t i v i t y
phenomen, quite w e l l .
With Figure 6 a three-dimensional r e a c t i v i t y space i s shown. Where­
as t h i s i s the l i m i t for p i c t o r i a l representation, s t a t i s t i c a l
methods can deal with spaces of higher dimensionalities. In a study
aimed at modelling the r e a c t i v i t y of single bonds i n a l i p h a t i c
chemistry a data set of 28 molecules representing that f i e l d was
chosen. Table I I gives t h i s data set.
The entire set of molecules contained 782 bonds out of which 111
σ-bonds were selected. The parameters were calculated by our methods to
b u i l d a r e a c t i v i t y space with e l e c t r o n e g a t i v i t y difference, resonance
e f f e c t parameter, bond p o l a r i z a b i l i t y , bond p o l a r i t y , σ-charge d i ­
s t r i b u t i o n , and bond d i s s o c i a t i o n energy as s i x coordinates.
F i r s t , unsupervised-learning pattern recognition methods were
applied. A p r i n c i p a l component analysis showed that the dimension­
a l i t y of the space could be reduced without much loss of information.
With three factors, instead of s i x , 85.9% of the variance of the data
set could s t i l l be reproduced. The f i r s t factor can be i d e n t i f i e d as
containing the σ-electron d i s t r i b u t i o n , the second factor i s highly
loaded with the bond d i s s o c i a t i o n energy and bond p o l a r i z a b i l i t y . The
t h i r d factor contains a mixture of e f f e c t s . Cluster analysis was
applied as a second unsupervised learning technique. In t h i s case i t
was applied to a r e a c t i v i t y space of reduced dimensionality using -
for reasons that become clearer below - the resonance e f f e c t , bond
p o l a r i t y , and the bond d i s s o c i a t i o n energies as coordinates. The r e ­
sults are shown as a dendrogram i n Figure 8.
I t i s probably not too s u r p r i s i n g that the same bond types
c l u s t e r together, as they are characterized by s i m i l a r values for the
respective parameters. However, the i n t e r r e l a t i o n s h i p s between d i f ­
ferent bond types indicated by the o v e r a l l structure of the dendro-

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
21. GASTEIGER ET AL. Acquisition and Representation of Knowledge 271

\θ θ
point 2
\6+ 6-
C=0
/ \φ θ
C—Ο point 3
κ
Figure 7. The two choices for heterolysis of the carbonyl double
bond, and t h e i r representation as points i n Figure 8.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

Table I I . L i s t of compounds used i n deriving a r e a c t i v i t y function

1) cyclopropane
2) cyclobutane
3) cyclopentene
4) cyclopentadiene
5) ethyl bromide
6) ethyl iodide
7) methylene chloride
8) a l l y l chloride
9) neopentyl chloride
10) 1-methyl-1-cyclopropyl-ethy1 bromide
11) 1-methyl-1-cyclobuty1-ethyl iodide
12) 2,2,4,4,-tetramethylcyclobutanol
13) acetaldehyde
14) acetone
15) trimethylacetaldehyde hydrate
16) choral hydrate
17) aldol
18) methyl propionate
19) ethyl acetoacetate
20) ct-chloropropionic acid
2Π 5-hydroxy-nona-3,5,8-triene-2-one
22) 2-oxocyclopentane carboxylic acid
23) 5-hydroxy-5-methyl-butylrolactone
24) 1-dimethylamino-propene
25) 4-amino-2,4-dimethyl-2-pentanole
26) succinimide
27) a-picoline
28) 6-chloro-6-methoxy-bicyclo [3.1 .oJhex-2-

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
272 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

Figure 8. Dendrogram of the r e l a t i o n s h i p between the various


bonds on heterolysis as obtained by a c l u s t e r analysis (A= accep-
tor* D= donor).

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
21. GASTEIGER E T A L . Acquisition and Representation of Knowledge 273

gram, and some of the smaller d e t a i l s of the dendrogram give i n t e r ­


esting information. To name just one: the C-C bonds of the three-
and four-membered carbocycles are found to be rather c l o s e l y related
to the carbon-halogen bonds ( i n both cases carbenium ions can be ob­
tained, either d i r e c t l y as with the C-Hal bond, or a f t e r attack of an
+
e l e c t r o p h i l e (H , Lewis acid) as both with halocarbons and with cyclo-
propanes and cyclobutanes).
Next, supervised-learning pattern recognition methods were ap­
p l i e d to the data set. The 111 bonds from these 28 molecules were
c l a s s i f i e d as e i t h e r breakable (36) or non-breakable (75), and a step­
wise discriminant analysis showed that three v a r i a b l e s , out of the
six mentioned above, were p a r t i c u l a r l y s i g n i f i c a n t : resonance e f f e c t ,
R, bond p o l a r i t y , QQ, and bond d i s s o c i a t i o n energy, BDE. With these
three variables 97.3% of the non-breakable bonds, and 86.1% of the
breakable bonds could be c o r r e c t l y c l a s s i f i e d . This says that chemi­
cal r e a c t i v i t y as given by the ease of h e t e r o l y s i s of a bond i s w e l l
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

defined i n the space determined by just those three parameters. The


same conclusion can be drawn from the r e s u l t s of a K-nearest neigh­
bor analysis: with k assuming any value between one and ten, 87 to
92% of the bonds could be c o r r e c t l y c l a s s i f i e d .
One method that we have found p a r t i c u l a r l y useful f o r our pur­
poses i s l o g i s t i c regression analysis (LoRA). In t h i s method, a
binary c l a s s i f i c a t i o n i s taken as a p r o b a b i l i t y , P (given the value
Q

0 or 1) and modelled by the two coupled equations 5 and 6.


f
Ρ = 1/(1 + e" ) (5)
f = c + c.x. + c x + ... (6)
ο 11 II 0 0

In the l i n e a r function f, the x. are the parameters considered


relevant to the problem. The c o e f f i c i e n t s c. are determined to maxi­
mize the f i t of the calculated p r o b a b i l i t y £ as c l o s e l y as possible
to the i n i t i a l c l a s s i f i c a t i o n P .
Q

The method applied to the problem of chemical r e a c t i v i t y trans­


lates into the following. A data set of molecules i s chosen and bonds
i n these molecules are selected and s p e c i f i e d either breakable or
non-breakable ( P = 0/1). Then, the physicochemical parameters deemed
0

important for the r e a c t i v i t y of the bonds under i n v e s t i g a t i o n are


calculated and used as variables x. i n Equation 6. LoRA i s applied
to model the i n i t i a l c l a s s i f i c a t i o n of bonds into breakable or non-
breakable classes.
In this process, a function f i s obtained that can be used as a
numerical estimate f o r the ease of breaking of a bond. We therefore
c a l l i t a r e a c t i v i t y function. The all-important point i s that
through LoRA the q u a l i t a t i v e information of whether a bond i s break­
able or not i s used to construct a function that predicts chemical
reactivity quantitatively.
A r e a c t i v i t y function (Equation 7) applicable to single bonds i n
a l i p h a t i c species was obtained with the data set of 111 bonds from
the 28 molecules mentioned above.

f = 2.87 + 0.162-R + 32.9-Q - 0.084-BDE (7)


σ
In a s i m i l a r manner, a function quantifying the r e a c t i v i t y of
bonds i n charged species was developed. These functions are of quite

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
274 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

general v a l i d i t y - The numerical values calculated with them permit


p r e d i c t i o n of which bonds and combinations of bonds w i l l react pre-
f e r e n t i a l l y . Inferences on the course of complex organic reactions
can be drawn from t h i s information.
As an example: What w i l l be the product of heating 1,2:1,4—di—
epoxy-p-menthane, J . (Figure 9) with alumina i n toluene? A chemist
would assume i n i t i a l breaking of an epoxide-ring. But which one of
the two? Or w i l l both break? Furthermore, f o r each epoxide r i n g there
are two possible choices of C-0 bonds.
Figure 10 shows the sequence of bond breaking obtained by a p p l i -
cation of the r e a c t i v i t y function for neutral a l i p h a t i c molecules and
the one for charged species. The consecutive bond breakings that are
explored lead to the conclusion that the pattern of breaking and
making bonds as indicated i n structure 2 should be the most favored
one. Thus, i t i s predicted that both oxirane-rings are broken, one
even i n the d i r e c t i o n leading to the seemingly less stable carbenium
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

ion. Furthermore, even a bond i n the saturated six-membered r i n g i s


found to be breakable. The mechanistic pattern of structure g permits
to make the inference that compound 3 i s t n e
most l i k e l y product of
t h i s reaction. This i s indeed the observed course and product of the
rearrangement of 1 (28).
Examples of other cases of p r e d i c t i o n of complex organic react-
ions have been given elsewhere (3). Functions applicable to the
r e a c t i v i t y of multiple bonds and of aromatic systems have been de-
veloped i n an analogous manner.

Conclusion. I t has been demonstrated that the methods developed for


the c a l c u l a t i o n of physicochemical e f f e c t s can form the foundation
for a general quantitative treatment of chemical r e a c t i v i t y . Based on
the factors calculated with these various methods, r e a c t i v i t y funct-
ions can be elaborated that are able to assign a numerical r e a c t i v i -
ty to bonds and combinations of bonds i n a molecule. In t h i s manner
the course and outcome of organic reactions can be predicted. A
quantitative treatment of chemical r e a c t i v i t y i s also an e s s e n t i a l
component i n synthesis design since i t allows evaluation of the
f e a s i b i l i t y of various synthetic reactions and pathways.
The knowledge base of that part of the EROS system that predicts
chemical r e a c t i v i t y consists of the procedures for c a l c u l a t i n g the
physicochemical e f f e c t s and the way i n which they are connected.
These methods can be part of a series connection (Figure 3) or of a
p a r a l l e l connection (Equation 4). In other words, the knowledge base
consists of the chemical models that form the b u i l d i n g blocks and the
s t a t i s t i c a l models that form the network of connections.
As the chemical models mentioned here r e f e r to some fundamental
thermochemical and e l e c t r o n i c e f f e c t s of molecules, t h e i r a p p l i c a t i o n
i s not r e s t r i c t e d to the p r e d i c t i o n of chemical r e a c t i v i t y data. In
f a c t , i n the development of the models extensive comparisons were
made with physical data, and thus such data can also be predicted
from our models. Furthermore, some of the mechanisms responsible f o r
binding substrates to receptors are n a t u r a l l y enough founded on
quite s i m i l a r e l e c t r o n i c e f f e c t s to those responsible for chemical
r e a c t i v i t y . This suggest the use of the models developed here to c a l -
culate parameters for quantitative s t r u c t u r e - a c t i v i t y r e l a t i o n s h i p s
(QSAR).

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
GASTEIGER ET AL. Acquisition and Representation of Knowledge 275

F i g u r e 9. Example of a problem f o r r e a c t i o n p r e d i c t i o n
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

3
F i g u r e 1 0 . Network of b o n d - b r e a k i n g and -making p a t t e r n s
e x p l o r e d by the r e a c t i v i t y f u n c t i o n s l e a d i n g t o the c o r r e c t p r e ­
d i c t i o n o f p r o d u c t 3 from Κ

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
276 ARTIFICIAL INTELLIGENCE APPLICATIONS IN C H E M I S T R Y

In t h i s sense, expert systems f o r the p r e d i c t i o n of chemcial reactions,


for the design of organic syntheses, f o r the p r e d i c t i o n of p h y s i c a l
data, f o r structure e l u c i d a t i o n , and f o r QSAR can be founded on the
knowledge base comprized by the models presented here.

Acknowledgments

Support of t h i s work by the Deutsche Forschungsgemeinschaft and by


Imperial Chemical I n d u s t r i e s , p i c , United Kingdom, i s g r a t e f u l l y
appreciated.

Literature Cited

1. Gasteiger, J . ; Jochum, C. Topics Curr. Chem. 1978, 74, 93.


2. Gasteiger, J . Chim. Ind. (Milan) 1982, 64, 714.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch021

3. Gasteiger, J . ; Hutchings, M.G.; Christoph, B.; Gann, L.;


H i l l e r , C.; Löw, P.; M a r s i l i , M.; Saller, H.; Yuki, K. Topics
Curr. Chem., submitted.
4. Benson, S.W.; Buss, J.H. J . Chem. Phys. 1958, 29, 546.
5. A l l e n , T.L. J . Chem. Phys. 1959, 31, 1039
6. Gasteiger, J . ; Jacob, P.; Strauss, U. Tetrahedron 1979, 35, 139.
7. Gasteiger, J . ; Dammer, O. Tetrahedron 1978, 34, 2939.
8. Gasteiger, J . Tetrahedron 1979, 35, 1419.
9. Gasteiger, J . Comput. Chem. 1978, 2, 85.
10. Gann, L.; Löw, P.; Yuki, K.; Gasteiger, J . , unpublished results.
11. a) McMillen, D.F.; Golden, D.M. Ann. Rev. Phys. Chem. 1982, 22,
493.
b) Egger, K.W.; Cocks, A.T. Helv. Chim. Acta 1973, 56, 1516.
12. M i l l e r , K.J.; Savchik, J.A. J . Am. Chem. Soc. 1979, 101, 7206.
13. Gasteiger, J . ; Hutchings, M.G. Tetrahedron Lett. 1983, 24, 2537;
J . Chem. Soc., Perkin Trans. 2 1984, 559.
14. Kang, Y.K.; Jhon, M.S. Theor. Chim. Acta 1982, 61, 41.
15. Löw, P.; Gasteiger, J . , unpublished r e s u l t s .
16. Gasteiger, J . ; M a r s i l i , M. Tetrahedron Lett. 1978, 3181; Tetra-
hedron 1980, 36, 3219.
17. Mulliken, R.S. J . Chem Phys. 1934, 2, 782.
18. Gasteiger, J . ; Guillen, M.D. J . Chem. Res. (S) 1983, 304; (M)
1983, 2611.
19. Gasteiger, J . ; M a r s i l i , M. Org. Magn. Resonance 1981, 15, 353.
20. Guillen, M.D.; Gasteiger, J . Tetrahedron 1983, 39, 1331.
21. M a r s i l i , M.; Gasteiger, J . Croat. Chem. Acta 1980, 53, 601.
22. Gasteiger, J . ; S a i l e r , H. Angew. Chem. 1985, 97, 699; Angew.
Chem. Int. Ed. Engl., 1985, 24, 687.
23. Hutchings, M.G.; Gasteiger, J . Tetrahedron Lett. 1983, 24, 2541.
24. Kreevoy, M.M.; T a f t , R.W. J . Am. Chem. Soc. 1955, 77, 5590.
25. Gasteiger, J . ; Hutchings, M.G. J . Am. Chem. Soc. 1984, 106, 6489.
26. Hutchings, M.G.; Gasteiger, J . J . Chem. Soc., Perkin Trans. 2,
i n press.
27. Hutchings, M.G.; Gasteiger, J . J . Chem. Soc., Perkin Trans. 2,
i n press
28. Ho, T.L.; Stark, C.J. Liebigs Ann. Chem. 1983, 1446.

RECEIVED December 17, 1985

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
22

An E x p e r t S y s t e m f o r High P e r f o r m a n c e Liquid
Chromatography Methods Development

1 1 2
RenéBach ,JoeKarnicky ,and Seth Abbott
1
Varian Research Center, Varian Associates, Inc., Palo Alto, CA 94303
2
Varian Instrument Group, Varian Associates, Inc., Walnut Creek, CA 94598
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

ECAT (Expert Chromatographic Assistance Team) is an


expert system being developed at Varian Associates.
The goal of our project is to create a computer pro-
gram that performs, at the human expert level, the
tasks of designing, analyzing, optimizing, and
trouble-shooting a high performance liquid chroma-
tography (HPLC) separation method. The program is
successfully reaching conclusions relating to a number
of probes that test the design and trouble-shooting
capabilities. This paper describes the development of
ECAT in terms of the overall strategy of the program,
the hardware and software used, and the development of
the knowledge bases. Current results and future plans
are discussed.

The goal of our current research i s to apply A r t i f i c i a l I n t e l l i -


gence (AI) techniques to the writing of an expert system f o r High
Performance Liquid Chromatography (HPLC) methods development; that
i s , to produce a computer program capable of developing HPLC
separation methods i n a manner comparable to that of an expert
chromatographer. The expert system program i s named ECAT (an
acronym f o r Expert Chromatographic Assistance Team).
Creating a machine chromatographer i s a highly ambitious
goal. Because i t w i l l involve a very large e f f o r t to complete the
ECAT program as envisioned, we are developing the system as a set
of (eventually interacting) modules whose f u n c t i o n a l i t y can be
separately specified and implemented.
Once one has b u i l t or acquired an expert system s h e l l , an
expert system i s usable and useful at an early stage of develop-
ment. Subsequent development consists of increasing and r e f i n i n g
the knowledge, expanding the f u n c t i o n a l i t y and improving the
e f f i c i e n c y of the system.
The reader of this paper should be aware that the o v e r a l l
design of ECAT (described i n the sections on SYSTEM DESIGN and
FUTURE WORK) has only been implemented to the extent of the
running modules that are described under CURRENT STATUS.
0097-6156/ 86/0306-0278$06.00/0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
22. BACH ET AL. An Expert System for HPLC Methods Development 279

Background

The o r i g i n a l design f o r ECAT was described i n an a r t i c l e by Dessy


(_1). The development of the system to date has deviated consider-
ably from this plan both i n implementation methodology and rate of
progress. This i s primarily a t t r i b u t a b l e to the fact that the
s k i l l and experience of our group i n applying AI programming
techniques has grown with time.

Project Motivation. Chromatography, i n general, and methods


development, i n p a r t i c u l a r , exhibit c h a r a c t e r i s t i c s which indicate
that writing an expert system i s worthwhile: while chromatography
i s used by a large and diverse technical group ( i . e . , b i o l o g i s t s ,
engineers), the number of s k i l l e d chromatographers i s i n f i n i t e
supply.
HPLC i s characterized by a dynamic, expanding knowledge base,
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

which should benefit from a systematic reorganization of knowledge


i n a common repository (the expert system). Writing an expert
system f o r HPLC would make available to users of chromatographic
techniques an automatic, r e l i a b l e and fast application of e x i s t i n g
chromatographic expertise. I t could communicate t h i s expertise i n
an i n s t r u c t i o n a l manner, and provide f o r the convenient construc-
t i o n and manipulation of data (rule) bases containing structured
representations of chromatographic knowledge.
AI research i n the l a s t decade has demonstrated that i t i s
possible to capture and apply the human expertise related to a
specialized f i e l d by means of an expert system computer program.

Limitations of Conventional Programming. I t i s clear that t r u l y


i n t e l l i g e n t and comprehensive methods development i s s u f f i c i e n t l y
complex to be beyond what a conventional computer program can
manage. Conventional programming methods are inadequate because
of the d i f f i c u l t y of w r i t i n g , and subsequently debugging and
modifying, a procedural algorithm which could perform the complex
task of HPLC method development. In addition, conventional pro-
gramming methods don't support e f f i c i e n t l y the a b i l i t y to repre-
sent and manipulate information which i s non-numeric, judgmental,
uncertain, and incomplete. Research i n AI over the l a s t two
decades has yielded programming languages and programming methods
for writing expert systems which do not suffer from the above
limitations. We have applied some of these methods, described
below, to create ECAT.

Expert System Programming. Many of the concepts and terms which


w i l l be used i n the description of this work are unique to the
f i e l d s of AI and computer science. The reader should refer to the
a r t i c l e by Dessy (_2) or to the introductory a r t i c l e of t h i s sympo-
sium for a more detailed description of these concepts.
There i s some disagreement w i t h i n the AI community as to what
q u a l i f i e s a computer program to be called an "expert system". We
use the term to describe a program which has the following charac-
t e r i s t i c s : 1) The program performs some task (e.g., HPLC methods
design) which requires specialized human expertise. This human
expertise often takes the form of h e u r i s t i c s (empirical rules of

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
280 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

thumb); 2) The "domain knowledge" ( i . e . , the knowledge i n the


program s p e c i f i c to the task at hand, f o r example: methods design)
i s e x p l i c i t l y encoded i n a computer readable form and i s segre-
gated from the mechanisms f o r i t s application ( c o l l e c t i v e l y c a l l e d
the "inference engine"); 3) The system has the p o t e n t i a l for
explaining i t s reasoning; 4) The amount of knowledge encoded i n
the system i s n o n - t r i v i a l ( i . e . , for a rule-based system there are
hundreds or thousands of r u l e s ) .
Figure 1 i l l u s t r a t e s the elements and individuals involved i n
developing and using the ECAT expert system. The domain knowledge
(including h e u r i s t i c knowledge) i s e l i c i t e d from the domain expert
by the knowledge engineer who uses software tools to convert the
knowledge into computer processable form ( i . e . , facts and rules i n
knowledge bases). An i n d i v i d u a l uses the program by communicating
with the system v i a the user interface. In response to the user's
requests, the inference machinery makes l o g i c a l deductions and
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

performs tasks by processing the appropriate knowledge base.


Results are communicated back to the user v i a the user i n t e r f a c e .

Related Work

Algorithmic Methods Development. The recent development of s t a -


t i s t i c a l l y - b a s e d HPLC solvent optimization computer programs (3-9)
have achieved useful behavior i n experimental design by optimizing
separations with respect to s p e c i f i c performance c r i t e r i a . How-
ever, AI programming techniques were not applied i n these pro-
grams.

Expert Systems f o r Chemistry. At t h i s time, there are a very


large number of expert systems for chemistry i n various stages of
development. (See, for example, some of the other papers i n t h i s
symposium.) Some of the more successful systems developed i n the
past include: The DENDRAL series of programs from Stanford- These
include DENDRAL (started i n 1965), CONGEN, and META-DENDRAL
(10-11). These programs elucidate chemical structures from mass-
spectral information. Similar programs have been used to compu-
t e r i z e CI 3 NMR spectral analysis (12-13). Most recently the
PROTEAN project aims at computing the three-dimensional structure
of proteins i n solution using NMR data (14); The CRYSALIS program
interprets a three-dimensional image of the electron density map
obtained by X-ray crystallography of proteins (15); SYNCHEM and
SYNCHEM2 (16-17), LHASA (18), and SECS (19) are examples of compu-
terized or computer-assisted organic synthesis.

System Strategy

The goal of ECAT i s to provide assistance to the user of a chroma-


tograph i n the development of an HPLC method. To do t h i s , one
must specify the tasks performed i n developing an a n a l y t i c a l
method. The computer performs these tasks by processing informa-
tion. In ECAT we are c a l l i n g the c o l l e c t i o n of information spe-
c i f i c to a task a Module. The modules and information flow which
w i l l be needed for the completely implemented ECAT are shown i n
Figure 2.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
BACH ET AL. An Expert System for H PLC Methods Development

DOMAIN
EXPERT

KNOWLEDGE
USER ENGINEER
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

KNOWLEDGE BASE

x
USER
INTERFACE CONSTRUCTION
AIDS
N Kr CMP
KNOWLEDGE
BASE

MRS INFERENCE ENGINE: COLDIAG


FORWARD CHAINING KNOWLEDGE
BACKWARD CHAINING BASE

META LEVEL

ECAT SHELL

Figure 1. Elements involved i n development and application of


the ECAT expert system.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

SAMPLE AND MATRIX


INFORMATION
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

DECIDE ON DIAGNOSE
SAMPLE CLEANUP HARDWARE
(MODULE 4) FAULTS
(MODULE 6)

OPTIMIZE THE
SEPARATION
(MODULE 5)

~ ~ r ~
OPTIMIZED
SEPARATION

Figure 2. ECAT task modules: flow of information i n the method


design process.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
22. BACH ET AL. An Expert System for HPLC Methods Development 283

The complete ECAT system as envisioned i n Figure 2 w i l l take


as input the user's s p e c i f i c a t i o n of the sample to be analyzed
(analytes, matrix) and w i l l ultimately produce a separation method
that s a t i s f i e s the user's requirements for resolution and analysis
time.
The strategy adopted i s that the system w i l l f i r s t decide
whether Gas Chromatography (GC) or Liquid Chromatography (LC) i s
the best separation method. I f LC i s the method of choice, and a
q u a l i f i e d separation i s not found i n the program's l i b r a r y , i t
w i l l design and optimize a separation, also specifying pre-column
sample treatment where applicable. The design of this separation
w i l l ultimately include cycles of designing an i n i t i a l separation,
performing the experiment, analyzing the r e s u l t s , and redesigning
u n t i l a s a t i s f a c t o r y separation i s achieved. During the optimiza-
t i o n step i t may be necessary to diagnose for column and hardware
failure.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

The chemical information which the program w i l l need w i l l be


stored i n data bases or input by the user. The d e t a i l s of these
modules are discussed i n the sections on current results or future
plans.
To summarize, a complete methods development program must be
able to: 1. provide chemical information,
2. choose between GC and LC,
3. specify column, mobile phase constituents and
detector,
4. decide on sample cleanup,
5. optimize (or redesign) the separation,
6. diagnose hardware problems.

Implementation

Our program i s being implemented as a knowledge-based system. The


knowledge about chromatography which i s imbedded i n the program i s
i n the form of facts and rules. These facts and rules are repre-
sented within the computer by statements i n predicate l o g i c . In
predicate logic a statement i s represented by a l i s t of symbols,
where the f i r s t symbol (the predicate) represents a r e l a t i o n s h i p
among the objects which are represented by the other symbols i n
the l i s t . Complex facts are expressed using what are c a l l e d
" l o g i c a l connectives" (e.g., AND, OR, NOT, IF). We d i s t i n g u i s h
statements s t a r t i n g with IF and c a l l them rules. A rule i s a l s o
referred to as an IF-THEN statement. A rule asserts that the
statements i n the l e f t hand side imply the statements i n the r i g h t
hand side. An inference engine i s used to interpret those rules
to generate new facts or to answer questions.
Figure 3 shows examples of some ECAT facts and rules. A
rule's components are a name, a type declaration, an English
language description of the rule, an English language t r a n s l a t i o n
of the rule, and the actual form that i s processed by the program
during inferencing. Variables are bound during inferencing.

Development Environment. Hardware: the hardware currently con-


s i s t s of a Symbolics 3670 workstation, a Symbolics 3640 worksta-
t i o n , and a VAX 750, a l l connected by Chaosnet (an Ethernet proto-

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
284 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

(userl
(type fact)
(pform (largest-mw 500 daltons)))

(user2
(type fact)
(pform (analyte-class phenols)))

(user3
(type fact)
(pform (asked (analyte-class $class))))

(cmpgenl
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

(descr n i l )
(type rule)
(text ( i f sample molecular-weight i s > 100 then there are more
than three carbons i n the molecule))
(pform ( i f (and (largest-mw $mw daltons)
(> $mw 100))
then (more-than-three-carbons))))

(cmpgen7
(type rule)
(text ( i f the analyte class i s not a protein and not a peptide,
then use the s p e c i f i e d analyte class f o r further
inferencing))
(pform ( i f (and (analyte-class $class)
(asked (analyte-class $class))
(unknown (analyte-class protein))
(unknown (analyte-class peptide)))
then (consider (analyte-class $ c l a s s ) ) ) ) )

(cmpl
(descr (a default rule f o r selecting separation mode))
(type rule)
(text ( i f the chemical class of the analyte i s not a protein,
and the analyte has more than three carbons, and the
analyte does not belong to a class for which s t r a i g h t
phase i s recommended, then use a reverse phase sepa-
r a t i o n mode))
(pform ( i f (and (consider (analyte-class $class))
(more-than-three-carbons)
(unknown (consider (analyte-class protein)))
(unknown (straight-phase-packing $class $x $y)))
then (separation-mode reverse-phase))))

Figure 3. Examples of facts and rules i n ECAT. $... are


variables.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
22. BACH ET AL. An Expert System for H PLC Methods Development 285

col). Software: the Symbolics machines run Z e t a l i s p , and we are


using Franzlisp i n Eunice (a UNIX emulator running under VMS) on
the VAX (see Figure 4).
To develop the expert system we are using a f i r s t - o r d e r l o g i c
programming system c a l l e d MRS (20). I t i s a general inference
engine providing f o r forward chaining, backward chaining and
control of the inferencing by a meta-level reasoning system.
Reasoning at the meta-level refers to reasoning about reasoning,
that i s , reasoning about what needs to be done next, or what i s
the best way to solve the problem at hand.
Forward chaining i s reasoning from known facts v i a rules t o
conclusions. For example, i f a user asserted the three facts
l i s t e d at the top of Figure 3 the program would conclude, by
forward chaining, that the separation mode should be reverse
phase. We use forward chaining to process the Column and Mobile
Phase (CMP) design knowledge base. Backward chaining proves given
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

hypotheses by testing whether the " i f " parts of relevant rules are
known or provable using other rules. For example, i f the program
was asked the equivalent of "What separation mode should I use?"
i t could use backward chaining through the rules i n Figure 3 to
i n f e r that i t should ask the user about molecular weight and
analyte classes to provide the answer to the question. We use
backward chaining for the column diagnosis. MRS runs i n Z e t a l i s p ,
Maclisp and Franzlisp. We have made some modifications to the MRS
inferencing c a p a b i l i t y and provided a better user i n t e r f a c e .
We selected MRS for the following reasons: The domain exper-
t i s e of the column troubleshooting and of the CMP design i s read-
i l y expressed i n IF-THEN rules that MRS i s designed to handle.
Previous users of MRS had indicated that i t was a v e r s a t i l e t o o l
for reasoning with various forms of domain expertise and that the
meta l e v e l reasoning could be used to solve p a r t i c u l a r l y d i f f i c u l t
problems. MRS doesn't require, although i t runs well on, s p e c i a l -
ized hardware such as a Lisp machine supporting high r e s o l u t i o n
graphics. Because the source code i s provided, i t i s easy to
write extensions to MRS d i r e c t l y i n Lisp (such as the user i n t e r -
face). F i n a l l y , since MRS i s academic software, i t i s inexpen-
sive.

Results

Development of ECAT Knowledge Bases. The extent of an expert's


domain knowledge t y p i c a l l y exceeds that which he or she r e a l i z e s ,
or i s capable of immediately a r t i c u l a t i n g . Our experience has
shown that an expert asked to begin with a "tabula rasa" and
perform an instantaneous brain dump of domain knowledge w i l l y i e l d
only a small portion of that knowledge. The technique we are
using to f a c i l i t a t e transfer of human expertise to the expert
system program involves an i t e r a t i v e process which incrementally
improves program f u n c t i o n a l i t y . Incorrect or incomplete conclu-
sions reached by the program are presented to the human expert who
i s asked to provide the information necessary f o r the program to
y i e l d the expert's recommended s o l u t i o n . This process i s r e -
peated, expanding the knowledge base and hence the frequency with
which successful problem solving occurs.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
286 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

This strategy i s i l l u s t r a t e d by our construction of the


knowledge base of Module 3 of the ECAT program. Module 3 spe-
c i f i e s the HPLC a n a l y t i c a l column, mobile phase constituents, and
detector to be used. The design problem given to the program i s
termed a "sample probe". A sample probe consists of a s p e c i f i c a -
t i o n of a user's sample (input) and the recommendations which the
program SHOULD compute (output). Sample probes are prepared by
colleagues outside the program ( i . e . , by chromatographers other
than the domain expert) by selection from new separations ap-
pearing i n refereed chromatographic journals, and from standard,
q u a l i f i e d HPLC methods.

F i r s t Rules* The f i r s t probe tested was the trace analysis of


phenols i n wastewater. At this point, the knowledge base con-
tained no rules and thus no answer was given as to column, mobile
phase or detector s p e c i f i c a t i o n . The expert stated that the
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

separation should be run i n a reverse phase mode on a C 1 8 - s i l i c a


column, with a water-acetonitrile mobile phase containing 0.1%
acetic acid as a competing acid additive (to reduce peak t a i l i n g
of weakly a c i d i c phenols). At t h i s point, the expert was asked to
explicate h i s reasoning as a series of rules which concluded the
correct design recommendation. Nine rules were s p e c i f i e d .
It should be noted here that i n specifying the rules f o r the
f i r s t probe (phenols), i t became clear that rules f o r choosing the
column and mobile phase interact s i g n i f i c a n t l y with detector
rules. 0.1% acetic acid works well as a competing acid additive
i n terms of chromatography of the phenols. However, carboxylate
ions are known to quench the fluorescence of phenols. Thus, i f
one were to use a fluorescence detector for trace phenol detec-
t i o n , an a l t e r n a t i v e competing a c i d , such as 0.1% phosphoric acid
should be substituted. I t was decided that mobile phase/detector
i n t e r a c t i o n rules would be the f i r s t detector rules to be added to
the knowledge base.

More Rules. Figure 5 tracks the number of IF/THEN rules added to


the knowledge base to specify column and mobile phase c o n s t i -
tuents. Detector rules other than those r e l a t i n g to mobile phase
compatibility were not entered at this time. As the knowledge
base expanded, subsequent probes of similar molecular structure
(and hence s i m i l a r chromatographic properties) were solved with
addition of few or no rules. Solution without requiring i n c r e -
menting of the knowledge base i s termed a "direct h i t . " Spikes i n
the graph of Figure 5 occur for new sample probes having major
s t r u c t u r a l differences from those already tested - for example,
the sample probe "LDH isoenzymes" required special rules regarding
protein chromatography.
It should also be noted that new sample probes can generate
additions to the sample information queries asked of the user at
the beginning of the "probe session." Thus, protein probes r e -
quired the addition of queries regarding molecular weight, i s o -
e l e c t r i c point and whether b i o l o g i c a l a c t i v i t y i s to be preserved
i n the chromatographic step. These questions are triggered only
i f the user specifies the sample as a peptide or protein i n answer
to the i n i t i a l sample questions. Also, once the sample i s spe-
c i f i e d as a protein, the question as to the pKa or pKb of the

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
BACH ET AL. An Expert System for H PLC Methods Development 287

CHAOSNET
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

TELEPHONE
LINES

Figure 4. Hardware and Software f o r the development of ECAT.

PHENOLS
OPIUM ALKALOIDS
ACID EXTRACT OF URINE
TETRACYCLINES
SCH 28191 EXPT'L DRUG
BETA-CAROTENE
LDH ISOENZYMES
HGH TRYPTICDIGEST
UREA, THIOUREA
TRICYCLIC ANTIDEPRESSANTS
AVERMECTINS
CARDIAC DRUGS
IBUPROFEN
CHL0R0-, NITRO-PHENOLS
TESTOSTERONE STEROIDS

AVERAGE OF RULES USED TO


SOLVE A PROBE IS CA.15

6 7 8 9 10 12 13 14 15
PROBE NUMBER

Figure 5. Development of knowledge base rules to select column


and mobile phase constituents. * indicates "direct h i t " .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
288 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

sample w i l l not be asked, and the i s o e l e c t r i c point i s requested


instead. Thus the i n i t i a l query session i s " i n t e l l i g e n t " i n the
sense that the questions asked are s p e c i f i c to the sample probe
itself.
The next step i n the development of this knowledge base w i l l
be to subject i t to probe input by chromatographers i n the Varian
HPLC Applications laboratory. For each external probe which i s
answered with an incorrect or incomplete answer by the ECAT pro-
gram, an interrogation of the probe creator by the ECAT domain
expert w i l l generate a d d i t i o n a l rules to be conveyed to the know-
ledge engineers. Thus the knowledge base w i l l be incremented.

Automatic Testing. As the knowledge base expands, the need to


check each new rule f o r consistency with the e x i s t i n g rule set
becomes c r i t i c a l . This i s done automatically. A program subjects
the f i l e of previous sample probes to the expanded knowledge base
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

and checks to see i f previous solutions are unaffected. If pre-


vious solutions have been affected, one must proceed to debug the
new additions to the knowledge base. This often requires r e -
w r i t i n g some rules. Sometimes i t provokes rethinking and reformu-
l a t i n g part of the knowledge base.

Current Performance - Modules

The performance of ECAT i s p r i m a r i l y determined by the correctness


and extent of the knowledge and data bases, that i s , the modules
shown i n Figure 2.

Module 3, Column and Mobile Phase Design (CMP). This i s the core
module f o r ECAT. I t can currently specify i ) a n a l y t i c a l column
and mobile phase constituents for reverse phase chromatography of
common classes of organic molecules; i i ) reverse phase, i o n
exchange phase and hydrophobic i n t e r a c t i o n chromatography of
proteins and peptides; i i i ) a limited set of s p e c i a l t y classes
of molecules best treated by straight phase chromatography (e.g.,
mono- and disaccharides). The rules f o r s e l e c t i o n of the HPLC
detector are under development within Module 3. Some of the rules
for detector mobile phase compatibility are already encoded. A
set of rules for detector s e l e c t i o n i s ready but not yet encoded.
The program i n f e r s design parameters using data base informa-
t i o n from Module 1 and user-supplied information, along with an
extensive knowledge base of chromatography h e u r i s t i c s . Module 3
currently contains ca. 160 rules, generated to cover 15 sample
probes which represent some commonly separated classes of com-
pounds (see Table I ) . Figure 6 shows an example of the a p p l i c a -
t i o n of ECAT to a design problem. The items i n Figure 6 are the
user inputs and system recommendations i n the form i n which they
are actually processed and generated by the program.
Figure 7 shows part of the user consultation that e l i c i t e d
the inputs l i s t e d i n Figure 6. The current user interface pro-
vides on-line help as w e l l as a menu of numbered v a l i d r e -
sponses. The user may e i t h e r type i n the number or the l i s t e d
item. In answer to the user typing "?", the system rephrases the
question, redisplays acceptable values, and specifies what other
characters are recognized. I f this i s not enough information, the

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
22. BACH ET AL. An Expert System for H PLC Methods Development 289

Table I . Sample probes used to develop knowledge base


of Module 3 f o r s p e c i f i c a t i o n of a n a l y t i c a l
HPLC column, and mobile phase constituents.

Probe 1. phenoIs moderately polar, weakly a c i d i c


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

molecules
Probe 2. opium a l k a l o i d s polar, basic nitrogen hetero
cycles, t y p i c a l of many drugs
Probe 3. acid extract of urine carboxylic acids
Probe 4. tetracyclines molecules with s i g n i f i c a n t metal-
complexation character
Probe 5. SCH 28191 same as opium a l k a l o i d s (Probe 2)
(experimental drug)
Probe 6. beta-carotene non-polar, neutral molecules
Probe 7. LDH isoenzymes proteins
Probe 8. HGH t r y p t i c digest peptide fragments
Probe 9. urea, thiourea small, polar molecules
Probe 10. t r i c y c l i c a n t i - same as opium a l k a l o i d s (Probe 2)
depressants
Probe 11. avermectins moderately polar, neutral mole-
cules
Probe 12. cardiac drugs same as opium a l k a l o i d s (Probe 2)
Probe 13. ibuprofen moderately polar carboxylic acid
Probe 14. chlorophenols non-fluorescent phenols
nitrophenols (see Probe 1)
Probe 15. testosterone steroids complex mixture of compounds
sharing same hydrocarbon backbone
and d i f f e r i n g i n functional group

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

USER ENTRIES

(analyte-class phenols)
(specific-analyte phenol)
(pka-of phenol 11)
(largest-mw 400 daltons)
(detector-type fluorescence)
(smallest-analyte-amount 10 ng)
(class-of sample-matrix river-water)
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

RECOMMENDATIONS

Guard column
(additional-column guard-column)
(packing-of guard-column p e l l i c u l a r )
(packing-of guard-column silica-based)
(diameter-of p e l l i c u l a r 25 micron)

A n a l y t i c a l column
(separation-mode reverse-phase)
( r e s t r i c t (diameter-of p a r t i c l e $value micron)
(<= $value 5))
(packing-of $column silica-based)
(prefer (bonded-phase $column CI8)
(bonded-phase $column C8) 0.2)

Mobile phase
(prefer ( l i q u i d - o f solventb a c e t o n i t r i l e )
( l i q u i d - o f solventb methanol) 0.4)
( l i q u i d - o f solventb methanol)
( l i q u i d - o f solventb a c e t o n i t r i l e )
( l i q u i d - o f solventa water)
(additive-of solventb competing-acid phosphoric-acid)
(additive-of solventa competing-acid phosphoric-acid)
( r e s t r i c t (ph-of $3 $4) (>= $4 2) (<= $4 7.5))
(concentration-of phosphoric-acid solventb 0.1%)
(concentration-of phosphoric-acid solventa 0.1%)

gure 6. Example of user inputs and system recommendations f o r


CMP probe. $... are v a r i a b l e s .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
BACH ET AL. An Expert System for H PLC Methods Development 291

You are running the Column and Mobile Phase Selection module.

You should type ? or H any time you require help.


Some prompts require an a d d i t i o n a l <CR> to terminate input acqui­
sition.
Be careful not to type ahead.

V a l i d values:
1. amino-acid-hydrolysate 17. oligonucleotides
2. amino-acid- 18. oligosaccharides
physiological-fluids 19. oligosaccharides
3. c i t r i c - a c i d - c y c l e - a c i d s 20. peptide
4. diastereomers 21. phenols
5. carboxylic-acid 22. phospholipids
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

6. disaccharides 23. porphyrins


7. glucosamines 24. porphyrins
8. g l y c o l i p i d s 25. prostaglandins
9. glycosphingolipids 26. protein
10. hyd r oxyv i t ami ns d 2+d 3 27. sphingolipids
11. l i p i d s 28. stereoisomers
12. methylated-nucleosides 29. steroids
13. monosaccharides 30. sugar-alcohols
14. monosaccharides! 31. tricarboxylic-acids
15. nue le ο s i de s+nuc l e ο t i de s 32. other
16. nucleotides
Analyte c l a s s : phenols <CR>
Analyte class: <CR>

V a l i d value i s a number.
phenol pKas: ? <CR>
Enter the pKa values f o r phenol.

Valid value i s a number,


phenol pKas: 11 <CR>

V a l i d value i s a number (unit: daltons).


Largest molecular weight: H <CR>
You are asked to enter the molecular weight of the largest mol­
ecule you are interested i n analyzing. Typical values are ranging
from the low hundreds to a few hundred thousand ( i n the case of
proteins).

Any of the following i s a v a l i d response: <number> unknown


Largest molecular weight: 400 <CR>

Are you using a fluorescence detector ? [ y ] : <CR>

Valid value i s a number (unit: Nanograms)


Smallest analyte amount: 10 <CR>

Figure 7. Excerpts from a user/expert-system consultation.


Underlined items are user input. <CR> indicates user typed a
carriage return.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
292 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

user can type "h" f o r help. The help text i s not yet completely
written, but i s r e a d i l y extensible. "How" and "why" queries are
not yet recognized.

Module 6, Column Diagnosis (COLDTAG). This module uses chroma-


tographic parameters such as e f f i c i e n c y , asymmetry, retention
time, s e l e c t i v i t y and operating pressure, to detect f a i l u r e s of
the column or other chromatographic hardware. Table I I l i s t s the
types of column f a i l u r e which the module can currently handle.
Note that the module w i l l also correctly diagnose some problems
which are NOT column malfunctions but which might be interpreted
as such by a user.

Table I I . Types of Column Failure Diagnosed and Treated by


Module 6
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

Column Failures plugged column bed or f r i t s


d i s s o l u t i o n of column bed at high pH
physical compression of column bed
h y d r o l y t i c cleavage of bonded phase
chemical a l t e r a t i o n of cn bonded phase
reaction of Nh^> bonded phase with C=0
deactivation of S i adsorbtion s i t e s by
trace H^O
loss of packing material from column
i r r e v e r s i b l e adsorbtion of sample matrix
components
Non-column Failures too large an increase i n i n j e c t i o n volume
inadvertent change to strongly e l u t i n g
i n j e c t i o n solvent
inadvertent overloading of column

Module 1, Determination of Chemical and Structural Information on


the Sample. The task of Module 1 i s to provide non-chromato-
graphic data for analytes p r i o r to s p e c i f i c a t i o n of the chromato-
graphic method. Data bases have been developed f o r pK values of
organic molecules, i s o e l e c t r i c points of proteins, and f l u o r e s -
cence spectral properties of organic molecules.

Other Modules. Modules 2, 4 and 5 are currently i n a design


stage.

Future Plans

Module (Knowledge Base) Development. Future development of the


ECAT system w i l l involve, i n chronological order:
1) Incrementing Module 3 (CMP) according to sample probe
testing by chromatographers not d i r e c t l y associated with the
project and adding detector s e l e c t i o n rules.
2) Development of the knowledge base f o r Module 5 (optimiza-
t i o n of mobile phase composition and program). The column and
mobile phase constituents having been specified by Module 3, the
knowledge base of Module 5 w i l l be used along with inputs of
required analysis time, and desired degree of resolution to guide

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
22. BACH ET AL. An Expert System for HPLC Methods Development 293

optimization of the mobile phase composition and program ( i f


gradient e l u t i o n i s required). This module w i l l require a great
deal of e f f o r t as one must write knowledge base rules to determine
which parameters (pH, i o n i c strength, ion-pair reagent concentra-
t i o n , organic modifier concentration, etc.) to optimize by e i t h e r
simplex or f a c t o r i a l design techniques. One must also develop
algorithms to analyze the q u a l i t y of chromatographic separations
with respect to user input requirements.
3) Integration of Module 6 (diagnosis of hardware problems)
into Module 5. I t i s sometimes necessary to detect and t r o u b l e -
shoot hardware and column f a i l u r e s during the optimization step.
When abnormal changes i n separation parameters occur during an
optimization, this detection can h a l t the series of optimization
experiments and n o t i f y the user of the system f a i l u r e , preventing
a useless optimization of a "broken" HPLC system. An example of
this coupling of troubleshooting to optimization would be a s i t u a -
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

t i o n i n which one i s optimizing a reverse phase separation of


opium a l k a l o i d s , with respect to the parameters pH and % a c e t o n i -
t r i l e i n the mobile phase. Suppose one has run a series of exper-
iments varying the pH between 2 and 3 and the concentration of
a c e t o n i t r i l e between 40% and 50%. Assume the e f f i c i e n c y has
remained between 6000-7000 plates and asymmetry between 1.2-1.3.
On the next experiment, with pH 3 and 55% a c e t o n i t r i l e , the reten-
t i o n time decreases as expected, but the peak e f f i c i e n c y drops to
500 plates, and asymmetry increases to 9.0. Simultaneously, a
s l i g h t pressure increase occurs. The column troubleshooting
module would f l a g the abnormal change i n chromatographic param-
eters occuring for a very s l i g h t change i n mobile phase charac-
ter. I t would then go back and repeat a previous experiment such
as pH 3 and 50% a c e t o n i t r i l e . If the previous e f f i c i e n c y cannot
be reproduced, i t i s c e r t a i n a malfunction has occurred. The
module could then h a l t the optimization, troubleshoot the f a u l t
(collapse of column bed with formation of a void at head of c o l -
umn), and recommend corrective action to the user.
4) Development of Module 4 (knowledge base for sorbent
cartridge-based sample cleanup p r i o r to the a n a l y t i c a l chroma-
tography step). The column and mobile phase constituents spe-
c i f i e d by Module 3 w i l l be fed into Module 4 i n order to determine
the procedure f o r sample cleanup, i s o l a t i o n and e l u t i o n steps
p r i o r to the a n a l y t i c a l chromatography step. In developing
Module 4, we w i l l use the recently developed techniques based on
sorbent cleanup and i s o l a t i o n of sample analytes rather than the
c l a s s i c a l l i q u i d - l i q u i d extraction techniques. This decision was
based on the a b i l i t y to automate the sorbent technique by using
short chromatographic sorbent cartridges and on technical advan-
tages discussed i n d e t a i l elsewhere (21).
5) Expansion of Module 3 to include rules for s e l e c t i o n of
detectors and detector parameters. The rules w i l l handle o p t i c a l
absorbance and fluorescence (including pre- and post-column d e r i -
vatization) and electrochemical detection.
6) Expansion of the data bases i n Module 1 to include spec-
troscopic and electrochemical data to be used by the detector
s e l e c t i o n rules of Module 3. (This would include UV absorbance
spectral properties of organic molecules, fluorescence quenching
and a c t i v a t i n g properties of solvent environments, and e l e c t r o -

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
294 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

chemical a c t i v i t y of organic molecules.) Additional data which


w i l l eventually be needed i n ECAT includes s o l u b i l i t y c h a r a c t e r i s -
t i c s , b o i l i n g points, and melting points of organic molecules.
7) Development of Module 2: Knowledge base for screening out
samples which are best done by GC, and development of a l i b r a r y of
standard, q u a l i f i e d HPLC and GC methods. The system w i l l decide
whether the a n a l y t i c a l demands of the separation are best served
by gas chromatography or by l i q u i d chromatography. Information
available from Module 1 i s needed ( b o i l i n g and melting point data,
molecular weight), along with information on analyte l e v e l s ,
matrix properties, sample complexity and r e s o l u t i o n require-
ments. As the ECAT program evolves, one might eventually consider
adding decision c a p a b i l i t y regarding other important separation
techniques, such as gel electrophoresis, to the knowledge base of
step 2. I t w i l l be quite useful here to include a " l i b r a r y " of
standard methods used i n GC and LC. The experts must specify
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

those known analytes that are best separated by GC. For example,
trace analysis of v o l a t i l e pesticides at sub-picogram l e v e l s i s
best performed by GC, and the user should have access to the
recommended GC method before considering an LC development. On
the other hand, analysis of r e l a t i v e l y non-volatile i o n i c drugs i s
best done by HPLC, and here the user should be provided with a
standard, q u a l i f i e d HPLC method i f such a method e x i s t s . Thus,
Module 2 w i l l include both a knowledge base guiding the decision
as to GC versus LC and w i l l provide a l i b r a r y of standard, qual-
i f i e d chromatographic methods.

Knowledge Representation

We are currently i n v e s t i g a t i n g the expansion of the ECAT capa-


b i l i t y to represent and process knowledge by including a represen-
t a t i o n scheme based on h e i r a r c h i c a l l y structured descriptions of
object properties (so-called "frame-based"). It i s sometimes
awkward to express subtle or i n d i r e c t knowledge i n the simple form
of forward chaining reasoning we are currently using. A s p e c i a l -
ized planning software architecture such as SPEX (22) might be
required to handle the f u l l fledged design module.
We are looking i n t o ways of expressing uncertainty. For
example, uncertainty occurs i n ECAT when there are alternate sus-
pected causes of a separation malfunction or alternate choices of
bonded phase for some sample classes. In ECAT, representation of
uncertain information w i t h i n causal reasoning i s currently handled
by predicates such as "prefer" or "consider". There i s a long-
standing discussion of reasoning about uncertain, inexact or
unreliable information (23). Certainty factors (24), s t a t i s t i c s ,
fuzzy sets (25) and e x p l i c i t reasoning are methods that can be
applied to solve t h i s problem. MYCIN-type certainty factor han-
d l i n g can be revised to f i t e n t i r e l y into the realm of s t a t i s t i c s
(26). Gordon and S h o r t l i f f e have proposed a computable method f o r
using the Dempster-Shafer theory of evidence (27). Cohen and
Grinberg (28) have argued that i t i s best to reason e x p l i c i t l y
about uncertainty i n ways s i m i l a r to human thought processing.
However, the l a t t e r method requires extensive computing which we
f e e l i s hardly j u s t i f i a b l e i n our case. We w i l l thus investigate
further a p p l i c a t i o n of simple statistics and Gordon and

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
22. BACH ET AL. An Expert System for HPLC Methods Development 295

S h o r t l i f f e ' s proposal. In addition, we w i l l experiment with a


s i m p l i f i e d , domain r e s t r i c t e d form of e x p l i c i t reasoning about
uncertainty.

User Interface

This, of course, i s a very important part of the program. We are


developing i t on an as-needed basis i n response to feedback from
users. In p a r t i c u l a r , we s t i l l have not implemented a complete
explanation f a c i l i t y . The user interface currently provides
online help and a menu based selection of v a l i d responses whenever
applicable.

Conclusion

We have presented the development of an expert system i n HPLC.


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

The o v e r a l l project goals are f a i r l y ambitious and w i l l require


continuous work f o r many years to come. However, the CMP design
module (the key module of the project) and the COLDIAG module are
beyond a prototype stage of development. Since those two modules
cover the two modes of reasoning we consider using, we w i l l be
able to complete the entire ECAT program using the approach de­
scribed i n t h i s paper.

Acknowledgments

The authors wish to thank Steve Rosenblum f o r c r i t i c a l reviews of


the manuscript and June Shelley f o r help i n the preparation of the
manuscript.

Literature Cited

1. Expert Systems Part 2, Dessy, R. Ε., Ed.; Anal. Chem. 56,


1984, 1312A
2. Expert Systems Part 1, Dessy, R. Ε., Ed.; Anal. Chem. 56,
1984, 1200A
3. Glajch, J. L.; et a l . J. Chrom. 1980; 199, 57.
4. Glajch, J. L.; et a l . J. Chrom. 1982; 238, 269.
5. Kirkland, J. J.; Glajch, J. L. J. Chrom. 1983; 255, 27.
6. Débets, H. G.; et a l . Anal. Chim. Acta 1983; 150, 259.
7. Schoenmakers, P. J.; Drouen, A. C. J. H.; B i l l i e t , H. A. H.;
de Galan, L. Chromatographia 1982; 15, 688.
8. Haddad, P. R.; Drouen, A. C. J. H.; Billiet, H. A. H.;
de Galan, L. J. Chrom. 1983; 282, 71.
9. B i l l i e t , H. A. H.; Drouen, A. C. J. H.; de Galan, L.
J. Chrom. 1984; 316, 231.
10. Barr, Α.; Feigenbaum, E. "Handbook of AI"; William Kaufman
Inc., 1982; Vol. II, Chap. VIIB.
11. Lindsay, R.; Buchanen, B. G.; Feigenbaum, Ε. Α.; Lederberg,
J. "DENDRAL"; McGraw H i l l ; New York, 1980.
12. Crandell, C. W.; Gray, Ν. Α. Β.; Smith, D. H. J. Chem. Inf.
and Comp. Sci. 1952; 22, 48.
13. Gray, Ν. Α. Β. A r t i f i c i a l Intelligence (1984); 22, 1-21.
14. Jardetzky, 0. Proc. Int. Conf. on Frontiers of Biochemistry
and Molecular Biology 1984.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
296 A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

15. Terry, A. Stanford Heuristic Programming Project, Report


No. HPP-83-19, May 1983.
16. Agarwal, K. K.; Larsen, D. L.; Gelernter, H. L. Computers in
Chemistry 1978; 2, 75.
17. Gelernter, H. L.; et al. Science 1977; 197, 1041.
18. Corey, E. J.; Long, A. K.; Rubenstein, S. D. Science 1985;
228, 408-418.
19. Wipke, W. T.; Ouchi, G. I.; Krishnan, S. A r t i f i c i a l
Intelligence 1978; 11, 173.
20. Russell, S. Stanford Knowledge Systems Laboratory, Report
No. KSL-85-12, 1985.
21. Van Home K.; Good, T. American Laboratory 1983; 15, 116.
22. Bach, R.; Iwasaki, Y.; Friedland, P. Nucleic Acids Research
1984; 12, 11-29.
23. Panel on Reasoning with Uncertainty for Expert Systems,
International Joint Conference on Artificial Intelligence,
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch022

Los Angeles, California, 1985.


24. Buchanan, B. G.; Shortliffe, Ε. H. "Rule Based Expert-
Systems"; Addison-Wesley, 1984, Chap. 10.
25. Zadeh, L. A. in "Machine Intelligence"; Hayes, J.;
Michie, D.; Mikulich, L.I. Eds.; John Wiley and Sons:
New York, 1979; pp. 149-194.
26. Heckerman, D. Proceedings of the Workshop on Uncertainty and
Probability in A r t i f i c i a l Intelligence; American Association
for A r t i f i c i a l Intelligence, 1985; pp. 9-20.
27. Gordon, J.; and Shortliffe, E.H., A r t i f i c i a l Intelligence
1985; 26, pp. 323-357.
28. Cohen, R.; Grinberg, M. R. AI magazine 1983; 4, 17-24.

RECEIVED January 16, 1986

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
23
An Expert System for Optimizing
Ultracentrifugation Runs

Philip R. Martz, Matt Heffron, and Owen Mitch Griffith

Beckman Instruments, Inc., Fullerton, C A 92634

The SpinPro U l t r a c e n t r i f u g a t i o n Expert System i s a


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch023

computer program that designs optimal u l t r a c e n t r i -


fugation procedures to s a t i s f y the investigator's
research requirements. SpinPro runs on the IBM PC/XT.
Ultracentrifugation is a common method in the separa-
t i o n of b i o l o g i c a l materials. I t s c a p a b i l i t i e s ,
however, are too often under-utilized. SpinPro
addresses t h i s problem by employing Artificial
Intelligence (AI) techniques to design e f f i c i e n t and
accurate u l t r a c e n t r i f u g a t i o n procedures. To use
SpinPro, the investigator describes the centrifugation
problem i n a question and answer dialogue. SpinPro
then offers detailed advice on optimal and a l t e r n a t i v e
procedures for performing the run. This advice
results i n cleaner and faster separations and improves
the e f f i c i e n c y of the u l t r a c e n t r i f u g a t i o n laboratory.

U l t r a c e n t r i f u g a t i o n i s a common and powerful method i n the separ-


a t i o n of b i o l o g i c a l materials. Despite i t s widespread use, however,
few investigators f u l l y e x p l o i t i t s c a p a b i l i t i e s . As a r e s u l t , run
times are unnecessarily long and separations are i n d i s t i n c t . In the
long run, the e f f i c i e n c y and performance of the laboratory s u f f e r .
The fundamental cause of t h i s s i t u a t i o n i s the increasing
complexity of the u l t r a c e n t r i f u g a t i o n environment; the investigator
must select the run parameters from a growing l i s t of rotors,
gradient materials, and l i t e r a t u r e references. Knowing which rotor
to use and at what run speed and run time i s a d i f f i c u l t matter.
Furthermore, the s e l e c t i o n of one parameter complexly l i m i t s the
a v a i l a b l e choices f o r the remaining parameters.
Reliance on procedures reported i n the l i t e r a t u r e has com-
pounded the problem. Often these procedures, perhaps i n i t i a t e d by
investigators with a l i m i t e d set of rotors, are i n e f f i c i e n t by
today's standards: the rotor i s inappropriate, the run speed i s too
slow, or the run time i s too long. A new investigator applying t h i s
procedure does not take f u l l advantage of the p o t e n t i a l of u l t r a -
centrifugation.

0097-6156/ 86/ 0306-0297S06.00/ 0


© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
298 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

One s o l u t i o n to the problem i s to provide the i n v e s t i g a t o r with


technical advice. Good advice should y i e l d several immediate
benefits: 1) Reliance on inappropriate or outdated techniques can
be eliminated. 2) Better use can be made of the a v a i l a b l e equip-
ment; shorter run times and improved separations w i l l r e s u l t .
3) The advice can be s p e c i f i c to the research requirements of the
investigator. 4) The time usually wasted i n performing standardi-
zation runs, designing an u l t r a c e n t r i f u g e procedure, or researching
u l t r a c e n t r i f u g a t i o n techniques can be minimized. In general, good
advice w i l l improve the procedures, and thereby, improve the
e f f i c i e n c y of most laboratories.
Designing e f f i c i e n t u l t r a c e n t r i f u g a t i o n procedures and pro-
v i d i n g good advice, however, i s a complex task; the knowledge and
experience of an u l t r a c e n t r i f u g a t i o n expert are often required. In
t h i s paper we describe a computer program, the SpinPro U l t r a c e n t r i -
fugation Expert System, that designs u l t r a c e n t r i f u g a t i o n procedures
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch023

i n response to the requirements of the i n v e s t i g a t o r . SpinPro runs


on the IBM PC/XT. The program i s based on techniques from the f i e l d
of A r t i f i c i a l I n t e l l i g e n c e (AI) and expert systems: the powerful
c a p a b i l i t i e s of the Lisp programming language; an inferencing
procedure capable of drawing conclusions from a complex knowledge
base; and a knowledge base derived from the expertise of u l t r a -
centrifugation experts. Indeed, SpinPro's use can be compared to
the advice any person might seek from an expert. The i n v e s t i g a t o r
and SpinPro enter into a question and answer dialogue i n which the
investigator describes the research goals and sample character-
i s t i c s . At the conclusion of the dialogue, SpinPro produces the
following reports:

1. The Design Inputs Report i s a summary of the SpinPro-


investigator dialogue.
2. The Optimal Plan Report describes an optimal u l t r a c e n t r i f u g a t i o n
procedure designed to solve the problem described i n the
dialogue. I t uses the most appropriate rotor from the
entire l i n e of Beckman r o t o r s .
3. The Lab Plan Report i s s i m i l a r to the Optimal Plan, but i t
describes a procedure based e x c l u s i v e l y on the u l t r a c e n t r i f u g e s
and rotors a v a i l a b l e i n the investigator's laboratory.
4. The Plan Comparisons Report compares the Optimal Plan and Lab
Plan, i d e n t i f y i n g s i g n i f i c a n t differences and trade-offs between
the two plans.

The reports constitute a complete set of recommendations f o r the


u l t r a c e n t r i f u g a t i o n problem posed to SpinPro. Thus, SpinPro
performs the advisory role of an u l t r a c e n t r i f u g a t i o n expert:
interviewing the i n v e s t i g a t o r f o r the problem d e s c r i p t i o n , o f f e r i n g
expert advice on the most appropriate c e n t r i f u g a t i o n procedure, and
f i n a l l y , comparing a l t e r n a t i v e procedures.

Major Functions

SpinPro has four major functions: CONSULTATION, INFORMATION,


CALCULATION, and CONFIGURATION. The CONSULTATION function performs
the role of expert advisor. I t i s the main topic of t h i s paper.
The INFORMATION function provides a database of u l t r a c e n t r i f u g a t i o n

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
23. MARTZ ET AL. Optimizing Ultracentrifugation Runs 299

techniques, centrifuges, rotors, and l i t e r a t u r e references. The


CALCULATION function performs a variety of routine c a l c u l a t i o n s
including rotor speed reductions, k f a c t o r s , and p e l l e t i n g time.
The CONFIGURATION function records the ultracentrifuges and rotors
i n the investigator's laboratory. This information i s used by the
CONSULTATION function when designing a run using the equipment from
the laboratory.

User Interface

A l l user inputs are made by pointing at text on the computer screen


with a "mouse" controlled cursor. The mouse i s a hand-held pointing
device, which when moved by the investigator over a f l a t surface,
controls the movement of a cursor or pointer on the computer screen.
To run the CONSULTATION function, the user points at the text
"CONSULTATION" on the screen and c l i c k s the mouse button. When
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch023

using SpinPro, the keyboard i s not required. In our observations,


novice users of the program have been able to design u l t r a c e n t r i -
fugation procedures w i t h i n minutes of using the program.

The CONSULTATION Function

The primary goal of the CONSULTATION function i s to provide the best


advice possible on precisely how to set up and run an u l t r a c e n t r i -
fugation procedure that i s s p e c i f i c a l l y designed f o r the i n v e s t i -
gator's research. SpinPro addresses v i r t u a l l y a l l problems i n the
u l t r a c e n t r i f u g a t i o n of b i o l o g i c a l samples excluding whole c e l l s . To
t h i s end, SpinPro i s "knowledgeable" about d i f f e r e n t i a l , rate-zonal,
and isopycnic methods. I t addresses the separation of proteins,
glycoproteins, proteoglycans, l i p o p r o t e i n s , subcellular f r a c t i o n s ,
nucleic a c i d s , and v i r u s e s . SpinPro's rotor knowledge includes
swinging bucket, f i x e d angle, v e r t i c a l tube, zonal, and continuous
flow rotors.

Operation

The CONSULTATION function i s run by using the mouse to select the


text "CONSULTATION" from the computer screen. The f i r s t question of
the dialogue, "Please enter the class of your sample of i n t e r e s t " ,
appears on the screen. The pop-up menu l i s t s the sample types to
chose from. The i n v e s t i g a t o r then uses the mouse to select the
appropriate response from the pop-up menu. This question and answer
procedure continues u n t i l SpinPro has enough information, t y p i c a l l y
10 to 15 questions, from which to i n f e r a l l of the relevant param-
eters. The dialogue i s directed by SpinPro i n response to answers
to previous questions. Thus, i f the sample i s a p r o t e i n , SpinPro
requests the sedimentation c o e f f i c i e n t ; i f the sample i s a n u c l e i c
a c i d , SpinPro requests the type of nucleic a c i d . At the conclusion
of the dialogue, the reports are w r i t t e n to the disk. Using the
pop-up menu, the reports can be read or saved.
The dialogue includes c a p a b i l i t i e s to increase i t s f l e x i b i l i t y .
F i r s t , the i n v e s t i g a t o r can change an answer to a previous question
without d i s r u p t i n g the course of the dialogue. This c a p a b i l i t y i s
useful when describing a problem that d i f f e r s only s l i g h t l y from a
previously described problem. Second, the i n v e s t i g a t o r can ask why

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
300 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

the current question i s being asked. The "Why?" function informs


the user what SpinPro i s attempting to i n f e r ( i . e . , the l i n e of
reasoning) at any p a r t i c u l a r step, and i t describes the a f f e c t that
d i f f e r e n t answers w i l l have on the l i n e of reasoning. Third, when
the answer to a question i s not known, the investigator can answer
the question with "unknown". Depending on the question, SpinPro
responds either by asking a related question or by assuming a
reasonable answer and designing the procedures based on this
assumption. Any assumptions that have been made are noted i n the
reports. F i n a l l y , for the experienced users of SpinPro, there i s
the option to request that, during the dialogue, a short form of the
question be used.

Optimization C r i t e r i a

Two of the dialogue questions are of unique importance and are


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch023

p a r t i c u l a r l y representative of SpinPro's c a p a b i l i t i e s . The f i r s t i s


a question of research requirements. Every u l t r a c e n t r i f u g a t i o n pro-
cedure should r e f l e c t the investigator's concern f o r purity of the
separation or short run time, goals that often run counter to each
other. Rarely does any procedure state t h i s trade-off e x p l i c i t l y .
The optimization c r i t e r i a question, "Select one of the following
optimizations:", not only i d e n t i f i e s the trade-offs involved when
designing a procedure, but allows the investigator to control them.
The investigator can select the c r i t e r i o n which s a t i s f i e s the
s p e c i a l i z e d requirements of the research. The c r i t e r i a are: 1)
p u r i t y , 2) minimize run time, 3) minimize cumulative run time, 4)
minimize number of runs, 5) continuous flow rotor procedures, and 6)
procedures f o r processing many samples of small volume.
Based on the optimization c r i t e r i o n , SpinPro can select the
most appropriate rotor. For example, suppose the investigator has a
r e l a t i v e l y large sample volume, a l l of which needs to be processed
as soon as possible. The "minimize cumulative run time" c r i t e r i o n
would be the appropriate choice. SpinPro would then i n i t i a t e the
following rotor selection procedure: SpinPro determines the t o t a l
sample volume based on inputs of the sample volume, the current
concentration of the sample, and a correction f o r any pre-run
d i l u t i o n s of the sample. Next, consideration i s made f o r whether
tubes or bottles w i l l be used. The program then evaluates rotors
for the number of tube positions and the amount of sample per tube.
At t h i s point, SpinPro w i l l have estimated f o r each rotor the number
of runs required to process the sample. SpinPro then estimates the
run time f o r each rotor to perform a single run. Based on these
estimates, SpinPro selects the rotor that w i l l give the shortest
t o t a l run time when the run time i s summed over the t o t a l number of
runs. S i m i l a r l y , the investigator can select any of the
optimization c r i t e r i a and i n i t i a t e a v a r i e t y of precise rotor
s e l e c t i o n procedures.

Lab Rotors

The second question of unique importance concerns the investigator's


s e l e c t i o n of a rotor f o r the Lab Plan. Whereas, i n the Optimal
Plan, SpinPro selects the rotor; i n the Lab Plan, the investigator
selects the rotor. The i n v e s t i g a t o r , however, i s not required to

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
23, MARTZ ET AL. Optimizing Ultracentrifugation Runs 301

select the rotor b l i n d l y from those available i n the lab. SpinPro


a s s i s t s i n the s e l e c t i o n by assigning each of the rotors i n the lab
to a category based on how well the rotor s a t i s f i e s the requirements
of the problem. The categories are as follows:

1. Optimal rotors - the rotors that are both best suited to per-
form the run and to achieve the stated optimization c r i t e r i o n .
2. Alternate rotors - other rotors that are not optimal but can
perform the run.
3. Not q u a l i f y i n g rotors - rotors that are not recommended for the
problem usually because they are too large or too small for the
sample volume, or because they do not generate s u f f i c i e n t l y
high c e n t r i f u g a l forces.
4. Not compatible rotors - rotors that are not c l a s s i f i e d , as part
of the rotor safety program, for running i n the ultracentrifuge
chosen from the lab.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch023

The investigator can select any rotor from categories 1 and 2 above.
This allows the investigator to experiment with the rotors i n the
lab and to design procedures as variations on the theme established
i n the Optimal Plan. Ultimately, the rotor selected i n the Optimal
Plan by SpinPro and i n the Lab Plan by the investigator are the
major source of difference i n the run parameters, p u r i t y , and
o v e r a l l effectiveness of the two plans.

The Design Inputs Report

As noted e a r l i e r , SpinPro writes four reports regarding the recom-


mended procedures. The Design Inputs Report summarizes the ques-
tions posed by SpinPro and the answers provided by the i n v e s t i g a t o r .
A Design Inputs Report i s shown i n figure 1. The pop-up menu on the
r i g h t allows the user to switch between reports, p r i n t the reports,
or perform other functions. The report summarizes the problem that
i s addressed by the Optimal Plan (Figure 2) and the Plan Comparisons
Report (Figure 3).
A summary of the report follows: The problem i s to separate
proteins. Furthermore, SpinPro should pay p a r t i c u l a r attention to
the p u r i t y of the separation. The sample i s not negatively affected
by sucrose, has a sedimentation c o e f f i c i e n t of 16 Svedbergs, and i s
i n l i q u i d form of 3 mL and a concentration of 1% w/w. The protein
of i n t e r e s t should be placed 45% from the top of the gradient at the
end of the run. Of the gradient concentrations 10-40% and 5-20%,
the 10-40% i s preferred by the investigator. There are no solvents
i n the sample that are harmful to the tubes. F i n a l l y , from the lab,
SpinPro should use the L2-75B u l t r a c e n t r i f u g e and the SW 41 T i
r o t o r , which does not require a speed derating due to i t s age.

The Optimal Plan Report

The Optimal Plan i s SpinPro's recommendation of how best to perform


the run. The Optimal Plan of figure 2 i s underlined and annotated
below:

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
302 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

SpinPro Ultracentrifugation Expert System


Design Inputs

Experiment: SpinPro Consultation 11-Sept-1985 9:30:00


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch023

Particle class: Protein


Separation vs Concentration: Separation Page Forward
Optimization criterion: Purity Page Backward
Assoc/Dissoc in sucrose: No Optimal Plan
Sedimentation coefficient: 16.0 Lab Plan
10-40% or 5-20% gradient?: 10-40 Comparisons
Sample form: liquid/semi-solid Design Inputs
Total sample volume (mL): 3.0
Change Answer
Sample concentration % w/w: 1.0
Save Reports
Selected final location: 45.0
SpinPro Top
Solvents: No
Exit to D O S
Selected lab centrifuge: L2-75B
Selected lab rotor: S W 41 Ti
Rotor derated?: No

Figure 1. The Design Inputs Report f o r the problem described to


SpinPro. The Optimal Plan and the Lab Plan are based on t h i s
problem. The pop-up menu on the right allows switching to the
other reports or performing other functions.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
23. MARTZ ET AL. Optimizing Ultracentrifugation Runs 303

SpinPro Ultracentrifugation Expert System


Optimal Plan

Experiment: SpinPro Consultation 11-Sept-1985 9:30:00


This is a complete plan for a protein sample separation
Optimization criterion: Purity

Method: Density gradient, Rate-zonal


Page Forward
Gradient: 10-40% continuous sucrose
Page Backward
Rotor/run conditions: S W 55 Ti rotor at 55000 rpm
Optimal Plan
for approximately 6 hours
Lab Plan
Potential tube materials: Polyallomer, Ultra-Clear
Comparisons
Design Inputs
Centrifuge: L8-80M set at 4 degrees C
Λ
Change Answer
Omega-squared t: 7.132x10 11
Save Reports
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch023

Acceleration/deceleration: fast/fast
SpinPro Top
Exit to D O S
Prior to the run prepare sample as follows:
No special sample preparation is required.
Load 0.3 mL of the Protein sample in full tubes at
the top position of the gradient.
At the end of the run the 16 S particles will be
approximately 45% from the top of the gradient.
To process the entire sample volume requires approximately
2 centrifuge run(s) with an estimated total run time of
12 hours, 5 minutes.

Figure 2. The Optimal Plan Report f o r the problem described i n


the Design Inputs Report of figure 1. The plan gives the recom­
mended procedure f o r doing the run.

SpinPro Ultracentrifugation Expert System


Plan Comparisons

Experiment: SpinPro Consultation 11-Sept-1985 9:30:00

Run summaries:
Optimal: SW 55 Ti at 55000 rpm for 6 hours per run
Page Forward
in 2 run(s). Requiring a total of approximately
Page Backward
12 hours, 5 minutes Optimal Plan
Lab: S W 41 Ti at 41000 rpm for 15 hours, Lab Plan
45 minutes per run in 2 run(s). Requiring a total of Comparisons
approximately 31 hours, 30 minutes Design Inputs
Change Answer
Comparisons: Save Reports
The Optimal Plan requires 38% of the Lab Plan run SpinPro Top
time for a single run. It requires 38% of the Lab Exit to D O S
Plan run time when processing the entire sample.

Figure 3. The Plan Comparisons Report compares the Optimal and


Lab Plans. The comparison shows that, because the Lab Plan uses
the SW 41 T i r o t o r , the run times are dramatically d i f f e r e n t .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
304 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

This i s a complete plan f o r a protein sample separation. A l l of the


relevant parameters have been inferred i n a "complete plan".
" P a r t i a l plans" indicate that one or more parameters could not be
determined.

Optimization c r i t e r i o n : P u r i t y . The report restates the optimiza-


t i o n c r i t e r i o n chosen by the investigator.

Method: Density gradient, Rate-zonal. The rate-zonal method i s one


of s i x addressed by SpinPro. The other methods are d i f f e r e n t i a l ,
d i f f e r e n t i a l - f l o t a t i o n , discontinuous, isopycnic, and 2-step
isopycnic. These methods d i f f e r dramatically i n t h e i r set up,
p r i n c i p l e s of operation, and expected r e s u l t s . The rate-zonal
method i s described here b r i e f l y so that the recommendations to
follow can be appreciated. P r i o r to the run i n a rate-zonal method,
a gradient material i s introduced to the rotor tubes i n steps of
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch023

increasing density from the top to the bottom of the tube. The
sample to be separated i s layered, as a t h i n band, on the top of the
gradient. As the run begins, each component i n the sample moves
toward the bottom of the tube. Some components sediment f a s t e r than
others. This fact i s the basis f o r the separation. I f the run
parameters are appropriate, the components w i l l form separate bands
within the gradient. At the conclusion of the run, the band
representing the component of interest can be removed from the tube.

Gradient: 10 - 40% continuous sucrose. SpinPro usually selects the


gradient concentration and the gradient material. Here, SpinPro
narrowed the choices to the 5-20% or 10-40% gradient, noting i n the
dialogue that a trade-off between p u r i t y and run time e x i s t s between
the 5-20% and the 10-40% gradient, but e i t h e r w i l l work. The inves-
t i g a t o r selected the 10-40% gradient. The investigator could, i f
desired, f i n i s h the plan based on the 10-40% gradient, and then
using the change answer function, t r y the 5-20% gradient to f i n d out
how the recommendations d i f f e r . Sucrose i s the gradient material of
choice here. SpinPro considers a wide v a r i e t y of gradient materials
including cesium c h l o r i d e , Nycodenz, Metrizamide, g l y c e r o l , and
potassium t a r t r a t e .

Rotor/run conditions: SW 55 T i rotor at 55000 rpm f o r approximately


6 hours. These recommendations form the core of any procedure.
SpinPro usually considers more factors i n the rotor s e l e c t i o n
process than does the expert. In determining the run speed, SpinPro
considers every possible reason to reduce the run speed. If there
are none, the rotor i s run at f u l l speed. When there are reasons
(e.g., when using s a l t gradients, b o t t l e s , d i f f e r e n t i a l p e l l e t i n g ,
or discontinuous runs), the run speed may have to be reduced dramat-
i c a l l y , from 80,000 rpm to 40,000 rpm, for example. There are many
cases of rotors being run too slow for the a p p l i c a t i o n or too fast
for safety. Accurate determination of the run time i s a complex
problem based on the gradient c h a r a c t e r i s t i c s , c a l c u l a t i o n s , i n t e r -
polations from numerical tables, and experience. SpinPro employs
a l l of these methods i n order to i n f e r run times f o r many s p e c i a l
cases.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
23. MARTZ ET AL. Optimizing Ultracentrifugation Runs 305

P o t e n t i a l tube materials: Polyallomer, Ultra-Clear. SpinPro checks


that a l l gradient materials, samples, and solvents are compatible
with the tube materials. The a f f e c t s of acids, bases, o i l s , organic
solvents, and s a l t s on the tube materials are considered.

Centrifuge: L8-80M set at 4 degrees C. The Optimal Plan recommends


the L8-80M u l t r a c e n t r i f u g e . SpinPro selects a temperature that w i l l
protect the i n t e g r i t y of the sample.

Omega-squared t : 7.132xl0Ell. SpinPro calculates this measure of


the t o t a l force applied to the gradient and sample during the run.

Acceleration/deceleration: f a s t / f a s t . Many investigators overlook


the a f f e c t that improper acceleration or deceleration can have on
disrupting the separation, e s p e c i a l l y when re-orientation of the
gradient occurs i n f i x e d angle or v e r t i c a l tube rotors. SpinPro
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch023

addresses many s p e c i a l cases.

P r i o r to the run prepare sample as follows: No special sample


preparation i s required. Proper sample preparation i s important to
,, ,,
prevent overloading the gradient. A sample that i s too concen-
trated w i l l d r i f t through the gradient before the run i s started.
If the sample i s i n a proper form, as i t i s here, then no
preparation w i l l be recommended.

Load 0.3 mL of the Protein sample i n f u l l tubes at the top p o s i t i o n


of the gradient. Applying the correct amount of sample i s important
to prevent "overloading" the gradient. The rotor tubes can be run
f u l l or half f u l l , or b o t t l e s can be used i n place of tubes.
SpinPro determines which option i s most appropriate. A number of
parameters are affected by t h i s option, including the run time.
Knowing where to load the sample i s important. Samples can be
loaded at the top, middle, or bottom of gradients, or mixed
homogeneously with them.

At the end of the run the 16 S p a r t i c l e s w i l l be approximately 45%


from the top of the gradient. In the rate-zonal method, common
practice i s to have the component of i n t e r e s t at the 50% p o s i t i o n i n
the gradient when the run i s over. SpinPro allows the f i n a l
p o s i t i o n to be s p e c i f i e d , giving the investigator the opportunity to
adjust the procedure so that components not of i n t e r e s t are widely
separated from the component of i n t e r e s t .

To process the e n t i r e sample volume requires approximately 2


centrifuge run(s) with an estimated t o t a l run time of 12
hours, 5 minutes. SpinPro determines how many runs are required to
process the e n t i r e sample volume. The t o t a l run time i s estimated.
When large sample volumes are involved, and thus many runs are
required, the investigator can change the optimization c r i t e r i o n to
"minimize number of runs" or "minimize cumulative run time" i n order
to more e f f i c i e n t l y process the sample. Since two runs are required
here, the investigator may want to select a larger rotor for use i n
the Lab Plan.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
306 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

The Lab Plan Report

The Lab Plan provides information s i m i l a r to that of the Optimal


Plan except that there i s the additional constraint of using only
the ultracentrifuges and rotors available i n the laboratory. This
requirement can r e s u l t i n dramatic differences between the Optimal
Plan and the Lab Plan. The run times can d i f f e r by hours, f o r
example, or the p u r i t y of the separation can be s i g n i f i c a n t l y
affected. A completely d i f f e r e n t gradient can be recommended as a
function of the rotor selected from the lab. If there are no rotors
i n the lab capable of doing the separation, SpinPro reports that the
run cannot be done with the available rotors.

The Plan Comparisons Report

The Plan Comparisons report summarizes the differences between the


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch023

plans i n terms of run time and number of runs required to process


the sample (figure 3). In the figure the Optimal Plan uses the
SW 55 T i rotor and the Lab Plan uses the SW 41 T i rotor. The
d i f f e r e n t run times r e s u l t i n g from these rotors are compared on a
percentage basis. A s i m i l a r comparison i s made f o r the t o t a l run
time required to process the entire sample. Each of the rotors
requires two runs to process the entire sample. The comparison of
the t o t a l run times can help i n i d e n t i f y i n g the slower, but larger
capacity, rotors that are more e f f i c i e n t f o r handling large sample
volumes. I f warranted, SpinPro makes q u a l i t a t i v e comparisons
between the two plans.

Expert System Details

SpinPro i s a t y p i c a l backward chaining, rule-based expert system.


Rule-based systems are systems i n which the expert's knowledge i s
encoded primarily i n the form of i f - t h e n r u l e s , i . e . , i f a set of
conditions are found to be true then draw a conclusion or perform an
a c t i o n . "Backward chaining" refers to the procedure f o r finding a
s o l u t i o n to a problem. In a backward chaining system, the inference
engine works backwards from a hypothesized solution to f i n d facts
that support the hypothesis. Alternative hypotheses are t r i e d u n t i l
one i s found that i s supported by the f a c t s .
SpinPro's backward chaining inference engine i s c a l l e d "MP".
"MP" has been developed by Beckman to support the development of
expert systems. I t has several features that have been designed
s p e c i f i c a l l y i n response to the requirements of the SpinPro project.
Two of these requirements are that SpinPro run on an IBM PC/XT and
that the program-user interface be advanced and easy to use. The
report generator and the pop-up menu/mouse i n t e r a c t i o n provide the
advanced user i n t e r f a c e . To be able to run the program on the IBM
PC/XT and s t i l l address the u l t r a c e n t r i f u g a t i o n problem required the
development of fact tables, "why responses", rule functions, rule
groups, and "constraints". Development of these features has
greatly improved the a b i l i t y of "MP" to make complex inferences.
Some of these features are demonstrated i n the rule example of
figure 4. The r u l e , one of approximately 800 rules i n SpinPro, i s
assigned to the rule group 2-STEP.ISOPYCNIC.DNA.RULES. Only those
r u l e s , i d e n t i f i e d by the rule group name and pertinent to the s o l u -

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
23. MARTZ ET AL. Optimizing Ultracentrifugation Runs 307

t i o n of a p a r t i c u l a r problem, are applied to that problem. This


breakdown of rules into rule groups i s one of the methods used to
f a c i l i t a t e putting a complex expert system on a microcomputer with
r e l a t i v e l y limited memory and processing power.
The o v e r a l l e f f e c t of the rule i n figure 4 i s to s e l e c t , from a
set of rotors, those rotors that are best for minimizing the run
time when using the 2-step isopycnic method to separate DNA. The
i n i t i a l set of rotors i s called USERS.MATCHED.ROTORS. The f i n a l set
of rotors i s c a l l e d the MINIMIZE.RUN.TIME.ROTORS. The body of the
rule applies tests to the i n i t i a l set of rotors and concludes that
the rotors passing the tests are the MINIMIZE.RUN.TIME.ROTORS. In
greater d e t a i l , Clause 1 of the rule tests the value of the para-
meter VERTICAL.TUBE.ROTORS. The value of this parameter t e l l s
SpinPro whether v e r t i c a l tube rotors should be considered f o r the
run. Often this can be deduced by SpinPro, but when i t can't, the
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch023

question "Do you want to consider using v e r t i c a l tube rotors i n t h i s


run" i s posed to the user. The parameter VERTICAL.TUBE.ROTORS has a
set of properties that define i t s c h a r a c t e r i s t i c s including the
prompt used to request the information, the "expect" property used
to specify the acceptable responses to the prompt, and the "Why
Response" property used i n response to the investigator's input of
"Why?".
If the value of VERTICAL.TUBE.ROTORS i s found to be true (or
"yes") then clause 2 of the rule i s evaluated. The references to
" f a c t " i n clause 2 cause the system to refer to a table that
contains the facts for p a r t i c u l a r rotors. References to the f a c t s
ROTOR.DESIGN, TUBE.VOLUME, and K.FACTOR are applications of p a r t i c -
u l a r constraints to the rotors. For example, two constraints are
that the rotor must have a tube volume greater than 1 mL and a k
factor less than 50. Clause 3 further pares the set of rotors on
the basis of k factor by taking only the best rotor and any rotor
with a k factor within 50% of the k factor of the best r o t o r .

The Other Functions

SpinPro includes two other functions that enhance i t s role as an


expert advisor. This i s i n recognition that an expert provides more
than expert advice. An u l t r a c e n t r i f u g a t i o n expert serves i n many
r o l e s : a teacher of centrifugation p r i n c i p l e s , a describer of
standard procedures, and a source of l i t e r a t u r e references.
The INFORMATION function contains an extensive database of
u l t r a c e n t r i f u g a t i o n information organized i n a h i e r a r c h i c a l fashion
(Figure 5). The primary purpose of the INFORMATION function i s to
provide an on-line reference to separation techniques, gradient
materials, r o t o r s , tubes, and centrifuges. For example, INFORMATION
can be used to get information on the Type 70.1 T i rotor, the
c o m p a t i b i l i t y of polyallomer tubes with c e r t a i n chemicals, a
description of rate-zonal separations, and references to isopycnic
methods. The subjects i n the information hierarchy can be expanded
to give a more detailed breakdown of the subject. For example,
expanding the "Fixed Angle" subject y i e l d s a d e t a i l e d breakdown of
the f i x e d angle rotors. The investigator could now select one of
the rotor names on the screen and get information about that r o t o r .
The INFORMATION function includes the subject "SpinPro", which i s a
complete on-line manual of the SpinPro system.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
308 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

RULE 2667: (Rulegroup: 2-STEP.ISOPYCNIC.DNA.RULES)

If: 1) V E R T I C A L . T U B E . R O T O R S , and

2) Find all instances of T H A T . R O T O R among the value of USERS.MATCHED.ROTORS


such that:

1) the ROTOR.DESIGN fact of T H A T . R O T O R = one of: SWINGING.BUCKET,


FIXED.ANGLE, or VERTICAL.TUBE, and

2) the T U B E . V O L U M E fact of T H A T . R O T O R > 1, and

3) the K . F A C T O R fact of T H A T . R O T O R < = 50


(saving those in C O L L E C T E D . R O T O R S ) , and

3) Find all instances of T H A T . R O T O R among C O L L E C T E D . R O T O R S for


which: the K . F A C T O R fact of T H A T . R O T O R is within 50% of the smallest value
so computed (saving those in C O L L E C T E D . R O T O R S )
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch023

Then: 1) Conclude that MINIMIZE.RUN.TIME.ROTORS is each of C O L L E C T E D . R O T O R S .

Figure 4. A rule that selects rotors to minimize the run time


i n a plasmid DNA separation. The r u l e examines a set of rotors
c a l l e d USERS.MATCHED.ROTORS, s e l e c t i n g those rotors that s a t i s f y
c r i t e r i a based on the rotor design, tube volume, and k f a c t o r .

SpinPro = Fixed Angle^

= Sample Materials Vertical Tube


and Particles
Information == Separation Methods = Swinging Bucket
Top Level =
= Tubes and Bottles = Continuous Flow

= Ultracentrifuge Rotors = Zonal

Ultracentrifuges = Table of Rotors


by Use
Glossary = Accessories

===== Rotor Maintenance

===== Rotor Warranties

Point at a n Information Item a n d c l i c k a n y m o u s e b u t t o n f o r O p t i o n s M e n u .

Figure 5 . The information hierarchy of SpinPro showing the ca-


tegories of information a v a i l a b l e . The positions i n the h i e r -
archy can be expanded to give a more d e t a i l e d breakdown of each
subject.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
23. MARTZ ET AL. Optimizing Ultracentrifugation Runs 309

The CALCULATION function provides a variety of routine c a l c u l a -


tions performed i n most u l t r a c e n t r i f u g a t i o n laboratories. Included
are d i l u t i o n calculations f o r sucrose, a p e l l e t i n g time c a l c u l a t i o n ,
and a calculation for determining rotor speed reductions f o r s a l t
gradients. As with the INFORMATION function, the CALCULATION func-
t i o n i s a support tool i n the e f f o r t to e f f i c i e n t l y design and carry
out a separation.

Development of SpinPro

There i s much concern about the length of time required to develop


expert systems, p a r t i c u l a r l y since so many have achieved various
stages of prototype, but few have been completed. Our experience
with SpinPro has led to many i n s i g h t s , more than can be f u l l y d i s -
cussed here. Nevertheless, a few major points are worth mentioning.
It i s not p a r t i c u l a r l y clear to us why SpinPro has succeeded i n
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch023

achieving product status and other expert systems have not, although
we suspect that an early decision to produce a product rather than
to do AI research has been important. The problem domain of u l t r a -
centrifugation appears to have been a good choice. The domain has
proven to be f a i r l y well bounded, even though the 800 rules required
has exceeded early estimates by a factor of four. When considering
the various stages of prototyping, debugging, and refinement, over
25,000 rules have been w r i t t e n , and tossed out. Perseverance,
sustained by having a concrete goal of "completeness" rather than a
more indeterminate goal of "demonstrating f e a s i b i l i t y " or
"prototyping", was c r u c i a l to the success of the project.
In some ways expert systems programming i s l i t t l e d i f f e r e n t
from more " t r a d i t i o n a l " programming. For example, s i m i l a r to most
software programs, about 50% of the code i n SpinPro i s f o r the user
i n t e r f a c e ; debugging has been very time consuming; and miscommuni-
cation was the source of a great deal of additional e f f o r t . Since
these problems are a part of t r a d i t i o n a l programming as w e l l , tech-
niques designed to a s s i s t t r a d i t i o n a l programmers, such as organ-
i z a t i o n p r i n c i p l e s , s p e c i f i c a t i o n , and e f f e c t i v e communication also
apply to expert systems.
In other ways expert systems programming i s much d i f f e r e n t .
T r a d i t i o n a l p r i n c i p l e s of s p e c i f i c a t i o n and organization are tested,
i n part, because the program undergoes evolutionary and sometimes
revolutionary revisions as an understanding of the problem domain
grows. Despite early detailed s p e c i f i c a t i o n , the tendency of the
s p e c i f i c a t i o n and the project to evolve toward i t s f i n a l d e f i n i t i o n
seems to be unavoidable.
From i t s inception to completion, the development of SpinPro
has taken about s i x person years. The development team has included
a manager, two knowledge engineers, one primary expert, four experts
for review, and two people responsible f o r the content of the
INFORMATION function. During this time, we have completed the
following major a c t i v i t i e s :

1. s p e c i f i c a t i o n and prototyping
2. knowledge a c q u i s i t i o n from the expert
3. knowledge coding into rules and debugging of rules
4. design and implementation of the "MP" inference engine
5. design and implementation of the user interface

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
310 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

6. c o l l e c t i n g and w r i t i n g the contents of the INFORMATION function


7. converting from Interlisp-D on the Xerox 1108 AI workstation to
Gold H i l l Common Lisp (GCLISP) on the IBM PC/XT

Of these a c t i v i t i e s , task 3 (knowledge coding) and task 4 (inference


engine) were the major e f f o r t s . Knowledge coding and debugging
required at least f i v e times as much e f f o r t as task 2, the knowledge
a c q u i s i t i o n from the expert. Task 7, converting from the develop-
ment environment to the product proved to be one of the major
hurdles.
There are two notable AI enhancements that are not a part of
SpinPro. F i r s t , the "MP" inference engine does not include uncer-
t a i n t y reasoning. The problem domain has only a l i m i t e d use f o r i t ,
and where i t i s required, uncertainty i s handled w i t h i n the c a p a b i l -
i t i e s of "MP". Second, "MP" does not include an a b i l i t y to explain
i t s reasoning beyond the "Why?" function discussed e a r l i e r . An
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch023

explanation c a p a b i l i t y was not implemented because the usual form of


presenting a trace of the rules that have f i r e d i s inadequate and
p o t e n t i a l l y confusing to the user. Why? Because rules t y p i c a l l y
encode "shallow" knowledge (the expert's experience and rules of
thumb) and i n a rule trace, are inadequate for communicating the
r e a l , "deep" knowledge, reasons for making a decision.

SpinPro and the Expert

How does SpinPro compare to the expert i n solving u l t r a c e n t r i f u g a -


t i o n problems? For most problems, SpinPro designs procedures as
good as the expert, i f not better. The inherent c a p a b i l i t i e s of
computers are responsible f o r this achievement; they are consistent,
they don't forget, and they are precise. For example, SpinPro
contains a vast amount of knowledge that i s not a part of the
expert's active memory. Many of the rules are an integration of the
expert's knowledge and procedures reported i n the l i t e r a t u r e . Other
rules are derived from l i t e r a t u r e references only. This vast amount
of knowledge i s immediately available to SpinPro, but not to the
expert. For the new problems, the ones never described to SpinPro,
the expert i s f a r superior. The expert has i n t e l l i g e n c e ,
c r e a t i v i t y , common sense, and an understanding of the p r i n c i p l e s of
u l t r a c e n t r i f u g a t i o n . These are human tools that the expert can
bring to bear on new problems. At t h i s stage i n AI a p p l i c a t i o n s ,
and despite the goal of AI to recreate these human a b i l i t i e s ,
SpinPro, l i k e other expert systems, i s lacking.
From the SpinPro project emerged a strong SpinPro-expert
r e l a t i o n s h i p . Early i n the project the expert was doubtful about
the prospects of capturing years of education and experience i n a
software program. Also the expert f e l t threatened by the expecta-
t i o n that h i s role would be subsumed by a computer. These problems
soon disappeared as the challenge of creating SpinPro became more
important. As the project neared completion, the expert took
personal r e s p o n s i b i l i t y f o r the accuracy of SpinPro and pride i n i t s
l e v e l of achievement. SpinPro's future development remains c l o s e l y
t i e d to the expert.
SpinPro required that the expert c r i t i c a l l y review the science
of u l t r a c e n t r i f u g a t i o n and h i s knowledge of i t . For example,
SpinPro sometimes designed a procedure using a rotor that was not

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
23. MARTZ ET AL. Optimizing Ultracentrifugation Runs 311

expected or recommended an exceptionally short run time that was


shorter than thought possible. These procedures required careful
review. Sometimes they were accepted as v a l i d improvements to
e x i s t i n g procedures. Isopycnic runs are one example, where SpinPro
found that procedures t y p i c a l l y requiring 12-16 hours, could be run
for 7-9 hours with the same r e s u l t s . Thus, SpinPro i s i n d i r e c t l y
responsible f o r advancing the expert's understanding of u l t r a c e n t r i -
fugation and f o r improving u l t r a c e n t r i f u g a t i o n techniques. SpinPro
promoted a degree of rigorousness that had never before been applied
to u l t r a c e n t r i f u g a t i o n .
Updates to SpinPro continue as new rotors and new techniques
are developed or as inadequacies are found. New expert systems
techniques, such as the a b i l i t y to incorporate the p r i n c i p l e s of a
problem domain, rather than just the experience of the expert,
should give SpinPro the a b i l i t y to design procedures for novel
problems and to explain i t s reasoning. The updates insure that
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch023

SpinPro w i l l be a repository of knowledge about the current state of


u l t r a c e n t r i f u g a t i o n ; SpinPro's expertise should continue to improve.
Furthermore, the expert remains g a i n f u l l y employed as a f i n a l
a r b i t r a t o r on the i n c l u s i o n or exclusion of any new knowledge.

Conclusion

The SpinPro U l t r a c e n t r i f u g a t i o n Expert System provides an integrated


package of expert advice, information, and c a l c u l a t i o n functions.
Its purpose i s to allow investigators to f u l l y exploit the c a p a b i l i -
t i e s of u l t r a c e n t r i f u g a t i o n , thereby improving the e f f i c i e n c y of the
u l t r a c e n t r i f u g a t i o n laboratory. I t uses AI techniques to provide
the a b i l i t y to advise on the best selection of run parameters that
s a t i s f y the investigator's requirements. Our experience with
SpinPro has shown that i t e f f e c t i v e l y performs the role of an expert
advisor: designing e f f i c i e n t u l t r a c e n t r i f u g a t i o n procedures that
can reduce run times and improve the q u a l i t y of separations.

Acknowledgments

For t h e i r contributions to the SpinPro Ultracentrifugation Expert


System, the authors thank Gertrude Burguieres, Mike Brown, P h y l l i s
Browning, Marsha Chase, Judy Cummings, Manny Gordon, Mary Jane
MacDwyer, Edna Podhayny and Bruce Wintrode.

R E C E I V E D January 14, 1986

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
24

Elucidation of Structural Fragments


by C o m p u t e r - A s s i s t e d I n t e r p r e t a t i o n of IR S p e c t r a

1
Hugh B. Woodruff, Sterling A.Tomellini ,and Graham M. Smith
Merck Sharp & Dohme Research Laboratories, Rahway, NJ 07065

Since its introduction to the scientific community in


late 1980, PAIRS (Program for the Analysis of IR Spec-
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch024

tra) has been used successfully by a large number of


researchers. Recent improvements to PAIRS have made
this package incorporate most of the aspects of expert
systems. The improvement highlighted in this paper is
the capability for scientists to inquire of the system
why a particular interpretation result was achieved.
This capability enhances the ability for scientists to
learn from the knowledge base of interpretation rules
present in PAIRS. It also simplifies the process by
which the PAIRS knowledge base can be refined through
incorporation of improved rules from expert
spectroscopists.

One of the more i n t e r e s t i n g areas available f o r development i n analy-


t i c a l spectroscopy i s the generation of algorithms and software capa-
ble of i n t e r p r e t i n g IR spectra. A number of papers have been
published recently on computerized i n t e r p r e t a t i o n of v i b r a t i o n a l
spectra (1-22). Thegeneration of such software requires the a n a l y t i -
c a l chemist to understand the i n t e r p r e t a t i o n process and be able to
translate the process into an algorithm which the computer can per-
form. While generating the actual computer code i s by no means
t r i v i a l , the chemical knowledge required to solve the i n t e r p r e t a t i o n
problem makes a chemist and not a computer s c i e n t i s t the l i k e l y pro-
ducer of such a program.
Among the most widely d i s t r i b u t e d of these i n t e r p r e t a t i o n pro-
grams i s a package c a l l e d PAIRS Program f o r the Analysis of IR Spec-
tra) which has been d i s t r i b u t e d by the authors and the Quantum
Chemistry Program Exchange to nearly 100 researchers. The program i s
available i n both IBM mainframe and DEC VAX versions. A s i m p l i f i e d
schematic of the information flow i n PAIRS i s shown i n Figure 1.
Spectral information i n the form of a d i g i t i z e d IR spectrum including
peak l o c a t i o n , width and i n t e n s i t y values may be entered either
1
Current address: University of New Hampshire, Durham, NH 03824

0097-6156/ 86/0306-0312$06.00/0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
24. WOODRUFF ET AL. Computer-Assisted Interpretation of IR Spectra 313

i n t e r a c t i v e l y or from a f i l e created previously, perhaps with the


aid of a d i g i t i z i n g tablet.
Since the introduction of PAIRS i n 1980, considerable e f f o r t
has been expended on improving various aspects of the package to
make i t more valuable to researchers. A version of PAIRS capable of
running on a N i c o l e t FTIR instrument-based minicomputer was deve-
loped to eliminate the time required to d i g i t i z e spectra and to
make the program a v a i l a b l e to the p r a c t i c i n g a n a l y t i c a l spectrosco-
p i s t (12). Recently, versions of PAIRS capable of running on other
FTIR systems have been reported (23, 24).

Generating Interpretation Rules

The generation of i n t e r p r e t a t i o n rules f o r PAIRS has proven to be a


time-consuming and often inexact process. Many man-years were re-
quired to generate the f i r s t set of rules. Trulson and Munk (18)
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch024

emphasized the massive e f f o r t required f o r rule development i n


t h e i r report on t h e i r promising work on a table-driven approach to
infrared spectral i n t e r p r e t a t i o n . Rule development and subsequent
testing are generally much more time consuming than e i t h e r acquiring
test spectra or programming the i n t e r p r e t a t i o n routines.
One of the strengths of PAIRS i s the a c c e s s i b i l i t y of the i n -
terpretation r u l e s i n a form that i s e a s i l y understandable and modi-
f i a b l e by the s c i e n t i s t . To accomplish t h i s feat, a s p e c i a l
E n g l i s h - l i k e language known as CONCISE (Computer Oriented Notation
Concerning Infrared Spectral Evaluation) was developed (19).
CONCISE has a very small (62 words) and well-defined vocabulary
which can be mastered by non-computer-oriented s c i e n t i s t s . I t con-
s i s t s of if-then-else l o g i c and begin-done blocking. Once the voca-
bulary and structure of CONCISE are known, the s c i e n t i s t i s free to
create or change i n t e r p r e t a t i o n rules at w i l l .
In order to expand the usefulness of the PAIRS package, an
automated rule generation program has been developed. An advantage
of automated rule generation i s that a more mathematical and uniform
method of determining expectation values can be developed and used.
(An expectation value i s a measure of the l i k e l i h o o d of occurrence
for the presence of a p a r t i c u l a r f u n c t i o n a l i t y i n the unknown com-
pound.) A detailed d e s c r i p t i o n of the algorithms used f o r the auto-
mated rule generator i s presented elsewhere (21).
The s i m p l i c i t y and c l a r i t y of CONCISE has been retained i n the
automated rule generator which creates CONCISE i n t e r p r e t a t i o n rules
for PAIRS based on a representative set of IR spectra. The rule
generator uses peak p o s i t i o n , i n t e n s i t y , and width tables produced
by an automated peak p i c k i n g routine. This method reduces the de-
pendency on published frequency c o r r e l a t i o n data and enhances the
usefulness of data already a v a i l a b l e . A l l work was done using the
version of PAIRS running on a N i c o l e t 1180 minicomputer and programs
generated have been optimized f o r t h i s system.
CONCISE rules are generated based on the frequency of occur-
rence of peaks i n compounds i n a spectral database. Good interpre-
t a t i o n rules have been created using a r e l a t i v e l y small number of
spectra i n the database. To recreate i n t e r p r e t a t i o n rules f o r the
168 classes of compounds currently addressed by PAIRS i n an auto-
mated manner would require a s u b s t a n t i a l e f f o r t and a better

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
314 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

spectral database than currently e x i s t s . However, the automated


rule generator provides the tools to accomplish t h i s task and to
expand the current rule base.

Tracing Interpretation Rules

The discussion thus f a r has centered on input to the i n t e r p r e t e r ;


however, the s c i e n t i s t i s perhans most interested i n the informa-
t i o n returned by PAIRS. The r e s u l t s were previously l i m i t e d to a
numerical i n d i c a t i o n of the l i k e l i h o o d that any p a r t i c u l a r function-
a l i t y or sub-functionality i s present. While the rules upon which
interpretations are based are available i n an E n g l i s h - l i k e language,
CONCISE, i t i s normally a rather d i f f i c u l t process to determine why
a given f u n c t i o n a l i t y was assigned a given value. The usefulness of
PAIRS would be greatly enhanced, e s p e c i a l l y as a research t o o l , i f
the program were able to provide the user with a clear trace of the
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch024

decision making process. Very recent e f f o r t s have resulted i n an


improved version of PAIRS which not only allows the user to question
which f u n c t i o n a l i t i e s may be present, but also why they are thought
to be present (22).
Major changes and additions were required to make PAIRS capable
of providing an e a s i l y understandable trace of the i n t e r p r e t a t i o n .
The interpreter required the vast majority of these modifications,
including the addition of a number of new subroutines. F u l l use was
made of the decompiling features already present i n the i n t e r p r e t e r .
Therefore, input-output and decision c o n t r o l l i n g routines make up
the majority of the subroutines added. The decision c o n t r o l l i n g
routines a c t u a l l y serve a dual purpose. Not only do these routines
decide which data should be printed during a trace, but they also
keep track of the progress of the interpreter as i t makes i t s way
through the i n t e r p r e t a t i o n r u l e s . Thus, the c o n t r o l l i n g routines
know at any given moment which rules have already been interpreted
and which rules remain to be interpreted. The rule compiler was
modified to create a f i l e containing the "header" names, which are
the names of the major f u n c t i o n a l i t i e s . The CONCISE i n t e r p r e t a t i o n
rules were not changed during t h i s process. Now the user i s pres-
sented with three options f o r i n t e r p r e t i n g a spectrum: 1.) trace
the decision making process f o r a l l f u n c t i o n a l i t i e s ; 2.) trace the
decision making process f o r any of the major f u n c t i o n a l i t i e s (e.g.,
acid) and i t s corresponding s u b - f u n c t i o n a l i t i e s (e.g., acid-satur-
ated); or 3.) interpret the spectrum without any tracing as was
done previous to the modifications described i n t h i s paper. In any
case, an entire i n t e r p r e t a t i o n takes place and, therefore, a numeri-
cal i n d i c a t i o n i s available for the l i k e l i h o o d that each functiona-
l i t y and sub-functionality i s present.
I t i s important to remember that the if-then-else l o g i c of the
CONCISE language forces the interpreter to follow one unique path
through the i n t e r p r e t a t i o n r u l e s , a path dictated by the spectral
data entered. A very important consequence of being able to follow
only one path i s that a trace of the decision making process can
give information about what decisions were made but cannot give any
information about what decisions might have been made had the spec-
t r a l data been d i f f e r e n t . Knowing what decisions were made can,
however, give a good i n d i c a t i o n why a given f u n c t i o n a l i t y might have
been reported at a lower value than expected.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
24. WOODRUFF E T A L . Computer-Assisted Interpretation of IR Spectra 315

The best way to demonstrate the added c a p a b i l i t y and increased


v e r s a t i l i t y of the i n t e r p r e t e r due to the t r a c i n g feature i s through
example. Since the i n t e r p r e t e r generally bases a good deal of impor-
tance on peak i n t e n s i t y information, i t i s obvious that mixtures and
larger molecular weight compounds w i l l often cause the i n t e r p r e t e r to
return l e s s than desirable r e s u l t s . In cases where the i n t e n s i t i e s
are lower than would normally be expected for a given f u n c t i o n a l i t y ,
a valuable feature of the modified program i s the a b i l i t y to see
quickly what decisions have been made and why these decisions were
made.
The a n t i b i o t i c actinospectacin, the structure of which i s given
below, was chosen to demonstrate the improved i n t e r p r e t e r . A spec-
trum of actinospectacin (published i n Volume 10 of the B u l l e t i n of
the International Center of Information on A n t i b i o t i c s ) was d i g i t i z e d
with the r e s u l t i n g peak data being presented i n Table I. The peak
data i n Table I were entered into the i n t e r p r e t e r without any empiri-
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch024

c a l formula information. The sample state entered r e f l e c t e d the fact


that the spectrum was taken as a KBr p e l l e t . Table I I contains the
twenty f u n c t i o n a l i t i e s and s u b - f u n c t i o n a l i t i e s which the i n t e r p r e t e r
predicted as most l i k e l y to be present i n the sample. (The *1*, *2*,
and *3* terminology indicated one, two, or three occurrences, respec-
t i v e l y , of alpha branching or unsaturation i n the alcohol.) Pre-
v i o u s l y t h i s information was e s s e n t i a l l y a l l that the user could
learn using PAIRS without investing the time necessary to decipher
the CONCISE rules for the f u n c t i o n a l i t i e s i n question. The improved
version of PAIRS, however, allows the user to ask, f o r example, "Why
was "sulfone" indicated with such a high expectation value?". I f the
data i n Table 1 are reinterpreted with the decision process for the
f u n c t i o n a l i t y "sulfone" being traced, the user learns that the high
l i k e l i h o o d for a "sulfone" i s due to the presence of the 1330 and
1351 cm""l bands of i n t e n s i t y 7 and 6, r e s p e c t i v e l y , the 1121 and 1145
cm"-'- bands of i n t e n s i t y 7 and 9, r e s p e c t i v e l y , and the presence of
-1
more than two bands between 1090 and 1170 cm with i n t e n s i t i e s
greater than 7. The actual decision trace i s given i n Figure 2.
Should the user suspect that these bands are due to another function-
a l i t y , knowledge of how these bands were used i n p r e d i c t i n g the pre-
sence of a "sulfone" may allow the i n t e r p r e t e r ' s p r e d i c t i o n of a high
l i k e l i h o o d of "sulfone" to be l e s s highly regarded.
Conversely, one may suspect the presence of a p a r t i c u l a r func-
t i o n a l i t y but discover that the i n t e r p r e t e r predicts that functiona-
l i t y with a low expectation value. Knowing the structure of actino-
spectacin, one would expect that "ketone" should be predicted to be
present with a f a i r l y high expectation value. The i n t e r p r e t e r , how-
ever, returns a value of 0.01 for the l i k e l i h o o d of presence of the
"ketone". In t h i s case, the user learns that the low expectation
value f o r "ketone" was based on the absence of any peak with i n t e n s i -
ty 7 or greater i n the carbonyl region between 1571 and 1800 cm~^-.
In the case of an unknown compound, knowledge of the i n t e r p r e t e r ' s
decisions can give the user added i n s i g h t s and ideas, e s p e c i a l l y
when the spectrum i s not i d e a l for a given f u n c t i o n a l i t y . The user,
i n any case, now has the a b i l i t y to work with the program to see i f
minor v a r i a t i o n s i n the data would r e s u l t i n d i f f e r e n t and possibly
more reasonable i n t e r p r e t a t i o n s .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Digitized
Spectrum
Chemical
PAIRS Functionality
(Interpreter) Predictions.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch024

CONCISE
Rules

Figure 1. Information flow i n PAIRS.

S t r u c t u r e o f the a n t i b i o t i c actinospectacin.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
24. WOODRUFF ET AL. Computer-Assisted Interpretation of IR Spectra 317

Table I. D i g i t i z e d Actinospectacin Spectrum

Relative
Peak No. P o s i t i o n (cm Intensity Width

1 3527 9 Broad
2 3401 10 Broad
3 3311 10 Broad
4 3254 10 Broad
5 3071 9 Broad
6 2962 9 Average
7 2796 6 Broad
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch024

8 2486 2 Average
9 1645 5 Average
10 1629 5 Average
11 1581 4 Average
12 1566 4 Average
13 1460 8 Average
14 1429 6 Average
15 1392 8 Average
16 1351 6 Sharp
17 1330 7 Average
18 1271 2 Average
19 1235 3 Average
20 1215 3 Average
21 1190 6 Average
22 1176 7 Sharp
23 1145 9 Average
24 1121 7 Average
25 1107 8 Average
26 1087 9 Average
27 1078 10 Average
28 1046 9 Average
29 1037 9 Average
30 1024 9 Sharp
31 999 7 Sharp
32 981 3 Average
33 952 4 Sharp
34 936 4 Sharp
35 923 7 Average
36 891 2 Average
37 875 3 Average
38 859 5 Average
39 814 3 Average
40 728 5 Average

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
318 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Table I I . PAIRS I n t e r p r e t a t i o n R e s u l t s for Actinospectacin

FUNCTIONALITY EXPECTATION VALUE

1 ALCOHOL 0.99
2 SULFONE 0.85
3 OLEFIN-(NON-AROMÏ 0.75
4 0LEFIN-CHR=CH2 0.75
5 ALCOHOL-PHENOL 0.75
6 ALC0H0L-PRIM(*1*) 0.75
7 ALCOHOL-PRIMARY 0.75
8 ALCOHOL-SEC-(*1*) 0.75
9 ALCOHOL-SEC-(*2*) 0.75
10 ALCOHOL-SEC-RING 0.75
11 ALCOHOL-SECONDARY 0.75
12 ALCOHOL-TERT-(*1*) 0.75
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch024

13 ALCOHOL-TERT-(*2*) 0.75
14 ALCOHOL-TERT-(*3*) 0.75
15 ALCOHOL-TERT-RING 0.75
16 ALCOHOL-TERTIARY 0.75
17 SULFONAMIDE 0.75
18 SULFONAMIDE-PRIM 0.75
19 SULFONAMIDE-SEC 0.75
20 SULFONAMIDE-TERT 0.75

F U N C T I O N A L I T Y SULFONE
PASSED I N I T I A L E M P I R I C A L FORMULA TEST
PEAK QUERY
ANY P E A K ( S ) POSITION: 1290 - 1360
INTENSITY: 7 - 10 WIDTH : SHARP TO AVERAGE
ANSWER YES
PEAK QUERY
ANY P E A K ( S ) POSITION: 1110 - 1170
INTENSITY: 7 - 10 WIDTH: SHARP TO BROAD
ANSWER YES
ACTION SET SULFONE TO 0 . 5 0 0
CURRENT V A L U E = 0 . 5 0 0
PEAK QUERY
AT L E A S T 2 P E A K ( S ) POSITION: 1260 - 1360
INTENSITY: 4 - 1 0 WIDTH: SHARP TO AVERAGE
ANSWER YES
ACTION ADO 0 . 1 0 0 TO SULFONE
CURRENT V A L U E = 0 . 6 0 0
PEAK QUERY
AT L E A S T 2 P E A K ( S ) POSITION: 1260 - 1360
INTENSITY: 7 - 10 WIDTH: SHARP TO AVERAGE
ANSWER NO
PEAK QUERY
AT L E A S T 2 P E A K ( S ) POSITION: 1 0 6 5 - 1170
INTENSITY: 4 - 1 0 WIDTH: SHARP TO AVERAGE
ANSWER YES
ACTION ADD 0 . 1 0 0 TO SULFONE
CURRENT V A L U E = 0 . 7 0 0
PEAK QUERY
AT L E A S T 2 P E A K ( S ) POSITION: 1065 - 1170
INTENSITY: 7 - 1 0 WIDTH: SHARP TO AVERAGE
ANSWER YES-----
ACTION ADD 0 . 1 5 0 TO SULFONE
CURRENT V A L U E = 0 . 8 5 0

Figure 2. Trace of sulfone f u n c t i o n a l i t y during i n t e r p r e t a t i o n


of actinospectacin.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
24. WOODRUFF E T A L . Computer-Assisted Interpretation of IR Spectra 319

This point i s i l l u s t r a t e d by a second example. A vapor-phase


spectrum of p r o p i o n i t r i l e was obtained and i t s d i g i t i z a t i o n i s shown
i n Table I I I . For the sake of example, assume the s c i e n t i s t entered
the 2246 cuT* peak as average rather than sharp. The i n t e r p r e t a t i o n
would r e s u l t i n l i k e l i h o o d s of 0.90 f o r isocyanate and 0.30 f o r n i -
t r i l e . Performing the i n t e r p r e t a t i o n with the tracing function
turned on would quickly show that the rules base the d i s t i n c t i o n be-
tween isocyanate and p r o p i o n i t r i l e very heavily on the width of the
-
peak i n the v i c i n i t y of 2260 cm *. Reinterpreting t h i s spectrum
with the correct, sharp width entered f o r the 2246 cm"* peak r e s u l t s
i n a n i t r i l e l i k e l i h o o d of 0.50 and isocyanate of 0.40.

Table I I I . D i g i t i z e d P r o p i o n i t r i l e Spectrum

Relative
Peak No. P o s i t i o n (cm 1) Intensity Width
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch024

1 2246 10 Average
2 2996 8 Average
3 1461 7 Average
4 2950 6 Average
5 1431 5 Average
6 1074 4 Average
7 787 3 Average
8 2892 3 Average
9 1319 2 Average
10 1386 1 Average
11 546 1 Average

Summary

Through the addition of automated spectrum input on instrument-based


computers, automated rule generation, and automatic tracing of d e c i -
sion r u l e s , PAIRS has been enhanced to be an even more valuable t o o l
for the spectroscopist. PAIRS i s available f o r d i s t r i b u t i o n from
the Quantum Chemistry Program Exchange, Indiana U n i v e r s i t y , Blooming-
ton, IN 47405 (Catalog No. QCPE 497).

Literature Cited

1. Gray, N.A.B. Anal. Chem. 1975, 47, 2426.


2. Woodruff, H.B.; Munk, M.E. J . Org. Chem. 1977, 42, 1761.
3. Woodruff, H.B.; Munk, M.E. Anal. Chim. Acta 1977, 95, 13.
4. Zupan, J . Anal. Chim. Acta 1978, 103, 273.
5. Visser, T.: Van der Maas, J.H. J . Raman Spectros. 1978, 7, 125.
6. Visser, T.; Van der Maas, J.H. J . Raman Spectros. 1978, 7, 278.
7. Leupold, W-R; Domingo, C.; Niggemann, W.; Schrader, B. Fresenius'
Z. Anal. Chem. 1980, 303, 337.
8. Woodruff, H.B.; Smith, G.M. Anal. Chem. 1980, 52, 2321.
9. Visser, T.; Van der Maas, J.H. Anal. Chim. Acta 1980, 122, 337.
10. Varmuza, K. "Pattern Recognition i n Chemistry"; Springer-Verlag;
New York, 1980, No. 2, Lecture Notes i n Chemistry Series
11. Woodruff, H.B.; Smith, G.M. Anal. Chim. Acta 1981, 133, 545.
12. Tomellini, S.A.; Saperstein, D.D.; Stevenson, J.M.; Smith G.M.;
Woodruff, H.B.; Seelig, P.F. Anal. Chem. 1981, 53, 2367.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
320 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

13. Farkas, M.; Markos, J . ; Szepesvary, P.; Bartha, I.; Szalontai,


G.; Simon, Z. Anal. Chim. Acta/Computer Techniques and Optimiza-
tion 1981, 133, 19.
14. Szalontai, G.; Simon, Z.; Csapo, Z.; Farkas, M.; P f e i f e r , Gy.
Anal. Chim. Acta/Computer Techniques and Optimization 1981, 133
303.
15. Debska, B.; Duliban, J . ; Guzowska-Swider, B.; Hippe, Z. Anal.
Chim. Acta/Computer Techniques and optimization 1981, 133, 303.
16. Frank, I.E.; Kowalski, B.R. Anal. Chem. 1982, 54, 232R.
17. Zupan, J . Anal. Chim. Acta 1982, 139, 143.
18. Trulson, M.O.; Munk, M.E. Anal. Chem. 1983, 55, 2137.
19. Smith, G.M.; Woodruff, H.B. J . Chem. Inf. Comp. S c i . 1984, 24,
33.
20. Tomellini, S.A.; Stevenson, J.M.; Woodruff, H.B. Anal. Chem.
1984, 56, 67
21. Tomellini, S.A.; Hartwick,R.A.;Stevenson, J.A.; Woodruff, H.B.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch024

Anal. Chim. Acta 1984, 162, 227.


22. Tomellini, S.A.; Hartwick, R.A.; Woodruff, H.B. Appl. Spectrosc.
1985, 39, 331.
23. Saperstein, D.D.; "A Scheme For Optimized Infrared Interpreta-
tions", paper # 216, 1985. Pittsburgh Conference & Exposition on
A n a l y t i c a l Chemistry and Applied Spectroscopy, Feb. 25-March 1,
1985.
24. DeHaseth, J.A.; Mir, K.A., "A Minicomputer Based Structure E l u c i -
dation Program", paper # 217, 1985. Pittsburgh Conference &
Exposition on A n a l y t i c a l Chemistry and Applied Spectroscopy,
Feb. 25-March 1, 1985.

RECEIVED December 17, 1985

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
25

Automation of Structure Elucidation


from Mass Spectrometry-Mass Spectrometry Data

1 2 3 4 5
K. P.Cross ,P. T. Palmer, C. F.Beckner ,A. B.Giordani ,H. G.Gregg ,P. A. Hoffman ,
and C. G. Enke

Department of Chemistry, Michigan State University, East Lansing, MI 48824


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch025

A system has been designed to automate the extraction


of structural information from mass spectrometry/mass
spectrometry (MS/MS) spectra. Currently operational
elements in this system include data bases for MS/MS
spectra and molecular structures, spectrum matching
programs, and a structure generator. Individual
spectra within the complete set of MS/MS spectra are
related to the molecular substructures from which they
arise. The correlations between individual MS/MS
spectra and specific substructures can be determined by
identifying the compounds that have matching MS/MS
spectra, and then identifying the substructures they
have in common. These correlations can supply
identified substructures to a molecular structure
generator such as GENOA. This empirical scheme assumes
no knowledge of the fragmentation process, ion
structures, or rearrangements.

The development o f mass spectrometry/mass spectrometry (MS/MS) has


provided the chemical analyst with a powerful t o o l f o r structure
elucidation. The primary goal of t h i s project i s to develop the
f u l l capacity o f t r i p l e quadrupole mass spectrometry (TOMS) as a
t o o l f o r routine structure determination. To accomplish t h i s , we
have designed and developed computer data bases f o r spectra and
structures (1,2), programs f o r matching spectra (3), and procedures
1
Current address: Chemical Abstracts Service, Columbus, OH 43210
2
Current address: Finnigan MAT, San Jose, CA 95134
J
Current address: Department of Psychiatry, Mt. Sinai School of Medicine and Bronx
Veterans' Administration Medical Center, New York, NY 10029
'Current address: Lawrence Livermore National Laboratory, University of California,
Livermore, CA 94550
5
Current address: Lederle Laboratories, American Cyanamid Corporation, Pearl River,
NY 10965
0097-6156/ 86/ 0306-0321 $06.00/ 0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
322 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

f o r determining spectrum/ substructure c o r r e l a t i o n s . These tools


were designed f o r integration into a complete system f o r on-line
structure determination by MS/MS.
Structure analysis by MS/MS d i f f e r s from normal MS i n that each
of the fragment ions from the sample i o n i z a t i o n process i n the
source can be selected, one mass at a time, f o r further
fragmentation and subsequent mass analysis. The ion i n the normal
mass spectrum selected f o r analysis i s c a l l e d a parent ion. The
fragments o f that ion, generally produced by c o l l i s i o n - i n d u c e d
d i s s o c i a t i o n (CID) are c a l l e d daughters. A mass spectrum of a l l the
daughters of a p a r t i c u l a r parent ion ( c a l l e d a daughter spectrum) i s
obtained by holding the f i r s t mass analyzer constant at the mass o f
the selected parent ion and scanning the second mass analyzer. A
complete MS/MS spectrum i s a three-dimensional array i n which there
i s a daughter spectrum f o r every mass represented i n the normal mass
spectrum.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch025

MS/MS data are very e x p l i c i t ; daughter spectra may reveal


s t r u c t u r a l c h a r a c t e r i s t i c s o f i s o l a t e d portions o f the molecule (4),
and under c e r t a i n conditions, a l l masses i n a daughter spectrum are
single-event neutral losses from the parent ion. Thus, c l e a r
substructure/property r e l a t i o n s h i p s can be obtained from MS/MS
spectra. These r e l a t i o n s h i p s can be used t o i d e n t i f y substructures
i n unknown compounds. Possible compound structures can then be
developed from the i d e n t i f i e d substructures. This approach should
f a c i l i t a t e the i d e n t i f i c a t i o n o f unknown compounds not previously
studied by mass spectrometry.
Data from the TOMS instrument are used i n two d i f f e r e n t ways:
1) t o develop a l i b r a r y o f spectrum/substructure c o r r e l a t i o n s from
studies o f known compounds and 2) to use the developed c o r r e l a t i o n s
to determine the substructures and thence the o v e r a l l structures o f
unknown compounds. The data base required f o r t h i s process i s a
l i b r a r y o f the s p e c t r a l c h a r a c t e r i s t i c s o f many substructures,
rather than a l i b r a r y o f the spectra o f a l l known compounds. In
p r i n c i p l e , m i l l i o n s of compounds could be i d e n t i f i e d using a l i b r a r y
of only a few thousand spectrum/substructure r e l a t i o n s h i p s .
A block diagram o f our target system f o r the automatic
e l u c i d a t i o n of molecular structure i s shown i n Figure 1 (5). While
the system i s not yet complete, the three data bases and a spectrum
matching program have been developed and integrated into a
comprehensive system to acquire, store, match, and c o r r e l a t e the
MS/MS data. Descriptions o f t h e i r structures and c a p a b i l i t i e s and
examples o f t h e i r a p p l i c a t i o n are included i n t h i s paper. Also a
molecular structure generator, GENOA (6), has been acquired and
implemented, but i s not yet integrated into the system. An example
of the determination of spectrum/substructure c o r r e l a t i o n s and t h e i r
a p p l i c a t i o n i n structure determination through GENOA i s also given
here.
The flow o f data through the system shown i n Figure 1 depends
on whether the experimental data are from a reference compound f o r
the development o f the l i b r a r y or from an unknown compound f o r
analysis. Reference compound spectra are c o l l e c t e d i n the
experimenter's data base and may be archived i n the reference data
base. They can also be matched against other spectra from other
reference compounds by the spectrum matching program. When a match
i s found i n d i c a t i n g that the two compounds have produced an

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch025

REFERENCE SPECTRA EXPERIMENTAL DATA

À .

REFERENCE SPECTRA LIBRARY STORAGE MULT! —DIMENSIONAL


DATA BASE DATA BASE
Id
Ό REF SPECTRA TEST SPECTRA
Ο DATA PLOTS
o\J/ 2k.
m INVERTED
m SPECTRUM MATCHING
|DATA BASE

o MATCH LISTS
ο
ο
til
a.
TEST STRUCTURES STRUCTURE/
SUBSTRUCTURE
LIBRARY STORAGE SUBSTRUCTURE
SEARCHINO
DATA BASE
MATCHED IDENTIFIED
SUBSTRUCTURES SUBSTRUCTURES
MOLECULAR
STRUCTURE
GENERATOR

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ALL POSSIBLE STRUCTURES

ACS Symposium Series; American Chemical Society: Washington, DC, 1986.


Figure 1. Overall System f o r determination of structure by
MS/MS.
324 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

i d e n t i c a l ion structure, the molecular structures are compared by


the substructure searching function to determine the substructure(s)
they have i n common. These common substructures are candidate
precursors o f the common i o n . Through rearrangements, i t i s
possible f o r more than one substructure to produce a p a r t i c u l a r ion.
Additional compounds with matching spectra or substructures are
studied u n t i l c l e a r spectrum/substructure c o r r e l a t i o n s are produced.
Once the c o r r e l a t i o n s are made, the substructure(s) associated with
a p a r t i c u l a r spectrum are stored i n the structure/substructure data
base, and are l o g i c a l l y l i n k e d to that spectrum.
The spectra from an unknown compound are matched against the
reference spectra t o produce a l i s t o f the substructures that are
related t o the matched spectra. When t h i s substructural information
has been extracted from the MS/MS spectra, i t i s entered i n t o the
molecular structure generator c a l l e d GENOA (6). GENOA, which i s
constrained by h e u r i s t i c chemical r u l e s , uses a l l a v a i l a b l e
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch025

composition and structure information, i n c l u d i n g overlapping and


nonunique substructures, to postulate the number and i d e n t i t y of a l l
possible molecular structures o f the unknown compound. I f the
resolution o f any remaining s t r u c t u r a l ambiguities i s e s s e n t i a l t o
the experiment, a d d i t i o n a l information derived from MS/MS or other
sources i s f e d t o GENOA t o further reduce the number o f output
structures. This structure e l u c i d a t i o n scheme combines an
exhaustive and automatic algorithm f o r the evaluation o f the
s t r u c t u r a l p o s s i b i l i t i e s , the experimenter's chemical i n t u i t i o n , and
the knowledge base of the experimentally determined
spectrum/substructure c o r r e l a t i o n s .

There are three data bases present i n our MS/MS information


management system, one f o r immediate experimental data and two f o r
a r c h i v a l data. The experimenter's data base has been described
elsewhere (1). One a r c h i v a l data base manages the MS/MS spectra,
while the other manages the structures and substructures. The two
a r c h i v a l data bases are l o g i c a l l y l i n k e d together so that a l l
information concerning a p a r t i c u l a r molecule or substructure i s
associated with i t s spectra.
The MS/MS spectrum data base i s capable o f s t o r i n g and
c o r r e l a t i n g a l l types o f MS/MS s p e c t r a l data including parent,
daughter, neutral l o s s , and conventional mass spectra ( 2 ) . A l l
spectra are stored i n an unabridged format and a l l spectra f o r each
compound are l o g i c a l l y associated with that compound. Redundant
spectra such as those taken under d i f f e r e n t operating conditions are
a l l associated with a s i n g l e compound r e g i s t r y number thereby
s i m p l i f y i n g both the r e t r i e v a l and maintenance o f the data base
information.
The most important feature of the reference spectrum data base
i s the provision to generate and store inverted data (data that are
presorted on various secondary elements o f the record). The data
present i n the spectrum data base may be inverted upon any s p e c i f i e d
c h a r a c t e r i s t i c , such as m/z value, and then be r e t r i e v e d using that
characteristic. For instance, a data f i l e inverted about the
daughter m/z value w i l l contain, f o r each m/z value, a l i s t o f
pointers to the reference daughter spectra that have a peak at that

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
25. CROSS ET AL. Automation of Structure Elucidation from MS-MS Data 325

mass. Hence the pointers t o a l l reference spectra containing a


p a r t i c u l a r m/z value may be very q u i c k l y r e t r i e v e d . When Boolean
algebra operations are performed on inverted data l i s t s , the power
of the design increases dramatically. A prescreen f o r a l l reference
daughter spectra containing the major features o f a t e s t spectrum
such as peaks at 43.0 and 57.0 but not 119.0 reduces g r e a t l y the
number o f reference spectra that need t o be matched i n greater
d e t a i l . In addition t o a daughter m/z value, s p e c t r a l data may be
inverted about molecular weight, empirical formula, and parent i o n
m/z value. Over 30,000 primary spectra and other information are
currently stored i n the spectrum data base as w e l l as MS/MS spectra
corresponding to several s p e c i f i c classes o f compounds.
The structure data base was designed t o contain both molecular
structures and substructures (7). The MS/MS instrument s p e c i f i c a l l y
provides a substructure/property r e l a t i o n s h i p where several daughter
spectra may correspond t o a s i n g l e substructure and any daughter
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch025

spectrum may correspond to more than one substructure. Even though


a simple 1:1 correspondence between daughter spectra and
substructures cannot be assumed, there i s a basis f o r a l o g i c a l l i n k
between the MS/MS spectra i n the s p e c t r a l data base and the
respective substructures i n the structure data base. This l i n k
allows r e t r i e v a l o f s t r u c t u r a l information from the reference
daughter spectra best matching the unknown spectrum. Structures
present i n the structure data base may be r e t r i e v e d v i a substructure
number, Chemical Abstracts Service number, or spectrum data base
number, and then drawn.
The structures and substructures are stored unambiguously using
a modified version o f the Morgan algorithm f o r encoding molecular
structures v i a connectivity tables. The version o f the algorithm
implemented included the modifications described by Wipke and Dyott
(8) f o r the representation o f stereochemical isomers. The notation
of the elements was expanded t o include a l l known elements. Any
molecule up t o 128 atoms i n s i z e (excluding hydrogens) may be
included i n the data base. The structure data base contains over
30,000 structures corresponding to the spectra present i n the MS/MS
spectrum l i b r a r y as w e l l as substructures corresponding t o various
reference daughter spectra.

Matching MS/MS Spectra

The MS /MS spectra matching program allows the chemist t o match any
MS/MS spectrum against e i t h e r MS or MS/MS spectra i n the reference
spectrum data base (3). The program uses inverted data organized by
m/z value t o l o g i c a l l y eliminate inappropriate reference spectra.
The program f i r s t determines the data base frequency (length o f the
pointer table) o f each major peak i n the experimental daughter
spectrum and then ranks the peaks i n ascending order o f frequency.
Inverted data l i s t s o f reference spectra containing peaks are
r e t r i e v e d i n t h i s order and l o g i c a l l y ANDed together u n t i l the
number o f candidate reference spectra i s s u f f i c i e n t l y small.
Additional reductions i n the number o f candidate spectra i s possible
by using molecular weight, parent i o n m/z value, and empirical
formula may also be invoked t o further reduce the number o f
candidate spectra. When matching daughter spectra, s p e c i f y i n g the
parent ion m/z value alone usually produces a s u f f i c i e n t l y small

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
326 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

number of candidate spectra. Abundance values are not considered


and the reference data base i s not accessed u n t i l intensity-based
matching i s performed. The short matching times achieved with t h i s
design makes i t p r a c t i c a l to work with unabridged spectra.
Once the number of candidate reference spectra has been reduced
to reasonable s i z e (25-100), intensity-based matching i s performed
to characterize the correspondence between the experimental and
remaining candidate spectra. Several d i f f e r e n t factors i n d i c a t i n g
the degree to which the spectra match i n various respects are
determined. The values of these match f a c t o r s are used to
d i s t i n g u i s h spectra that a r i s e from i d e n t i c a l substructures from
those that a r i s e from d i f f e r e n t substructures.
The various match f a c t o r s c a l c u l a t e d by the matching program
are l i s t e d i n Table I. The o v e r a l l match f a c t o r (PT) i s a
combination of forward and reverse searching techniques. I t takes
into account the deviations i n i n t e n s i t y of the sample spectrum
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch025

peaks with respect to the candidate spectrum peaks and v i c e versa


for a l l peaks i n both spectra. The pattern correspondence match
f a c t o r (PC) i s a forward searching match f a c t o r which takes i n t o
account the i n t e n s i t y deviations of sample spectrum peaks with
respect to the candidate spectrum peaks f o r peaks common to both
spectra. This f a c t o r detects s t r u c t u r a l s i m i l a r i t i e s , such as
substructures, based on common s p e c t r a l patterns. NC, NS, and NR
give an i n d i c a t i o n of the number of peaks upon which the match was
based and i n which d i r e c t i o n i t was most successful. IS and IR
indicate the magnitude of the ion current unmatched i n each
direction. These match f a c t o r s are s i m i l a r to those proposed by
Damen, Henneberg, and Wiemann (9).
Because instrument operating conditions can s e r i o u s l y a f f e c t
the r e l a t i v e i n t e n s i t i e s of ions i n daughter spectra, there was a
need to know the range of conditions over which the daughter spectra
of i d e n t i c a l parent ions could be distinguished from a l l other
daughter spectra. Daughter spectra were c o l l e c t e d f o r several
compounds f o r every combination of a wide range of operating
parameters. An acceptable range of standard conditions was defined
as that over which the spectrum matching system would provide high
match factors f o r daughter spectra of the same compound.
Of the 32 instrumental parameters on our TOMS, only the
c o l l i s i o n energy and c o l l i s i o n c e l l pressure were found to
s i g n i f i c a n t l y a f f e c t MS/MS spectra. The acceptable range of
c o l l i s i o n c e l l pressure was that found to y i e l d f i r s t order
fragmentation regardless of the compound type. Since d i f f e r e n t
c o l l i s i o n c e l l pressures are required to obtain f i r s t order
fragmentation f o r d i f f e r e n t compounds, b r i e f k i n e t i c studies are
used to determine the fragmentation order, and to ascertain the
pressure necessary to provide f i r s t order fragmentation. Similarly,
we have determined a useable operating range f o r the c o l l i s i o n
energy of 15 to 25 eV.

The procedure f o r obtaining the spectrum/substructure r e l a t i o n s h i p s


i s as follows. For a selected known compound, a daughter spectrum
i s acquired f o r every mass value greater than 1% r e l a t i v e i n t e n s i t y
that appears i n the primary spectrum of that compound. These

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
25. CROSS ET AL. Automation of Structure Elucidation from MS-M S Data 327

Table I. Match Factor D e f i n i t i o n s

PT An o v e r a l l match factor that indicates how w e l l the


i n t e n s i t i e s o f a l l the peaks i n the two spectra match.

PT = (Σ Ys + Yr - 2* |Yr - Ys|) / (Σ Ys + Σ Yr) * 100


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch025

where Y i = log2 (Intensity/Total Ion Count)

Ys and Yr correspond t o the adjusted abundances at each mass


in the sample and reference spectra respectively

PC A pattern correspondence factor that indicates how well the


i n t e n s i t y of the peaks i n common match.

PC = (Σ Ys - |Yr - Ys|) / (Σ Ys) * 100

NC The number of peaks common to both the candidate and unknown


sample spectrum.

NS The number of peaks remaining unmatched i n the unknown


sample spectrum.

NR The number of peaks remaining unmatched i n the reference


spectrum.

IS The percent t o t a l i o n current of the sample spectrum that


was unmatched i n the comparison due to NS.

IR The percent t o t a l ion current of the reference spectrum that


was unmatched i n the comparison due to NR.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
328 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

daughter spectra are then matched against a l i b r a r y of daughter


spectra from reference compounds.
After the s p e c t r a l matching process has been completed, the
l i s t of compounds with the top matching daughter spectra are
i d e n t i f i e d and r e t r i e v e d f o r each daughter spectrum i n the reference
compound. The molecular structures of the compounds with best
matching spectra are drawn and compared f o r common substructures.
The common substructures y i e l d candidate spectrum/substructure
correlations. A d d i t i o n a l compounds are then tested to confirm or
modify each c o r r e l a t i o n . Once the daughter spectrum i s c o r r e l a t e d
with one or more substructures, t h i s daughter spectrum i s stored i n
the spectrum data base and i s linked to the associated substructures
stored i n the structure data base.
An h e u r i s t i c program w r i t t e n by Shelley (10) has been adapted
f o r our computer system to d i s p l a y molecular structures and
substructures from connectivity tables. Since the molecular
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch025

structure and substructure representations are stored i n a unique,


irredundant form, the structure drawings f a c i l i t a t e visual
comparison f o r commonalities.
An example of the spectrum/substructure determination process
i s i l l u s t r a t e d f o r the reference compound di-n-octylphthalate.
Daughter spectra were acquired f o r every major ion (above 1*
r e l a t i v e i n t e n s i t y ) that appeared i n the conventional mass spectrum
(Figure 2) of the reference compound. A l l the daughter spectra were
then matched against the reference daughter spectra of the same
parent mass (but from d i f f e r e n t compounds) i n the data base. The
r e s u l t s of some of the matches are described below.
The match of the 105+ daughter spectrum of di-n-octylphthalate
against the reference l i b r a r y of m/z 105 daughter spectra i s
presented i n Table I I . The top four matching spectra a l l correspond
to structure I I I i n Figure 3. Some of the spectra used i n t h i s
match are shown i n Figure 4. Note that the top four matching
daughter spectra are very s i m i l a r ; a l l three contain the same peaks,
only the i n t e n s i t y patterns are d i f f e r e n t (NR, NS, IS, and IR f o r
the three are a l l zero). There i s a large difference i n o v e r a l l
match f a c t o r values (PT) between daughter spectra representing the
correct substructure and that of the next best match.

Table I I . Match of 105+ Daughter Spectra vs. Di-n-octylphthalate

PT PC NC NS NR IS IR Compound

100 100 2 0 0 0 0 Di-n-octylphthate


99 99 2 0 0 0 0 D i-n-pentylphthalate
98 98 2 0 0 0 0 D i-n-butylphthalate
98 98 2 0 0 0 0 D i-n-ethylphthalate
66 93 2 0 2 0 31 4-t-butyl-l,2-benzenediol
60 85 2 0 2 0 20 2-t-butyl-4-methylphenol
38 50 1 1 3 42 29 p-t-butylbenzyl alcohol
36 50 1 1 3 42 52 2-t-butyl-6-methylphenol

The r e s u l t s of the match of the m/z 149 daughter spectrum of


di-n-octylphthalate against m/z 149 daughter spectra from other
compounds i n the reference l i b r a r y i s given i n Table I I I . The

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
CROSS ET AL. Automation of Structure Elucidation from MS-M S Data 329
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch025

Figure 3. Substructure ( I and I I I ) , i o n i c structure ( I I ) , and


molecular structure (IV) produced by structure drawing program.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
330 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

D î — η —octyl —phthalate
.4? 100
"en
S 1 0

1 1
Ί ' 1 H"
Di-n--pentyl-phthalate
100

c 10
1 r
Ί 1 ~

Di—n —butyl —phthalate


100

10
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch025

ι • 1 • 1 ' — η · r

2—t—butyl — 6 — m e t h y î p h e n o !
-T? 100
'cn
1 0
5
π j 1 1 1 Τ 1 Γ

Senzy!—t—butano!
^ 100
'(0
10
I
c 1 τ 1
1 1
— I"

p - t - b u t y l b e n z y l alcohol
100 "1
'cn 5
c
<u
10 1
c 1 1 1 1
I 1 ι Γ

2~t-butyl—4-methylphenol

'to
£ 10

1 1 1 1 1
Τ 1 1 1 Γ

4—t—butyl—1,2-benzenedioi
100

S 1 0

τ j 1 1 « j 1 [ 1 Γ
0 20 40 60 80 100

m/z

Figure 4. Selected daughter spectra of parent mass 105 from


reference l i b r a r y .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
25. CROSS ET AL. Automation of Structure Elucidation from MS-MS Data 331

daughter spectra u t i l i z e d i n the matching process are shown i n


Figure 5. The top four matching spectra a l l correspond to the same
molecular substructure, namely the phthalate substructure (structure
I i n Figure 3 ) . At t h i s point, i t i s necessary to make a
d i s t i n c t i o n between a substructure and an i o n i c structure. The
substructure correlated with the top four matching spectra i s
structure I I i n Figure 3 whereas the i o n i c structure o f the parent
ion m/z 149 i s structure I I i n Figure 3. I t i s not necessary to
know the i o n i c structure f o r t h i s empirical approach.

Table I I I . Match o f 149+ Daughter Spectra vs. Di-n-octylphthalate

PT PC NC NS NR IS IR Compound

100 100 4 0 0 0 0 D i-n-octylphthalate


96 96 4 0 0 0 0 D i-n-butylphthalate
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch025

87 86 4 0 0 0 0 D i-n-pentylphthalate
87 86 4 0 0 0 0 Di-n-ethylphthalate
54 57 3 1 7 3 2 2-t-butyl-4-methylphenol
44 56 3 1 10 9 15 p-t-Butylbenzyl alcohol
42 35 1 3 1 19 29 p-t-amylphenol
35 61 3 1 10 3 26 2-t-butyl-6~methylphenol

The compounds y i e l d i n g the top four daughter spectra are d i - n -


octylphthalate, di-n-butylphthalate, di-n-pentylphthalate, and
diethylphthalate. Once again, only the r e l a t i v e i n t e n s i t i e s d i f f e r
between these daughter spectra. I t i s important that these spectra
are properly grouped by the spectrum matching program and that
there i s a s u b s t a n t i a l difference between the o v e r a l l match factors
of the matched spectra and those corresponding t o unrelated
substructures. The difference between the o v e r a l l match f a c t o r o f
the unknown and the best matching daughter spectra corresponding t o
a d i f f e r e n t substructure i s 46. Since the o v e r a l l match f a c t o r
range i s 0-100 and the variance w i t h i n the s i m i l a r daughter spectra
i s 13, a value o f 46 represents a good separation. The next best
matching daughter spectrum outside o f t h i s group o f three
corresponds to a substructure o f 2-t-butyl-4-methylphenol.
From the daughter spectra of di-n-octylphthalate, we were able
to determine two spectrum/substructure c o r r e l a t i o n s ; the 149+
daughter spectrum t o structure I i n Figure 3 and the 105+ daughter
spectrum to structure I I I i n Figure 3. In order to obtain spectrum
substructure r e l a t i o n s h i p s f o r the a l k y l portions o f the reference
molecule di-n-octylphthalate, we would then match other portions o f
the complete MS/MS spectrum against those o f compounds containing
a l k y l substructures. However, t h i s portion o f the reference l i b r a r y
has not yet been developed. Thus, t o complete the structure
e l u c i d a t i o n we have used standard methods o f s p e c t r a l i n t e r p r e t a t i o n
(11). As w i l l be shown, these methods can also lead t o useful
spectrum/substructure r e l a t i o n s h i p s .
Since the substructure represented by the daughter spectrum o f
the m/z 149 ion was the largest i d e n t i f i a b l e substructure i n the
compound, the parent spectrum o f m/z 149 was used to obtain data
r e l a t e d t o the R groups attached t o the phthalate substructure
(Figure 6). The largest ion (149+) associated with the phthalate

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
332 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

-t? 100 Di — η — ο cty I —phthalate


'to
1 0
5 i
r—-i j r r r—•—ι—·"
Di-n- pentyl — phthalate
.i? 100 - ν
'co 5
1 0
5 -3
Ί 1
Γ

Di—η—butyl—phthalate
100 -m
'co
S
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch025

1 0

1
_L- 1
Ί Γ ^ 1 —I Γ « 1 τ

2-t-butyl-6-methylphenol
& 100

1 0
5
, hi, ll I
Ί 1
—\ 1 r
— Γ

Benzyl—t—butanol
^ 100 Έ

'co 1
s 1 0
-a
1 1
ι — 1
— ι — 1
— ι — 1
— ι — 1
— Γ
p-t-buty!benzyl alcohol
£ 100 -g
'to
1 0
S -i
1
Ί 1
1 1
1 f
1 1
Τ
1L
2—t—butyl —4—methylphenol
^ 100
'co
S 1 0

Ί 1
Γ ι — ' — I — 1
— I — 1
— Γ

ρ—t-amylphenol

'co

S ίο -i
1
— Τ­ "Τ I ι — 1
— ι —
ο 20 40 60 80 100 120 1 40

m/z

Figure 5. Selected daughter spectra o f parent mass 149 from


reference l i b r a r y .

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
25. CROSS ET AL. Automation of Structure Elucidation from MS-MS Data 333

substructure was used since the neutral losses leading t o i t s


formation correspond to the groups attached to the phthalate
substructure. This parent spectrum need not be acquired from the
TOMS d i r e c t l y , since i t can be generated from the set of daughter
spectra f o r the unknown. The parent spectrum o f m/z 149 has 4 major
(non-isotopic) peaks at m/z 167, 261, 279, and 391. This
corresponds t o neutral losses of 18 (167-149) 112 (261-149), 130
(279-149), and 242 (391-149). The neutral loss of 112 i s CsHie, the
loss o f 130 i s CeHi70H (which may represent an a l k y l group) and the
loss of 242 i s C 8 H 1 7 O C 8 H 1 7 (which i s a rearrangement product). The
neutral losses i n the m/z 149 parent spectrum are thus d i r e c t l y
related to the two CeHi7 a l k y l substructures i n the reference
compound. The low mass ion series i n the primary mass spectrum i s
also related to the a l k y l chains and the unbroken sequences o f ions
every 16 mass u n i t s i s i n d i c a t i v e of unbranched alkanes (11).
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch025

The GENOA program i s a constrained molecular structure generator


r e s u l t i n g from the Stanford Dendral project (12,13,14) and marketed
by Molecular Design L t d (6). This program generates molecular
structures using the overlapping substructural information obtained
from the daughter spectrum/substructure r e l a t i o n s h i p and the
empirical formula o f the compound. Additional spectral and non-
s p e c t r a l information from other sources may also be included.
H e u r i s t i c rules determine whether a p a r t i c u l a r generated structure
i s chemically p l a u s i b l e , and whether or not i t i s retained. The
advantage of the GENOA program i s i t s a b i l i t y t o exhaustively
produce a l l the p l a u s i b l e compounds given the generation
constraints. This c a p a b i l i t y eliminates the p o s s i b i l i t y that the
chemist might overlook any possible compounds. In many cases, the
number and types o f d i f f e r e n t structures that are produced suggest
the nature o f the missing s t r u c t u r a l data. The experiments needed
to acquire such data may then be obtained from the known
spectrum/substructure c o r r e l a t i o n s .
An e s s e n t i a l piece o f information required by GENOA i s the
empirical formula o f the unknown compound. We have developed
software that adapts the standard "molecular weight versus possible
empirical formulae" table. U t i l i z i n g a l l pertinent MS/MS data,
several constraints can be placed upon the empirical formula
generator, and i t generates a l l possible empirical formulae
consistent with those constraints and the molecular weight. We have
been using M+l daughter spectrum information instead o f high
r e s o l u t i o n mass spectrometry t o a i d i n the determination of the
empirical formula of an unknown compound (4,15). The daughter
spectrum of the M+l isotope i o n contains peak p a i r s at adjacent
1 3
masses representing the d i s t r i b u t i o n o f the C atom between the
i o n i c and neutral fragments. The r e l a t i v e i n t e n s i t i e s of these
daughter p a i r s depends on the r a t i o of carbon atoms l o s t t o carbon
atoms retained i n fragment ion of the M+l i o n to the observed
daughter ion. Hence the peak height or area r a t i o s can be u t i l i z e d
to obtain the number of carbon atoms present i n the compound. Once
the number of carbon atoms i s determined by t h i s method, i t used as
a constraint to the empirical formula generator. The r e s u l t i n g
reasonable empirical formulae are given t o GENOA.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
334 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

100.0 -a

10.0 -

& 1

c
t>
·+-»
c
a* 1.0 -
3 :
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch025

0.1 Ι ' Γ I ' V l ' I 1


I 1
I'I 1
I 1
1 ' 1 ' Γ ' Ι ' ΐ ' 1 ' I ' I ' 1 ' I 1
1 ' Ι Ί Ί 1
I'l'l
140 160 180 200 220 240 260 280 300 320 340 360 380 400

M/Z
Figure 6. Parent spectrum of mass 149 from di-n-octylphthalate.

100 -ι

80 Η

60 Η

S 40

20 H

J I I I I J ι' ι ι ι ι ι j ι ι ι ι j f ι ι ι j ι ι ι ι y ι ι ι r
50 100 150 200 250 300 350 400

Μ/Ζ
13
Figure 7. Daughter spectrum of the C - c o n t a i n i n g protonated
molecular ion of di-n-octylphthalate.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
25. CROSS ET AL. Automation of Structure Elucidation from MS-MS Data 335

For example, to determine the empirical formula o f


1 3
di-n-octylphthalate, the daughter spectrum o f the C containing
molecular ion (392) was obtained (Figure 7). The r e l a t i v e peak
areas of adjacent peak p a i r s at m/z 149 and 150 i s 2:1. This
1 3
indicates that the M+l ion i s twice as l i k e l y to lose a C atom as
retain i t . Thus the r a t i o o f the number of carbon atoms l o s t to
those retained i s 2:1. Since the i d e n t i f i e d phthalate substructure
contains 8 carbons, the unknown compound (di-n-octylphthalate) must
contain 24 carbon atoms. These data, along with the molecular
weight o f 390 as determined from the conventional CI mass spectrum
of the unknown was fed into the empirical formula generator and the
output was one empirical formula: C 2 4 H 3 8 O 4 .
Given the phthalate substructure, the two a l k y l substructures
and the empirical formula, GENOA can now be used to generate a l l
plausible molecular structures. The oxygen i n the C 8 H 1 7 O H group i s
allowed to overlap with either terminal phthalate oxygen. With t h i s
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch025

information, GENOA constructs only one molecular structure


(structure IV o f Figure 3) and i t i s di-n-octylphthalate. The
number of generated structures depends on the completeness of the
information provided. I f the branching o f the a l k y l group i s not
s p e c i f i e d , 89 d i f f e r e n t structures are generated which represent a l l
the isomeric permutations o f the a l k y l groups. The i d e n t i t i e s o f
these generated structures, however, would provide clues as to
further needed information. In cases where MS/MS information cannot
determine a unique result, additional spectral and non-spectral
information may be given to GENOA as s t r u c t u r a l constraints.

Conclusions

The software tools f o r structure determination by MS/MS are now at a


stage where we can begin to apply them to real elucidation problems.
Nearly a l l o f the software tools have been integrated into a
comprehensive, i n t e r a c t i v e system. The system has been successfully
used to develop daughter spectra/substructure correlations and
extend the MS/MS data bases. The elucidation process i s t o t a l l y
empirical and does not assume that s t r u c t u r a l integrity i s
maintained i n the ionization or fragmentation process. As a result,
the ion structures need not be i d e n t i f i e d . Preliminary results from
applying the system to structure determination problems have been
very encouraging.
Acknowledgment

This work was supported by NIH grant no. 2R01GM28254.


Literature Cited

1. Gregg, H.R., Hoffman, P.Α., Enke, C.G., Crawford, R.W., Brand,


H.R., Wong, CM., Anal. Chem. 1984, 56, 1121.
2. Hoffman, P.Α., Enke, C.G., presented at 31st Annual Conference
on Mass Spectrometry and A l l i e d Topics, Boston, MA (1983);
bound p. 556.
3. Cross, K.P., Enke, C.G., Computers and Chemistry i n press.
4. Bozorgzadeh, M.H., Morgan, R.P., Beynon, J.H., Analyst 1 9 7 8 ,
103, 613. "™

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
336 ARTIFICIAL I N T E L L I G E N C E APPLICATIONS IN C H E M I S T R Y

5. Enke, C.G., presentation at "Applications of AI in Mass


Spectrometry", Workshop at 33rd Annual Conference on Mass
Spectrometry and Allied Topics, San Diego, CA, May 26-31,
1985.
6. Molecular Design Ltd., 1122B Street, Hayward, CA 94541.
7. Cross, K.P., Beckner, C.F., Enke, C.G., in preparation for
submission to Computers and Chemistry.
8. Wipke, T. W., Dyott, Τ. M., J. Am. Soc. 1974, 96, 4834.
9. Damen, H., Henneberg,D.,Weimann,B.,Anal. Chim. Acta 1978,
103, 289.
10. Shelly,C.A.,J. Chem. Inf. Comput. Sci.1978, 23, 61.
11. McLafferty, F.W. "Interpretation of Mass Spectra", Univ.
Science Books, Mill Valley, CA, 1980.
12. Lindsay, R.K., et a l . , "Applications of A r t i f i c i a l
Intelligence to Organic Chemistry: The Dendral Project",
McGraw H i l l , New York, NY, 1980.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch025

13. Barr, Α., Feigenbaum, E.A., "Handbook of A r t i f i c i a l


Intelligence, Vol. II", William Kaufman, Inc., Los Altos, CA,
1981.
14. Carhart, R.E., et. al., J. Org. Chem. 1981, 46, 1708.
15. Todd, P.J., Barbalas, M.P., F.W., Organic Mass
Spec. 1982, 17, 79.

RECEIVED January 14, 1986

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
26

A r t i f i c i a l Intelligence, Logic P r o g r a m m i n g , and Statistics


in M a g n e t i c R e s o n a n c e I m a g i n g a n d S p e c t r o s c o p i c
Analysis

Teresa J. Harner, George C. Levy, Edward J. Dudewicz, Frank Delaglio, and Anil Kumar
National Institutes of Health Resource for Multi-Nuclei NMR and Data Processing,
Department of Chemistry, Syracuse University, Syracuse, NY 13210
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch026

Logic Programming in combination with expert directed


statistical analysis makes possible a unique aproach to
new expert systems for NMR and other chemical analyses
as well as for medical applications of NMR. We have
used this approach to begin understanding the behavior
1
of T , T2 and H density in magnetic resonance imag-
1

ing (MRI). Also, we are u t i l i z i n g this technique


to develop intelligent behavior within our NMR1 and
NMR2 spectroscopic data reduction systems.
Sets of rules generate a solution space which may be
s t a t i s t i c a l , functional or symbolic (non-numerical).
unlike other expert system environments, the stat-
i s t i c a l foundations which govern many of the "macro-
scopic" inferences are included, allowing for modifica-
tion to the underlying "implicit" statistical bases at
any time. The logic programming environment allows
modifications to the knowledge-base through automatic
and user-generated commands, and lends itself to the
development of easily understood natural language
interfaces.

Software f o r NMR applications i s now i n widespread use and i t i s


therefore important that such packages work well not only i n more
t r a d i t i o n a l chemical s h i f t , relaxation or resonance applications but
i n the more recent context as p o t e n t i a l pre-processors of imaging data.
In p a r t i c u l a r , two systems f o r NMR spectroscopic analysis,
IURL(1) (one-dimensional analysis) and MR2(2J (two-dimensional anal-
ysis) , are f u l l y operational, while a t h i r d system f o r analysis of
magnetic resonance imaging (MRI) parameters, i s beginning t o emerge.
While NMR1 and NMR2 are written i n conventional numerically based code
(FORTRftN-77), the MRI system, MRLJÛGLESP combines the use of FORTRAN
and the l o g i c programming language, Prolog.
Our research on human tissue discrimination methods deriving
from MRI parameters i s leading t o the evolution of prescribed s t a t i s -
t i c a l methods f o r data screening, normalization and discrimination.
These are driven by Prolog. In Prolog, sets of l o g i c a l inferences

0097-6156/ 86/ 0306-0337$06.00/ 0


© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
338 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

which carry information about a p p l i c a t i o n data types, are capable of


distinguishing important features. Procedures pass the data through
those analyses which optimize c l a s s i f i c a t i o n .
In t h i s paper we note some of the f a i l u r e s inherent i n most
current computer-aided (and manual) NMR spectroscopic techniques and
r e f l e c t on possible solutions v i a A r t i f i c i a l Intelligence (AI) tech-
niques.
A d e s c r i p t i o n of current AI methods which lend themselves t o
problems of t h i s type i s included, as well as a d e s c r i p t i o n o f
applications t o NMR spectroscopy and MR imaging. L a s t l y , there i s a
b r i e f d e s c r i p t i o n of MRLJiOCLESP i n i t s current preliminary state.

IWO.- Model Computer Software f o r Spectroscopic Analysis

NMR1 i s a g r a p h i c s - o r i e n t e d software system containing over 100


options, each allowing the user a large degree of freedom t o analyze
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch026

spectroscopic data i n a s i n g l e dimension. At the core of NMR1 i s a


set of procedures f o r data reduction, estimation of i n i t i a l parameters,
and the u t i l i z a t i o n of a s e t of convergence methods f o r baseline
conditioning, peak i d e n t i f i c a t i o n and curve f i t t i n g .
Curve f i t t i n g i s c u r r e n t l y accomplished using a non-linear
minimization (modified Levenberg-Marquardt) a l g o r i t h m f o r t h r e e -
parameter Lorentzians, as w e l l as f i v e a d d i t i o n a l non-linear peak
shapes.
Generally, a user/Chemist may learn a great deal from the
displays of the Fourier transformed spectra using the options f o r
analysis a v a i l a b l e with graphics interaction. Nevertheless there i s
a great deal of room f o r improvement. The following l i s t summarizes
the most s a l i e n t current d i f f i c u l t i e s with t r a d i t i o n a l computer-aided
analysis:

1. Subroutines used t o obtain quantitative measures of the


parameters associated with overlapping peaks can end
i n misleading r e s u l t s i f an incorrect t h e o r e t i c a l l i n e
shape has been u t i l i z e d .

2. Automated i n i t i a l parameter estimation may not be


accurate and then the user w i l l be required t o intercede
with manual estimation.

3. I t may be necessary t o manually i n s t a l l or delete peaks;


e s p e c i a l l y when signal-to-noise i s low or when the peaks
are l a r g e l y un-resolved. In cases of very small peaks
or when overlap between peaks i s high, standard algorithms
may f a i l and return u n r e a l i s t i c linewidth values.

4. I f i n i t i a l estimates are too f a r from correct values,


c a l c u l a t i o n s may diverge. Then the process must be
restarted with better i n i t i a l estimates.

5. Because of the underlying mathematical assumptions i n -


herent i n s t a t i s t i c a l modeling of the data, i f the
assumptions are for any reason incorrect the f i n a l f i t
may be poor or good, but not s i g n i f i c a n t . Currently,

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
26. HARNER ET AL. Magnetic Resonance Imaging and Spectroscopic Analysis 339

software users must have s u f f i c i e n t background i n the f i e l d


to properly interpret the v a l i d i t y of the output from curve
f i t t i n g and other algorithms.

In addition to these problems, there e x i s t the set of conditions


under which the user must manually set a l l i n i t i a l parameter estim-
ates. Manual constraints on the parameters may often be the only way
to obtain a proper convergence i f the true, bounded region of the
solution i s known by the user, or i f one has some s p e c i f i c knowledge
of the correct s t a r t i n g values. This may be p a r t i c u l a r l y true i f
there are additional parameters with complex functional forms such as
phase angle. Automated paramater s e t t i n g which takes into account
some of these problems could lead to more consistent r e s u l t s and
require l e s s user expertise.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch026

MR Imaging
While a n a l y t i c a l spectroscopy has been used f o r many years i n order to
obtain information regarding chemical structure, magnetic resonance
imaging i s a r e l a t i v e l y new f i e l d . M i s l e a d i n g l y well-resolved
images may a i d an expert Physician i n diagnosing human tissue abnormal-
i t i e s , but as l i t t l e i s understood about the relationships which e x i s t
between t i s s u e MRI parameters and tissue h e a l t h , not t o mention
secondary f a c t o r s ( g e n e t i c , environmental, macro-physiological,
e t c . ) , such judgements, accurate or not, are often purely subjective.
In s i m i l a r applications, precedents are well established for the use
of expert systems as medical diagnostic t o o l s (1,1,5).
The optimal research strategy involves the systematic search to
uncover these relationships at the same time as the development of a
computer methodology proceeds. Such software systems w i l l not only
give the kind of information about physicochemical structure as have
previously designed systems f o r NMR spectroscopic analysis, but w i l l
serve as co-investigators, f a c i l i t a t i n g through automated procedures,
a n a l y t i c a l tasks which are normally time-consuming and complex.
Experimental data analysis of tissue parameters and construction
of an automated format f o r MRI research has been proceeding i n our
l a b o r a t o r y w i t h seme success. S t a t i s t i c a l Analysis has revealed
that, with the proper normality transformations applied t o T^, ^ and
1H density over eight regions of interest i n the human brain ( l e f t
and r i g h t sides respectively of C o r t i c a l White Matter, I n t e r n a l
Capsule, Caudate Nucleus, and Thalamus), the values within t i s s u e
type generally follow a normal distribution(£). This implies that
e x i s t i n g discriminant functions may be able to optimally c l a s s i f y
data according t o t i s s u e type (although i n i t i a l r e s u l t s a l s o show
large overlaps between the normal d i s t r i b u t i o n s of several t i s s u e
types). Indeed, preliminary results have yielded correct c l a s s -
i f i c a t i o n percentages between 73 and 86%(2).
As shown i n Figure 1, however, s t a t i s t i c a l analysis alone i s
only one of the steps towards r e a l i z i n g a f u l l y functional system for
MRI tissue discrimination. Experimental data i s passed through
software ( e v e n t u a l l y NMR1/NMR2) f o r pre-processing. Since the
s t a t i s t i c a l analyses must themselves be applied to an ever-increasing
number of regions of i n t e r e s t (ROI's) i t would be of great use to

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
340 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

develop an automated s t a t i s t i c a l treatment methodology for future


research. A l s o i t i s not yet known j u s t what f i n a l precision to
f

expect from discriminant analysis, even including secondary descriptor


data which should help to obviate overlaps between MRI responses from
d i f f e r e n t R0I"s. I t seems quite l i k e l y that additional, q u a l i t a t i v e
information, or non-traditional numerical discrimination procedures,
might be required at some point to approach an accuracy acceptable
for future c l i n i c a l use of MRI expert systems. L a s t l y , a general
computer system a r c h i t e c t u r e must c o n t r o l these procedures for
highest e f f i c a c y .

Overview of Proposed AI Techniques

Most researchers are now aware that " A r t i f i c i a l Intelligence" has


diverse meanings both inside and outside Computer Science. Such a
s i t u a t i o n i s understandable given the actual variety of methods which
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch026

could, without being s t r i c t l y inaccurate, f a l l under the heading of


AI. I t i s therefore important, i f one intends to use AI techniques to
develop a computer program architecture, to f i r s t define c l e a r l y the
AI related approaches which one has chosen to apply.
For the present, the following AI t o o l s and techniques comprise
the building materials intended for construction of our automated
spectroscopic analysis and MRI tissue discrimination systems:
Logic Progranming and s p e c i f i c a l l y , the programming language
r

Prolog, s a t i s f i e s a s p e c i f i e d goal by resolving i t s premises. For


resolution to take place, these premises must i n turn become the
subgoals for premises which can be s a t i s f i e d . A goal i s s p e c i f i e d
with a predicate name and a set of arguments whose values must be
instantiated for the goal to succeed. The goal-to-premise structure
forms sets of clauses which operate upon the p r i n c i p l e s of f i r s t
order predicate l o g i c ( 8 ) .
Decision Trees provide the o v e r a l l structure for problem resolution
i n the current system. The outcome of a t e s t at a p a r t i c u l a r node i n
the tree i s recorded and d i r e c t s the next decision for branching. If
a f a i l u r e i s encountered at a l l possible branches, the un-resolved
problem i s passed back up to the node at which there l a s t existed a
possible, untested, solution. Prolog lends i t s e l f n i c e l y to t h i s
structure since i t s basic architecture includes decision-making v i a
such a " d e p t h - f i r s t " search strategy(2J
Mnfk>i M a s h i n g a n d fiîmîiarîiy Μ « ί ΐ Ω ) are the means by which, at
any node i n the decision tree, an actual t e s t i s made. Within a
database, l i e s a set of model data structures, to which an attempt i s
made to match the actual input and output format. The i n i t i a l problem
i s to f i n d a model which, of a l l stored models, contains the fewest
d i f f e r e n c e s i n structure between model and actual data set. An
example i s shown i n Figure 2, a system designed to c l a s s i f y human
brain tissue type. A data set i s entered which contains variable
names "Cbs","^", " τ " and "1H density" as headings i n the f i r s t
2

row. In the f i r s t column, i t i s found, l i e s a set of integer numbers


running from 1 to 4, and within the set i t s e l f are "*" 's. Stored
within the system are a series of general models which i d e n t i f y data
matrices i n sets of predicates defining " f i r s t row", "column", and
other distinguishing c h a r a c t e r i s t i c s of data sets.
By searching the stored models for such c h a r a c t e r i s t i c s , the
program constructs a model data set which appears to come c l o s e s t to

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
HARNERETAL. Magnetic Resonance Imaging and Spectroscopic Analysis

[NATURAL L A N G U A G E INTERFACE

IPRE-PROCESSOR: NMR1/NMR2

[PROLOG C O N T R O L P R O G R A M
I X Z
ISTATISTICAU LOGICAL
EXPERT INFERENCE EXPERIMENT
SYSTEM ENGINE CONTROL

Figure 1. System Flow Chart


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch026

ACTUAL DATA SET

Ob? T1 T2 Densitv

1 663.000 77.0000 96.798


1 775.000 84.0000 107.554
2 659.000 82.0000 99.556
2 * 76.0000 *
3 619.000 79.0000 99.467
3 667.000 79.0000 102.868
4 * 79.0000 *
4 651.000 80.0000 84.752

LOGIC PROGRAMMING COPE

list_of_yariable_jiaees [ a, b, ,obs,...,T1,...,
T2,...,Density,...].

symbol__list[<integers>,<real_numbers>, " * , . . . ] .n

position_peanings[column(Number,Symbol,Meani n g ) ,
row(Number1,Symbol1,Meaning 1 ) ] .

Model_data_sets[model1(column(A,B,C),row(D, Ε , F ) ) , .
modeln(column(I,J,K),row(X,Υ,Ζ))].

MODEL DATA SET - CLOSEST (GENERAL) MATCH

Obs T1 T2 Density

1 VAL VAL VAL


1 VAL VAL VAL
2 VAL VAL VAL
2 NONVAL VAL VAL
3 VAL VAL VAL
3 VAL VAL VAL
4 NONVAL VAL NONVAL
4 VAL VAL VAL

Figure 2. Pattern Matching with Prolog

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
342 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

t h a t d e f i n e d by t h e structure of the input data set. For that


p a r t i c u l a r model, there are now a set of options f o r action upon the
data. The variable name "Cbs" i s associated with a s e t of classes,
therefore the program knows that 4 classes are represented by the
n
data set. The variable names "Ti , "T2 and "1H" density are untrans-
W

formed names, which information, taken with the f a c t that "*" has
been used i n the place of data i n c e r t a i n positions i d e n t i f i e s the
set as being raw and untransformed. The program w i l l then proceed by
"cleaning up" the raw data set, making appropriate transformations
and applying a discriminant analysis t o the set, under the assumption
of four classes.
In actual practice a number of t e s t s must be passed a t various
nodes before f i n a l c l a s s i f i c a t i o n takes place. Also, a p r o h i b i t i v e
time would be required t o search a large database of models f o r ones
which most c l o s e l y approximated the actual data set. For t h i s reason
the concept of s i m i l a r i t y nets i s introduced. In t h i s case, a more
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch026

general model i s f i r s t chosen, one which i s c l e a r l y not completely


absurd. A subset of other models which are v a r i a t i o n s of t h i s f i r s t
general model then provides the index f o r the f i n a l choice of model.
Such a reduction i n the model l i s t s greatly reduces the search space
for the c l o s e s t f i t .
While a great many other techniques may be employed t o ensure
a consistent, and e f f i c i e n t , logic-driven software, the techniques
described above are the primary d r i v e r s f o r the e f f e c t i v e resolution
of goals i n the prototype expert system MRILJ/XLESP. Once a detailed
accounting i s made of the c h a r a c t e r i s t i c s which f u l l y describe input
and output, providing Prolog code i s q u i t e e a s i l y accomplished.

Specific Applications:
Imaging

The prototype expert system, MRXJJOG_ESP has been written t o a i d


in c l a s s i f y i n g t i s s u e type from primary and secondary tissue descrip-
t o r s and i s capable of l i m i t e d applications. While i t would be
incorrect t o say that the current system i s a robust expert system,
since i t i s not y e t able t o f u l l y make the inferences regarding the
input data sets which would lead t o automatic tissue c l a s s i f i c a t i o n ,
the program does successfully enable s e r i a l combinations of s t a t i s t i c a l
procedures t o be run from a central Prolog c o n t r o l l i n g program.
Commands are parsed so that simple English language structures can be
interpreted, and a tracking procedure keeps an automatic l o g o f
analyses run and steps taken.
At the experimental interface, we expect t o ultimately assemble
a data set of approximately 500 patients. Thus f a r we have only worked
on a much smaller data set of 23 individuals. The actual protocol for
obtaining t h i s data i s described elsewhere(S) ·
Figure 3 provides a sample session with MRI_JOG_ESP. The
c o n t r o l l i n g code i s written i n Prolog, while numerically oriented,
a n a l y t i c a l procedures f o r c l a s s i f i c a t i o n are written i n Fortran-77.
xhe system has been w r i t t e n t o run under Data General's AOS/VS
operating system (MV series computers) but i t i s expected t o be
e a s i l y ported t o the D i g i t a l Equipment VAX VMS environment.
At the front end of the c o n t r o l l i n g program there are three main
branches i n the form of predicates:

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
26. HARNER ET AL. Magnetic Resonance Imaging and Spectroscopic Analysis 343

WELCOME TO MRILOG
Use t h e o p t i o n s l i s t below t o g u i d e y o u r i n t e r a c t i o n
w h i l e k e e p i n g y o u r r e s p o n s e s r e l a t i v e l y s i m p l e , and
you s h o u l d have few p r o b l e m s w o r k i n g w i t h i n t h e s y s t e m .

GENERAL FEATURES

I. a u t o _ a n a l y s i s : based on a u s e r s p e c i f i e d
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch026

d a t a f i l e , program d e t e r m i n e s a n a l y s e s and
r u n s them.
II. u s e r _ d r i v e n _ a n a l y s i s : user s p e c i f i e s f i l e s
and r u n l i s t ,
III. h e l p _ f i l e : p r o b a b l y a good p l a c e t o s t a r t .

What i s y o u r g e n e r a l o b j e c t i v e ,
based on t h e i n f o r m a t i o n j NOTE !
s u p p l i e d above? ! Answers can t a k e a !
j n a t u r a l l a n g u a g e form)
! but user should t r y j
I t o respond w i t h i n thei
j c o n t e x t o f t h e prompt!

! : I would like t o perform a user d r i v e n analysis.

REMEMBER: R e g u l a r " r e s e t " w i l l


d e l e t e a l l t h e d a t a f i l e s accum-
Reset? (yes/no/all) u l a t e d so f a r EXCEPT:
<user_analysis(xxx)>
and < a u t o _ _ a n a l y s i s (xxx )>.
By s p e c i f y i n g " a l l " i n t h e r e s e t
command, t h e s e w i l l a l s o go.
See U s e r s Manual o r H e l p f i l e
for further information.

no,

F i g u r e 3. An I n t e r a c t i v e S e s s i o n w i t h MRI_LOG_ESP.

Continued on next page

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
344 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

P l e a s e e i t h e r g i v e a c l e a r and
concise d e s c r i p t i o n of
your o b j e c t i v e , o r type i n
a l i s t o f the procedures
you w i s h t o i n v o k e f o r d a t a
analysis.

AVAILABLE PROCEDURES FOR ANALYSIS

call name description

<twod> two d i m e n s i o n a l
graphics.
<normtest > - normality testing.
<tran> - transformation of
variables.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch026

<tran_disc> - d i s c r i m i n a n t a n a l y s i s
of t r a n s f o r m e d v a r s .
<disc__fun> - d i s c r i m i n a n t a n a l y s i s
(untransformed data)

L e t ' s do a n o r m t e s t .

SEARCHING AVAILABLE ANALYSES FOR YOUR SPECS

The f o l l o w i n g a n a l y s e s w i l l be r u n u s i n g d a t a from a n
input f i l e . I f the l i s t i s not c o r r e c t i n d i c a t e
t h a t a change i s r e q u i r e d . O t h e r w i s e , t y p e "go",
(or some o t h e r a f f i r m a t i v e )

RUN L I S T :
[normtest ]

I : go.
INPUT F I L E SPECIFICATION: normtest

W i l l t h i s be new d a t a ?
! : no.

You c a n use any f i l e w h i c h c o n t a i n s PROPERLY FORMATTED d a t a .

What i s ( a r e ) y o u r i n p u t f i l e name(s)?
I: I want t o examine t 1 c c and t 2 c l .

Do you w i s h t o examine f i l e t 1 c c
i : no.

Do you w i s h t o examine f i l e t2cl


i : no.
Figure 3. Continued.
Continued on next page

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
26. HARNER ET AL. Magnetic Resonance Imaging and Spectroscopic Analysis

S p e c i f i e d i n p u t f i l e s scanned.
S t a r t i n g normtest using i n p u t f i l e : t 1 c c
S t a r t i n g normtest using input f i l e : t 2 c l

NORMALITY TESTING OF VARIABLES (normtest)

All output has been appended i n u s e r _ _ a n a l y s i s . 007

Do you want a p r i n t o u t
i : no.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch026

P l e a s e e i t h e r g i v e a c l e a r and
concise d e s c r i p t i o n of
your o b j e c t i v e , or type i n
a l i s t of the procedures
you w i s h t o i n v o k e f o r d a t a
analysis.

AVAILABLE PROCEDURES FOR ANALYSIS

call name description

<twod> two d i m e n s i o n a l
graphics.
<normtest > normality testing,
<tran> t r a n s f o r m a t i o n of
variables.
< t r a n_d i s c > discriminant analysis
of transformed v a r s .
<disc_fun> discriminant analysis
(untransformed data)

halt.

Figure 3. Continued.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
346 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

• aiito__analysis

• ujser^âriveru^nalysis

• helçL/ile

In the emerging system, auto_analysis represents the automated


l o g i c - d r i v e n search tree which i s able t o apply the appropriate
analysis from l o g i c a l inference and/or upon receipt from the user o f
a description of the problem space (data s e t ) . The program w i l l then
search, sort and c l a s s i f y the data as appropriate t o each reasoning
technique l o g i c a l l y demanded by the problem/data input. In many
respects auto_analysis represents the "expert system core". For the
i n i t i a l input data set, the user i s questioned regarding a f i n a l
objective. This objective, i n combination with the r e s u l t s obtained
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch026

from each analysis, are what provide deterministic control, as the


i n i t i a l data s e t i s formatted f o r subsequent analysis and output from
previous analysis i s i t s e l f reformatted (as a r e s u l t of computer
generated interpretation of the output). This reformatting of the
output i s , once again, determined by the next analysis which the
program deams essential t o the s a t i s f a c t i o n of the primary, user-spec-
i f i e d goal.
The c a p a b i l i t y of determining the order of procedures, based on
the user s p e c i f i c a t i o n of the end goal alone and t o know a t what
f

point computation should stop with a r e s u l t recorded, i s the implemen-


t a t i o n of the t h e o r e t i c a l AI techniques described above.
Conversely, useiudriveru^nalysis allows the user t o specify from
one t o a l l of the available analyses and the s p e c i f i c data sets t o
use f o r each given analysis. Thus, i f predicates twod and normtest
were s p e c i f i e d , a "Rurulist" would be interpreted consisting o f :

[tnod noimtest]
r

I t should be noted that a l l the formatting knowledge required t o run


auto-analysis i s also required t o run a user analysis. In t h i s case
one i s simply overriding the computers "better judgement" i n terms
of procedural protocol.
The following predicates comprise the current s t a t i s t i c a l pro-
cedures handled by the system.

1. <twod> - two dimensional graphics

2. <threeD> - three dimensional graphics

3. <nonntest> - normality evaluation of variables

4. <trans> - normality transformation of variables

5. <disQ_fun> - l i n e a r discriminant analysis

These programs, written i n Fortran-77 are accessed through


systems c a l l s i n Prolog. Naturally t h i s l i s t w i l l be greatly augmented
i n future versions of the program. However t h i s set of s t a t i s t i c a l
analyses i s comprehensive in-so-far as a data set may be screened and

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
26. HARNER ET AL. Magnetic Resonance Imaging and Spectroscopic Analysis 347

c l a s s i f i e d using these routines alone, thereby providing a general


but s i m p l i f i e d model upon which t o base t h e l o g i c a l i n f e r e n c e s
which govern choice of analysis.
At the top l e v e l of the program, having provided the " a n a l y t i c a l
objective" f o r a given data set, the user i s asked t o v e r i f y that the
program has understood the command c o r r e c t l y . I f t h i s i s v e r i f i e d the
program w i l l go on t o ask the user f o r the name of the input f i l e
containing data from which i t i s desired t o proceed with t e s t i n g .
A copy of the r e s u l t s of every t e s t on a set of data i s saved
and when a single run of a series of t e s t s has been made, the r e s u l t s
of each procedure i s appended i n a single f i l e .
Commands are requested from the user a t f a i r l y regular i n t e r -
v a l s . The program w i l l accept most general sentences which may be
construed by the parser t o e l i c i t some form of action f o r which the
program has been written.
While several predicates require and accept only affirmative
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch026

(yes) or negative (no) responses from the user, f o r the most part,
communication with the program i s governed by what has been termed a
"Context Parser", the main predicate of which has three l e v e l s t o
handle varying l e v e l s of l i n g u i s t i c complexity.
The aim of the graphics software, (twod, threeD), i s t o enable
the user to rapidly examine a large number of two- and three-dimension-
a l scatter p l o t s . At present the program i s capable of handling up t o
120 variables with up to 200 observations f o r each.
Predicate normtest tests/evaluates normality of the given set of
data points (corresponding t o any v a r i a b l e ) , while, disc_fun performs
a l i n e a r discriminant analysis on groups of data (maximum o f 10
groups) with respect t o any selected variables (maximum of 20 var-
iables) ·
There are only two types of output f i l e s and output f i l e names.
These are:

autXL_3nalysia_out<xxx>

and

user jEmalysis_out<xxx>
%

Where "<xxx>" symbolizes a sequence number. As runs are made, each


one i s placed i n a l i s t of output f i l e s . As runs are made, whether
i n auto or user mode, an accumulation of r e s u l t s i s i n e v i t a b l e .
Also, due to the method by which input and output f i l e s are appended,
there i s some accumulation of "garbage" f i l e s . The procedure reset
provides a way of deleting unecessary f i l e s .

N M R I and s i m i l a r Software (a>eçtresçppic Analysis)


Precisely how the AI techniques discussed above might improve the
current, numerically-based software i s s t i l l somewhat speculative.
Close inspection of the current f a i l i n g s discussed previously
indicates that t h e i r source l i e s i n two main areas which are d i f f i c u l t
for numerically based algorithms t o handle:

1. The problem of defining the baseline and locating and


defining the spectral features (peaks).

American Chemical Society


Library
1155 16th St., N.W.
In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;
Washington,
ACS Symposium Series; American ChemicalD.C. 20036
Society: Washington, DC, 1986.
348 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

2. The correct setting of i n i t i a l parameters.

For quantitative characterization of molecules including analyses


of i n vivo (metabolic) NMR spectra, the problem of peak overlap and
baseline i d e n t i f i c a t i o n (which are p a r t i c u l a r l y problematic f o r i n
V i v o spectroscopy) i s r e a l l y only one part of the more general
experimental problem of discovering spectral differences a r i s i n g i n
complex environments. We can greatly increase the e f f i c i e n c y of the
o v e r a l l experimental p r o c e s s and s o l v e the peak q u a n t i f i c a t i o n
problem u t i l i z i n g the data base structure inherent i n a l o g i c program­
ming framework.
With sets of rules providing the f a c t s from which a f u l l model
can be constructed, the program i s informed regarding the o r i g i n of
the spectrum to be analyzed. A comparison i s then made between model
and actual spectra. Anomalous features are thus i d e n t i f i e d .
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch026

In (2) above, a second d i f f i c u l t y with spectral analysis i s


i d e n t i f i e d which might be a l l e v i a t e d through the use of l o g i c program­
ming methods: The setting of i n i t i a l parameters. As noted above,
t h i s i s generally an automated procedure i n NMR1 but sometimes
does require user intervention. A minimization algorithm which d i d
not derive i t s information from a pre-set l o c a l minimum could be
applied a t each minimum i n the spectrum and then proceed t o examine
the consistency of the r e s u l t s . Such a method would not be d i f f i c u l t
to implement within the framework of a logic-driven, c o n t r o l l i n g algor­
ithm. By applying t h i s minimization algorithm at multiple minima and
by examining consistancy (and not by numerical methods alone!), a
determination would be made as t o whether the derived minimum was
f a l s e or g l o b a l .
Investigation thus f a r has been made into c h a r a c t e r i s t i c i n Vivo
3 1
P peaks with some thought to l o c a l i z e d pattern matching (1). In the
coming year we w i l l begin t o look a t coding c h a r a c t e r i s t i c i n Vivo
spectra and developing a Prolog algorithm which analyzes the r e s u l t s
of the minimization algorithm. For the most part, i t i s hoped that
MRI^J/DQ_ESP w i l l provide the "expert system s h e l l " which may be
e f f e c t i v e l y applied to the problems i n spectroscopic analysis.

Acknowledgments

The authors acknowledge the c o l l a b o r a t i o n of Dr. F e l i x Wehrli and


co-workers at General E l e c t r i c Medical Systems and also p i l o t project
funding from NIH (Grants RR-01317, and RR-01831) and the General
E l e c t r i c Company.

Literature Cited

1. Dumoulin, C.L., Levy, G.C.; Journal of Molecular Spectroscopy,


113, 299-310 (1984); Dumoulin, C.L., Levy, G.C.; Computers and
Chemistry, 5, 9-18 (1981).

2. Levy,G.C.,Delaglio, F., Macur, Α., Begemann, J.; Computer


Enhanced Spectroscnpy, in press (1986).

3. Buchanan,Β.Α., Shortliffe, E.H.; "Rule Based Expert Programs:


The MYCIN Experiments of the Stanford Heuristic
Programming Project", Addison-Wesley, Reading, MA (1984).

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
26. HARNER ET AL. Magnetic Resonance Imaging and Spectroscopic Analysis 349

4. Miller,R., Pople, H., Meyers, J.; New E n g l a n d Journal of Medicine,


307, 468-476 (1982).

5. Weiss, S.M., Kulikowski, C.A.; "A Practical Guide to Design-


ing Expert Systems", Rowman and Allanheld, Totowk, NJ (1984).

6. Dudewicz, E.J.; "Statistical Analysis of Magnetic Resonance


Imaging Data in The Normal Brain, Part I: Data, Screening,
Normality, Discrimination, Variability"; unpublished report,
1985.

7. Levy, G.C., Dudewicz, E.J., Harner, T.J., Wehrli, F.W., Breger,


R.; (Submitted),Magnetic R e s o n a n c e in Medicine, (1985).

8. Kowalski, R.; "Logic for Problem Solving", Computer Science


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch026

Series, North-Holland Publishing Co., NY (1979).

9. Clocksin, W.F., Mellish, C.S.; "Programming in Prolog",Springer-


-Verlag,Berlin, (1981).

10. Winston, P. H. , " A r t i f i c i a l Intelligence",Second Editoin,


Addison-Wesley, Reading Massachusetts (1984).

R E C E I V E D January 24, 1986

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
27

An E x p e r t S y s t e m for O r g a n i c S t r u c t u r e Determination

Bo Curry
Chemical Systems Department, Hewlett-Packard Laboratories, Palo Alto, CA 94304-1209

We are developing an expert system which interprets


low-resolution mass spectra, infrared spectra, and
other user-supplied information and produces a l i s t of
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

functional groups present in an unknown organic com-


pound. The input data are interpreted as evidence
supporting the presence or absence of each of the over
900 functional groups and organic substructures repre-
sented in the knowledge base. This evidence i s then
combined by an "inference engine" to determine the
probability that the group is present. Each type of
input spectra is interpreted by a separate module,
which has private internal data structures; these
modules can use different techniques and even be
written in different computer languages. The modular
architecture was designed to allow new modules inter-
preting different types of spectra to be easily in-
corporated into the system. A major goal has been the
reduction of the number of false positive assertions.

An analyst attempting to i d e n t i f y an unknown compound from s p e c t r a l


data begins by searching l i b r a r i e s o f spectra of known compounds
(Figure 1). Programs which r a p i d l y and r e l i a b l y search s p e c t r a l
l i b r a r i e s are widely available.(1-2) However, although these
l i b r a r i e s continue to grow, i t w i l l remain true that the majority of
compounds encountered i n r e a l samples are not represented i n the
l i b r a r i e s . These compounds can at present be i d e n t i f i e d only through
a laborious manual process r e q u i r i n g considerable expertise.
I n t e r p r e t a t i o n of molecular spectra involves four basic steps.
F i r s t , major s k e l e t a l and functional group components o f the mole-
cule are i d e n t i f i e d , e i t h e r from assumptions about the compound
o r i g i n or from features of the spectra. Second, non-localized
molecular properties such as the molecular weight, elemental compo-
s i t i o n , and chromatographic behavior are considered. These global
constraints can be used to eliminate u n l i k e l y f u n c t i o n a l groups,
deduce the presence o f groups and s k e l e t a l u n i t s which have no d i s -
t i n c t i v e features i n the spectra, and detect m u l t i p l e occurrences of

0097-6156/86/0306-0350$06.00/0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
27. CURRY An Expert System for Organic Structure Determination 351

f u n c t i o n a l groups. Complete candidate structures are then generated


by assembling the functional groups subject to the global con-
s t r a i n t s . More data may be c o l l e c t e d to narrow down the number of
candidates. F i n a l l y , the candidate structures are tested f o r compat-
i b i l i t y with a l l o r i g i n a l data. F i n a l confirmation i s obtained by
synthesis of the candidate compound and comparison with the unknown.
We are developing an expert system to automate the f i r s t step
of t h i s process, the i n t e r p r e t a t i o n of molecular spectra and i d e n t i -
f i c a t i o n of substructures present i n the molecule. The automatic
i n t e r p r e t a t i o n of spectra would by i t s e l f provide a u s e f u l t o o l f o r
an organic chemist who may not be an expert spectroscopist. Also,
reported algorithms f o r the assembly of candidate structures from
known substructures, such as the GENOA program.(3-6) r e l y on the
input of accurate and s p e c i f i c substructures i n order to f u n c t i o n
c o r r e c t l y and e f f i c i e n t l y . I d e n t i f i c a t i o n of substructures i s thus a
l o g i c a l s t a r t i n g point.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

Information about substructures present i n an unknown can be


obtained from a wide v a r i e t y of sources, and one of our major object-
ives has been to allow a l l a v a i l a b l e data to be used by the program.
Programs have been described i n the l i t e r a t u r e which i n t e r p r e t C-13
and 1-H NMR spectra,(7-13) low and high-resolution mass spectra,
(14-15) i n f r a r e d spectra,(16-23) MS-MS spectra,(24) and 2D-NMR
spectra.(25) The methods employed may be generally c l a s s i f i e d as
rule-based methods or pattern-matching methods. Rule-based methods
apply i n t e r p r e t a t i o n rules to discrete features of the spectra.(26)
These rules are usually empirical c o r r e l a t i o n s having p h y s i c a l s i g -
n i f i c a n c e , expressed i n a form s i m i l a r to that used by human i n t e r -
preters. Rule-based systems maintain a r e l a t i v e l y d e t a i l e d i n t e r n a l
representation of t h e i r knowledge, and can e x p l a i n t h e i r conclusions
i n a language i n t e l l i g i b l e to the user. Pattern-matching methods
attempt to c l a s s i f y the spectrum based on some global measure of
"spectral distance" from spectra of known compounds.(27) Any p h y s i c a l
knowledge used by the algorithm i s embodied i n i t s distance measure,
which may be a complicated function of many features of the spectra.
The c l a s s i f i c a t i o n decision i s made from a s t a t i s t i c a l analysis of
the distance from representative members of the classes being d i s -
tinguished. Explanations of the system's conclusions are are usually
l i m i t e d to reporting the computed s p e c t r a l distances. Whichever
method i s employed, the output i s i n the form of a l i s t of suggested
substructures, chosen from a predefined set, with confidence factors
v a r i o u s l y computed.
The choice between rule-based and pattern-matching approaches
depends not only on the p r e d i l e c t i o n of the experimenters, but also
on the nature of the data being interpreted. The reported NMR i n t e r -
preters a l l use rule-based methods. The pattern-matching algorithm
used i n the STIRS program (14) appears to be the most successful at
i n t e r p r e t i n g low-resolution mass spectra of general organic com-
pounds. Both rule-based and pattern-matching techniques have been
applied to the i n t e r p r e t a t i o n of i n f r a r e d spectra. The rule-based
methods seem to be the most successful.(16-23) We have therefore
designed our program to allow each type of spectrum to be interpreted
by the most e f f i c i e n t method; d i f f e r e n t methods can even be simul-
taneously applied to the same spectrum.
When the unknown i s present i n sub-microgram amounts, as i s
often the case when i t has been i s o l a t e d chromatographically, the

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
352 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

primary s t r u c t u r a l techniques are mass spectrometry, i n f r a r e d spec-


troscopy, and various methods of determining elemental composition.
We have therefore concentrated our i n i t i a l e f f o r t s on i n t e r p r e t i n g
these types of data, while recognizing the need to be able to use
data from other sources, such as NMR, when they are a v a i l a b l e . A
s k i l l e d chemist can often c o r r e c t l y i d e n t i f y an unknown of moderate
s i z e (molecular weight < 200) using only the IR spectrum, the low-
r e s o l u t i o n mass spectrum, and some knowledge of the sample o r i g i n .
Even when a precise i d e n t i f i c a t i o n i s not p o s s i b l e , a generic class-
i f i c a t i o n of the compound type i s u s e f u l and often s u f f i c i e n t . A
program which interprets IR and mass spectra i s therefore a u s e f u l
a n a l y t i c a l t o o l i n i t s own r i g h t , and provides the basis f o r develop-
ment of more comprehensive c a p a b i l i t i e s i n the future.
In our present system, i n f r a r e d spectra are i n t e r p r e t e d using a
rule-based approach, while mass spectra are i n t e r p r e t e d by the STIRS
algorithm. The a b i l i l i t y to use d i f f e r e n t techniques f o r d i f f e r e n t
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

types of data implies a modular architecture, i n which the "expert"


responsible f o r the i n t e r p r e t a t i o n of each spectrum maintains i t s own
rules and data structures (Figure 2). I t i s important, however, that
the i n t e r p r e t a t i o n of the various spectra be mutually consistent.
Information obtained from the mass spectrum, f o r example, should
a f f e c t the way the i n f r a r e d spectrum i s assigned. Conversely, the
i n t e r p r e t a t i o n of mass spectral l i n e s must be consistent with the
presence of f u n c t i o n a l groups known to be present from other sources.
This requires a means of communication among the parts of the program
responsible f o r the i n t e r p r e t a t i o n of d i f f e r e n t types of data. Con-
sistency also requires a means of combining evidence from d i f f e r e n t
sources. When data from d i f f e r e n t sources c o n t r a d i c t each other, the
i n d i v i d u a l modules should be able to r e i n t e r p r e t t h e i r data so as to
resolve the contradiction.
As i n any c l a s s i f i c a t i o n problem, there i s a tradeoff between
the rate of r e c a l l , or proportion of c o r r e c t substructures detected,
and the r e l i a b i l i t y , or avoidance of f a l s e p o s i t i v e assertions. I t
i s rather the exception than the rule f o r an observation to have a
s i n g l e , unequivocal explanation. When reasonable a l t e r n a t i v e i n t e r -
pretations are p o s s i b l e , a d e c i s i o n must be made about what to
report. At one extreme, a l l p o s s i b i l i t i e s could be asserted, ensur-
ing 100% r e c a l l ( i . e . no substructure which i s a c t u a l l y present w i l l
f a i l to be detected) at the cost of a high rate of f a l s e p o s i t i v e s .
At the other extreme, ambiguous data could be ignored, which guaran-
tees no f a l s e p o s i t i v e s , although many substructures which are
present w i l l be missed. We have taken a middle road between these
extremes by developing a measure of the "best" or most probable
i n t e r p r e t a t i o n , taking into account a l l of the data a v a i l a b l e . When
the best choice i s not clearcut, the d i s j u n c t i o n of the competing
a l t e r n a t i v e s i s e x p l i c i t l y asserted. The goal has been to minimize
the rate of f a l s e p o s i t i v e s , while r e p o r t i n g the most s p e c i f i c
possible i n t e r p r e t a t i o n of the data.
An important feature of expert systems i s the a c c e s s i b i l i t y to
the user of the knowledge base and the reasoning process. Both the
terminology used by the program and i t s i n t e r p r e t a t i o n of data have
chemical s i g n i f i c a n c e . Each conclusion reached by the program can be
traced by the user to the o r i g i n a l data. When a l t e r n a t i v e explana-
tions f o r an observation are p o s s i b l e , the choice i s v i s i b l e to the

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
27. CURRY An Expert System for Organic Structure Determination

Identify
Subunits

Specify
Global
Constraints
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

Figure 1. Flow chart f o r i d e n t i f i c a t i o n of an organic compound.

MS
J, ι ill. il., il

MW rnQthyl-ketone
1 48 monosubst-benzQnQ

Figure 2. Schematic drawing of the interpreter. The program i s


represented by the area inside the s o l i d rectangle. Program
modules are drawn as c i r c l e s , and t h e i r associated databases as
rectangles. A l l of the modules have read access to the Chemical
Classes database.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
354 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

user. I f the program has made an error, the user can c o r r e c t i t ,


thereby modifying the o r i g i n a l conclusions.

Program D e s c r i p t i o n

The architecture of our current system i s shown schematically i n


Figure 2. The design i s modular, with a C o n t r o l l e r module, a
Reasoner module, a database of over 900 organic substructures, and a
separate "Expert" module assigned to each k i n d of input data. The
C o n t r o l l e r module controls the progress of the c a l c u l a t i o n by con-
s i d e r i n g each of the substructuras which has not yet been eliminated,
beginning with the most general. I t requests each of the Expert
modules to supply i t with evidence supporting or denying the presence
of the substructure currently being considered. This evidence i s
c o l l e c t e d and passed to the Reasoner. When no more evidence can be
c o l l e c t e d , the analysis i s f i n i s h e d .
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

The Reasoner combines evidence from a l l sources and makes


deductions from t h i s evidence. The combination of evidence r e s u l t s
i n a s i n g l e "confidence l e v e l " f o r each substructure. These c o n f i -
dence l e v e l s designate the degree to which the evidence supports the
presence of the substructure i n the unknown compound. They range
from -100% (substructure d e f i n i t e l y absent), through 0% (no i n f o r -
mation) , to +100% (substructure d e f i n i t e l y present). The confidence
l e v e l s are u l t i m a t e l y derived from s t a t i s t i c a l analysis of represent-
a t i v e s p e c t r a l l i b r a r i e s . D e t a i l s of the generation and propagation
of confidence l e v e l s w i l l be described i n a separate report.(28)
Each Expert module i s permitted to use any convenient method to
carry out i t s mission of i n t e r p r e t i n g i t s assigned data. The Experts
use p r i v a t e rules and data structures, and communicate with the
C o n t r o l l e r module both by suggesting the presence of substructures,
and by evaluating the l i k e l i h o o d of substructures under considera-
t i o n . Each Expert can read the current confidence l e v e l associated
with each substructure, and thus has access to information generated
by other Experts or deduced by the Reasoner.
Communication among these modules i s accomplished i n two ways.
F i r s t , the chemical database, besides s t o r i n g the chemical knowledge
of the program, serves as a "blackboard" on which the progress of the
computation i s recorded.(29) Only the C o n t r o l l e r and Reasoner
modules are allowed to w r i t e on the blackboard, but a l l modules can
read i t . In t h i s way the conclusions of each Expert module are
a v a i l a b l e to a l l the others to guide t h e i r i n t e r p r e t a t i o n . Second,
the C o n t r o l l e r module controls the o v e r a l l path of the analysis by
sending messages to the i n d i v i d u a l Experts. The only requirement of
a new Expert module being added to the system i s that i t be able to
respond appropriately to these messages.
The current prototype system includes three Expert modules, the
IR Expert, the STIRS Expert, and the Human. A l l modules are w r i t t e n
i n L i s p . The IR Expert i s a rule-based i n f r a r e d i n t e r p r e t e r which we
have developed. The STIRS Expert i s an interface to the STIRS
program, a pattern-matching mass spectrum i n t e r p r e t e r developed by
McLafferty and coworkers at Cornell U n i v e r s i t y , which i s w r i t t e n i n
Fortran.(14) The interface translates the output of STIRS into a form
palatable to our program, and handles the message-passing p r o t o c o l
required by the C o n t r o l l e r . The Human module controls communication
with the user. I t allows user-supplied elemental or substructure
information to influence the course of the analysis. The power of

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
27. CURRY An Expert System for Organic Structure Determination 355

the modular approach i s shown by our a b i l i t y to integrate the r e s u l t s


of three i n t e r p r e t a t i o n methods which d i f f e r profoundly i n t h e i r
internal details.

The Chemical Database. The chemical knowledge of the system i s em-


bodied i n a database of over 900 organic substructures, arranged i n a
hierarchy (Figure 3). With each of these substructures i s associated
a connection t a b l e , s t a b i l i t y information, and a p r o b a b i l i t y of oc-
currence denoting how common the group i s . This information may be
used by the Expert modules when deciding among possible i n t e r p r e t a -
tions .
As the analysis progresses, evidence i s accumulated supporting
the presence or absence of defined substructures. The evidence i s
combined by the Reasoner module to form a b e l i e f function, which
describes the degree to which each substructure i s c u r r e n t l y be-
l i e v e d . This information i s stored i n the chemical database, where
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

i t i s a v a i l a b l e to the Expert modules and to the C o n t r o l l e r as i t


decides the course of the a n a l y s i s . As the b e l i e f function evolves,
the current state i s displayed g r a p h i c a l l y to the user, who may h a l t
the a n a l y s i s , query the current state, and r e d i r e c t the course of the
analysis by supplying evidence f o r or against a substructure.

IR Expert Module. The IR Expert's r u l e base consists of over 1000


c o r r e l a t i o n s between observed i n f r a r e d bands and v i b r a t i o n a l modes of
s p e c i f i c substructures. Associated with each r u l e i s a wavenumber
range, an i n t e n s i t y range, and two confidence l e v e l s . Four i n t e n s i t y
l e v e l s are allowed. The i n t e n s i t y l e v e l s are defined on an approxi-
mate semilog scale, r e l a t i v e to the most intense peak i n the spec-
trum: WEAK - 2 - 5%, MEDIUM - 5 - 15%, STRONG - 15 - 40%, VSTRONG -
40 - 100%. The program does not attempt to assign bands weaker than
2% of the strongest band. Each IR r u l e i s equivalent to the p a i r of
propositions :

a) IF a band of i n t e n s i t y I appears i n the region x l - x2 cm-1,


THEN i t i s due to the v i b r a t i o n a l mode M of substructure S, AND

b) IF no band of i n t e n s i t y I appears i n the region x l - x2 cm-1,


THEN the substructure S i s not present i n the unknown.

About 800 of these r u l e s were chosen by t e s t i n g a l l the IR cor-


r e l a t i o n s we could f i n d i n the literature,(30-32) mostly f o r con-
densed phases, against the EPA gas-phase l i b r a r y of 2300 compounds.
(33-34) About 30% of the l i t e r a t u r e c o r r e l a t i o n s were not generally
s a t i s f i e d by the l i b r a r y spectra, and were discarded. Another 200
rules were discovered by searching f o r patterns i n compound classes
i n the l i b r a r y which could reasonably be a t t r i b u t e d to expected v i b -
r a t i o n a l modes of those classes. S t a t i s t i c s were generated f o r the
p r o b a b i l i t y that each of the IR rules would be s a t i s f i e d f o r com-
pounds which contained, or d i d not contain, the substructure speci-
f i e d by the r u l e . These s t a t i s t i c s were used to compute two c o n f i -
dence l e v e l s f o r each r u l e , corresponding to the confidence i n the
two propositions a) and b) implied by the r u l e .

Messages. As noted above, the expert modules communicate t h e i r


r e s u l t s to the user and to the C o n t r o l l e r by responding to messages

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
356 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

sent by the C o n t r o l l e r . There are s i x messages to which each Expert


module i s required to respond:
The ALIVE? message asks the Expert i f i t i s a v a i l a b l e f o r con-
s u l t a t i o n i n t h i s analysis. The receiving Expert resets i t s i n t e r n a l
state, and responds TRUE i f i t has data, FALSE i f i t doesn't.
The SUGGESTIONS message asks the Expert to report any substruc-
tures i t believes, on i t s own, to be present or absent. The report
takes the form of a l i s t of items of evidence, each supporting the
presence or absence of a p a r t i c u l a r chemical group.
The SPECIALIZE message asserts the hypothetical presence of a
chemical group, and asks the Expert which subgroups may be present.
For example, the message "SPECIALIZE carbonyl" would cause the re-
c e i v i n g Expert to return evidence f o r or against the presence of
ketone, aldehyde, ester, amide, and other s p e c i f i c types of carbonyl,
under the assumption ( f o r the moment) that the compound does i n f a c t
contain a carbonyl group.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

The TEST message asks the Expert to return any evidence i t may
have against the presence of the group being tested.
The REEVALUATE message i s sent when a piece of evidence sup-
p l i e d by an Expert has been contradicted. I t asks the Expert to
modify or r e t r a c t the evidence, i f possible. Many i n f r a r e d c o r r e l a -
tions have known exceptions i n s p e c i f i c cases. For example, a n i t r o
group on a benzene r i n g raises the expected frequency ranges of the
hydrogen wags. I f the presence of a n i t r o group i s known or suspec-
ted, the aromatic wag assignments must be reevaluated.
The EXPOUND message asks the Expert to p r i n t out, f o r the user's
b e n e f i t , the reasons supporting a piece of evidence. Each piece of
evidence o r i g i n a t e d i n i t i a l l y i n some feature of the data. The
degree of d e t a i l supplied i n response to t h i s message depends on the
i n d i v i d u a l Expert. The IR Expert, f o r example, can report the i n f r a -
red bands which were assigned to a p a r t i c u l a r v i b r a t i o n a l mode of a
substructure, as w e l l as possible a l t e r n a t i v e assignments. The STIRS
Expert reports the incidence of the substructure among the best h i t s
i n d i f f e r e n t STIRS data classes.

Example : 4-phenyl-2-butanone

The r e s u l t s of the i n t e r p r e t a t i o n of the gas phase IR and low-resolu-


t i o n mass spectra of 4-phenyl-2-butanone are given i n Figure 4. This
compound, with a molecular weight of 148, i s t y p i c a l of the s i z e and
complexity of compounds which our program handles w e l l . The IR spec-
trum was taken from the EPA gas-phase IR l i b r a r y , and the mass spec-
trum from the Registry of Mass Spectral Data.(35)
The program was run three times: f i r s t with only the STIRS
r e s u l t s , second with only the r e s u l t s of the IR i n t e r p r e t a t i o n , and
f i n a l l y with both spectra together. A l l functional groups reported
by the program with confidence l e v e l s > 10% are l i s t e d . In addition,
STIRS c o r r e c t l y determined the molecular weight.
The most s p e c i f i c defined functional groups a c t u a l l y present i n
the unknown are benzyl, monosubstituted-benzene, X-CH2CH2-X (where
the "X" represents any group other than -H or -CH2-), and methyl-
ketone. That i s , the program would have achieved a perfect score had
i t reported these substructures and no others. In f a c t , the program
was unable to determine the correct environments of the ketone and
-CH2- groups, although i t reported only one incorrect substructure.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
27. CURRY An Expert System for Organic Structure Determination 357

These r e s u l t s are consistent with our goal of reducing the rate of


f a l s e p o s i t i v e s , at the cost of f a i l i n g to report the most s p e c i f i c
possible substructures which are a c t u a l l y present. I f the low-confi-
dence report of the presence of benzyl and X-C-CH3 groups i s accepted
(Figure 4), the reported r e s u l t s s u f f i c e to uniquely determine the
complete structure.
The e f f e c t s of the low-level combination of evidence are i l l u s -
t r a t e d by two features of the output. F i r s t , the confidence l e v e l
for the ketone group increases from 19% f o r the IR-only i n t e r p r e t a -
t i o n to 30% f o r the combined i n t e r p r e t a t i o n , despite the f a c t that
STIRS had nothing to say about the presence of a ketone or even of a
carbonyl. This i s explained by the increased confidence i n monosub-
stituted-benzene derived from the combined spectra, which causes a
f i n g e r p r i n t l i n e t e n t a t i v e l y assigned to an ester C-0 s t r e t c h to be
reassigned to a phenyl v i b r a t i o n . Reducing the l i k e l i h o o d of an
ester group increases the l i k e l i h o o d that the C-O s t r e t c h i s due to a
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

ketone group. Secondly, the c o n t r a d i c t i o n between STIRS' a s s e r t i o n


of methyl-benzene and the IR denial not only reduces the b e l i e f i n
methyl-benzene, but also allows the a s s e r t i o n of benzyl and unsatur-
ated- CH3 (X-C-CH3). These substructures were not suggested by e i t h e r
spectrum taken alone.
A s l i g h t l y abridged explanation offered by the program f o r i t s
b e l i e f i n methyl-benzene i s shown i n Figure 5. There i s both p o s i -
t i v e and negative evidence. The p o s i t i v e evidence comes p r i m a r i l y
from STIRS, and the negative evidence r e s u l t s from the f a i l u r e to
observe a medium i n t e n s i t y C-H s t r e t c h i n g band expected f o r methyl-
benzene. A small amount of p o s i t i v e support f o r methyl-benzene i s
also supplied by the IR Expert, showing that c o n f l i c t s can occur
between d i f f e r e n t features of a s i n g l e spectrum. The degree to which
each piece of evidence i s i n c o n f l i c t with other evidence i s noted.
The explanation f a c i l i t y traces the f i n a l b e l i e f back to p r i m i t i v e
pieces of evidence supplied by the Expert modules. The Experts are
then responsible f o r explaining how the evidence depends on the ob-
served spectrum. STIRS i s unable to do more than report which of i t s
data classes supported the substructure and with what p r o b a b i l i t y .
The IR Expert module, on the other hand, can give a r i c h l y d e t a i l e d
d e s c r i p t i o n of the assignment of the spectrum.

Results

We have evaluated our prototype system at several l e v e l s . Each Ex-


pert module has been tested i n d i v i d u a l l y . Detailed r e s u l t s of t e s t s
of the STIRS program have been published by McLafferty et al.(36)
The IR Expert module was tested extensively against the EPA l i b r a r y .
The e f f e c t s of competition among the IR rules were explored by
using the complete system, with the STIRS module disabled, to i n t e r -
pret the spectra of 1807 compounds from the l i b r a r y . For the t e s t ,
we selected 500 of the 900 chemical substructures which both are
chemically i n t e r e s t i n g and display at l e a s t one d i s t i n c t i v e i n f r a r e d
band. Some of the selected substructures were subsets of others: f o r
example, a l c o h o l , phenol, and primary alcohol were a l l i n the t e s t
set. As expected, some f u n c t i o n a l groups d i s p l a y i n g very d i s t i n c t i v e
i n f r a r e d bands were detected much more r e l i a b l y than others. Figure 6
shows the r e l i a b i l i t y , f a l s e p o s i t i v e and r e c a l l rates f o r a few
selected f u n c t i o n a l groups.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
358 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

900 defined substructures


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

Figure 3. A subset of the chemical substructures database,


showing the h i e r a r c h i c a l ordering.

ο
^^CH CH CCH
2 2 3

Class MS Only IR Only MS 8, IR

80% 69% 99%


© χ
95 95


19 30
CCC
-CH -
2 65 65

-CH 3
98 56 98

69 -44 25

14
O r
X=C-CH 3 37

Figure 4. Substructures reported f o r 4-phenyl-2-butanone at > 10%


confidence, f o r three runs of the i n t e r p r e t e r using d i f f e r e n t
data sets.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
27. CURRY An Expert System for Organic Structure Determination 359

Why wQthyl-bQnzQno?

36% POSITIVE:

41% from STIRS ( c o n f l i c t 27%)


8% from IR band a t 2933 cm-1
assuming unsaturatQcJ-C-CH3 (37%)
( c o n f l i c t 27%)

11% NEGATIVE:

23% bQcausQ o f f a i l u r e t o s a t i s f y
C-Hsym-mQthy 1 -benzQne-1
IR band 2860-2883 m
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

( c o n f l i c t 45%)

F i g u r e 5. Sample o f the e x p l a n a t i o n s p r o v i d e d by the program f o r


i t s c o n c l u s i o n s . More d e t a i l about the source o f the r e p o r t e d
c o n f l i c t , the assignments o f IR bands, o r the data c l a s s e s
r e s p o n s i b l e f o r the STIRS evidence can a l s o be p r o v i d e d .

IR r e s u l t s for 1807 compounds


Reliabi1ity

False positives
IXXXXI
Recal1

> 45% c o n f i d e n c e

Figure 6. S t a t i s t i c s f o r 5 selected substructures of the 500


tested on the EPA IR database. Values of the R e l i a b i l i t y , False
P o s i t i v e s , and R e c a l l (see text) are compared at the 45%
confidence l e v e l . The number of compounds i n the database
containing each substructure i s given beneath the substructure
name. Note the expanded scale used to p l o t the False P o s i t i v e
measure.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
360 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

The " r e c a l l " i s the p r o b a b i l i t y that a substructure present i n


the unknown w i l l be reported, while the " r e l i a b i l i t y " i s the proba-
b i l i t y that a reported substructure i s a c t u a l l y present.(36) These
functions are defined as:

Recall(S) - Number_correctly_reported(S) / Total_number_present(S)

R e l i a b i l i t y ( S ) » Number_falsely_reported(S) / Total_number_reported(S)

for a l l compounds i n the database containing substructure S. Both


measures are functions of the confidence l e v e l (CL) threshold above
which we count a substructure as "reported". A l l substructures are
reported at CL > -100%, while none are reported at CL > +100%. We
have a r b i t r a r i l y chosen CL > 45% as a threshold i n Figure 6.
An a l t e r n a t i v e measure of r e l i a b i l i t y often used i s the " f a l s e
p o s i t i v e " r a t e , defined as:
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

FP(S) - Number_falsely_reported(S) / Total_number_present(NOT S) ,

which i s r e l a t e d to the r e c a l l and r e l i a b i l i t y measures by:

Total_number_present(S) * R e c a l l * R e l i a b i l i t y
FP(S) -
Total_number_present(NOT S) * (1 - R e l i a b i l i t y )
This i s the p r o b a b i l i t y that a compound which does not contain sub-
structure S w i l l be i n c o r r e c t l y reported to contain i t . For sub-
structures which occur r a r e l y i n the database, the (1 - FP) rate w i l l
be considerably greater than the r e l i a b i l i t y , and may be misleading.
For example, f o r the S02 group (1% of the database), the FP rate was
< 8%, although the r e l i a b i l i t y was only 25% (Figure 6). That i s ,
although the program f a l s e l y asserted the presence of an S02 group
(with > 45% CL) only 8% of the time, 3/4 of the assertions of S02
were i n c o r r e c t . The l a t t e r s t a t i s t i c i s probably of more i n t e r e s t to
an analyst t r y i n g to evaluate the program's reports. On the other
hand, the FP i s a better measure of the raw d i s c r i m i n a t i n g power of
the program, since i t would presumably be unchanged by changing the
proportion of the target substructure i n the database. The two meas-
ures serve d i f f e r e n t functions, and should both be reported.
The tradeoff between r e l i a b i l i t y and r e c a l l can be adjusted f o r
i n d i v i d u a l f u n c t i o n a l groups by changing the frequency ranges allowed
for the IR c o r r e l a t i o n s . For some of the f u n c t i o n a l groups which are
w e l l represented i n the EPA l i b r a r y (e.g. esters, alcohols) we have
manually optimized the r u l e ranges to maximize ( 3 * R e l i a b i l i t y +
R e c a l l ) . Since the l i b r a r y i s known to contain e r r o r s , and i s skewed
towards the smallest (often anomalous) members of homologous s e r i e s ,
we have not t r i e d to do t h i s f o r a l l groups (e.g. S02). Further
t e s t i n g on l a r g e r l i b r a r i e s w i l l allow further refinements of the IR
rules.
Many of the errors observed r e s u l t from the consistent confusion
of two p a r t i c u l a r f u n c t i o n a l groups. For example, although the pres-
ence of a methyl group was erroneously reported (at >45% confidence)
for 30% of the 400 compounds which lack methyl groups, a methyl group
was reported f o r only 1 of the 33 compounds l a c k i n g both CH3 and CH2
groups. Conversely, the presence of a methylene group was never i n -

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
27. CURRY An Expert System for Organic Structure Determination 361

c o r r e c t l y asserted f o r compounds which lack methyl groups. Examina-


t i o n of the reasons f o r the confusion confirm that the C-H s t r e t c h i n g
and HCH deformation v i b r a t i o n s , whose frequency and i n t e n s i t y ranges
are s i m i l a r f o r methyl and methylene, are often misassigned. Such
consistent confusion between s i m i l a r substructures can be dealt with
by assigning the bands to a generic -CH2X group, and deciding between
methyl and methylene only a f t e r the nearby environment has been
determined.
Average r e s u l t s f o r 500 IR-active substructures are shown i n
Figure 7 at four d i f f e r e n t confidence l e v e l s . The average compound
i n the database contains 8.1 of the 500 substructures. At a c o n f i -
dence l e v e l of > 45%, only 1.4 (of 492) i n c o r r e c t substructures are
reported, while 4.6 of 8.1 substructures a c t u a l l y present are repor-
ted. I n other words, a " t y p i c a l " analysis w i l l report 6.0 substruc-
tures at > 45% confidence, of which 4.6 are correct. 3.5 substruc-
tures a c t u a l l y present i n the compound w i l l f a i l to be reported. In
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

an actual a n a l y s i s , i n f r a r e d data i s combined with other types of


data, so that many of the substructures undetected by i n f r a r e d would
be found by other techniques.
We have analyzed over 100 unknown compounds using both the mass
spectrum and the IR spectrum i n combination. The combination of the
two techniques gives s u b s t a n t i a l l y better r e s u l t s than does e i t h e r
technique alone. As expected, many f u n c t i o n a l groups are preferen-
t i a l l y detected by one technique or the other. For example, ketone
groups are r a r e l y detected i n the mass spectrum, but are u s u a l l y cor-
r e c t l y interpreted from the infrared. Chlorine and bromine, on the
other hand, are e a s i l y detected i n the mass spectrum but often missed
by the i n f r a r e d i n t e r p r e t e r . Also, because of the i n t e r a c t i o n be-
tween the two i n t e r p r e t a t i o n methods, substructures are frequently
detected by the combined techniques which are not found by e i t h e r
technique alone. This can occur as a r e s u l t of r e s o l v i n g a contra-
d i c t i o n between the two Experts, as i n the example above, or because
one Expert i s able to further s p e c i a l i z e a r e s u l t suggested by the
other. For example, i n the i n t e r p r e t a t i o n of b i s - 2 - c h l o r o - e t h y l -
ether, the IR Expert alone f a i l s to detect the presence of c h l o r i n e .
When chlorine i s suggested by the STIRS Expert, however, the IR
Expert c o r r e c t l y reports the -CH2C1 group. A few substructures, such
as non-terminal o l e f i n s , are not r e l i a b l y detected i n e i t h e r mass or
i n f r a r e d spectra. For such groups, other techniques (NMR, UV absorp-
t i o n , Raman) are necessary.
In many cases, the r e s u l t s of the IR and mass spectrum i n t e r -
p r e t a t i o n are s u f f i c i e n t to allow a complete molecular structure to
be deduced. I n preliminary t e s t s on 12 unknown compounds of molecu-
l a r weight 100-200, the author, using the r e s u l t s reported by the
program but without access to the o r i g i n a l spectra, was able to
c o r r e c t l y i d e n t i f y 9 of the unknowns.
These r e s u l t s are encouraging, and suggest that our system i n
s u b s t a n t i a l l y i t s present form could serve as a u s e f u l t o o l f o r an
a n a l y t i c a l chemist, as w e l l as eventually providing a framework f o r
completely automated i d e n t i f i c a t i o n of organic compounds.

Conclusions

We have developed an expert system which can i n t e r p r e t various kinds


of data and report f u n c t i o n a l groups present i n an unknown organic

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
362 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Avorago C o r r e c t and Incorroct


Assortions for 1807 Compounds
ο 0 Infrared Only
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

Confidence Level (%)

Figure 7. Average number of substructures reported c o r r e c t l y


( s o l i d color) and i n c o r r e c t l y (hatched) a t four d i f f e r e n t
confidence l e v e l s , f o r IR data only. A t o t a l of 500
substructures were considered, of which an average of 8.1 were
present i n each compound.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
27. CURRY An Expert System for Organic Structure Determination 363

compound. The program employs a modular construction, which allows


each type of data to be interpreted i n the most e f f i c i e n t way. The
conclusions derived by d i f f e r e n t modules are able to influence each
other a t a low l e v e l .
The program knows the chemical r e l a t i o n s h i p s between f u n c t i o n a l
groups, and can use t h i s knowledge i n i t s reasoning process.
The reasoning process i s accessible to the user, so that each
conclusion can be traced back to the o r i g i n a l data responsible f o r
i t . Choices made by the program can be i s o l a t e d and overridden by
a knowledgeable user.
Contradictions a r i s i n g among evidence from d i f f e r e n t sources
are resolved i n a natural way, using knowledge about the e f f e c t s o f
perturbations and common interferences on the spectra.
A rule-based i n f r a r e d spectra i n t e r p r e t e r has been developed as
a major module of the program. This module has been tested as a
stand-alone system, and i n conjunction with STIRS. The low rate of
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

f a l s e p o s i t i v e assertions i s encouraging, and work continues to


reduce t h i s rate s t i l l further by incremental refinement of the
knowledge base.
In i t s present form, our system can provide s i g n i f i c a n t a s s i s t ­
ance to a chemist t r y i n g to i d e n t i f y an unknown organic compound.
Research i s i n progress to extend the c a p a b i l i t i e s o f the program
both by expanding the number of d i f f e r e n t data sources i t can handle
(NMR, UV/visible absorption spectra) and by incorporating a "molecule
b u i l d e r " which assembles complete candidate structures, where pos­
s i b l e , from the suggested substructures.

Acknowledgments

I would l i k e to thank Reed Letsinger and others i n the Expert Systems


Department at HP Labs f o r h e l p f u l discussions and t e c h n i c a l a s s i s t ­
ance .

Literature Cited

1. Hippe, Z.; Hippe, R. Appl. Spectrosc. Reviews 1980, 16, 135-186.


2. Bally, R. W.; van Krumpen, D.; Cleij, P.; van't Klooster, H. A.
Anal. Chim. Acta 1984, 157, 227-243.
3. Masinter, L. M.; Sridharan, N. S.; Lederberg, J.; Smith, D. H.
J. Am. Chem. Soc. 1974, 96, 7702-7723.
4. Carhart, R. E.; Smith, D. H.; Gray, Ν. A. B.; Nourse, J. G.;
Djerassi, C. J. Org. Chem. 1981, 46, 1708-1718.
5. Nelson, D. B.; Munk, M. E.; Gash, Κ. B.; Herald, D. L.
J. Org. Chem. 1969, 34, 3800.
6. Shelley, C. Α.; Hays, T. R.; Munk, M. E. Anal. Chim. Acta
Computer Techniques and Optimization 1978, 103, 121-132.
7. Fujiwara, I.; Okuyama, T.; Yamasaki, T.; Abe, H.; Sasaki, S.
ibid 1981, 133, 527-533.
8. Szalontai, G.; Simon, Z.; Csapo, Z.; Farkas, M.; Pfeifer, Gy.
ibid 1981, 133, 527-533.
9. Debska, B.; Duliban, J.; Guzowska-Swider, B.; Hippe, Z. ibid
1981, 133, 303-318.
10. Dubois, J.-E.; Carabedian, M.; Dagane, I. Anal. Chim. Acta
1984, 158, 217-233.
11. Gribov, L. Α.; Elyashberg, M. E.; Koldashov, V. N.; Plentnjov,
I. V. ibid 1983, 148, 159-170.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
364 A R T I F I C I A L I N T E L L I G E N C E A P P L I C A T I O N S IN C H E M I S T R Y

12. Small, G. W.; Jurs, P. C. Anal. Chem. 1984, 56, 1314-1323.


13. Gray, Ν. A. B. A r t i f i c i a l Intelligence 1984, 22, 1-21.
14. Haraki, K. S.; Venkataraghavan, R.; McLafferty, F. W.
Anal. Chem. 1981, 53, 386-392.
15. Buchs, Α.; Schroll, G.; Duffield, A. M.; Djerassi, C.; Delfino,
A. B.; Buchanan, B. G.; Sutherland, G. L.; Feigenbaum, Ε. Α.;
Lederberg, J. J. Am. Chem. Soc. 1970, 92, 6831.
16. Ishida Y.; Sasaki, S. Computer Enhanced Spectrosc. 1983,
1, 173-184.
17. Varmuza, K. Anal. Chim. Acta 1980, 122, 227-240.
18. Zupan, J. ibid 1978, 103, 273-288.
19. Visser, T.; van der Maas, J. H. ibid 1980, 122, 363-372.
20. Smith, G.; Woodruff, H. B. J. Chem. Inf. Comp. Sci. 1984,
24, 33.
21. Gray, Ν. A. B. Anal. Chem. 1975, 47, 2426.
22. Delaney, M. F.; Denzer, P. C.; Barnes, R. M.; Uden, P. C.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch027

Anal. Lett. 1979, 12 963-978.


23. Bink, W. G.; van 't Klooster, H. A. Anal. Chim. Acta
1983, 150, 53-59.
24. Cross, K. P.; Giordani, A. B.; Gregg, H. R.; Hoffman, P. Α.;
Beckner, C. F.; Enke, C. G. "An Automated Structure
Determination System for MS/MS Data", 190th ACS National Meeting,
Chicago, IL (1985).
25. Christie, B. D.; Munk, M. E. "Computer-assisted Structure
Elucidation Using 2-Dimensional NMR Data", 190th ACS National
Meeting, Chicago, IL (1985).
26. Buchanan, B. G.; Shortliffe, Ε. H. "Rule-based Expert Systems";
Addison-Wesley: Menlo Park, CA, 1984.
27. Jurs, P. C.; Isenhour, T. L. "Chemical Applications of Pattern
Recognition"; Wiley: New York, NY, 1975.
28. Curry, Β., manuscript in preparation.
29. Charniak, E.; McDermott, D. "Introduction to A r t i f i c i a l
Intelligence"; Addison-Wesley: Menlo Park, CA, 1985.
30. Bellamy, L. J. "The Infrared Spectra of Complex Molecules";
Chapman and Hall: London, 1975.
31. Nyquist, R. A. "The Interpretation of Vapor-Phase Infrared
Spectra", vol. 1; Sadtler Research Labs: Philadelphia, PA, 1984.
32. Socrates, G. "Infrared Characteristic Group Frequencies";
John Wiley and Sons, Ltd.: New York, NY, 1980.
33. Griffiths; et al., GC-IR Subcommittee of the Coblenz Society
Evaluation Committee, Appl. Spectrosc. 1979, 33, 543.
34. de Haseth, J., Chemistry Dept., Univ. of Georgia, Athens, GA,
personal communication.
35. "Registry of Mass Spectral Data"; Electronic Data Division,
Wiley: 605 Third Ave., New York, NY 10158.
36. Dayringer, H. E.; McLafferty, F. W. Org. Mass Spectrosc.
1976, 11, 543-551.

RECEIVED December 17, 1985

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
28

C o n c e r t e d O r g a n i c A n a l y s i s of M a t e r i a l s and
Expert-System Development

1 1 1 1 2
S. A.Liebman ,P. J.Duff ,M. A.Schroeder ,R. A.Fifer ,and A.M.Harper
1
U.S. Army Ballistic Research Laboratory, Aberdeen Proving Ground, MD 21005-5066
2
Chemistry Department, University of Texas at El Paso, El Paso, TX 79968-0513
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

A prototype multilevel expert system network


has been developed for application to materials
characterization. Selected analytical
instruments generate databases which are
treated and interpreted within an analytical
strategy toward a desired goal. Using a
commercial expert system shell, TIMM, a linked
network of expert systems, EXMAT, has been
developed. The expertise of a chemometrician
is embedded within the network at the data
analysis and interpretation stages as a linked
expert system, EXMATH. For general chemical
analysis, expert systems capable of symbolic
and numeric processing appear necessary to
provide integrated decision structures using
data generated from appropriate instruments and
sensors. Final implementation of EXMAT will
demonstrate the potential significance of
a r t i f i c i a l intelligence (AI) in analytical
chemistry with varied intelligent laboratory
and process instrumentation.

Requirements for high-performance materials have focused on the


a b i l i t y to relate structure/composition to end-use behavior.
A n a l y t i c a l instrumentation designed over the past decade has made
impressive advances i n defining the composition of complex
polymeric systems, including detailed description of polymer
chemical microstructure. Concerted organic analysis has been
followed since the early '70s (1-3), including multivariate
p r o f i l e analysis of gas chromatographic (GC) patterns (Α), and
computer simulations of GC/spectral patterns to aid i n t e r p r e t a t i o n
(_5_). Work reported i n 1968 (6,7) included automated data
a c q u i s i t i o n and computer-aided i n t e r p r e t a t i o n from multiple
a n a l y t i c a l spectrometers (mass, nuclear magnetic resonance (NMR),
infrared (IR), and u l t r a v i o l e t (UV)). The four spectrometers were
tied to i n d i v i d u a l computers which fed data into a central

0097-6156/86/0306-0365$06.00/0
© 1986 American Chemical Society

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
366 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

computer programmed for structure e l u c i d a t i o n based on a


combination of a l l four types of data. Most recently,
applications of computer models that describe r e l a t i o n s h i p s
between chemical, physical, and mechanical responses were
described by Kaelble (8). Many c o r r e l a t i o n s between chemical
structure and polymer composite performance have been established
over the past decade w i t h i n the i n d u s t r i a l R&D community.
1
K a e l b l e s work emphasizes the significance of modern
characterization methods for this purpose.
Concurrently, pattern recognition programs were developed as
i n t e r p r e t i v e aids along with comprehensive experimental design,
factor a n a l y s i s , and other s t a t i s t i c a l approaches w i t h i n the
chemometrics f i e l d (9-15). Only w i t h i n the past few years has the
precision and high r e p r o d u c i b i l i t y of appropriate key
instrumentation made possible r e a l i s t i c applications for materials
analysis. Microprocessor-based chromatographic, p y r o l y s i s /
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

concentrator, thermal, and spectral instrumentation are combined


with chemometric tools to provide chemically s i g n i f i c a n t
11
information as " i n t e l l i g e n t instruments become a v a i l a b l e (16).
These advances are a l l i e d to highly automated hardware common i n
c l i n i c a l labs and computer-controlled process equipment (17-19).
Automated c a l i b r a t i o n and data-handling methods have been
i n t e g r a l parts of commercial a n a l y t i c a l systems for many years, as
well as embedded software to automate complex pneumatic/electronic
sequences i n concentrator and chemical reactor instrumentation
using on-line GC analysis (20). Recently, a commercial high
pressure l i q u i d chromatograph (HPLC) system (21) demonstrated
adaptive i n t e l l i g e n c e to optimize separations for complex sample
mixtures. The optimization program, OPTIM I I , i n i t i a l l y queries
the chromatographer and then performs a sequence of automated
steps. Likewise, l i b r a r y search algorithms (22-28), pattern
recognition (29-36), and optimization (37-42) methods have
developed i n numerous laboratories.
The well-known DENDRAL and META-DENDRAL programs (43) are
noted as the major AI success i n chemical applications over the
past decade. However, advances i n a n a l y t i c a l technology and
computer c a p a b i l i t i e s have led to new approaches (44-56).
Information fusion from selected instrumental tools often i s a
more productive route than exhaustive data analysis from a single
source. Furthermore, combination of chromatographic separation
with s p e c t r a l , thermal, and microchemical analyses can be
r e a l i s t i c a l l y achieved i n many laboratories. Generalizing and
documenting this trend using an AI approach seemed appropriate at
this time.

Results and Discussion

General. We have studied the c h a r a c t e r i z a t i o n of multicomponent


materials by combining modern a n a l y t i c a l instrumentation with a
commercially a v a i l a b l e AI expert system development t o o l .
Information generated from selected a n a l y t i c a l databases may be
accessed using TIMM, ("The I n t e l l i g e n t Machine Model,") a v a i l a b l e
from General Research Corp., McLean, VA. This Fortran expert
system s h e l l has enabled development of EXMAT, a h e u r i s t i c a l l y -
1inked network of expert systems for materials analysis.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
28. LIEBMAN ET A L . Concerted Organic Analysis of Materials 367

An important aspect of our AI a p p l i c a t i o n i s the attention


paid to including well-established Fortran programs and database
search methods into the decision structure of an expert system
network. Only c e r t a i n AI software tools (such as TIMM)
e f f e c t i v e l y handle this c r i t i c a l aspect for the a n a l y t i c a l
instrumentation f i e l d at this time (57-60). The a b i l i t y to
combine symbolic and numeric processing appears to be a major
factor i n development of m u l t i l e v e l expert systems for p r a c t i c a l
instrumentation use. Therefore, the expert systems i n the EXMAT
linked network access factor values and the decisions from EXMATH,
an expert system with chemometric/Fortran routines which are
appropriate to the nature of the instrumental data and the
information needed by the analyst. Pattern recognition and
c o r r e l a t i o n methods are basic c a p a b i l i t i e s i n this f i e l d .

TIMM - The I n t e l l i g e n t Machine Model. The expert system s h e l l ,


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

TIMM, i s a frame-like system which employs an analogical p a r t i a l


c
match inferencing procedure, s i m i l a r to a forward-chaining proces
when the e x p l i c i t l i n k i n g method i s followed. P a r t i a l match
inferencing, as proposed by Hayes-Roth and Joshi (61) means
matching on a subset of clauses i n i n d i v i d u a l rules. Analogical
inferencing uses s i m i l a r i t y , as well as exactitude, to match rule
clauses. Thus, TIMM e f f e c t i v e l y uses incomplete and approximate
knowledge i n a supervised learning format. The created expert
system i s divided into two sections: (1) a decision structure
with ordered input factors and values, and (2) the knowledge base
containing rules that are displayed to the user i n an " i f , then"
format. A set of test conditions i s compared to those contained
in the knowledge base and a weighted s i m i l a r i t y metric i s
applied. A v a r i a t i o n of the nearest neighbor search algorithm i s
used for pattern-matching.
H e u r i s t i c a l l y - l i n k e d i n d i v i d u a l expert systems (ES) are
prepared using i m p l i c i t and/or e x p l i c i t l i n k i n g methods to permit
processing of "microdecisions" that are part of more complex
"macrodecisions". The prototype EXMAT was developed using an
i m p l i c i t l i n k i n g procedure wherein the decision choices of one ES
become the f i r s t ordered factor/values of another ES. Prior to
l i n k i n g , each system i s independently b u i l t , trained, exercised,
checked for consistancy and completeness, and then generalized.
Terse or verbose explanations may be included, as well as decision
confidence l e v e l s that are trained into the system by the domain
experts. TIMM i s domain independent, permitting expert systems to
be r e a d i l y developed i n f i e l d s wherein expertise e x i s t s . EXMAT
was developed within this protocol with the important advantage
that TIMM ES can be embedded i n routines which are basic to the
1
analysts problem-solving c a p a b i l i t y and accessed using the
advanced REASON subroutine developed by General Research
Corporation.

EXMAT - A Linked Network of Expert Systems for Materials Analysis.


Seven i n d i v i d u a l expert systems comprise EXMAT: (1) problem
d e f i n i t i o n and a n a l y t i c a l strategy; (2) instrumental configuration
and conditions; (3) data generation; (4) chemometric/search
algorithms; (5) r e s u l t s ; (6) i n t e r p r e t a t i o n ; (7) a n a l y t i c a l goals.
Dynamic headspace (DHS)/GC and p y r o l y s i s GC (PGC)/concentrators

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
368 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

interfaced to Fourier transform infrared (FTIR) and mass spectral


(MS) detectors, combined with HPLC, thermal, and elemental
analyses have been chosen i n this approach for composite materials
characterization. Generation of databases i n the prototype EXMAT
system w i l l focus on the s p e c i f i c domain of propellants and
polymer composites. However, the general concept of integrating
information from relevant databases emulates the actions of a
pragmatic problem-solver i n many domains. C l e a r l y , the s p e c i f i c
a n a l y t i c a l strategy, instrumental configurations, databases, and
i n t e r p r e t i v e aids must be developed accordingly (8_). EXMAT
i l l u s t r a t e s the inherent p o t e n t i a l of combining i n t e l l i g e n t
instrumentation with AI symbolic processing i n a problem-solving
format.
Figure l a outlines the decision and control structure of
TIMM; Figure l b , the expert systems network; and Figure l c , the
o v e r a l l decision structure of EXMAT. Expert System (ES) #1 i s
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

given ( i n part) i n Figure 2 showing the decision choices and


factors/values needed to e s t a b l i s h the problem d e f i n i t i o n and
a n a l y t i c a l strategy. A n a l y t i c a l systems included i n the strategy
s p e c i f i c a l l y emphasize those tools a v a i l a b l e i n the B a l l i s t i c
Research Laboratory and which have a proven c a p a b i l i t y of
generating precise, reproducible data on a wide v a r i e t y of
materials. Therefore, the analyst may select the combination of
instrumentation (chromatographic, spectrometric, thermal, or
elemental) dependent on the scope of the problem, nature of the
information needed, d e t a i l s of the samples involved, and the
available a n a l y t i c a l tools and methods. There are approximately
85 rules i n the knowledge base of ES #1 at this time, four of
which are shown i n Figure 3.
Figure 4 outlines a portion of ES #2 for choice of the
s p e c i f i c instrumental configuration and conditions which are
indicated by the decisions and factors provided i n ES#1. This i s
a c r i t i c a l step, since the databases generated i n ES #3 must be
d i r e c t l y correlated to the s p e c i f i c instrumental configuration and
conditions i n ES #2 for the concerted analysis of samples,
references, etc.; e.g., pattern comparisons between analyses with
specialty GC detectors (FID-flame i o n i z a t i o n , TCD-thermal
conductivity, NPD-nitrogen/phosphorus, PID-photoionization). This
stage focuses on the a t t r i b u t e s of modern a n a l y t i c a l
instrumentation: f l e x i b l e , modular, microprocessor/computer-
controlled hardware that can be r e a d i l y interfaced for e f f i c i e n t
data-acquisition and handling. ES #2 also emphasizes varied
sample processing, such as pyrolysis and dynamic headspace, i n
order to analyze materials which cannot be introduced d i r e c t l y
into the chromatographic or spectral systems. Also, instrumental
methods designed for trace organic analysis or for sample-limited
studies are important c a p a b i l i t i e s . The instrumental
configurations are grouped into s i x major systems - Sys 1-GC, Sys
2-FTIR, Sys 3-MS, Sys 4-HPLC, Sys 5-Thermal, and Sys 6-Elemental.
ES #3 dictates the selected databases and sample-tracking
mechanism that are based on the decisions of ES #2. For example,
data obtained using a direct FTIR method as suggested i n the
decisions of ES #1 and #2 would be put into the FTIR database
under D conditions. However, a sample examined with the GC-FTIR
configuration would be entered into the GC-FTIR database with

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
LIEBMAN ET A L . Concerted Organic Analysis of Materials

ΙΤΙΜΜ EXPERT SYSTEM BUILDERJ

|USER APPLICATION SYSTEM|


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

Figure l a . Expert system s h e l l - TIMM.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

EXMAT

ESTABLISH FRAMEWORK FOP INTEGRATING ES


DECISIONS AND ACTIONS TO-BE-TAKEN

REASON
A SUBPROGRAM PRODUCED BY GRC
ENABLES
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

(A) APPLICATION TO CALL ES


AS A LINKED SUBROUTINE

(B) AN ES TO CALL AND UTILIZE DATA


FROM OTHER FILES/PROGRAMS

(C) TIMM ES TO PASS A DECISION TO


AN A C T I O N - T O - B E - T A K E N COMPONENT
OF THE PROGRAM

Ο EXMATH

CALLS USER DEFINED DATA FILE


FOR CHEMOMETRICS

DECISION INVOKES MATH SUBROUTINE AND


ACCEPTS EXMAT DECISIONS

CONVERTS MATH RESULTS TO FACTOR VALUES


FOR INPUT TO TIMM EXPERT SYSTEMS

Figure l b . An expert system network.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
28. LIEBMAN ET AL. Concerted Organic Analysis of Materials 371

A LINKED NETWORK OF EXPERT SYSTEMS FOR MATERIAL ANALYSIS

ES #1 ANALYTICAL STRATEGY FOR DEFINED PROBLEM


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

ES #2 INSTRUMENTAL CONFIGURATION/CONDITIONS

ES #3 DATABASE GENERATION

ES #4 DATA TREATMENT

ES #5 DATA RESULTS

ES #6 DATA INTERPRETATION

ES #7 ANALYTICAL GOAL

Figure l c . Development of EXMAT.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
372 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

DECISION:

ANALY STRATEGY
Choices:
GC/SYS1
FTIR/SYS2
MS/SYS3
LC/SYS4
TA/SYS5
EL/SYS6

FACTORS:
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

SCOPE
Type of Values: Unordered D e s c r i p t i v e Phrases
Values:
SCREEN
TIME/FUND LIMIT
QUAL/QUANT
QUANT
PURITY
VOLATILES
TRACE DETECT
KINETICS
MECHANISM
CORRELATION
R&D

SAMPLE AMT
Type of Values: Linearly-Ordered D e s c r i p t i v e Phrases
Values:
UNLIMITED
GM
MG
MICROGM
TRACE

SAMPLE FORM
Type of Values: Unordered D e s c r i p t i v e Phrases
Values:
POWDER
BULK
SEMISOLID
LIQUID
FILM/LAMIΝ
FIBER
MULTIMEDIA

Figure 2. P a r t i a l decision structure of ES #1 a n a l y t i c a l


strategy.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
LIEBMAN ET AL. Concerted Organic Analysis of Materials

Rule 17

If:
SCOPE IS SCREEN
SAMPLE AMT IS GM
SAMPLE FORM IS MULTIMEDIA
SAMPLING PROCESS IS RANDOM
SAMPLE HISTORY IS UNKWN
INSTR. AVAIL IS NO LC
Then:
ANALY STRATEGY IS GC/SYS1(50)
FTIR/SYS2Î5Û)

Rule 18

I-f :
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

SCOPE IS TRACE DETECT


SAMPLE AMT IS MG
SAMPLE FORM IS POWDER
SAMPLING PROCESS IS STATIC
SAMPLE HISTORY IS DEGRADATION
INSTR. AVAIL IS NO LC
Then:
ANALY STRATEGY IS GC/SYS1(30)
MS/SYS3(70)

Rule 19

If:
SCOPE IS QUANT
SAMPLE AMT IS TRACE
SAMPLE FORM IS FILM/LAMIΝ
SAMPLING PROCESS IS STATIC
SAMPLE HISTORY IS UNKWN
INSTR. AVAIL IS NO METHOD
Then:
ANALY STRATEGY IS MS/SYS3(100)

Rule 20

If:
SCOPE IS TRACE DETECT
SAMPLE AMT IS TRACE
SAMPLE FORM IS FILM/LAMIN
SAMPLING PROCESS IS RANDOM
SAMPLE HISTORY IS DEGRADATION
INSTR. AVAIL IS NO ELEM
Then :
ANALY STRATEGY IS GC/SYS1(20)
MS/SYS3(80)

Figure 3. Typical rules i n ES #1.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
374 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

DECISIONS

EXPTL CONFIG
Choices:
GCSYS1/A
GCSYS1/AEC
FTIRSYS2/D
FTIRSYS2/ABCD
MSSYS3/E
MSSYS3/ABCE
LCSYS4/FIK
LCSYS4/GJK
LCSYS4/FIL
LCSYS4/GJL
TASYS5/M
TASYS5/N
TASYS5/0
TASYS5/P
ELSYS6/Q
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

ELSYS6/R
FACTORS:
ANALY STRATEGY
Type of Values: Unordered D e s c r i p t i v e Phrases
Values:
GC/SYS1
FTIR/SYS2
MS/SYS3
LC/SYS4
TA/SYS5
EL/SYS6

GC CONFIG
Type of Values: Unordered D e s c r i p t i v e Phrases
Values:
DIRECT GC/FID/TCD
DIRECT GC/FID/NPD
DHS/FID/TCD
DHS/FID/NPD
PGC/FID/TCD
PGC/FID/NPD
DHS/PGC/FID/TCD
DHS/PGC/FID/NPD

FTIR CONFIG
Type of Values: Unondered D e s c r i p t i v e Phrases
Values:
DIRECT
MICROSAMPLING
DRIFT
ATR
VARIABLE Τ
DHS/FTIR
GC-FTIR
DHS/GC-FTIR
PGC-FTIR
DHS/PGC-FTIR

MS CONFIG
Type of Values: Unordered D e s c r i p t i v e Phrases
Values:
RIC
SIM
PYROL/MS
DUG/MS
GC-MS/P1D

Figure 4. P a r t i a l decision structure of ES #2 configuration.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
28. LIEBMAN ET AL. Concerted Organic Analysis of Materials

conditions designated AD, ABD, or ABCD. In the f u l l y documented


ES form, the conditions A/B/C/etc. w i l l be described i n
appropriate d e t a i l for the user and accessed by using the "verbose
version" from the menu. Format for instrumental database
generation and management was aided by the work reported e a r l i e r
by R. Crawford, C. Wong, and coworkers at Lawrence Livermore
Laboratory (62,63). A d d i t i o n a l l y , data report transfer from GC
data stations to the host VAX-VMS system was aided by recent work
reported from Argonne National Laboratory (64,65).
Data treatment i n ES #4 incorporates chemometric methods
available for chromatographic or spectral a n a l y s i s : preprocessing
of data, normalization, smoothing, deconvolution, optimization,
f i n g e r p r i n t i n g , pattern recognition, factor analysis (eigenvector
and canonical methods), and other appropriate routines. The
l a t t e r have been purchased or incorporated from the l i t e r a t u r e ;
e.g., PAIRS, an infrared i n t e r p r e t i v e program by H. Woodruff and
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

coworkers (66), and the MS library/search programs provided by


Hewlett-Packard for their MS systems. These searches provide a
" h i t l i s t " from the respective l i b r a r i e s and some a d d i t i o n a l
options for spectral i n t e r p r e t a t i o n .
Our linked pattern recognition expert system, EXMATH,
operates on given databases v i a the preprocessing, data
manipulation, c l a s s i f i c a t i o n , factor a n a l y s i s , or p l o t t i n g
packages as driven within EXMAT. For example, the c l a s s i f i c a t i o n
package includes linear discriminant a n a l y s i s , regression
analysis, p r i n c i p a l component catagory a n a l y s i s , nonlinear mapping
and nearest neighbors a n a l y s i s . The factor analysis package
provides loading e x t r a c t i o n , factor scores, factor r o t a t i o n s , and
canonical c o r r e l a t i o n a n a l y s i s .
The results of data treatment are documented and evaluated i n
ES #5 and the i n t e r p r e t a t i o n i n ES #6 i s guided by the analyst's
constraints and requirements. For instance, simple v i s u a l pattern
comparisions may be acceptable for sample i d e n t i f i c a t i o n , or a
combined database (GC-FTIR/GC-MS), (PGC/FTIR), (GC/TA), etc.,
analysis may be required. Judgmental decisions must be trained
into the system as to depth of a n a l y s i s , i t s a c c e p t a b i l i t y and
r e l i a b i l i t y (e.g., the h i t q u a l i t y index (HQI) of the MS search
combined with that from the FTIR search may confirm w i t h i n a 95%
confidence l e v e l the GC peak or sample i d e n t i t y ) .
F i n a l l y , ES #7 incorporates the i n t e r p r e t i v e results of these
treatments to direct the analyst toward the designated a n a l y t i c a l
goal(s) v i a i m p l i c i t / e x p l i c i t l i n k i n g mechanisms. The f i n a l goal
(structure, composition, mechanism, k i n e t i c s , c o r r e l a t i o n ,
experimental design a n a l y s i s , or l i b r a r y extension) i s approached
by incorporating the e a r l i e r decision/choices of ES #1-6 for
evaluation i n the decision structure of ES #7. Some procedures
may be straightforward; e.g., a screening analysis with a single
instrument/configurâtion generates a sample pattern that v i s u a l l y
matches a known reference to the s a t i s f a c t i o n of the analyst.
Other studies involving several instrumental systems ( i n our
scenario...chromatographic, s p e c t r a l , thermal, or elemental), may
require feedback from several interpretive r e s u l t s . Since TIMM i s
e a s i l y modified, the f i n a l form of EXMAT w i l l l i k e l y be improved
over that described for this prototype; i . e . , including e x p l i c i t
and i m p l i c i t l i n k i n g mechanisms.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
376 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

EXMATH - An Expert System for Pattern Recognition. A prototype


expert system for pattern recognition and data a n a l y s i s , EXMATH,
1
has been developed to embed a chemometrician s expertise into an
accessible form for researchers. The selected l i b r a r y of
subroutines developed over the past ten years comprise a portion
of the EXMATH program to permit an integrated expert systems
approach (Figures 5 and 6).
For each a n a l y t i c a l system, expert systems drivers were
written which control data input to and operation of the
algorithm. A second, more i n t e l l i g e n t set of d r i v e r s : (1) receive
input i n the form of a decision from the external expert system
network; (2) c o l l e c t the necessary subroutines for a h e u r i s t i c
algorithm to solve data questions; (3) inspect the v a l i d i t y of the
input data; (4) drive the algorithm; and (5) transfer the r e s u l t s
v i a GRC's REASON algorithm back to the external expert network for
future decision-making. For example, i f a least squares
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

regression on the data f i l e i s c a l l e d by the external expert


network, the EXMATH system inspects the input data, drives the
regression under j a c k n i f i n g protocols, and c o l l e c t s v a r i a b l e ,
r e s i d u a l , and f i t c o r r e l a t i o n r e s u l t s for analysis by the other
expert system modules. The procedure i s implemented and executed
without any mathematical expertise from the user.

Summary

Development of a linked network of expert systems, EXMAT, has been


described for a p p l i c a t i o n to materials characterization. Selected
instrumentation which are common to modern laboratories generate
databases that are treated and interpreted within an a n a l y t i c a l
strategy directed toward a desired goal. Extension to other
problem-solving situations may use the same format, but with
specialized tools and domain-specific l i b r a r i e s . Importantly, a
1
chemometrician s expertise has been embedded into EXMAT through
access to information derived from a linked expert system,
EXMATH. Figures 7 and 8 outline this m u l t i l e v e l expert systems
approach developed for a p p l i c a t i o n of selected a n a l y t i c a l
instruments to the f i e l d of materials science.
A d d i t i o n a l l y , use of a commercial AI s h e l l for expert system
development has been demonstrated without the need to learn
computer programming languages (C, Pascal, LISP or any of i t s
v a r i a t i o n s ) , nor to have an intermediary knowledge engineer.
Although this development e f f o r t of 4-5 man months was on a
minicomputer, adaptation of EXMAT to the microcomputer version of
TIMM i s anticipated. The completed implementation of EXMAT w i l l
support the b e l i e f that AI combined with i n t e l l i g e n t
instrumentation can have a major impact on future a n a l y t i c a l
problem-solving.
In general, i t appears that expert systems which combine
symbolic/numeric processing c a p a b i l i t i e s are necessary to
e f f e c t i v e l y automate decision-making i n applications involving
a n a l y t i c a l and process instrumentation/sensors. Furthermore,
these integrated decision structures w i l l l i k e l y be embedded (67-
69) within the a n a l y t i c a l or process units to provide f u l l y
automated pattern recognition/correlation systems for future
i n t e l l i g e n t instrumentation.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
LIEBMAN ET AL. Concerted Organic Analysis of Materials

EXAMPLE-ALGORITHM B UIL DI Ν G - Ε X S Ρ D S

PURPOSE- EMULATION OF S P S S P R O C E D U R E FOR


DISCRIMINANT ANALYSIS

- USED IN ANALYSIS OF V A R I A N C E MODE

- PRODUCES DATA M A P P I N G OF SPACE


OF S A M P L E R E P L I C A T E VARIATION ABOUT
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

SAMPLE MEANS

OPERATION- INSPECTS INPUT DATA FOR

(1) PRIOR PREPROCESSING


(2) NECESSITY OF RANK REDUCTION
PRIOR TO ANALYSIS

- SCALES DATA IF NEEDED

-PERFORMS FACTOR ANALYSIS REDUCTION IF NEEDED

-COMPUTES S A M P L E MEANS AND ARRANGES DATA AS


A TRAINING SET OF MEAN V E C T O R S AND TEST SET
OF R E P L I C A T E VECTORS

- P R O J E C T S BY F A C T O R ANALYSIS OF MEAN VECTORS

- REPRODUCES VARIABLE WEIGHTS FOR PROJECTION


AND FURTHER ANALYSIS

Figure 5. EXMATH - h e u r i s t i c design.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

EXTGRT . INSPECTS INPUT D A T A MATRICES


FOR PREPROCESSING TASKS

- LOCATES TARGET AND MERGES/SORTS FILE

FOR INPUT TO DATA ANALYSIS

- PERFORMS FACTOR ANALYSIS IF NEEDED


Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

- LEAST SQUARES ROTATION TO

TARGET OR HYPOTHESIS

-RECONSTRUCTION OF MEASUREMENT
INFORMATION MATRIX TO REFLECT
CORRELATIONS

WHY?

(1) D E C O N V O L U T I O N OF C O M P O N E N T S IN MIXTURES

(2) H Y P O T H E S I S T E S T S ON I N T E R P R E T A T I O N O F RESPONSES

(3) S P E C T R A L M A T C H I N G T O REFERENCE RESPONSES

(4) LEAST SQUARES REGRESSION M O D E L I N G WITH


"NOISE FILTERING"

(5) DETERMINATION OF F U N D A M E N T A L PHYSICAL FACTORS


UNDERLYING SAMPLE MEASUREMENT RESPONSES

Figure 6· Target r o t a t i o n - subroutine i n EXMATH.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
LIEBMAN ET A L . Concerted Organic Analysis of Materials

EXMAT

A LINKED NETWORK OF E X P E R T S Y S T E M S
WITH P A T T E R N R E C O G N I T I O N A N D S E A R C H P R O G R A M S
FOR M A T E R I A L S CHARACTERIZATION

COMPONENTS ATTRIBUTES

1. D A T A B A S E M A N A G E M E N T A . S T O R A G E OF P A R A M E T E R S
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

<
~ ~ ~ ~ B
™ ~ ~ ™ ,
" ™ ~ B
~ AND D A T A O N S A M P L E S FOR
SELECTED INSTRUMENTAL
TECHNIQUES
B. RETRIEVAL OF S E L E C T E D
S A M P L E S FORMING A DATA
SET F O R M A T T E D FOR
MULTIVARIATE ANALYSIS

C. CREATE, A D D , D E L E T E , HELP
AND SHOW FUNCTIONS
2. E X P E R T S Y S T E M S AND
EMBEDDING SUBPROGRAMS - TIMM

A. FORTRAN SOURCE CODE

B. EMBEDDING O F TIMM S Y S T E M
WITHIN U S E R P R O G R A M S

C. C A P A B L E OF HANDLING
M E T R I C AND N O N - M E T R I C
INFORMATION

D. H E U R I S T I C DESIGN

Figure 7. EXMAT - a linked network of expert systems.

Continued on next page

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
380 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

COMPONENT ATTRIBUTES

3. PATTERN RECOGNITION A. HEURISTIC DESIGN


EXPERT SYSTEM -
EXMATH B. S U P E R V I S E D A N D U N S U P E R V I S E D
PATTERN RECOGNITION, FACTOR
ANALYSIS, PLOTTING

C. EXPERTISE INCLUDES DATA


P R E P R O C E S S I N G AND
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

E V A L U A T I O N OF R E S U L T S

D. U S E R INTERVENTION FOR
DATABASE MODIFICATIONS

E. IMPLEMENTABLE AS JACKNIFING
PROCEDURE

4.
R A N P
SPEÇTRAI, ?^ 9f1 A. PAIRS INFRARED SPECTRA
M A T C H ALGORITHIMS

PARTIAL I N T E R P R E T A T I O N

AIDS B. PBM-MASS SPECTRA

Figure 7. Continued.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

tn
PROBLEM STATEMENT EASUREMENTS S O L U T I O N OR D A T A I
OR H Y P O T H E S I S
HE >
m
L A B O R A T O R Y A U T O M A T I O N USING E X P E R T S Y S T E M S DRIVERS
H

EXPERT SYSTEMS
EXPERIMENTAL DESIGN ?
INSTRUMENT 1 INSTRUMENT 2 INSTRUMENT Ν s.
CONTROL CONTROL CONTROL
OPTIMIZATION OPTIMIZATION OPTIMIZATION
PREPROCESSING PREPROCESSING PREPROCESSING
Ci
INTERPRETATION INTERPRETATION INTERPRETATION 9"

DATABASE MANAGEMENT EXPERT S Y S T E M I


DATA ANALYSIS s*
INTERPRETATION

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


Figure 8. M u l t i l e v e l expert systems o u t l i n e , Automated
experimental design and decision-making.

ACS Symposium Series; American Chemical Society: Washington, DC, 1986.


382 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Acknowledgments

Contributions to data transfer and computer interfacing by


J. Romanski, K. F i c k i e , and R.M. Cahoon at the B a l l i s t i c Research
Laboratory are g r a t e f u l l y acknowledged. We appreciate the
cooperation provided by the General Research Corporation and
discussions with M.J. Aiken during t h i s development e f f o r t .

Literature Cited

1. Liebman, S.A. Amer. Lab., 1971, 18.


2. Liebman, S.A.; Ahlstrom, D.H.; Quinn, E.J.; Geigley, A.G.;
Meluskey, J.T. J. Polym. Sci., Part A-1, 1971, 9, 1843.
3. Liebman, S.A. ACS 6th Northeast Regional Meeting,
Burlington, VT, 1974, Sympos. Computers in Chemistry.
4. Liebman, S.A.; Ahlstrom, D.H.; Hoke, A.T. Chromatographia,
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

1978, 11, 427.


5. Liebman, S.A.; Ahlstrom, D.H.; Starnes, Jr., W.R. Schilling,
F.C. J. Macromol. Sci., Chem., 1982, A17(6), 935.
6. Sasaki, S.; Abe, H.; Ouki, T. Anal. Chem., 1968, 40, 2221.
7. Yamasaki, T.; Abe, H.; Kudo, Y.; Sasaki, S. In "Computer-
Assisted Structure Elucidation"; Smith, D.H., Ed.; ACS
SYMPOSIUM SERIES, No. 54, American Chemical Society,
Washington, DC, 1977; p. 108.
8. Kaelble, D.H. "Computer-Aided Design of Polymers and
Composites"; Marcel Dekker, Inc., NY, 1985.
9. Jurs, P.C.; Kowalski, B.R.; Isenhour, T.L. Anal. Chem.,
1969, 41, 21; Ibid., 690, 695.
10. Pichler, M.A.; Perone, S.P. Anal. Chem., 1974, 46, 1790.
11. Kowalski, B.R. Ed., "Chemometrics: Theory and
Applications"; ACS SYMPOSIUM SERIES No. 52, American Chemical
Society, Washington, DC, 1977.
12. Malinowski, E.R.; Howery, D.G. "Factor Analysis in
Chemistry", J. Wiley & Sons, NY, 1980.
13. Delaney, M.F. Anal. Chem. Fund. Rev., 1984, 56, 261R.
14. Harper, A.M.; Meuzelaar, H.L.; Metcalf, G.S.; Pope, D.L. In
"Analytical Pyrolysis"; Proc. 5th International Symposium,
Voorhees, K.J., Ed.; Butterworths Publ., London, 1984,
Chapter 6.
15. Harper, A.M. In "Pyrolysis and GC in Polymer Analysis";
Liebman and Levy, Eds., Marcel Dekker, Inc., NY, 1985;
Chapter 8.
16. Harper, A.M.; Liebman, S.A. Chemometrics Research
Conference, Gaithersburg, MD, May 1985, to be published in
NBS Research Journal.
17. Beckman Instruments, Inc., Spinco Div., Palo Alto, CA 94304;
Spin Pro Expert System, Brochure SB-664.
18. Kraus, T.W.; Myron, T.J., Control Engineering, 1984, 106.
19. Proc. 9th Annual Advanced Control Conf., Purdue University,
W. Lafayette, IN, 1983.
20. Chemical Data Systems, Div. of Autoclave Engineering, 7000
Limestone Rd., Oxford, PA 19363. Sample Concentrators, Models
320, 330; CDS Geochemical Research System, Model 820; Model
8000 Series Micro-Pilot Plant Systems.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
28. LIEBMAN ET AL. Concerted Organic Analysis of Materials 383

21. Spectra-Physics, Autolab Div., San Jose, CA Liquid


Chromatograph, Model SP8100 XR; Technical Bulletin D/S-01,
12/84.
22. Delaney, M.F.; Uden, P.C. J.Chromatogr.Sci., 1979, 17,
428.
23. Delaney, M.F.; Warren, Jr., F.V.; Hallowell, Jr., J.R. Anal.
Chem., 1983, 55, 1925.
24. Delaney,M.F.;Hallowell, Jr.,J.R.;Warren, Jr., J.R. J.
Chem. Inf. Comput. Sci., 1985, 25, 27.
25. Stauffer, D.B.; McLafferty, F.W.; E l l i s , R.D.; Peterson,
D.W. Anal. Chem., 1985, 57, 1056; Ibid., 899 and refs.
26. Kalchhauser, H.; Robien, W. J. Chem. Inf. Comput. Sci.,
1985, 25, 103.
27. Fein-Marquart Assoc., Inc., 7215 York Rd., Baltimore, MD
21212; Mass Spectral Info. System (MSIS); MASCOT: software
pkg. MS data on a PC.
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

28. Sadtler Research Laboratores, Inc., Spring Garden St.,


Phila., PA 19122.
29. Lysyj, I.; Newton, P.R. Anal. Chem., 1972, 44, 2385.
30. Kanal, L. IEEE Trans. Info.Theory, 1974, Vol. IT-20, 697;
and Proc. IEEE, 1972, 60, 1200.
31. Kowalski, B.R. Chemtech, 1974, 300.
32. Byers, W.A.; Perone, S.P. Anal. Chem., 1980, 52, 2173.
33. Moncur, J.G.; Bradshaw, W.G. J. High Resol.Chromatogr. & CC,
1983, 6, 595.
34. Frankel, D.S. Anal. Chem., 1984, 56, 1011.
35. Fredericks, P.M.; Osborn, P.R.; Swinkels, D.A.J. Fuel, 1984,
63, 139; and BHP Tech. Bulletin No. 27, 1983.
36. Infometrix, Inc., 2200 Sixth Ave., Seattle, WA 98121.
37. Deming, S.N.; Morgan, S.L. Anal. Chem., 1973, 45, 278A.
38. Deming, S.N. Amer. Lab., 1981, 13, 42.
39. Deming, S.N.; Morgan, S.L. "INSTRUMENTUNE-UP: A Computer
Program for Optimizing Performance of Common Lab.
Instruments"; Elsevier Scientific Software, Amsterdam, The
Netherlands, 1984.
40. Galjch,J.L.;Kirkland,J.J.;Squire,K.M.;Minor, J.M. J.
Chromatogr., 1980, 199, 57.
41. Sabate, L.G.; Diaz, A.M.; Tomas, X.M.; Gassiot, M.M. J.
Chromatogr. Sci., 1983, 21, 439.
42. Statistical Designs, 9941 Rowlett, Suite 6, Houston, TX;
"Software for Experimental Design and Optimization".
43. Barr, Α.; Feigenbaum, E.A.; Eds. Handbook of AI, William
Kaufman, Los Altos, CA, Vol. II, 1981.
44. Cooper, J.R.; Johlman, C.; Laude, D.A.; Brown, R.S.; Wilkins,
C.L. Proc. Pitts. Conf. Anal. Chem. & Spectr., New Orleans,
LA, 1985; and Anal. Chem., 1984, 56, 1163; Ibid., 57, 1044.
45. Greene, W.W.; Isenhour, T.L. Proc. Pitts. Conf. Anal. Chem.
& Spectr., New Orleans, LA, 1985; and Anal. Chem., 1983, 55,
1117.
46. Borman, S.A. Anal. Chem., 1982, 54, 1379.
47. Hayes-Roth, F.; Waterman, D.A.; Lenat, D.B.; Eds. "Building
Expert Systems", Addison-Wesley Publ. Co., Reading, MA, 1983.
48. Buchanan, B.G.; Shortliffe, E.H. "Rule-Based Expert
Systems"; Addison-Wesley Publ. Co., Reading, MA, 1984.
49. Lenat, D.B. Sci. Amer., 1984, 204.

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
384 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

50. Kinnucan, P. High Tech., 1984, 30; Ibid., 1985, 16.


51. Third Annual Conf. on Applied AI, Boston, MA, 1985,
DPMA/Tech. Training Corp., and Embedded Computer Software
Conf., Boston, MA, 1984.
52. Dessy, R.E. Anal. Chem., 1984, 56, 1200; Ibid., 1313.
53. Klass, P.J. Aviation Wk. & Space Tech., April 22, 1985, 46.
54. Harmon, P.; King, D. "Expert Systems", J. Wiley & Sons, Inc.
NY, 1985.
55. Bramer, M. & D. "The Fifth Generation", Addison-Wesley Publ.
Co., Reading, MA, 1984.
56. Pearl, J. "Heuristics-Intelligent Search Strategies for
Computer Problem-Solving", Addison-Wesley Publ. Co., Reading,
MA, 1984.
57. Selected AI Literature: ICS Applied AI Reporter, Univ. Miami,
Intell. Computer Systems Res. Institute, Coral Gables, FL
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch028

33124; The AI Magazine, Amer. Assoc. AI (AAAI), Menlo Park,


CA 94025; and Expert Systems, Internat. J. of Knowledge
Engineering, Croall, Ishizuka, Waterman, Eds., Learned
Information, Inc., Marlton, NJ
58. Michie, D.; Muggleton, S.; Riese, C.; Zubrick, S. First
Conf. AI Applns., Denver, CO, 1984 RuleMaster-A Second-
Generation Knowledge Engineering Facility Radian Tech. Rpt.,
MI-R-623, Radian Corp., P.O. Box 9948, Austin, TX 78766.
59. SRL DEXPERT, Systems Research Labs., Inc., Dayton, OH 45440-
3639; Integrates LISP algorithms into Fortran or Ada systems.
60. Proc. Workshop on Coupling Symbolic and Numerical Computing
in Expert Systems, sponsored by AAAI, Aug 1985, Boeing
Computer Services, Bellevue, WA.
61. Hayes-Roth, F. In "Pattern-Directed Inference Systems",
Waterman and Hayes-Roth, Eds., Academic Press, NY, 1978.
62. Crawford,R.W.;Brand,H.R.;Wong,C.M.;Gregg, H.R.;
Hoffman, P.Α.; Enke, C.G. Anal. Chem., 1984, 56, 1121.
63. Wong, C.M. Energy &Techn.Rev.,Lawrence Livermore National
Laboratory, University CA, Livermore, CA, 1984, 8.
64. Demirgian, J.C. J.Chromat.Sci.,1984, 22, 153.
65. Demirgian,J.C.;Eikens, D.I. Proc. Pitts. Conf. Anal.
Chem.& Spectr., 1985.
66. Tomellini, S.A.; Hartwick, R.A.; Woodruff, H.B. Appl.
Spectr., 1985, 39, 331. Quantum Chemistry Program Exchange,
Univ. Indiana, Bloomington, IN, Program #426.
67. Wilson, J.W.; Levine, J.B. Business Wk., June 10, 1985, 82.
68. Robinson, P. BYTE, June, 1985, 169.
69. Yianilos, P.N. Electronics, 1983, 56, 113.

R E C E I V E D December 17, 1985

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Author Index

A b b o t t , S e t h , 278 Hohne, Bruce A . , 87


B a c h , R e n é , 278 Houghton, Richard D . , 87
B e c k n e r , C . F . , 321 Huang, Conrad, 147
B e l l o w s , James C , 52 H u t c h i n g s , M. G . , 258
B e r t z , Steven Η . , 169 Johnson, P e t e r , 244
B u r n s t e i n , I l e n e , 244 K a r n i c k y , J o e , 278
C a b r o l , D a n i e l , 125 K e i t h , L . H . , 31
C a c h e t , C l a u d e , 125 K l e i n , T e r i E . , 147
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ix001

C o r b e t t , M i c h a e l , 244 K n i c k e r b o c k e r , C a r l G . , 69
C o r n e l i u s , R i c h a r d , 125 K u l i k o w s k i , C a s i m i r Α . , 75
C r o s s , K. P . , 321 Kumar, A n i l , 337
C u r r y , B o , 350 L a n g r i d g e , R o b e r t , 147
D e l a g l i o , F r a n k , 337 LaRoe, W i l l i a m D . , 231
D o l a t a , D a n i e l P . , 188 L e v i n s o n , Robert Α . , 209
Dudewicz, Edward J . , 337. L e v y , George C , 337
D u f f , P . J . , 365 Liebman, S . Α . , 365
E d e l s o n , D a v i d , 119 Low, P . , 258
E h r l i c h , S t e v e n , 244 M a r t z , P h i l i p R . , 297
Enke, C . G . , 321 Moore, Robert L . , 69
Evens, M a r t h a , 244 M o s e l e y , C . Warren, 231
F e r r i n , Thomas E . , 147 Palmer, P . T . , 321
F i f e r , R. Α . , 365 P a v e l l e , R i c h a r d , 100
G a r f i n k e l , D a v i d , 75 Renkes, Gordon D . , 176
G a r f i n k e l , L i l l i a n , 75 R i e s e , C h a r l e s Ε . , 18
G a s t e i g e r , J . , 258 S a i l e r , H . , 258
G i o r d a n i , A . B . , 321 S c h r o e d e r , Μ. Α . , 365
Gough, A l i c e , 244 S m i t h , A l l a n L . , 111
Gregg, H. G . , 321 S m i t h , Dennis Η . , 1
G r i f f i t h , Owen M i t c h , 297 S m i t h , Graham Μ . , 312
Hahn, Mathew Α . , 136 Soo, Von-Wun, 75
Hansch, C o r w i n , 147 S t u a r t , J . D . , 18,31
H a r n e r , Teresa J . , 337 T o m e l l i n i , S t e r l i n g Α . , 312
H a r p e r , A . M . , 365 T r i n d l e , C a r l , 159
Hawkinson, L o w e l l B . , 69 Wang, Tunghwa, 244
H e f f r o n , M a t t , 297 W i l c o x , C r a i g S . , 209
H e m p h i l l , C h a r l e s T . , 231 Wipke, W. Todd, 136,188
Herndon, W i l l i a m C , 169 Woodruff, Hugh B . , 312
Hoffman, P . Α . , 321

Subject Index

Actinospectacin—Continued
trace of sulfone f u n c t i o n a l i t y during
PAIRS i n t e r p r e t a t i o n , 315,3l8f
A b s t r a c t i o n , 189 A c t i o n s , d e f i n i t i o n , 94,95t,96
Actinospectacin A g r i c u l t u r a l formulations
d i g i t i z e d spectrum, 315,317t r e q u i r e m e n t s , 87
PAIRS i n t e r p r e t a t i o n , 315,3l8t s t r u c t u r e o f the e x p e r t
s t r u c t u r e , 315-316 system, 89,91-97

386

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Author Index

A b b o t t , S e t h , 278 Hohne, Bruce A . , 87


B a c h , R e n é , 278 Houghton, Richard D . , 87
B e c k n e r , C . F . , 321 Huang, Conrad, 147
B e l l o w s , James C , 52 H u t c h i n g s , M. G . , 258
B e r t z , Steven Η . , 169 Johnson, P e t e r , 244
B u r n s t e i n , I l e n e , 244 K a r n i c k y , J o e , 278
C a b r o l , D a n i e l , 125 K e i t h , L . H . , 31
C a c h e t , C l a u d e , 125 K l e i n , T e r i E . , 147
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ix002

C o r b e t t , M i c h a e l , 244 K n i c k e r b o c k e r , C a r l G . , 69
C o r n e l i u s , R i c h a r d , 125 K u l i k o w s k i , C a s i m i r Α . , 75
C r o s s , K. P . , 321 Kumar, A n i l , 337
C u r r y , B o , 350 L a n g r i d g e , R o b e r t , 147
D e l a g l i o , F r a n k , 337 LaRoe, W i l l i a m D . , 231
D o l a t a , D a n i e l P . , 188 L e v i n s o n , Robert Α . , 209
Dudewicz, Edward J . , 337. L e v y , George C , 337
D u f f , P . J . , 365 Liebman, S . Α . , 365
E d e l s o n , D a v i d , 119 Low, P . , 258
E h r l i c h , S t e v e n , 244 M a r t z , P h i l i p R . , 297
Enke, C . G . , 321 Moore, Robert L . , 69
Evens, M a r t h a , 244 M o s e l e y , C . Warren, 231
F e r r i n , Thomas E . , 147 Palmer, P . T . , 321
F i f e r , R. Α . , 365 P a v e l l e , R i c h a r d , 100
G a r f i n k e l , D a v i d , 75 Renkes, Gordon D . , 176
G a r f i n k e l , L i l l i a n , 75 R i e s e , C h a r l e s Ε . , 18
G a s t e i g e r , J . , 258 S a i l e r , H . , 258
G i o r d a n i , A . B . , 321 S c h r o e d e r , Μ. Α . , 365
Gough, A l i c e , 244 S m i t h , A l l a n L . , 111
Gregg, H. G . , 321 S m i t h , Dennis Η . , 1
G r i f f i t h , Owen M i t c h , 297 S m i t h , Graham Μ . , 312
Hahn, Mathew Α . , 136 Soo, Von-Wun, 75
Hansch, C o r w i n , 147 S t u a r t , J . D . , 18,31
H a r n e r , Teresa J . , 337 T o m e l l i n i , S t e r l i n g Α . , 312
H a r p e r , A . M . , 365 T r i n d l e , C a r l , 159
Hawkinson, L o w e l l B . , 69 Wang, Tunghwa, 244
H e f f r o n , M a t t , 297 W i l c o x , C r a i g S . , 209
H e m p h i l l , C h a r l e s T . , 231 Wipke, W. Todd, 136,188
Herndon, W i l l i a m C , 169 Woodruff, Hugh B . , 312
Hoffman, P . Α . , 321

Subject Index

Actinospectacin—Continued
trace of sulfone f u n c t i o n a l i t y during
PAIRS i n t e r p r e t a t i o n , 315,3l8f
A b s t r a c t i o n , 189 A c t i o n s , d e f i n i t i o n , 94,95t,96
Actinospectacin A g r i c u l t u r a l formulations
d i g i t i z e d spectrum, 315,317t r e q u i r e m e n t s , 87
PAIRS i n t e r p r e t a t i o n , 315,3l8t s t r u c t u r e o f the e x p e r t
s t r u c t u r e , 315-316 system, 89,91-97

386

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
INDEX 387

A g r i c u l t u r a l formulations—Continued Axiom, d e f i n i t i o n , 194


s t r u c t u r e o f the problem, 8 9 , 9 0 f Axiomatic theories
t y p e s , 87 d e f i n i t i o n , 194
A g r i c u l t u r a l formulations a p p l i c a t i o n s , s t e p s , 194
advantages o f e x p e r t s y s t e m , 88-89 A x i o m a t i c theory a p p r o a c h ,
Analogy and i n t e l l i g e n c e i n model s y n e t h e s i s , 188
building
components, 138-139
example o f e v a l u a t i o n , 140,142f
g o a l s , 137,138f Β
hardware c o n f i g u r a t i o n , I 4 0 , l 4 l f
i n p u t s c r e e n , 140,141f
ORTEP p l o t , 143,144f Backward c h a i n i n g , d e f i n i t i o n , 306
p r o c e d u r e , 139-144 Bimodal l o g i c
scoring, I40,l42f,l43 d e f i n i t i o n , 196
speed o f b u i l d i n g model, I43t i m p l i c a t i o n r u l e , 196t,197
s u p e r p o s i t i o n o f model and
refinement, I43,l44f
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ix002

A n a l y t i c a l systems, advances, 365-366


A p p l i c a t i o n s o f e x p e r t systems C
a p p r o p r i a t e s e l e c t i o n , 7-8
b i o l o g i c a l reactors, 9,10f
c h e m i c a l s c i e n c e and C a l c u l u s , d e f i n i t i o n , 190
e n g i n e e r i n g , 9-15 ChemData, d e s c r i p t i o n , 152
communication s a t e l l i t e s , 9 , 1 1 , 1 2 f Chemical e d u c a t i o n , a p p l i c a t i o n s o f
computing environment, 18-19 computers, 125
diagnosis of plant conditions, C h e m i s t r y , unique c h a r a c t e r i s t i c s , 258
r e a l t i m e , 69-70 Complete r e a c t i o n c o n c e p t ,
e x e c u t i o n e f f i c i e n c y , r e a l t i m e , 69 d e f i n i t i o n , 214
space s t a t i o n s , 11,13-15 Complex e q u i l i b r i u m c a l c u l a t i o n s
A r t i f i c i a l intelligence enzyme k i n e t i c s , 79-82
annual growth r a t e s f o r companies magnesium i o n s , 78-79
marketing p r o d u c t s based o n , 2 p h a r m a c o k i n e t i c s and drug dosage
change i n number o f j o b s regimen d e s i g n , 82,83
a v a i l a b l e , 16 Computer a l g e b r a system—See MACSYMA
e x p e r t systems, 1-16 Computer-assisted i n s t r u c t i o n ,
A r t i f i c i a l intelligence diagnostic d e s c r i p t i o n , 126
system, g o a l s , 56-57 Computer-oriented n o t a t i o n c o n c e r n i n g
A r t i f i c i a l i n t e l l i g e n c e i n organic IR s p e c t r a l e v a l u a t i o n
chemistry (CONCISE), d e s c r i p t i o n , 313
advantages, 210 Computer s o f t w a r e , e x p e r t s y s t e m s , 1-2
background, 210 Computers i n c h e m i c a l e d u c a t i o n
c a t e g o r i e s o f r e a c t i o n s , 210 advantages, 125
g e n e r a l i z a t i o n s about r e a c t i o n s GEORGE, 126-133
and s t r u c t u r e s , 210 l i m i t a t i o n s , 126
A r t i f i c i a l i n t e l l i g e n c e systems s o f t w a r e c a t e g o r i e s , 126
development CONCISE—See Computer-oriented n o t a t i o n
DENDRAL, 6 c o n c e r n i n g IR s p e c t r a l e v a l u a t i o n
INTERLISP, 6 Corona d e t e r m i n a t i o n , d e c i s i o n
LISP, 6 t r e e , 21-22f
MACSYMA, 6 Corona d e t e r m i n a t i o n r u l e expressed i n
A r t i f i c i a l i n t e l l i g e n c e techniques a u t o m a t i c a l l y generated r a d i a l
f o r n u c l e a r magnetic code, 21-22f
resonance a n a l y s i s Corona r u l e , example s e t , 2 0 - 2 2 f
d e c i s i o n t r e e s , 340
improvements, 347-348
l o g i c programming, 340 D
model matching s i m i l a r i t y
n e t s , 340,341f,342
Assignment s t a t e m e n t , Data bases f o r MS-MS
d e f i n i t i o n , 111-112 spectrum d a t a b a s e , 324-325

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
388 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Data bases f o r MS-MS—Continued D i f f e r e n t i a l c a l c u l u s , a p p l i c a t i o n of


s t r u c t u r e data b a s e , 325 MACSYMA, 104-105
Data c e n t e r d i s p l a y s D i f f e r e n t i a l equations, a p p l i c a t i o n of
recommendation s c r e e n , 6 5 , 6 7 f MACSYMA, 109
recommendation summary s c r e e n , 6 5 , 6 6 f Disconnection approach,
Data base o r g a n i z a t i o n f o r o r g a n i c d e s c r i p t i o n , 231-232
structures
comparison to Cambridge
c r y s t a l l o g r a p h i c d a t a b a s e , 227
comparison to the s c r e e n Ε
a p p r o a c h , 227-228
l i n e a r n o t a t i o n f o r r e a c t i o n s and
s t r u c t u r e s , 228-229 Easy d i s t a n c e geometry e d i t o r
p a r t i a l o r d e r i n g , 224,225f c o n t r o l p o i n t s , 151
r e t r i e v a l a l g o r i t h m , 224,226-227 d e s c r i p t i o n , 151
DCG—See D e f i n i t e c l a u s e grammar s e l e c t i o n , 151,155f
D e c i s i o n t r e e , corona s u r f a c e g e n e r a t i o n , 151
determination, 21-22f ECAT—See Expert chromatographic
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ix002

D e c l a r a t i v e languages a s s i s t a n c e team
c h a r a c t e r i s t i c s , 112 Elaboration of reactions for organic
d e s c r i p t i o n , 112 s y n t h e s i s (EROS), r e a c t i o n
D e f i n i t e c l a u s e grammar (DCG), 232-233 schemes, 2 5 9 , 2 6 l f
Definite integration, application of Emulsifiable concentrate,
MACSYMA, 107 d e s c r i p t i o n , 88
D i - n - o c t y l phthalate _ EROS—See E l a b o r a t i o n o f r e a c t i o n s
daughter spectrum o f C-containing for organic synthesis
M , 333,334f,335
+
Example s e t , corona r u l e , 2 0 - 2 2 f
mass spectrum, 328,329f Examples o f e x p e r t - s y s t e m a p p l i c a t i o n s
match o f 1 0 5 daughter s p e c t r a v s .
+
b i o l o g i c a l reactors, 9,10f
di-n-octv_l phthalate, 328t,330f communication s a t e l l i t e s , 9 , 1 1 , 1 2 f
match o f 149 daughter s p e c t r a v s . space s t a t i o n s , 11,13-15
d i - n - o c t y l phthalate, 331t,332f Execution e f f i c i e n c y
parent spectrum o f r a d i a l , 24-25
mass 149, 331,333,334f r e a l - t i m e a p p l i c a t i o n of expert
spectrum-substructure systems, 69
c o r r e l a t i o n s , 331 EXMAT—See Linked network o f e x p e r t
s t r u c t u r e s , 328,329f systems f o r m a t e r i a l s a n a l y s i s
Diagnosis EXMATH—See Expert system f o r p a t t e r n
d e f i n i t i o n , 56 recognition
e x p e r t s y s t e m , 57 Expert
Diels-Alder reactions experimental design with
algorithm for regiochemical PENNZYME, 81-82
s e l e c t i o n , 238 f i t t i n g o f models to d a t a , 80
basic f r o n t i e r molecular o r b i t a l s e l e c t i o n of a computational
t h e o r y , 234 model, 80
b a s i c h i g h e s t occupied m o l e c u l a r s e l e c t i o n o f a c o n c e p t u a l model, 80
o r b i t a l - l o w e s t unoccupied m o l e c u ­ E x p e r t chromatographic a s s i s t a n c e team
l a r o r b i t a l c a l c u l a t i o n s , 235-236 (ECAT)
d e t e r m i n a t i o n o f permutated l o w e s t automatic t e s t i n g , 288
unoccupied m o l e c u l a r o r b i t a l development equipment, 2 8 3 , 2 8 5 , 2 8 7 f
c o e f f i c i e n t s , 237-238 development o f knowledge
determination of substituent b a s e s , 285-286
e f f e c t s , 236-237t elements i n v o l v e d i n development
d i s c o n n e c t i o n a p p r o a c h , 231 and a p p l i c a t i o n , 2 8 0 , 2 8 1 f
g e n e r a l from d e r i v a t i o n , 239,240t examples o f f a c t s and r u l e s , 2 8 3 , 2 8 4 f
grammar, 233-234 e x p e r t system programming, 279-280
n a i v e approach d e r i v a t i o n , 238,239t f i r s t r u l e s , 286
n o t a t i o n rearrangement, 241-242 IF-THEN r u l e s , 2 8 6 , 2 8 7 f , 2 8 8
s t r u c t u r a l c o n s t r a i n t s on knowledge r e p r e s e n t a t i o n , 294-295
r e a c t a n t s , 235 l i m i t a t i o n s of conventional
use o f g e n e r a l form i n r u l e programming, 279
f o r m a t i o n , 240-241 module development, 292-294

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
INDEX 389

ECAT—Continued Expert s y s t e m s — C o n t i n u e d
p r o j e c t m o t i v a t i o n , 279 diagnosis of plant conditions,
system s t r a t e g y , 280,283 r e a l - t i m e a p p l i c a t i o n o f , 69-70
t a s k modules, 280,282f D i e l s - A l d e r r e a c t i o n s , 231-242
user i n t e r f a c e s , 295 execution e f f i c i e n c y , real-time
Expert system a p p l i c a t i o n o f , 69
d e f i n i t i o n , 56,279-280 hardware technology r e v o l u t i o n , 13
examples o f a p p l i c a t i o n s , 29 high-performance l i q u i d
formulation o f a g r i c u l t u r a l chromatographic methods
c h e m i c a l s , 87-97 developments, 278-295
major components, 3 knowledge e x t r a c t i o n , 27-28
p a r t s , 56 MS-MS d a t a , 321-335
TOGA, 20-21 NMR s p e c t r o s c o p y , 337-348
Expert system f o r a g r i c u l t u r a l o r g a n i c c h e m i s t r y , 258-274
formulations organic structure
a c c e s s i n g e x t e r n a l s o f t w a r e , 93 d e t e r m i n a t i o n , 350-363
f u t u r e developments, 96-97 o r g a n i c s y n t h e s e s , 244-257
response c h e c k i n g f u n c t i o n s , 9 2 t programs f o r c h e m i s t r y , 280
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ix002

s t a t u s o f development, 92 Rulemaster, 18-29


s t r u c t u r e , 89,91f s c i e n t i f i c and e n g i n e e r i n g
structure of conclusions, 93f applications, 8
s t r u c t u r e o f FACTS, 92f,93 s e l e c t i n g an a p p r o p r i a t e
s t r u c t u r e o f r u l e s , 93,94t,95-96 a p p l i c a t i o n , 7-8
Expert system f o r p a t t e r n r e c o g n i t i o n t y p e s , 75
(EXMATH) u l t r a c e n t r i f u g a t i o n , 297-311
d r i v e r s , 376 uses and v a l u e s , 4-5
h e u r i s t i c d e s i g n , 376,377f Expert systems a p p l i c a t i o n s , computing
p r o c e s s , 375 environment, 18-19
s u b r o u t i n e , 376,378f Expert systems b u i l d e r program,
Expert system f o r p r o c e s s c o n t r o l , advantages, 76
r e a l t i m e , 69-74 Expert systems b u i l d i n g , example f o r
Expert system f o r t r a n s f o r m e r f a u l t multiple equilibrium
d i a g n o s i s , TOGA, 25-29 c a l c u l a t i o n , 76-77
Expert system r u l e base E x p e r t systems d e v e l o p i n g , model, 7-9
a c t i v a t i o n by i n f e r e n c e engine, 58,60
b a s i c s t e p , 57-58,59f
b u i l d i n g t h e r u l e base, 60
m a l f u n c t i o n , 60 F
modes, 57
r u l e s , 57-58
Expert systems Factorization, application of
a n a l y s i s o f multicomponent MACSYMA, 105-106
m a t e r i a l s , 366-381 Four major s t e p s , model f o r d e v e l o p i n g
and t r a d i t i o n a l s o f t w a r e e n g i n e e r i n g , an e x p e r t system, 7-9
differences, 7
applications to supervise
c a l c u l a t i o n s and d e s i g n
experiments, 78-85
a p p l i c a t i o n s r e l a t e d t o chemical G
s c i e n c e and e n g i n e e r i n g , 9-15
a p p l i e d t o c h e m i s t r y , 1-16
a r t i f i c i a l i n t e l l i g e n c e , 1-16 Generation o f molecular s t r u c t u r e s
b u i l d i n g , 76,77 (GENOA)
c a l c u l a t i o n s u p e r v i s i o n , 76 advantage, 333
c h a r a c t e r i s t i c s and d e s c r i p t i o n , 324,333
v a l u e s , 2-4 e m p i r i c a l formula o f the unknown
computer a l g e b r a system, 100 compound, 333
computer s o f t w a r e , 1-2 example o f d i - n - o c t y l
c o n s u l t a t i o n problems, 75 p h t h a l a t e , 333,334f,335
d e f i n i t i o n , 18 Generator d i a g n o s t i c system, a c t u a l
description, 3 e x p e r i e n c e , 65,68

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
390 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Generic r u l e s KARMA—Continued
d e s c r i p t i o n , 153 i n t e r a c t i o n s f o r enzyme-ligand
hydrophobicity b i n d i n g , 152
examples, 153-154,155-156f knowledge, 152
GENOA—See G e n e r a t i o n o f m o l e c u l a r molecule e d i t o r , I48,150f
structures pop-up menus, I48,150f
GEORGE, 126-127,128f r u l e f o r m u l a t i o n , 153
comparison to o t h e r programs, 126 s p e c i f i c r u l e s , 156-157
diagram f o r d e t e r m i n a t i o n o f a n i l i n e system c o r e , 151-157
m o l a r i t y , 121,129,132f system d e s i g n , I 4 8 , l 4 9 f
diagram f o r d e t e r m i n a t i o n o f e t h a n o l system i m p l e m e n t a t i o n , 148,151
d e n s i t y , 129,132f KarmaData, d e s c r i p t i o n , 152
d i s p l a y o f u n i t c o n v e r s i o n , 129,130f K E E - a s s i s t e d r e c e p t o r mapping a n a l y s i s
domain, 126 d e s c r i p t i o n 147-148
example o f a r e l a t i o n page, 131,132f d i f f e r e n c e from t r a d i t i o n a l approach
e x t e n s i o n o f the domain o f to drug d e s i g n , 147-148
a p p l i c a t i o n , 133 Knowledge, m a n i p u l a t i o n f o r use i n
l e v e l s o f u s e , 127-132 computer programs, 2
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ix002

l o g i c , 127 Knowledge base


primary menu, 127,128f content, 4
program, 127-132 e x p e r t systems, 3-5
screen e x p l a i n i n g m o l e c u l a r mass Knowledge base f o r expert systems i n
c a l c u l a t i o n , 127,129,130f o r g a n i c chemistry
Group theory charge d i s t r i b u t i o n , i n d u c t i v e , and
a p p l i c a t i o n o f symbolic resonance e f f e c t s , 263,265
programming, 176-185 concepts i n v o l v e d i n o r g a n i c r e a c t i o n
software a v a i l a b l e , 185 c a u s e s , 260,264f
heats o f r e a c t i o n and bond
d i s s o c i a t i o n e n e r g i e s , 260,262t
h y p e r c o n j u g a t i o n , 265
H multilinear regression
a n a l y s i s , 265-266,267-268f
p o l a r i z a b i l i t y e f f e c t s , 262-263,264f
Hardware technology r e v o l u t i o n , e x p e r t r e a c t i v i t y space a p p r o a c h , 266-274
systems, 13 Knowledge e n g i n e e r i n g , d e s c r i p t i o n , 3-4
H e u r i s t i c s , d e f i n i t i o n , 3-4,189 Knowledge e x t r a c t i o n
e x p e r t systems, 27-28
Rulemaster, 27-28
TOGA, 27-28

IF-THEN r u l e s , r u l e - b a s e d e x p e r t L
systems, 3
Incremental multivalued l o g i c
d e s c r i p t i o n , 199-200 Languages, programming, v s . programming
i m p l i c a t i o n , 201 environments, 6
incremental a c q u i s i t i o n of L i n k e d network o f e x p e r t systems f o r
e v i d e n c e , 200-201 m a t e r i a l s a n a l y s i s (EXMAT)
Indefinite integration, application of a n a l y t i c a l g o a l s , 375
MACSYMA, 107 chemometric-search a l g o r i t h m s , 375
I n t u i t i v e t h e o r y , d e f i n i t i o n , 194 data g e n e r a t i o n , 367
documentation and e v a l u a t i o n o f
r e s u l t s , 375
Κ e x p e r t system network, 368,370f
i n d i v i d u a l systems, 367
i n s t r u m e n t a l c o n f i g u r a t i o n and
KARMA c o n d i t i o n s , 368
g e n e r i c r u l e s , 153-156 i n t e r p r e t a t i o n , 375
g r a p h i c i n t e r f a c e , 157 o u t l i n e , 376,379-38lf

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
INDEX 391

Linked Network o f EXMAT—-Continued MS-MS—Continued


o v e r a l l d e c i s i o n s t r u c t u r e , 368,371f spectrum-substructure
partial decision r e l a t i o n s h i p , 326-333
s t r u c t u r e , 368,372f,374f s t r u c t u r e a n a l y s i s , 322
problem d e f i n i t i o n and a n a l y t i c a l substructure-property
s t r a t e g y , 368 r e l a t i o n s h i p s , 322
r u l e s , 368,373 Mass spectrometry-mass spectrometry
LISP f o r symbolic programming s p e c t r a matching program
advantages, 177-178 d e s c r i p t i o n , 325
i m p l e m e n t a t i o n , 178-185 match f a c t o r s , 326,327f
LMA—See L o g i c machine a r c h i t e c t u r e range o f standard c o n d i t i o n s , 326
Logic Minimum r e a c t i o n c o n c e p t ,
b i m o d a l , 194-197 d e f i n i t i o n , 214
i n c r e m e n t a l m u l t i v a l u e d , 199-201 M o d e l , d e f i n i t i o n , 259
Lukasiewicz-Tarski Model computer software f o r
m u l t i v a l u e d , 197-199 s p e c t r o s c o p i c a n a l y s i s (NMR1)
L o g i c machine a r c h i t e c t u r e (LMA), d e s c r i p t i o n , 338
d e f i n i t i o n , 244 d i f f i c u l t i e s , 338,339
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ix002

L o g i c a l i n f e r e n c e s per second—See LISP Model f o r d e v e l o p i n g an e x p e r t s y s t e m ,


Logistic regression analysis f o u r major s t e p s , 7-9
d e s c r i p t i o n , 273 Modules o f ECAT
example o f problem r e a c t i o n column and mobile phase
p r e d i c t i o n , 274,275f d e s i g n , 288-292
network o f bond b r e a k i n g and making column d i a g n o s i s , 292t
p a t t e r n s , 274,275f d e t e r m i n a t i o n o f c h e m i c a l and
Lukasiewicz-Tarski multivalued l o g i c s t r u c t u r a l i n f o r m a t i o n on the
a l l o w e d v a l u e s , 198t sample, 292
c u m u l a t i v e e v i d e n c e , 198-199 M o l e c u l a r model b u i l d i n g
d e s c r i p t i o n , 197 a p p l i c a t i o n o f analogy and
i n t e l l i g e n c e , 136-144
a u t o m a t i c , 137
manual c o n s t r u c t i o n , 136
M PRXBLD, 137
SCRIPT, 137
WIZARD, 137
M a c r o o p e r a t o r s , 189 Molecular spectral i n t e r p r e t a t i o n ,
MACSYMA s t e p s , 350-351
advantages, 101-102 M o l e c u l a r s t r u c t u r e programs
c a p a b i l i t i e s , 102-103 a c c e p t i n g the s k e t c h , 160-161
d e s c r i p t i o n , 100-101 d i s t a n c e geometry changes d i s t a n c e s
equipment, 100 to C a r t e s i a n c o o r d i n a t e s , 164,166
examples, 103-110 e x t e n s i o n s o f the f u n c t i o n a l fragment
u s e s , 103 d a t a s t r u c t u r e , 166-167
Magnetic resonance imaging fragments, 164,165f
development, 339 i n t e r a c t i n g - f r a g r a e n t s modeling
MRI_L0G_ESP, 342-347 schemes, 167
system f l o w c h a r t , 339-340,341f LISP s t r u c t u r a l r e c o g n i z e r , 163-164
Malfunction o b s t a c l e s to wide u s e , 160
d e f i n i t i o n , 56,60 p r e l i m i n a r y p r o c e s s i n g o f the
examples o f a once-through b o i l e r s k e t c h , 161
system, 6 0 , 6 l t problem, 160
Manual c o n s t r u c t i o n o f m o l e c u l a r r e p r e s e n t a t i o n o f the molecule i n
models, d e s c r i p t i o n , 136 L I S P , 163
Mass spectrometry-mass spectrometry r o u t i n e to a s s i g n Lewis
data b a s e s , 324-325 structures, I6l,l63,l65f
development o f d a t a b a s e s , 322,324 s p e c i f i c a t i o n of v e r t i c e s , I 6 l , l 6 2 f
molecular structure Molecular structure representation in
g e n e r a t o r , 324,333-335 c l a u s e form
o v e r a l l system f o r d e t e r m i n a t i o n f o r atom f u n c t i o n , 246-247
s t r u c t u r e , 322,323f bond f u n c t i o n , 246-247
s p e c t r a matching program, 325-327 c l a u s e s , 245

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
392 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

Molecular structure representation i n Pattern-matching system for spectra


clause form—Continued c l a s s i f i c a t i o n , 351
fragments, 246-247 PENNZYME
Monitoring, d e f i n i t i o n , 56 enzyme and transport k i n e t i c
Monitoring system of a power plant program, 79-80
diagnosis activation l i m i t , 57 f i t t i n g of models to data, 81
schematic, 54f,57 interface with Expert, 80-82
sensors, 57,58t Pharmacokinetics and drug dosage
MRI_LOG_ESP regimen design
branches, 342,346 description of problem, 82
commands, 347 modeling considerations, 83-85
output f i l e s , 347 physiological pharmacokinetics, 83-85
sample session, 342,343-345 use of expert systems, 84-85
s t a t i s t i c a l procedures, 346-347 4-Phenyl-2-butanone
MYCIN, description, 138 explanations of conclusions of
organic structure
determination, 357,359f
interpretation of
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ix002

Ν spectra, 356-357,358f
Physicochemical parameters,
Necessity, d e f i n i t i o n , 58 examples, 151-152
Nuclear magnetic resonance spectro­ PICON—See Process i n t e l l i g e n t control
scopic analysis, systems, 337 Planning, 189
P o l a r i z a b i l i t y , d e f i n i t i o n , 263
Polynomial equations, applications,
0 application of MACSYMA, 103-104
Postulates, d e f i n i t i o n , 194-195
Power plant
Once-through b o i l e r system malfunction d e f i n i t i o n , 53
cation conductivity sensor schematic, 53,54f
malfunction, 62t,64f,65 types of b o i l e r s , 53
description, 60,6lt Power plant chemistry
diagnoses, 60-61 dependence on b o i l e r , 53,55
number diagnosed for each problems, 55
sensor, 60,6lt Predicate, d e f i n i t i o n , 193
sensor validation, 61-62,63f Predicate calculus
Organic structure determination formal symbols used QED, 192t
a c c e s s i b i l i t y to knowledge base and l o g i c , 190-192
reasoning process, 352 translation of chemical statements
chemical data base, 355,356f into predicate l o g i c , 192
example for working d e f i n i t i o n , 192
4-phenyl-2-butanone, 356-358 Predicates, d e f i n i t i o n , 93-94,95t
flow chart, 350,353f Problem solving and inference engine,
interpretation of spectra, 352,353f expert systems, 3
IR expert module, 355 Procedural languages
messages, 355-356 characteristics, 111-112
program description, 354-355 d e f i n i t i o n , 111
r e c a l l , 360 steps i n algebraic
r e l i a b i l i t y , 360 equation solving, 113
testing of known Process control, real-time expert
structures, 357-362 system f o r , 69-74
Process i n t e l l i g e n t control (PICON)
backward-chaining inference, 70-71
design requirements, 70
Ρ example of inference, 73
focus f a c i l i t y , 71,73
PAIRS—See Program for the analysis forward-chaining inference, 70-71
of infrared spectra overall structure of package, 74f
Pattern recognition programs, system for process
development, 366 control, 71,72f

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
INDEX 393

Program f o r the a n a l y s i s o f i n f r a r e d R e a c t i v i t y space approach c l u s t e r


s p e c t r a (PAIRS) analysis—Continued
automated r u l e g e n e r a t i o n compounds used i n d e r i v i n g a
program, 313-314 r e a c t i v i t y f u n c t i o n , 270,271t
i n f o r m a t i o n f l o w , 312-313,316f d i s c u s s i o n , 266
s t r e n g t h s , 313 h e t e r o l y s i s , 270,271f
t r a c i n g i n t e r p r e t a t i o n r u l e s , 314-319 l o g i s t i c r e g r e s s i o n , 273-274
Programming languages, v s . programming supervised-learning pattern
environments, 6 r e c o g n i t i o n methods, 273
Proof ordering, v s . time-ordered three-dimensional r e a c t i v i t y
p r e s e n t a t i o n o f f i r e d r u l e s , 23 s p a c e , 266,269f,270
P r o t o t y p e ( b u i l d i n g ) e x p e r t system, unsupervised-learning pattern
r e f i n e m e n t , 28 r e c o g n i t i o n methods, 270
PRXBLD, d e s c r i p t i o n , 137 R e a l - t i m e a p p l i c a t i o n o f e x p e r t systems
d i a g n o s i s o f p l a n t c o n d i t i o n s , 69-70
e x e c u t i o n e f f i c i e n c y , 69
R e a l - t i m e e x p e r t system f o r p r o c e s s
Q c o n t r o l , 69-74
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ix002

Reasoning
symbolic a p p l i c a t i o n appropriate
QED program to e x p e r t s y s t e m s , 8
agenda l i s t , 204-205 use i n problem s o l v i n g , 3
a n a l y s i s example, 2 0 5 , 2 0 6 f , 2 0 7 Rule-based system f o r s p e c t r a
b l o c k diagram, 201,202f c l a s s i f i c a t i o n , 351
BNF grammar f o r language, 203f Rule-based systems, d e f i n i t i o n , 306
c o m p i l a t i o n p r o c e s s f o r r u l e s , 202f RuleMaker
data b a s e , 204 i n d u c t i v e l e a r n i n g , 20-21
d e s c r i p t i o n , 201-202 knowledge a c q u i s i t i o n system, 20
i n t e r n a l form o f ALPHA-TO-SC, 204t RuleMaster
parse t r e e , 203f C-code g e n e r a t i o n , 24
r u l e p a s s i n g , 202 e x p e r t systems, 18-29
r u l e s , 205 e x p l a n a t i o n o f the l i n e o f
r e a s o n i n g , 23
e x t e r n a l p r o c e s s e s , 23-24
h i s t o r y , 19
knowledge e x t r a c t i o n , 27-28
R p o r t a b i l i t y , 25
programming s k i l l s r e q u i r e d , 28-29
two p r i n c i p a l components, 2 0 , 2 1 , 2 3
Radial
d i s c u s s i o n , 20-21,23
error detection at building S
t i m e , 24-25
e x e c u t i o n e f f i c i e n c y , 24-25
i n t e r f a c i n g s o f t w a r e , 23 S c i e n t i f i c and e n g i n e e r i n g
language f e a t u r e s , 21 a p p l i c a t i o n s , e x p e r t s y s t e m s , 5-6
s i m i l a r i t i e s t o P a s c a l and ADA, 21 SECS—See S i m u l a t i o n and e v a l u a t i o n o f
R e a c t i o n r u l e data base c h e m i c a l s y n t h e s i s program
connection tables S e l f - o r g a n i z e d knowledge base f o r
o r g a n i z a t i o n , 250-251 organic chemistry
G e l e r n t e r r e a c t i o n r u l e , 247,249f calculation of generalization
m u l t i s t e p r u l e s , 250-251 v a l i d i t i e s , 217-218
s i n g l e - s t e p r u l e s , 250-251 i n t e r a c t i v e s e s s i o n s , 219,220-223f
Reaction r u l e t r a n s l a t i o n i n t o clauses reaction generalizations
c l a u s e r e p r e s e n t a t i o n o f g o a l and based on s p e c i f i c
s u b g o a l , 251 o b s e r v a t i o n s , 212,214,215-216f
v a r i a b l e s u b s t r u c t u r e molecule v s . r e a c t i o n r e p r e s e n t a t i o n , 211-212,213f
known m o l e c u l e , 251-252 S e n t e n t i a l c a l c u l u s , d e s c r i p t i o n , 195
R e a c t i v i t y space approach S i m i l a r i t y o f molecules
c l u s t e r a n a l y s i s , 270,272f,273 a l i p h a t i c a l c o h o l s , 169,170,171f

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
394 ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY

S i m i l a r i t y o f molecules—Continued Symbolic programs f o r group


a p p l i c a t i o n s , 174 theory—Continued
c a l c u l a t i o n o f s i m i l a r i t y i n d e x , 170 b a s i c f u n c t i o n s , 178-180
c o m p l e x i t y measurements, 173,174t d i s p l a y decomposition o f
d i s t a n c e measurements, 170,173t products, I83,l84t
p e r c e p t i o n , 169 f u t u r e p l a n s , 185
q u a n t i f i c a t i o n , 170 language, 177-178
s i m i l a r i t y m a t r i x , 170,171f property l i s t s f o r c y c l i c
subgraph enumeration, 170,172t,174 g r o u p , 179t,l80
Simplification, application of r e c o r d s t r u c t u r e , l83t
MACSYMA, 106-107 terminal display of character
S i m u l a t i o n and e v a l u a t i o n o f c h e m i c a l correlation tables, I8l,l82t
s y n t h e s i s program (SECS), p l a n terminal display of character
r e p r e s e n t a t i o n , 189-190 t a b l e , 181t
S i m u l a t i o n o f complex k i n e t i c s t e r m i n a l d i s p l a y o f c l a s s e s , 180,1811
a d a p t a b i l i t y , 123 t e r m i n a l d i s p l a y o f the decomposition
a p p l i c a t i o n s , 119-120 o f p r o d u c t , I82t
approaches to mathematical Symbolic r e a s o n i n g
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ix002

problems, 120-121 a p p l i c a t i o n a p p r o p r i a t e to e x p e r t
d a t a s t r u c t u r e s , 122 systems, 8
equipment, 121 use i n problem s o l v i n g , 3
i n p u t language, 121-122 S y n t h e s i s p l a n n i n g programs
mathematical problem, 120 approaches to l a r g e s e a r c h
program o u t p u t , 122-123 s p a c e s , 189-190
syntax a n a l y s i s , 122 complexity o f synethesis
Software e n g i n e e r i n g , t r a d i t i o n a l , t r e e , 189,191f
d i f f e r e n c e s , expert systems, 7 f i r s t - o r d e r predicate
Software f o r s c i e n t i f i c c o m p u t a t i o n , c a l c u l u s , 190,192
r e v i e w , 111-112 problems, 189
Specific rules p r o c e d u r e , 188-189
d e s c r i p t i o n , 154 s t r a t e g i c b a s i s , 189
examples, 154,156f,157 symmetry-based s t r a t e g y f o r
Spectrum-substructure r e l a t i o n s h i p s 8 - c a r o t e n e , 190,191f
example f o r d i - n - o c t y l S y n t h e s i s w i t h LMA (SYNLMA)
p h t h a l a t e , 328-333 advantage, 244-245
p r o c e d u r e , 326,328 d e f i n i t i o n , 244
SpinPro u l t r a c e n t r i f u g a t i o n expert improvements, 256-257
system p r o c e s s , 245
backward-chaining inference r e a c t i o n r u l e d a t a b a s e , 247-251
e n g i n e , 306-307,308f s y n t h e t i c d e s i g n p r o c e s s , 253-256
c a l c u l a t i o n f u n c t i o n , 309 translation of reaction rules into
c o n s u l t a t i o n f u n c t i o n , 299 c l a u s e s , 251,252f
d e s c r i p t i o n , 298 S y n t h e t i c d e s i g n p r o c e s s u s i n g SYNLMA
d e s i g n i n p u t s r e p o r t , 301-302 problem-solving tree f o r synthesis of
development, 309-310 d a r v o n , 253,254-255f
i n f o r m a t i o n f u n c t i o n , 307,308f t w o - t r e e system, 253,256
l a b p l a n r e p o r t , 306
l a b r o t o r s , 300-301
major f u n c t i o n s , 298-299
methods, 204 Τ
o p e r a t i o n , 299-300
o p t i m a l p l a n r e p o r t , 301,303f,304
o p t i m i z a t i o n c r i t e r i a , 300 Taylor-Laurent s e r i e s , application of
p l a n comparison r e p o r t , 3 0 1 , 3 0 3 f , 3 0 6 MACSYMA, 108-109
p r o t e i n sample s e p a r a t i o n , 304-305 The i n t e l l i g e n t machine model (TIMM)
u s e r i n t e r f a c e , 299 d e c i s i o n and c o n t r o l
v s . e x p e r t , 310-311 s t r u c t u r e , 368,369f
Steam power p l a n t , downtime, 52 s e c t i o n s , 367
S u f f i c i e n c y , d e f i n i t i o n , 57-58 Time-ordered p r e s e n t a t i o n o f f i r e d
Symbolic programs f o r group theory r u l e s , p r o o f o r d e r i n g , 23
advantages, 176-177 TIMM—See The i n t e l l i g e n t machine model

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
INDEX

TK S o l v e r U
a c i d r a i n example, 115-116f, 117
c a p a b i l i t i e s , 113 Ultracentrifugation, problems, 297
c h e m i c a l a p p l i c a t i o n s , 117-118
c o m p u t a t i o n a l a p p r o a c h , 112-113
d e f i n i t i o n , 112
van der Waals gas V
example, 1 1 3 , 1 1 ^ , 1 1 5
T o o l s , used i n c o n s t r u c t i n g e x p e r t V a l i d i t y , aid i n precursor
systems d e s c r i p t i o n , 6 g e n e r a t i o n , 218
Transformer f a u l t d i a g n o s i s , e x p e r t Variance-covariance matrix of
system f o r , TOGA, 25-29 parameters, c a l c u l a t i o n by
Transformer o i l gas a n a l y s i s (TOGA) PENNZYME, 82
expert system, 20-21
e x p e r t system f o r t r a n s f o r m e r
f a u l t d i a g n o s i s , 25-29
d i a g n o s t i c a p p r o a c h , 25 W
knowledge e x t r a c t i o n , 27-28
Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ix002

knowledge r e f i n e m e n t , 28
o p e r a t i o n a l u s e , 27 Wettable powders, d e s c r i p t i o n , 88
reasons f o r b u i l d i n g the system, 26 Wiswesser l i n e n o t a t i o n (WLN),
v a l i d a t i o n , 26-27 background, 232

In Artificial Intelligence Applications in Chemistry; Pierce, T., el al.;


ACS Symposium Series; American Chemical Society: Washington, DC, 1986.

You might also like