0% found this document useful (0 votes)
56 views

Mermaid-A Front-End To Distributed Heterogeneous Databases

Mermaid is a front-end system that provides transparent access to multiple heterogeneous databases stored under various database management systems running on different machines. It allows users to manipulate distributed data using a common language like SQL or ARIEL. The system architecture, control, interfaces, language translation, query optimization, and network operations are described. Future research directions are also discussed.

Uploaded by

Mustanasar Syed
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Mermaid-A Front-End To Distributed Heterogeneous Databases

Mermaid is a front-end system that provides transparent access to multiple heterogeneous databases stored under various database management systems running on different machines. It allows users to manipulate distributed data using a common language like SQL or ARIEL. The system architecture, control, interfaces, language translation, query optimization, and network operations are described. Future research directions are also discussed.

Uploaded by

Mustanasar Syed
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Mermaid-A Front-End to Distributed Heterogeneous Databases

MARJORIE TEMPLETON, DAVID BRILL, SON K. DAO, ERIC LUND, PATRICIA WARD, ARBEE L. P. CHEN, MEMBER, IEEE, AND ROBERT MACGREGOR
Invited Paper

Mermaid is a system that allows the user of multiple databases stored under various relational DBMSs running on different machines to manipulate the data using a common language, either ARIEL or SQL. It makes the complexity of this distributed, heterogeneous data processing transparent to the user. In this paper, we describe the architecture, system control, user interface, language and schema translation, query optimization, and network operation of the Mermaid system. Future research issues are also addressed.

I.

INTRODUCTION

Distributed database management systems (DDBMS) requiretheintegrationoftechnologyfromotherfieldssuch as operating systems, networks, and for Department of Defense systems, security and formal verification. Two factors moving the field toward DDBMS are the development of networks which make DDBMS possible and the migration of users awayfrom large central CPUs toward powerful workstations. Data may reside in different computers for many reasons such as ownership, security classification, performance, or size. Data may be stored redundantly in different computers for reliability or survivability. SDC (now UNISYS Corporation) has developed a prototype named Mermaid that is a front-end to three relational database management systems (DBMS) that runs on a network of computers. It appears to the user to be a DDBMS. Thisprototype has demonstrated the feasibility of the concepts and has been used to experiment with different optimization algorithms and system control strategies. The basic system now being is developed into a product whileresearch continues on extensions in the areas of object management, security, and integrationwith adeductive inference engine.
Manuscript received October 15,1985; revised October 15,1986. M. Templeton, D. Brill, S. K. Dao, E. Lund, and P. Ward are with the UNISYS Corporation, Santa Monica, CA904064988, USA. A. L. P. Chen i s with Bell Communications Research, Inc., Morristown, NJ 07960, USA. R. McGregor is with the Information Science Institute, Marina del Rey, C A 90292, USA. IEEE Log Number 8714296.

The current Mermaid configuration consists of four com4.2 puters running UNlX and connectedwith an Ethernet. Each computer contains a DBMS: aVAXll/780which is host to a back-end Britton-Lee IDM database machine [2], a Sun 120withINGRES,aSun170withINGRES,andaSun120with Mistress [18].Mermaid, which may reside at any the of computers in the network, operates on top of these DBMSs which can also be used independently. The advantage of this front-end approach is that each DBMS mayoperate autonomously to achieve local control over access, accounting, and resource allocation. Moreover, this approach makes it possible to access existing databases stored under various DBMSs running on different computers. Section I 1 of this paper discusses similar efforts andprevious SDC work in this area. Section Ill examines the Mermaid system in detail, Section IV shows examples of operation, and Section V outlines our projectplans and future directions.
II. RELATEDTECHNOLOGY

Many technical problems must solved in order to be provide a front-end system which can provide transparent access to existing databases. The Mermaid project made has maximum use of related technology. The related technologies may be categorized as: standard user language and data model queryoptimization distributed operating systems.
A. Standard User Language and Data Model

Transparent access to a variety of DBMSs requires the definition of a data model and data manipulation language which can be used to mediate between different DBMSs. The languageand model need to be broad enough cover to all potential models used in target DBMSs and all funcUNIX is a trademark of AT&T.

oo1a9~19~~7/0~00-069~%01.oo 1987 IEEE


PROCEEDINGS OF THE

I E , VOL. EE

75,

NO. 5, MAY 1987

695

tionality of the targetlanguage, or at least a subset of the target language which willbe supported in the distributed system. There are two stages of solution to the problem of multiple database access. The first is to provide a standard view of single databases stored under different DBMSs and to provide access to a single database at a time through a standard query language. The second is to provide an integrated view of multiple heterogeneous databases and to provide the capability to access and integrate data from several databases to answer one query. The solution to the first problem, that of a common language, is being addressed by the DODllS NQL-DD/D Project [29], the SAFE Project [19], the ADAPT Project [14], and FOCUS [12]. SDC is developing a DODllS (Department of Defense Intelligence Information System) Query Language (DQL) and a common data dictionary and directory (DDID) for locating data. The goal is to provide a common functionality and language thatincludes some features not found in the target systems, and it does not attempt to provide all features found in all target systems. The next step beyond common access to a single databaseatatimeiscommonaccesstomuItipledatabases.CCA (Computer Corporation of America) developed the language ADAPLEX [21], [22]and the system Multibase [23], [13] which is a front-end distributor toheterogeneous DBMSs. Theirgoalsarethesameasthosefor Mermaid although their implementation differs significantly. CCAactually builttheir own DBMS, called the Local Data Manager that manip ulates data and does postprocessing. Mermaid uses one or more existing DBMSs to do manipulation and postprocessing. CCA has put more emphasis on access to nonrelational DBMSs and translation than Mermaid has and less emphasis on query optimization. Another difference is that Mermaid is stressing a separation of thedata model, the user query language, and the intermediate language. All communication between Mermaid processes is via an intermediate language which is independent of the user language and the semantic model of the data. Another system, NDMS (Network DataManagement System) [24], is being developed at CRAl in Italy. Its goals and architectureareveryclosetothoseof Mermaid. NDMSuses SQL for the user language and has the concept ofan intermediate language. The user model is the relational data modelwithsemanticextensions,andschemataoftheunderlying databases are integrated into a federated schema with different user views. The US.Air Force has sponsored development of a prototypeheterogeneousfront-endsystemcalledIISS(1ntegrated Information Support System) [ l l ] . It has a goal of accessing databases on three types of hardware WAX, IBM, and Honey e l l ) and many DBMSs. Their approachis batch-oriented. There is no a d hoc querylanguage and all retrievalrequests mustbeprecompiled.

for DDBMS such as SDD-1 [l],distributed INGRES [A, Encompass [2A, and R* [31], [15]. The Mermaid system started with ideas from SDD-1 and distributed INGRES when developing theoptimization algorithmsthatarecurrentlyrunning.Thesealgorithmswill be discussed further in Section lll-C6.

C. Distributed OperatingSystems
BBN has developeda prototype calledCronus [20]. Cronus provides a level above host operatingsystems that appears to be a distributed operating system to application programs. It provides uniform mechanisms for operating system functions including communication, access control, naming,and datastorageand retrieval. The llSS system has anoperating system component called NTM(Network Transaction Manager). The NTM is started as an operating system abovethe existing operating systems and supports the application programs. Communication between processes in NDMS is done through mail messages sent between computers connectedwith a network that uses the X.25 protocol. The systemcontrol and network protocol of theMermaid systemwill bediscussed in Sectionslll-C7and III-C8.
111.
MERMAID

ARCHITECTURE

A. Assumptions

Mermaid is expected to operate in an environmentwhere of databases. Adhoc users there is a federation preexisting and application programswill access individual databases directly, and Mermaid will be used primarily for a d hoc access to thefederation. Mermaid must operateabove the preexisting databases without changing the DBMS or the design ofthe databases. Mermaid cannot require changes to thedatabase structure for compatibility of data types or for optimizations. Mermaid must provide a query optimization algorithm which can adjust to a widevariety of data distributions. In some environments, there may be closely coupled databases in which a relation from one database may be replicated in another database or in which some relations may be fragmented with fragmentsstored in different databases. In other environments, the databases may be basically disjoint, although there must be some attributes that can be used to joinrelations across the databases or there would be little reason to treat them as a federation. It is assumed that processingcosts and communications costs will notbe uniform. There will be different data management systems which will have differentoperational characteristics even when running on the same hardware, and therewill be different types of computers. The network will also be nonuniform. There may be multiple local networks, possibly differenttypes of networks, connected by gateways. Since it is assumed that users of Mermaid will tend to use 6. Query Optimization it interactively, minimization of responsetime is important. Access to data in multiple databases requirescareful The response time includes thetime to develop the plan planning to control the time is required toprocess the for answering that the query as well as the time to execute it. query.This is important inlocal DBMSs, but it becomeseven Therefore, a major objective has been to design an optimore important when data must be moved across a net- mization algorithm that runs quickly even if the plan i s not work. Query optimization algorithms have been developed necessarilythe optimum. Plans are not compiledand saved

PROCEEDINGS OF THE IEEE, VOL. 75, NO. 5, M A Y 1987

USERINTERFACE

DISTRIBUTOR

USER INT

I
TO DIL CONTROL CONTROLLER

INGRES TRANSLATOR DIL-QUEL CONTROL

Lrl
CONTROL TRANSLATOR DIL-?

CONTROL TRANSLATOR OILIDL IDM INT

CONTROL TRANSLATOR OILQUEL

Fig. 1. Mermaid processes.

because a d hoc use is expected, but, in the future, plans could be saved if queries are saved in thequery library or executed from application programs. Since relational databases are becoming prevalent and relational interfacesare being developed for nonrelational DBMSs, Mermaid has emphasized access to relational DBMSs. Mermaid assumes the existence of a local optimizer which determines which indices to use, how toperform a join, and the order of joins if multiple joins are required. Most of the expected users will not be doing updates through Mermaid. The majority of updates will be made by existing application programs. Some updates may be made interactively, but the updates will be made by the owner of the data to a single database. SinceMermaid is expected to be a front-end system to existing databases, we expect few conflicting updates and little need for transactions that cross databases in this environment.

B. Overview
The development of Mermaid started in 1982. It was initially implemented on a single VAX running UNlX4.1.The

emphasis the first yearwas on query optimization and translation of queries to IDL which is the language of the IDM database machine. ARIEL (which is an SDCdeveloped I1 user-friendly query language) [ 6 and the user interface environmentwereadded in1983.TheEthernetwasacquired in 1984 and the entire system was moved to UNlX 4.2 and the network ofSuns as well as the original VAX. The major processes in the system are shown in 1 and Fig. described as follows: The user interface process: The user environment has an embedded ARIEL or SQL parser and a translator that produces DIL (Distributed Intermediate Language). The distributor process: It contains the optimizer and the controller. The user interface process and the distributor process could be on the same computer or different computers. One DBMS driver process for each database to be accessed. The driver also contains a translator that translates.from DIL to the DBMS query language. It interfaces to the DBMS through a DBMS supplied procedural interface. All information about schemata, databases, users, host

TEMPLETON et a/.: MERMAID

697

computers, and the network is contained in a DDlD (Data Dictionary/Directory) which is stored in a database and accessed through a special driver. Either of theDBMS drivers could be the DDlDdriver. The translator and the optimizer access the DDlD in order to do translation and query planning. The following paragraphswill describe the flow of aquery through thesystem. The user builds an ARIEL or SQL query in the user environment. This query is then parsed and validated by the translator. If the query valid, it is sent to the distributor. is The controller reads the queryand passes it to the optimizer which plans the execution. The DIL querymay have to be decomposed into several subqueries and the controller sends them to one or moreDBMS drivers for execution. The controller then waits for responses from the DBMS drivers. Each DBMS driver will process the DIL (sub)query by callingthetranslatortotranslatethequeryintotheDBMSquery language, sending the query the to DBMS, reading the.status (tuple count, error messages), and then retrieving the response if any. The result returned to the driver may be just a status if the operation does not retrieve data, it may be a report, or it may be a relation that must be sent t o another driver. When the final reporthas been assembled at a site, the controller directs the driver that site to send the report at to the user interface.

C. Component Discriptions
1) Data Dictionary/Directory: The information about the databases, the users, the DBMSs, the host computers, and the network contained in the DDlD. layers of schema is The supported in the DD/D shown in Fig. 2 and described are

SOL

ARIEL

SEMANTIC SUBSCHEMA

6
GIDBAL SCHEMA
LOCAL SCHEMA

Fig. 2. Mermaid schema architecture.


698

as follows: Subschema Layer: This layer represents the user view based on the globalschema. The user view can be represented in the relational or semantic model. GlobalSchema:This schemarepresents the global (federated) view of all the data defined in the distributed local schema. It is represented in the relational model. Distributed Local Schema: This schema represents the relational view of the localschema. LocalSchema:This schemacorresponds to the external view of the local database. It can be represented in any data model. The relational model is being used as the global data model. The local schemata currently supported are all relational, i.e., there is no distinctionbetween the distributed local schema and local schema. However, we maintain this four-layer schema architecture for the future expansion to include nonrelational DBMSs. Much of the friendliness of theARIEL language results from its exploitationofnewmodelingconstructsthat become available when a semantic model is defined above the global schema. For example,in many or most cases the system will be able to infer from the semantic subschema how to joina pair of relations on which pair of columns. This will allow a user to specify queries which omit references to joins. There could also be multiple global schemata defined over the same set of underlying databases to provide differentapplication viewsor different securityviews, but each different global schema is a different "database" that the user opens and each is stored separately in the DDlD. In addition to the schema information, the DDlD contains data describing physical characteristics, both of the data in the local databases (such as the size in bytes of a relation) and of the system (such as the number of sites in the system). It also includes the performance characteristics of the network connecting the various sites and the capabilities of the differentDBMSs which are needed for the optimizer. 2) Languages: a) User languages: Because of its modular structure, Mermaid can usea variety of user interfaces. Each specific user interface requires translator to translate the user lana guage into DIL. Currently thereare translators that accept ANSI standard SQL or ARIEL. If an application requires another type of interface, such as forms, a specific interface could be written to produce DIL the full and power of Mermaid would be available. b) Distributed lntermediate Language: DIL is a highly structured language which is designed for ease of translation. The user language can be flexible with a rich syntax, but it is translated into DILbefore it is given to the query optimizer. The use of DIL has many advantages: DIL supplies the functionality of a large class of databasequery languages. New translators can beeasilywritten for newuser languages without impacting the existing system. DIL can be entered directly into the distributor or into the driver so that either module can be tested independently. DIL is a better representation of the query than tree a form for transmission across a network because it contains only ASCII characters. Nested queries and aggregates in the qualification are difficult for the optimizer may be difficult for the and transPROCEEDINGS OF THE IEEE, VOL. 75, NO. 5, M A Y 1987

lator because many DBMSs do not support these features. We solve this problem by using the translator to translate the queries into a multi-step DIL query which contains no nested queries or aggregates in thequalification. We have found from experience that the intermediate language must be human-readable. It allows the system administrator and users to check the functioning the the of system viathe run-time journals that our system produces. It also simplifiesthedevelopmentof the programs, because each module can be tested independently by entering its input in DIL and looking at its DIL output. The DIL parser will enforce the rule that every relation referenced in a query must appear in a join clause. This restriction is motivated by the desire to avoid the growth of dataviaa cross-product. It may have be relaxed so that to a warning is given rather than actually forbidding it. However, in our experience, cross-products are generally obtained accidentally when the user forgets a join clause. c) DBMS languages: All of the DBMSs that are currently accessed are relational. We have translators within the DBMS drivers that translate DIL into IDL, QUEL, and SQL. 3) Query and Data Translation: There are three types of translation that take place within Mermaid. They are language translation from ARIEL or SQL to DIL, DIL to QUEL or SQLor IDL; datatranslation; and schematranslation.The mapping information used for schema and data translation is stored in the DD/D. a) Language translation:The user language may be SQL or ARIEL, and the DBMS language is whatever is supported by the adhoc interface to the DBMS. Problems occur when the languages do not have the same functionality. The developer of each translator must have a detailed understanding of thesemantics of various operations. Problems can occur with duplicate tuple removal, with buildingnew relations, and with aggregates. Most relational systemsleave duplicate tuples in sets returned unless the user specifies that duplicatesare to be removed. For example, INGRES allows the user to say retrieve (A.x) or retrieve unique (A.x). However, a system built on pure relational algebra would automatically remove duplicates. Mermaid builds new relations during query processing to hold temporary results. Dynamically created relations are not allowed in systems that do not have a data dictionary that can be accessed and updated during operation. Even those systems that do allow building new relations have different methods for doing For example, ORACLES[I71 so. SQL does not supportse1ect. . . insert intowhich is used to build new relations from old in Mistress SQL. The DIL retrieve into must be translated into twocommands for ORACLE, one to CREATE a new relation and another to INSERT into it. Aggregate processing i s not supported in a uniform manner across systems. Even in the current Mermaidsystem, which is all relational, there are differences between Mistress and the other DBMSs. IDM allows different qualification on each aggregate while Mistress has a global qualification. INGRES and IDM allow aggregates in the qualification clause as well as the target list, but Mistress allows aggregates only in thetarget list. There is no simple solution to the problem of different functionality across systems. Our current solution in Mermaid i s to define a basic set of functions that will be supported. DBMSs which cannot perform the minimum funcTEMPLETON et a/.: MERMAID

tions will notbe supported in the Mermaid system. Other DBMSs may be supported on a retrieve-only basis, that is, they cannot be used to perform d hoc joinsor reportgena eration. Queries are simplified as much as possible when translated into DIL. The translator from the user language to DIL generates a sequence of DIL subqueries for all queries that contain aggregates in thequalification, nested queries, or disjunctions. This procedure i s called query flattening. Thesequence of subqueries builds temporary relations which are referenced in later subqueries.Thissimplifiesthe queries which in turn makes it easier for the optimizer to plan the query execution for the translator to translate and each subquery into DBMS languages. Other features such as quantifiers, set-operators, trigonometricfunctions, user defined functions, and reportgeneration are not supported bythe user interface. Instead they are handled in otherways. For example, report generation functions are added in a post-processor. Output may be stored in relations or files for further processing. b) Data translation: The user views the federated datarepresentation for base as a single databasewith a common the data elements that are of the same domain. However, data may have different representations in different databases. The data used in the qualificationof the query are translated from the global representation, as input by the user, intothe local representation. The data that are retrieved for joining between databases or for the final report are translated into the global representation. Two basic types of translation are supported: functional translation and enumerated type translation. Functional translation is used for any algorithmic translation such as from milesto kilometersorfromonedateformattoanother. Enumerated type translation converts of values through sets a table lookup such as from internal codes to names. c) Schema translation: Schema translation is closely related to data translation but it affects the relations and fields named in thequery rather thanthe data. The names of fields may be changed, joins may be added or removed, predicates may be added, and fields may be concatenated or substringed. If fields in the qualification are concatenated or substringed,then the data in the qualification will have to be modified. Schema mapping may be at the relation or the field level. At the relation level, the types of translation include name changes, horizontalpartitions,projections, and one-tomany or many-to-one mappings. Horizontal partitions are defined with a predicate. Each partition may be in a different database or a partition may be excluded from the federated view. Projections require no translationbecause theexcluded fields are simply not known to theglobal view. One-to-many mappings from the global to the local view require the additionof a join in the query and can be done only within a single local database. When the global view is defined we exclude global relations that require a join across databases. Many-to-one mappings from the global to the local view may mean the removal of a join if more than one ofthe global relations is used in thesame query. For example, the relational global view the relation ship of database but a join of may map to a single relation in one two in another database. Thesubmarineand ship relationsintheglobalviewmayactuallymaptothesameunderlying boat relation. Field translations include name changes one-to-many and or many-toone mappings on character fieldsonly. For
699

example, a date may be stored in a single field as WMMDD in one database but as three fields month, day, and year in another database. Ourtranslation design supportsanothertranslationfrom the distributed local relational schema to the corresponding local nonrelationalschema. Thisdata model translation is currently not supported in the Mermaidsystem. 4) User Interface: The user interface appears to theuser to be a DBMS because provides aset of it commands similar to those provided by most DBMSs. This includes support for query libraries, query editors, debugging, help, synonym replacement, spelling correction, report manipulation, and options for customizing thesystem. The current version of the user interface runs on either a standard CRT terminal or on a Sun bit-mapped display. With both types of terminal, thereare windows that r e p resentdifferentfunctions. When runningwith thestandard CRT, the functionwindows overlay each other. When running with the multiple-window version for the Sun bitmapped display, the user is able to format and edit queries in one window, receive traces from Mermaid in another window, get output in a third window, and bring upadditional windows as needed for help and synonym editing. The implementation uses the standard set of window management tools provided by Sun (the SunTools)which allows the user to move windows around, use pop-up menus and themouse,andorganizethewindowssothatmuItiplefunctions can be viewed simultaneously. With any terminal type, the user interface provides three screenmodes: the standard command mode, the help mode, and the report mode. standard command mode The screen has three windows:the data window, the command window, and the message window. The current query is shown in thedata window, the commands to edit or run or validate the query are given in the command window, and messages are given inthe message window. For instance, if theuser enters run queryOl in the command window, the query named queryOl is loaded from the current query library andis displayed in thedata window. While it is running, messages 1ikeparsingof queryin progress appear in the message window. The user can enter help modeby entering the command help in the command window. then gets a new screen, He either on top of the command screen or as an additional window, and works in the help window until exiting from the help mode. The help screen, at the users discretion, allows the user to step backwards and forwards through various levels of help information. When the reportis returned, the user enters into report mode. The report modescreen lets the user scroll or page, as well as search, the report. The user interface also supports synonym replacement of for anyword or string words in a query as well as spelling correction. Spelling correction requires close integration with the ARIEL and S Q L parsersbecause unrecognized words must sent to the be spelling corrector during parsing and, if the spelling can be corrected, the correct word must replace the invalid word in the parse tree. With these capabilities and facilities, theuser is able to manage his own Mermaid system environment. 5) Data Distribution: Mermaid is designed to operate as a front-end to existing DBMSs. This means that, generally, each database will be disjoint. However, there may be data-

bases with the same or similar schema but different data at different sites, and there may be relations that are r e p licated at sites other than their originating site for performance and reliability. Four types of data distribution are supported: 1) A local relation exists in its entirety at one site. 2) A replicated relation exists in its entirety at more than one but not necessarily all sites. 3) A fragmented relation is located atseveralsites; each site contains a subset of the relations tuples and no tuple exists at more than one site. 4) A dependentfragmentedrelation is fragmented and the location of tupledepends upon the a location of a tuple in another fragmented relation; dependency means that tuples in a fragmentat a site can only join to tuples in another fragment at the same site to generate nonempty results. The concept of the dependent fragment or of placeis ment dependency a semantic concept which can be used to achieve faster query processing. Placement dependencies exist in databases whether they are recognized and used or not. For example, in an employee database, records about an employee may be located at the headquarter of the employees division. This results in a logical dependency between the employees record and divisionrecthe ords. 6) QueryDecomposition and Optimization: The simplest but possibly most expensive method of answering a query is to move all of the data to a single site and then process the entire queryat the site. This is not acceptable because large amounts of data might need to be moved across the network and large amounts of temporary storage might be needed the receiving site. Therefore, the query at must be decomposed and many subqueries run at individual database sitesbefore the final report is assembled at a single site. A high-level description of this process is contained in this section. Details may be found in[3], [32]-[35]. The Mermaid query optimization algorithmis one of the most complete algorithms in the current literature according to [I51 andone of the few that has been implemented and tested. In 1982 the first algorithm, the semijoin algorithm, was developed and tested. It is an extension to the SDD-1 algorithm which assumes that the most important cost is the number bytes transmitted between processes. of The algorithm was extended to support fragmented and replicated relations. In 1983 the replicate algorithm was INGRES. developed and tested. It is derived from distributed It assumes that CPU overhead dominates networkcosts and uses fragmented relationsto maximize the amount of parallelism in operations. In 1984 we began work to support nonuniform processor speeds, nonuniform network speeds, better aggregate processing, and acombined algorithmwhichusesthebestofthesemijoinandreplicatealg~ rithms [5].

a) Semijoin a/gorithm:The semijoin algorithmassumes that thecost of transferringdata between sites greatly outweighs the local processing costs at each site. The basic idea, as therefore, is to reduce relations much as possible before sending them across the network.This is accomplished by a combinationof local operations and intersite semijoins. The algorithm contains four stages: 1)site selection, 2) local reduction, 3) global reduction, and 4) assembly. These are briefly discussed in the following.

700

PROCEEDINGS OF THE IEEE, VOL. 75, NO. 5, M A Y 1987

Site selection is concerned with locating the relations that arerelevanttothecurrentquery. Incontrasttothereplicate strategy described below, this algorithm chooses exactly one copy of each relation. Since we are attempting to minimize network traffic, it is desirable that these relations reside at as few sites as possible. Therefore, site selection finds the minimal set that covers all relevant local relasite tions and fragments, and at leastone copy of each relevant replicated relation. Note thatwithin minimal site setthere a may still be more than one copy. In this case, an arbitrary choice is made, except that a copy at the resultsite is preferred. The second stageof the semijoin algorithm is local reduction. In this phase, eachrelation is reduced bya) projecting out the attributes referenced in the target and join clauses of thequery,and b) selecting tuples by applying predicates from the qualification. Local reduction can be performed at all sites in parallel. The distributor sends one command to each driver for each relation at the site controlled by that driver. After this is done, the relations as fully reduced are as possible without interrelation processing. In the global reductionstage, the reductive effect of the querys join clauses is achieved through a series of semij0ins.A semijoin selects the tuples from one relation which are capable of joining with tuples in another relation on somespecified attribute. Oneadvantageof semijoins isthat they necessarily reduce the total amount of data, whereas this might not true witha join. Also, fewer data need to be be transferred across the network to perform a semijoin than a join. Instead of moving the whole sending relation, the joining attribute($ of this relation are projected off, leaving only a set of unique values to be copied to the receiving relations site. The central problem in global reduction is finding the best sequence of semijoins. Mermaid uses a hillclimbing algorithm to determine such a sequence. The algorithm evaluates the costs and benefits ofeach possible semijoin and selects the most profitable one, that is, the one whose benefit most exceeds its cost. Thissemijoin is executed and the procedure is then iterated until no further profitable semijoins can be found. The cost of semijoining relation R1 with R2 is assumed as afunction of the of the joining attribute in size R2 which must be copied to Rls site.Since duplicate valuesare removed in the course of projecting this attribute from R2, the cost is estimated from theexpected number of unique values and the width of the joining attribute. The benefit is the amount by which R1 is reduced as a result of the size of R1 and semijoin. Benefit i s estimated from the total the selectivity of the joining attribute. The selectivity of an attribute is a measure of the variability of the values of that attribute, that is, whether the attribute has few or many of the values it could possibly assume. The selectivity is calculated by dividing the actual number of distinct values in the attributeby the possible number of distinct values, that is, by the size of the attributes domain. If a relation is joined with a highly variable attribute it will be less reduced than if joined with an attribute having only a small fraction of the different values in thedomain. If the joining attributecontains all thevalues in the domain, no reduction willoccur. Since the estimation of the and benefitis imperfect, cost the selection of semijoins i s performed dynamically. The

best semijoin is performed and a tuple count is returned. Then the costs and benefits are recomputed and the best remaining semijoin selected and performed. This process is is found whose benefit isgreater continues until no semijoin than its cost. The final stage in this algorithm is assembly. After global reduction, the reduced relations are gathered at a single result site. Currently thisi s the site where theusers query originates. Not all of the referenced relations need to be copied to the result site. Somtirnes,the full reductive effect of a relation is achieved in thesemijoin stage and the relause. relations are assembled, tion is of no further Once the the distributor sends a query to the result DBMS driver to produce a report. This query may specify that aggregation, arithmetic operations, joins, and sorting are t o be performed by the DBMS. The report is then sent by the distributor to theuser interface. b) Replicate algorithm: In contrast to the semijoin algorithm, the replicate algorithmassumes that local processing costs dominate network delay for most queries. Since transmission costs are of less concern, the replicate algorithm does not reduce relations. Instead, it finds an optimal set of sites at which the query can be executed in parallel, and it then combines the partial answers produced at these sites. The main features of this strategy are that it enhances parallelism, reduces the number of intermediate objects which must be built, permits the local DBMS to perform more local optimization, and is conceptually simple and easy to implement. There are four stages in the r e p licate algorithm: 1) site selection, 2) replication, 3) query execution, and 4) assembly. We will outline these below. Siteselection is the first, and most important, stageof this algorithm. As in the semijoin method discussed above, it serves to choose that set of sites at which query processing will be performed. There is a significant difference, however, inthehandlingofreplicated relations. Whereas semijoin site selection has the effect of picking only one copy of a relation for consideration, thealgorithm outlined here makes full use of all available copies. fact, the greater In the degree of replication, the more efficient this strategy will tend tobe. The replicate algorithm seeks to execute the users query at several different sites in parallel. All sites containing relations referenced bythe query are potential processing sites. However, the set that is actually chosen will have the following characteristics: a) not more than one relation will remain fragmented, i.e., have fragments at the selected sites, and b) there will complete copies of all other refbe erenced relations at each selected site. Since will often there be no set of sites that initially meets these requirements, fragments or full relations may haveto be copied between sites. The chosen set is simply the one which requires the least data transfer. The procedure for selecting processing sites is similar to that originally specified for distributed INGRES. The procedure first chooses the relation which is to remain fragmented. This is simply the relationhaving the largest total size. Then, weighting functions are used evaluate the relto ative desirability of each site on the basis of global data transfer requirements. A site is selected if the data it receives asaprocessingsitearelessthantheadditionaldataitwould have to send if it were not a processing site. In Mermaid, the site selection process has been gener-

TEMPLETON et a/.: MERMAID

701

alized to take placement dependenciesinto account. The site selection algorithm does not actually operate on relations but rather on classes of relations. Within each such class all relations have valid placement dependencies, but there are no dependencies between relations in different classes. Essentially, a class of relations can be treated as a single fragmented relation, and this substantially reduces transmission costs. The second stage in this algorithm is replication. Here certain relations fragments are copied to the or sites which have been selected for queryprocessing. The action taken for any given relation R depends on whether R is fragmented and, if so, whether it has been chosen to remain fragmented. If R is not fragmented, it is copied to all processing sites (where it does not already exist). If R is fragmented and is to remain so, then all fragments R not curof rently at a processing site are copied to one. Otherwise, all fragments of R are copied to each processing site (where they do not already exist). Query execution is the next stage. Once theprocessing sites contain the necessary data,the user's query is sent to each of these sites and is executed in parallel to provide a set of partialanswers. Actually, the query may be modified somewhat before it is sent. Queries which specify sorting provide a simpleexample. Sorting done at the processing sites would only have to be repeated at the result site when the partial answers werecombined.Another example involves aggregation. There are certain situationsin which operations such as count and averagecannot easily be performed in a distributed manner.In such cases, the aggregation operator may be modified or removed from the user's query before it is sent to the processing sites. These operations are delayed until after the partial answers havebeen assembled at the result site. The fourth and final stage of the replicate algorithm is assembly.The query is executed in parallel and yieldspara tial answer at each of processing sites.These partial answers are then moved to the result site where they are unioned into atemporary relation. This relation may constitute the complete answer or it may require further processing, as in the sorting and aggregation cases mentioned above. In either case, a report is generated and sent back to theuser interface for displayto the user. c) Combined algorithm: The replicate and the sernijoin algorithms are based on different assumptions about the network and the database. Since we are operating in a heterogeneous environment with predefined databases, weneedtooperateefficientlywhenthereareextensiverep licated dataas well aswhen there basically disjoint dataare bases. The replicate algorithm performs better in thefirst case while the semijoin algorithm performs better in the second case. We need to operate on fast local networksas well as across internet gateways. The replicate algorithm was designed for operation on local networks while the semijoin algorithm was originally designed for the ARPAnet. Therefore, we have designed andare implementing a combined algorithm that is further described in [5]. The new algorithm is basically an extension of the replicate algorithm.However, selection clauses are applied to individual relations or fragments and some semijoins may be performed to reduce large relations or fragments before they are moved across the network. In addition, processing

cost is considered as well as data transmission cost, and the transmission cost may be different between each pair of computers. This algorithm will perform well even when there are no replicated or fragmented relationsin the database because it degenerates to a variation ofthe semijoin algorithms. 7) System Control: The controller does process initialization of the drivers at local or remote sites, sets up interprocess communication, and handles the asynchronous 110 between the distributor and drivers. The controller and the communication mechanism provide the functionality of a distributed operating system, above the independent UNIX 4.2 operating systems. The controller has four major functions and is contained within the distributor process: configurationcontrol DIL parsing DILgeneration 10 control. 1 These provide thecentral control of the Mermaid system. They are wrapped aroundthe query optimizer provide and services for it. The configuration control determines what options are tobeusedforarunandwherethedatabasesaretobefound. The options control the amount and type of debugging information which varies when the system is running in different modes. The configuration control journals the options that are in effect for the run, and carries out the necessary remote log-ins and starts the DBMS driver processes. It also provides for the detection and handling of internal errors. The DIL is transformed bythecontroller from external, its human readable format, into internal structures that are used by the optimizer. As the optimizer plans the query execution, it gives pieces of the query to the DIL generator which turns the internal form into an ASCIIform. This back text is sent to DBMS drivers for execution. The distributor process is the only process that may be handling multiple outstanding commands, but only one command may be outstanding to any driver process. The VOcontrol maintains read and writequeuesfor each DBMS driver. As the optimizer writes a command to the DBMS driver, thewrite request is enqueued. If the site is available, the actual write is performed, the request is dequeued, and As the control is returned to the optimizer. the optimizer requeststhat an acknowledgment be read from any site, the I/O control requests any messages from theDBMS drivers. As messages are received from theDBMS drivers, the read queues are emptied of their outstanding work, and corthe responding write queue i s checked forany other work that the optimizer may have requested. Thus many DBMS driver processes may be running in parallel, and the optimizer does not block onevery command that it writes. A handshake mechanism with the supposedly active DBMS drivers is provided. The maximum time for completion of a handshake i s so small that it is not noticedby the users, and a failed site can be detected in a matter of seconds instead of in minutes. 8) Network Protocol:TCP/IP is the DoD standard proone tocol andis therefore the first selected for use by Mermaid. A main reason for selecting Sun computers is that the

702

PROCEEDINGS OF THE IEEE,

VOL. 75,

NO. 5, MAY 1987

UNlX 4.2 comes with an implementation of TCP1IP. However, TCPllP is designed mainly for terminal-to-host communication and not for distributed application program communication. Mermaids concept of operation does not fit the TCP/IP model. In the Mermaid model, a master process is started when a user logs in. The master starts slave processes on the same or remote computers. Any process may communicate with any other process so many-to-many communication connections are needed. All of the processes are operating on behalf of user and therefore the areowned by him. This is necessary for security so that the DBMS access controls can check permissions forthe specific user and foraccounting so that userscan be billed for the resources used. To support remote process management in Mermaid, at least the following capabilities need exist within thenetto work or thecombination of the network the operating and system. Remote process initiation. The userinterface has to start the distributor process and then the controller starts a local or remote driver process for each database that is to be accessed. Out-of-band messages. The controller needs to send a software interruptto the remote process to kill itor to determinewhether or not it is still operational. Permanent communication channels. A permanent 110 channel must be set up between the user interface and the distributor process and between the controller and each driver process. Dynamic communication channels. Anytwo processes located anywhere the network in must be able to set up a communication channel between them. Nonblocking read. It should be possibleforthecontrollerto read thevarious 10 channel connections the driver 1 to processes in an asynchronous, nonblockingfashion. Otherwise, asynchronous polling scheme must be used, with degraded performance due to the (sometimes) unnecessary waiting for data to arrive. We havedeveloped a communication mechanism which provides Mermaid with two types of network-related activities to facilitate the operation of the system.Theyare remote process creation and interprocess communication between any distinct pair of processes at arbitrary locations in the network. The communication mechanism is layered and has evolvedoverthecourseoftheproject.Initially,weusedthe standard UNlX 4.2 Socket Interface with TCP as the underlying transport protocol.This is a low-level interfacewhich is relatively nonportable. In the current system, we have
movedtotheSunRemoteProcedureCall(RPC)facility.This

9) DBMS Drivers: Since the DBMS driver does low-level process control that is very dependent upon particular operating system calls and acts as the user of the DBMS, a driver must reside each DBMS site.A different version at of the driver is needed for each DBMS operating system and type. The driver is initiated by the controller when the first query is made. It is started as a local remote process that or i s owned by the user of Mermaid. The first message sent to the driverincludes the database nameand information about the network configuration. The driver establishes a connection with theDBMS and opens a database as a user. It then stores the network configuration table for future reference. lntersite transferof relations is done by retrieving a relation,translating thedata intoglobal units, and storingthem in a buffer or buffers in a standard format. The receiving driver translates the form to the restore format for parthe ticular DBMS and executes a bulk load. If therelation already exists, the restore appends tuples rather than replacing them, which allows Mermaid to gather fragments of relations from several sites into a single relationat a single site.

D. Testing and Measurement Tests were conducted using a database that exists in a centralized version as well as in different distributedconfigurations. The test database contains seventeen relations which range from four tuples to 19 OOO tuples. that has ships, posiThe test database is a Navy database tions, information about weapons installed on the ships, characteristics of the ships, battle groups which are temporary groupings ofships, and visitsto ports by the ships. The database is distributed to foursites which correspond to the second, third, sixth, and seventh U.S. Naval Fleets. Some relations are fragmented, some are replicated, some are dependent fragmented, and some are single-site relations. A set of test questions was developed to test access to different datadistributions, different numbers types of and join, and different features of the query language. Each new release of Mermaid has been tested with this set of test questions. We have done system tests to compare the operation of the semijoin and replicate algorithms and compare the to operation of the centralized version of the database with operation on four databases on a single computer and on three different computers.The results of the early testing are given in [28]. Resultsof testingon the network are given in [30].
IV. EXAMPLE TO ILLUSTRATE OPERATION

allows Mermaid programs to use function call semantics for networking, and it provides a uniform external data representation. RPC implementations are available for several different machines and operating systems. To take advanOS1 network standards, future tage of emerging the releases of Mermaid will use the MAP [8] and DODllS (Department of Defense Intelligence Information System) protocols for interprocess communication. This will sup port access to computers beyond the local network.

We will trace the operation one query illustrate the of to operation of the system. First, we will show how thequery is answered using the semijoin algorithm, and then we will show the same query using the replicate algorithm.
A. Semijoin Algorithm

Inthisexample,wewill beconcernedwiththreerelations in a naval database, namely, the ship, weapon, and install relations. The first two describe various characteristics of

TEMPLETON et a/.: MERMAID

703

the ships and weapons systems in the database, and the third specifieswhich weapons are installed on which ships. There arefour sites, S 2 , S3,S6, and S7,each of which corresponds to a fleet. The ship relation is fragmented at all second fleet, forexamfour of these sites. Thusships in the ple, aredescribed in the fragment at S2. The install relation is also fragmented at the foursites, and it is dependent on ship. That is, install tuplesat a given site can only join with ship tuplesat the same site to produce nonempty result. a The weapon relationis replicated in its entirety at two sites, S 2 and S3. We assume that aquery Which destroyers carry HarUS poon weapons? is submitted at S2. The usermight express this query in ARIEL as follows: retrieve name of ship where type = DD and flag = USA and name of weapon = Harpoon and ship.num = installshipid and install.weapon = weapon.id The translator in the user interface translates the ARIEL query into DIL as follows: BEGIN VARIABLES ship IS ship, install IS install, weapon IS weapon RETRIEVE name (ship.name) SELECT ship.type = DD ship.flag = USA weapon.name = Harpoon MERGE ship.num = = installshipid, install.weapon = = weapon.id END The user interface sends this DIL query to the distributor. The first stage in the optimization process is site selection. Becausethe ship and install relations fragmented at sites are S2, S3,S6, and S7,all four of these sites must be chosen. Furthermore, the copy of weapon S2 is chosen over the at oneatS3becauseS2istheresultsitewheretheuserresides. The goal of the semijoin algorithm is to reduce the size of relations as much as possible before assembling them attheresultsite.Thisisaccomplishedinthelocalandgloba1 reduction stagesof theoptimization process. For this query, there are two local reduction operations, which we refer to as LOCAL1 and LOCAL2, and two global reduction operations, GLOBAL1and GLOBAL2 (asdiscussed below). Table 1demonstratesthe progressivereduction of all relevant the relations at the four sites. For example, the ship fragment at S 2 initially contains 370 tuples, but after LOCAL1 it is reduced to 20 tuples. Similarly, there are initially 772 tuples in install at S 2 , but theGLOBAL1 and GLOBAL2 operations reduce this fragmentto 111 and then 14 tuples. In the localreduction phase, the selection clauses found in the users query are applied to the relations, and these relations are then projected on their joining and target attri-

Table 1: Progressive Reduction


SITE RELATIONS INITIALLOCALl LOCAL;!GLOBAL1GLOBAL2

S2

ship
install weapon

370 772

5 1
361 763 140 184

370 20 772 1 361 763

20 772

1
22
763

111 1

20 14 1 22 16
5

S3

ship
install

22 109
13

S6

ship
install

5 140
184 140 181

5 184

0
4
0

S7

ship
install

140
181

4
181

4
14

butes. The LOCAL1 operation reduces the weapon relation. It is accomplished by sending the following command to the driver at site S2: retrieve into weaponl (weapon.id) where weapon.name = Harpoon The LOCAL2 operation involves sending the following command to the drivers at S2, S3,S6,and S7: retrieve into ship1 (ship.name, ship.num) where ship.type = DD and ship.flag = USA This command is executed in parallel at the foursites. The distributorwaitsuntil allacknowledgments have been received before proceeding. In the global reduction stage, semijoins are used to further reduce relations. Initially, four semijoins are possible. These are as follows: SEMIJOINship BY install SEMIJOIN install BY ship SEMIJOINweapon BY install SEMIJOIN install BY weapon The cost and benefit of each of these semijoins is computed, and the most profitable over a certain threshold is selected for execution. The most profitable semijoin turns outto be reducing install byweapon. Normally, performing the semijoin would require projection weapon ontoits of joining attribute, weapon.id. In this case, however, the projection has already beendone in LOCALl. Therefore, weaponl can be copied directlyto the other sites at which install is fragmented:
COPY weapon @ S2
+

weaponl @ S3, S6, S7

Now, in operation GLOBALI, we actually perform the semijoin by sending the following command S2, S3,S6, to and S7: retrieve into install1 (installshipid, install.weapon) where install.weapon = weaponl.id This command is executed in parallel at the four sites and acknowledgments are sent to the distributor. After they are all received, the distributorexamines the remaining semijoins. There are now three possibilities: SEMIJOINship BY install

704

P R O C E E D I N G S O T H E IEEE, VOL. 75, NO. 5, M A Y 1987 F

SEMIJOIN install BY ship SEMIJOINweapon BY install This time the codbenefitanalysis determines that install should be reduced by ship. Because install and ship are fragmented at all four sites and a placement dependency exists between them, it is not necessary to project off the ship.num attribute orto copy this attribute between sites. Instead, the semijoin operation GLOBAL2 is achieved by simply sending the followingcommand to S2, S3, S6, and s 7: retrieve into install2 (installshipid, installl.weapnn) where installl.shipid = shipl.num Again, the semijoins are executed in parallel and each driver returns an acknowledgment.At this point, two further semijoins are still possible, but the optimizer decides that neither is worthwhile. Thus the global reduction phase is over. The install relation has been twice reduced and, in fact, Table1 shows that the fragments at S6 and S7 have no more qualifying tuplesat all. The next stage in the semijoin algorithm is assembly. We build a complete copy of the reduced ship relation result at site S2 by copying the remote ship fragments to S2 and appending them to the fragment already there. This assembly process is then repeated for the install relation. Note that there are only two nonnull fragments of install after global reduction. Also, we do not need to assemble weapon because S2 alreadycontains a complete copy of that relation. COPY shipl @ S3 COPY shipl @ S6 COPY shipl @ S7
-* -* -*

retrieve (shipl.name) There aretwo reasons why this preferredstrategy was not followed. First, to compensate forignoring processing

costs,wearecurrentlyusingabenefitthresholdinselecting
semijoins. That is, we will not performa semijoin that can only eliminate a few tuples because its DBMS cost usually outweighs its data transmission benefit. Thus ship was not further reduced because its largest fragment only contained 22 tuples. Second, the algorithm did not recognize the advantage of reducing ship because it was unable to look ahead to the assembly stage.During global reduction, semijoins were chosen for theirimmediate benefit and the longer range possibility of eliminating unnecessary assembly steps was missed. These problems will be addressed in future versions of the optimizer.
B. Replicate Algorithm

In order to contrast the operation of the semijoin and replicate algorithms, we will use the same query and database environment as above. The first stage in the replicate algorithmsite selection. is This involves choosing a relation (or class of relations) to remain fragmented, and then deciding which of the sites containing these fragments will be used for processing. Table 2 shows the initial distribution of the relations at the four sites, with the symbol F denoting fragments and C denoting copies. The size of each fragment and copy is specified accordingly. Table 2 also shows weights that are used in choosing the processing sites.
Table 2 SiteSelection Site

shipl @ S2 shipl @ S2 shipl @ S2 install2 @ S2

Ship

Install WeightWeapon
F 722 F 763

COPY install2 @ S3

-*

s2

F 370
F 361 F 140 273 F 140

C 51

1142 1124

After all relevant relations have been reduced and assembled at the result site, a report is produced and sent to the user interface. This is done bysending the followingcommand to the driver S2: at retrieve (shipl.name) where shipl.num = install2.shipid The select clauses in the original query are absent here becausethey havealready been applied during local reduction. Also, it is unnecessary to specify the join between install and weapon because GLOBAL1 has reduced install by weapon as completely as possible. After theuser has received the report, the final is to step send each driver a command to destroy all the temporary relations that have been built during processing. The distributor thenwaits to receive the next query from the user interface. While the semijoin algorithm produced satisfactory results in the above example, it should be noted that an even better strategy could have been followed. This would have been to reduce ship by install GLOBAL2 instead of in reducing install by ship. Doing so would have made it unnecessary to move any fragments of installto 52 during assembly. This is because install does not contain target effect on attributes and it could have no further reductive ship during the report stage. Thus the command to generate the report wouldsimply be:

s3
S6 s7

C 51

F 184
F 181

270

Because there i s a placement dependency between the ship and install relations, these two relations form class. a Normally the largest class is chosen to remain fragmented (since weapon but, in this case, there is onlyone possibility is notfragmentedhTherefore, its members, shipand install, will remain fragmented at some subset of their initial four sites. Each site that is chosen for processing needs to contain a copy of theweapon relation. If weapon does not already exist there, it has to be sent from another site. Wewill select the set of sites that minimizesdata transmission. This could be done by calculating the total transmission cost for every subset of the four sites, but it is more efficientto compute a single weight for each site. Theseweights represent the size difference between the fragmenteddata already at a site and the other data needed there. Forexample, the weight for S6 is calculated as: 140 + 184 - 51 = 273. Table 2 shows that all the weights are positive, and therefore all four sites are chosen to be processing sites. The next stage of this algorithmis replication. Each processingsite must have complete copies of all relations except the one that is to remain fragmented. In this case,

TEMPLETON et a/.: MERMAID

705

we send a copy of weapon to sites S 6 and S7: COPY weapon @ S2


+

weapon @ S6, S7

Schema translation including relation and field m a p ping. * Different network protocols and configurations. The current Mermaid prototype is being enhanced t o provide the followingfeatures: Support for updates. New queryoptimizer thatcombines the semijoin and replicate algorithms. More user interface features such as a forms/menu interface. Provision of a program interface. Mermaid communications packagecalls to isolate the programs from the network protocolorder to make in protocol conversion more transparent. A schema design tool is also being developed which supports the user when developing the global view of the databases from existing schemata.

Once allnecessary data are present at the processing sites, the query can be executed in parallel at eachof these sites. This is accomplished by sending the following command to thedrivers at S2, S3, S6,and S7: retrieve into result (ship.name) where ship.type = "DD" and ship.flag = "USA" and weapon.name = "Harpoon" and ship-num = installshipid and install.weapon = weapon.id Execution of this command results in a partial answer at each processing site. Afterreceivingacknowledgments from all of these sites, the distributor gathers the partial answers into a single complete answer at the result site. COPY result @ S3 COPY result @ S6 COPY result @ S7
+
+

result @ S2 result @ S2 result @a S2

B. Future Research
The major research efforts are in the areas of: Expert systems. Secure systems. Object management systems.

The next stage in thereplicate algorithmis to produce the report at the result site and write it to the user interface. We do so by sending the following command to S2: retrieve (result.name) Finally, as in the semijoin algorithm, the distributor destroys all the temporary relationsit has created and then waits for another query. Sincetheaboveexampleisfairlysimple,weshould briefly consider a more complex case. Referring to Table 2, s u p pose the sizes of the fragments of ship and install at S7 are 20 each. Then we would obtain negative weight at S7 and a this site would not be selected for processing. That is, it would be more costly to move weapon to S7 than to move the fragments at S7 elsewhere and do the processing at three sites. Thislatter option, then, is exactly what is done. The fragments of ship and installat S7 are sent to S 2 and unioned with copies of S2's ship and install fragments. Then weapon is sent t o S6 and the query is executed in parallel at the three processing sites, S2, S3, and S6.

v.

CONCLUSIONS AND FUTURE RESEARCH

A. Conclusions

Mermaid is an operational prototype which has demonstrated the feasibility of operating a front-end to disas tributed heterogeneous databases. It has been used for testing and improving query optimization algorithms and system control strategies. A user language, ARIEL, hasbeen defined which incorporates basic relational functionality with relaxed syntax (compared to relational languages,such asQUEL,SQL, IDL), and extended semantics. An internal language DIL has been implemented to support basic relational functionality and network commands. Mermaid supports heterogeneity at many levels:

706

Different DBMSs and computers containing databases. Data translation including function conversions and enumerated type conversions.

7) Expert Systems: Supporting a semantic data model not onlyopens the possibility for more sophisticated user interfaces and aids in the translation process, but it provides a basis for enforcing a much higher degree of data integritythan is possible in a purely relational system. Also, we envision that an expert system should be able to tie directly into thesemantic model, and thus add inference capabilities to thesystem. UNISYS is currently developing a Flexible Deductive Engine (the FDE) [36] which can use Mermaid to access external databases. The FDE runs on a Ethernet. It can produce Xerox 1100 and is connected to the a query in DIL and send it to Mermaid for execution. We have proposed to add the capability to send queries from Mermaid to the FDE so that inference capability may be added to the Mermaidsystem. 2) SecureSystems:An ultimate goal of the Mermaid project is to develop a system that can be evaluated at the B1 security level. This is an NSA security ratingas specified in the "Orange Book" [6]. A security model has been developed, and the current Mermaid system is being evaluated against the model. The results of the evaluation will influence the future design of the Mermaidsystem. 3 Object Management Systems: A longer term research ) issue i s the development of an object management system that will provide integrated management and sharing of structured data objects, text, images, and voice. Current object oriented systems such as the Xerox PARC SMALLTALK [9] and Apple Lisa [26] are not integrated with full DBMS capabilities. Inthe meanwhile, objectoriented machines such as the INTEL iAPX 432 [IO] offer new archiin tectures for object management, and new developments database management such as the CCA Local Data Manager (LDB) [4] and Berkeley's Postgres [25] support new data types. In order for the Mermaid system to integrate differentobjecttypes,the basicfunctionalitysupported byARlEL and DIL will need to be extended beyond relational operators to support operators for the additional object types.

PROCEEDINGS OF THE IEEE, VOL. 75, NO. 5, M A Y 1987

P. Bernstein, N. Goodman, E. Wong, C. Reeve, and J. Rothnie, Queryprocessing in a system for distributed databases (SDD-I), ACM Trans. Database Syst., Dec. 1981. Britton Lee Inc., lDM500Sohvare Reference Manual,Version 1.7, Nov. 1984. D. Brill, M. Templeton, and C. Yu, Distributed query processing strategies in Mermaid: A frontend to data management systems, in Proc. IEEE Data Engineering Conf.,Apr. 1984. A. Chan, U. Dayal, S. Fox, N. Goodman, R. Ries, and D.Skeen, Overview of an ADAcompatibIe distributed database manager, in Proc. ACM SIGMOD, May 1983. A. Chen, D.Brill,M.Templeton, and C.Yu,Distributedquery processing in Mermaid: A frontend system for multiple databases, submitted for publication, 1986. DoD Computer Security Center, Trusted computer system evaluation criteria, Tech. Rep. CSC-STD-001-83, Aug. 1983. R. Epstein, M. Stonebraker, andE. Wong, Distributed query processing in a relational database system, in Proc. ACM SIGMOD, May 1978. General Motors Technical Center, Manufacturing Automation Protocol Specification, Version 2.1, Mar. 1985. A. Goldberg and Robson,SMALLTALKBO: TheLanguageand D. Its Implementation. Reading, MA: Addison-Wesley, 1982. J. Hemenwayand R. Grappel, Intels iAPX Micromainframe, Mini-Micro Systems Rep., May 1981. Integrated Information Support System(IISS)-An evolutionaryapproach to integration,AF Wright Aeronautical Lab. Rep., 1985. Information BuildersInc., FOCUSGenerallnformation Guide, 1985. T. Landers and R. Rosenberg, An overview of multibase, in Proc. lnt. Symp. on Distributed Database, 1982. Logicon Inc., ADAPT I: Final functional and system design specification, Rep. 76-C-0899-2, Jan. 1978. C. Lohman, C. Mohan, L. Haas,B. Lindsay, P. Selinger, and P. Wilms, Query processing in R, IBM Res. Rep. RJ 4272, Apr. 1984. R. MacGregor, ARIEL-A semantic frontend to relational DBMSs, in Proc. VLDB Cont., Aug. 1985. Oracle Corp., ORACLE SQUUFl Reference Manual, Version 4.0, June 1984. Rhodnius Inc., Mistress: TheQuery Language, Version 2.2, July 1982. SAFE Project user interfacerequirements specification, TRW Tech. Rep. CE-7200E, Feb. 1983. R. Schantz and R. Thomas, The architecture of the Cronus distributed operating system, BBN Lab. Rep., Apr. 1985. D. Shipman, The functional data model and the data language DAPLEX, ACM Trans. Database Syst., Mar. 1981. J. Smith etal., ADAPLEXReference Manual, Computer Corp. of America, Jan. 1981. J. Smith, P. Bernstein, U. Dayel, N. Goodman, T. Landers, K. Lin, and E. Wong, Multibase-Integrating heterogeneous distributed database systems, in Proc. AFIPS, 1981. W. Staniszkis, M. Kowalewski, G. Turco, K. Krajewski, and M. Saccone, Network data management system-General architecture and implementation principles, in Proc. h t . Conf. on Engineering Software, Apr. 1983. M. Stonebraker and L. Rowe, The design of Postgres, in Proc. SIGMOD, May 1986. G. Stewart, A first look at Lisa, Popular Computing, Mar. 1983. Tandem Computers Inc., Distributed Database Management, 1981. M.Templeton, D. Bril1,A.Hwang, I. Kameny,and E. Lund, An overview of the Mermaid system-A frontend to heterogeneous databases, in Proc. /E EASCON, Sept. 1983. M. Templeton and J.Kendall, Solving the DODllSdatabase interoperability problem, in Proc. AFCEA, Mar. 1985.

[31] [32]

[33]

[MI [35] [36]

maid-Experiences with network operation, in Proc. IEEE Data Eng. Conf., Feb. 1986. R. Williams etal., R: An overview of the architecture, IBM Res.Rep. RJ3325, Dec. 1981. M. Templeton, D. Brill,and E. Lund, C. Yu,C.C.Chang, Query processing ina fragmented relational distributed system:MERMAID,IEEETrans.SohvareEng.,vol.SE-Il,pp.795810, Aug. 1985. C.Yu, K. Guh,C.Chang, C. Chen, M.Templeton,and D. Brill, Placement dependency andaggregate processing in a fragmenteddistributed database environment, in Proc. lEEE COMPSAC, Nov. 1984. -, An algorithm t o process queries in a fast distributed network, in Proc. Real Time System Symp., Dec. 1984. C. Yu, C. Chang,M. Templeton, D.Brill, and E. Lund, On the design of a query processing strategy in a distributed database environment, in Proc. ACM SIGMOD, May 1983. D. Van Buer, D. Kogan, D. McKay, L. Hirschman, R. Whitney, and R. Davis, FDE: A system for experiments in interfaces between logic programming and database, in Proc. NATO Advanced Study Workshop, July 1985.

Son K. Dao received the B.S. degree from

Ada is a trademark the Department Defense (Ada of of joint Program Office).

Purdue University, Lafayette, IN, and the M.S. degree from California State University, Northridge, CA, both in computer science. Since 1979 he has been involved in the development implementation and of a query retrievallanguage for IMSdatabases. He was the architect of a micro-mainframe Query Retrieval Systemdeveloped at InforProjmatics. In 1984 he joined the Mermaid

TEMPLETON et a/.: MERMAID

707

ect at System Development Corporation (now, UNISYS Corporation), Santa Monica, CA, wherehedevelopedtheprocedural interface and was involved with the network protocol enhancement for Mermaid. is currently manager of the translator group. He His responsibilities include the semantic query language, the language and schema translation, and the data dictionary design and implementation. He is also a lecturer at the Electrical Engineering and Computer Science Department, California State University, Northridge. Mr. Dao is a member of the IEEE Computer Society.

Arbee 1 P. Chen (Member IEEE) received the . B.S. degree from National ChiaoTung University, Taiwan, Republic of China, in 1977, the M.S. degree from Stevens Institute of Technology, Hoboken, NJ, in 1981, both in computer science, and the Ph.D. degree in computer engineering from the University of Southern California,Los Angeles, CA, in 1984. He was a Research Scientist at UNISYS Corporation (formerly, System Development Corporation),Santa Monica, CA, where he worked on query optimization, semantic query processing, concurrencycontrol, and performance analysis for the Mermaid system. He is now with Bell Communications Research, Inc., Morristown,NJ.Hisprimary research interestsinclude databases, distributed systems, and knowledge-based systems. for Computing Dr.Chen is a member of the Association Machinery, the IEEE Computer Society, and the ANSIIX31SPARCI Database Systems Study Group.

Rober MacGregor received the A.B.,M.A., and Ph.D. degrees in computerscience from the University of California at BerkePatricia Ward is currently project manager ley, where his research was in the area of of the user interface section of the Mermaid concrete complexity theory. Project. Shehas been with UNISYS CorSubsequently, he worked in the area of poration(formerly,the System Develop database management systems, and in 1982 ment Corporation), Santa Monica, CA, for he joined System Development Corporathe past nine years. Prior to her work on the tion (now, UNISYS Corporation), Santa Mermaid Project, she was one of the develMonica, CA, to work on asemantic query opers of KVM, a secure operation system. language and on translators for the MerHer more than years of computer expe25 maid system. His work at UNISYS also involved object-oriented rience includes employment UCM, Planat languages and knowledge representation. Heis now at the Inforning Research, GTE Data Services, and mation Science Institute, Marina del Rey, CA. Honeywell Corporation.

708

PROCEEDINGS OF THE IEEE,

VOL. 75, NO. 5, M A Y 1987

You might also like