Failover In-Depth
Failover In-Depth
ISSN: 2041-3114
© M axwell Scientific Organization, 2010
Submitted date: April 30, 2010 Accepted date: June 14, 2010 Published date: September 13, 2010
Abstract: This study discusses the issue of providing tolerance to hardware and software faults in Internet
system as well as issues related to clusterization of servers. A replication scheme is presented, and a detailed
dependability analysis of this scheme is performed. The proposed model was designed mainly for fault-tolerant
internet system where many unrelated applications could compete for hardware and software resources, thereby
exhibiting highly varying and dynamic system characteristics. A major feature of the model under consideration
is to attem pt the adaptive execution of red undant compo nents for a req uired level of fault toleran ce.
Key w ords: Clusterization, serve rs, replica tion sch eme , redun dant components, depe ndability ana lysis,
reliability, parallel processing, real-time processing, stochastic modeling
Corresponding Author: O.O. Adeosun, Department of Computer Science and Engineering, Ladoke Akintola University of
Technology, Ogbomoso, Nigeria
35
Res. J. Inform. Technol., 2(2): 35-38, 2010
be competed by many unrelated but concurrent service primary using timeout and its stub automatically re-routes
requests. In such varying environments the architectures a faulty request to an alternative server.
with the fixed requirem ent to hardw are com ponen ts are The weakness of primary/backup schemes is that in
either inefficient or infeasible. settings where both processes could have been active,
only one is actually performing operations. It is true we
Fault-Tolerant Internet Con nectivity Mod el: Guerraoui are gaining fault-tolerance but spending twice as much
and Schiper (19 96) opine that grou p com mun ication money to get this property. An outgrow th of this work
enables encapsulating a set of entities that cooperate to
was the emergence of schemes in which a group of
achieve some common service. A group has a logical
replicas could cooperate, with each process backup the
address, which allows clients ignore the existence of its
others, and each handling some share of the workload.
membe rs. In Fig. 1, a set of replicated servers composes
the group. All replicas must provide access to the same
methods and h ave to maintain the same state. To achieve Com munication mo del: We use an asynchronous
this, there sh ould be strong consistency . This w ill enable communication in our model; even overload server can be
read-o ne-w rite-all replica s. assumed as fault-suspected because there is no w ay to
distinguish between overload and faulty servers. Also, we
From primary to backup replication: In the replicated use an underneath group-communication layer to provide
service, one of the replicated servers (the primary) the needed multicast primitive and also server service.
executes a transaction locally. Many classical approaches The server service manages the replicated servers and
to replication are based on a primary/backup model where detects fault-suspected servers removing them from the
one device or process has unilateral control over one or group. The group communication layer operates in the
more other processes or devices. For example, the presence of message omission faults, processor crashes
primary might perform some computation, streaming a and recov eries as well network partitions and m erges.
log of updates to a backup (standby) proce ss, which can
then take over if the primary fails, that is, it forwards Design approach: As the number of nodes in a
updates to all other group member (backups) using the
distributed computation increases, so does the probability
total order multicast primitive (TOCAST) (Guerraoui and
for failure. If one thinks of a system as a collection of
Schiper, 1996). This primitive ensures that updates are
functionalities that must perform specific tasks, then the
delivered in the same order by all correct processes that
design of a survivable system can be thought of as a
work according to their specification. The termination
property of the TOCA ST assures the distributed system multistage proce ss. It should be noted that, in a malicious
progress despite of failures, as well as its non-blocking environment, each stage has its limitations.
characteristics. Typically, the primary waits for all In traditional fault-tolerance, tolerating faults is
back up an swers and return s respo nse to the clien t. typically achieved utilizing the principle of redundancy.
In order to avoid bottleneck, any replica can be
enabled to play the primary role. Backu p failure is Information redundancy: Usually considers the
transparent to the requester, but faulty primaries require inclusion of add itional information as the basis for fault
achieving failover. In this study, a clien t detects a faulty- recov ery. A typ ical exa mple is an error correction code.
36
Res. J. Inform. Technol., 2(2): 35-38, 2010
Time redundancy: Relies on m ultiple executions skewed Implementation Issues: Tw o OpenSource pro jects were
in time on the sam e node a nd is often used to mask identified: Java-G roups (B an, 1999 ) and JOnAS (Java
omissions. Open Application Server) (Danes et al., 2000). Our
replicated server is been deve loped to match the two
Spatial redundancy: Uses multiple components, each OpenSource. In our model, we changed some classes of
computing a value, and the final value is derived from a the JOnAS to include the TOCAST p rimitive in the
convergence function (e.g., majority voting). The server-side. Replicated servers join the group and use this
resulting N-m odular redundant (N M R) sy stem primitive to setting the distributed checkpoint. W e
implements a k-of-N system, which implies that the implement the distributed check point selecting, at
system functions as long as k or m ore com ponen ts are compiling time, updates to be forwarded during the
service execution. An update is assumed to be a method
fault free. A typical configuration is a triple-redundant
without result (it return s a null value). In the client-side,
redundancy (TM), which is a 2-of-3 system.
we modify the client’s stub to automatically re-route
faulty requests.
Enabling recovery failures and providing failover
service to users: Achieving the proposed Internet fault- CONCLUSION
tolerant service using replicated servers requires treating
client-primary as well primary-backups interaction. The Transactional systems could benefit from group
model handles client-primary interaction switching of the communication to achieve fault tolerance. The systems
client requests to alternative server, when the current are more available for service delivery and multicasting
service is interrupted. The work also handles primary- just updates provides good performance, eliminating
backup interaction implementing distributed checkpoints. additional co mm unica tion rou nds.
Recover from a failed server is ea sier. It just requires re- Also, we expect that replication improves the
routing clients’ requests. application response time, when compared with non-
replicated application servers, by allowing requests to be
Distributed checkpoint implementation: A distributed handled by several nodes rather than one besides
checkpoint contains all local snapshots placed in all the eliminating a single point-of-failure. In addition,
replicated servers. Each snapshot holds information about deployment and redeployment of new and recovered
the last executed method, the client who req uested this servers are necessa ry to maintain the Internet availability.
method and the server who executed this m ethod . This
follows a distributed checkpoint approach, which ACKNOWLEDGEMENT
multicasts a snapshot from a primary to all other servers.
W henever the primary receives a transactional request The authors are thankful to Prof. Adetunde, I.A., the
Dean of Engineering, University of Mines and
(using point-to-point commun ication) from a client, it
Technology, Tarkwa, Ghana, for his advice, valuable
updates its own state and m ulticasts synchronization
comments, sugg estions and for making th is article
messages to the backups using the TOCAST primitive.
publishable.
The primary verifies if the distributed checkpoint was
successfully established (waiting for all backup REFERENCES
confirmation me ssages) and answers the client.
Backups process the synchronization messages and Anderson, T., 1985. Resilient Computing Systems,
autom atically store updates in their own states to establish London, UK, Collins Professional and Technical
the distributed checkpoint and to reflect a single Books.
distributed global state. If a server fails, clients are Avizienis, A. and L. Chen, 1977. On the implementation
guaranteed access to the same data through the backups. of N-version programm ing for software fault
W hen a client-server conn ection is closed, all servers tolerance during program execution. COM PSAC 77,
remove the information about the distributed checkpoint Chicago, IL,pp: 149-155.
for that client. Storing this inform ation w ill enable Ban, B., 1999. JavaGroups User’s Guide, Department of
autom atic failover during a transaction execution. The Computer Science, Cornell University, pp: 73.
non-finished methods will be executed in another server. http://JavaGroups sourceforg e.net/.
Bondavalli, A., F. d i Gian dom enico and J. Xu, 1993 . A
Propagating updates to backups: There are tw o possible cost-effective and flexible scheme for softw are fault
strategies to propagate updates: deferred update and tolerance. J. Comput. Sys. SC. Eng., 8(4): 234-244.
immediate update (W iesmann et al., 2000). In deferred Danes, A., P. Dechamboux, M. Riveill and G. Vandome,
update, transactions are processed locally at one server 2000. Technologie a Base de C omposants EJB
and are forw ard to the backups at the comm it time w hile Experience et Perspectives Avec JOnAS. OCM 2000,
the immediate update synchronizes every transaction Nantes, Mai 2000, pp: 11-13. Retrieved from:
across all servers. http://www .objectweb.org/jonas/.
37
Res. J. Inform. Technol., 2(2): 35-38, 2010
Guerraou i, R. and A. Schip er, 1996. Fault_Tolerance by Randell, B., 1975. System structure for software fault
Replication in Distributed Systems. Departement tolerance. IEEE TSE, Vol. SE-1, No. 2, pp: 220-232.
d’Informatique Ecole Polytechnique Federale de W iesmann, M., F. Pedone, A. Schiper, B. Kemme and
Lausanne, 1996. G. Alonso, 2000. Understanding replication in
Laprie, J.C., J. Arlat, C . Beounes, K. Kanoun and databases and distributed systems. Proceedings of
C. Hourtolle, 1987. Definition and analysis of ICDCS 2000, Taipe i, Taiw an, R.O.C., April 2000,
hardware-and-softw are fault-tolerant architectures. pp: 264-274.
IEEE Comput., 23 (7): 39-51.
38