100% found this document useful (7 votes)
71 views91 pages

Large Scale Parallel Data Mining 1759 Lecture Notes in Computer Science 1st Edition by Mohammed Zaki, Ching Tien Ho ISBN 3540671943 978-3540671947

The document promotes instant access to various eBooks related to data mining and machine learning, featuring titles edited by notable authors such as Mohammed Zaki and Ching-Tien Ho. It highlights the importance of high-performance, scalable computing in handling large datasets for knowledge discovery. Additionally, it includes information about a workshop focused on parallel knowledge discovery systems, discussing challenges and advancements in the field.

Uploaded by

botsoeyhatt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (7 votes)
71 views91 pages

Large Scale Parallel Data Mining 1759 Lecture Notes in Computer Science 1st Edition by Mohammed Zaki, Ching Tien Ho ISBN 3540671943 978-3540671947

The document promotes instant access to various eBooks related to data mining and machine learning, featuring titles edited by notable authors such as Mohammed Zaki and Ching-Tien Ho. It highlights the importance of high-performance, scalable computing in handling large datasets for knowledge discovery. Additionally, it includes information about a workshop focused on parallel knowledge discovery systems, discussing challenges and advancements in the field.

Uploaded by

botsoeyhatt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Instant Ebook Access, One Click Away – Begin at ebookball.

com

Large Scale Parallel Data Mining 1759 Lecture


Notes in Computer Science 1st edition by Mohammed
Zaki, Ching Tien Ho ISBN 3540671943 978-3540671947

https://ebookball.com/product/large-scale-parallel-data-
mining-1759-lecture-notes-in-computer-science-1st-edition-
by-mohammed-zaki-ching-tien-ho-
isbn-3540671943-978-3540671947-19650/

OR CLICK BUTTON

DOWLOAD EBOOK

Get Instant Ebook Downloads – Browse at https://ebookball.com


Your digital treasures (PDF, ePub, MOBI) await
Download instantly and pick your perfect format...

Read anywhere, anytime, on any device!

Machine Learning and Data Mining in Pattern Recognition


Lecture Notes in Computer Science 2734 Lecture Notes in
Artificial Intelligence 1st edition by Susan Craw, Petra
Perner, Azriel Rosenfeld ISBN Â 3540405046
 978-3540405047

https://ebookball.com/product/machine-learning-and-data-mining-in-
pattern-recognition-lecture-notes-in-computer-science-2734-lecture-
notes-in-artificial-intelligence-1st-edition-by-susan-craw-petra-
perner-azriel-rosenfeld-isbn-354/
ebookball.com

Data Mining and Analysis Fundamental Concepts and


Algorithms 1st Edition by Mohammed Zaki

https://ebookball.com/product/data-mining-and-analysis-fundamental-
concepts-and-algorithms-1st-edition-by-mohammed-zaki-15688/

ebookball.com

Lecture Notes in Computer Science 1st Edition by Springer


ISBN

https://ebookball.com/product/lecture-notes-in-computer-science-1st-
edition-by-springer-isbn-10572/

ebookball.com

Advances in Knowledge Discovery and Data Mining Part II


14th edition by Mohammed Zaki , Jeffrey Xu Yu, Ravindran,
Vikram Pudi ISBN 3642136710 978-3642136719

https://ebookball.com/product/advances-in-knowledge-discovery-and-
data-mining-part-ii-14th-edition-by-mohammed-zaki-jeffrey-xu-yu-
ravindran-vikram-pudi-isbn-3642136710-978-3642136719-19674/

ebookball.com
Lecture Notes in Computer Science 4075 Lecture Notes in
Bioinformatics 1st edition by Victor Markowitz, Ulf Leser,
Felix Naumann, Barbara Eckman 9783540365952

https://ebookball.com/product/lecture-notes-in-computer-
science-4075-lecture-notes-in-bioinformatics-1st-edition-by-victor-
markowitz-ulf-leser-felix-naumann-barbara-eckman-9783540365952-19912/

ebookball.com

Automata for Branching and Layered Temporal Structures


Lecture Notes in Computer Science 5955 Lecture Notes in
Artificial Intelligence 1st edition by Gabriele Puppis
ISBN 3642118801 978-3642118807

https://ebookball.com/product/automata-for-branching-and-layered-
temporal-structures-lecture-notes-in-computer-science-5955-lecture-
notes-in-artificial-intelligence-1st-edition-by-gabriele-puppis-
isbn-3642118801-978-3642118807-195/
ebookball.com

Qualitative Spatial Reasoning with Topological Information


Lecture Notes in Computer Science 2293 Lecture Notes in
Artificial Intelligence 1st edition by Jochen Renz ISBN
3540433465 Â 978-3540433460

https://ebookball.com/product/qualitative-spatial-reasoning-with-
topological-information-lecture-notes-in-computer-
science-2293-lecture-notes-in-artificial-intelligence-1st-edition-by-
jochen-renz-isbn-3540433465-978-3540433460-196/
ebookball.com

Iterative Software Engineering for Multiagent Systems


Lecture Notes in Computer Science 1994 Lecture Notes in
Artificial Intelligence 1st edition by Jürgen Lind ISBN
3540421661 9783540421661

https://ebookball.com/product/iterative-software-engineering-for-
multiagent-systems-lecture-notes-in-computer-science-1994-lecture-
notes-in-artificial-intelligence-1st-edition-by-ja1-4rgen-lind-
isbn-3540421661-9783540421661-19638/
ebookball.com

The Seventeen Provers of the World Lecture Notes in


Computer Science 3600 Lecture Notes in Artificial
Intelligence 1st edition by Freek Wiedijk, Freek Wiedijk
ISBN 3540307044 Â 978-3540307044

https://ebookball.com/product/the-seventeen-provers-of-the-world-
lecture-notes-in-computer-science-3600-lecture-notes-in-artificial-
intelligence-1st-edition-by-freek-wiedijk-freek-wiedijk-
isbn-3540307044-978-3540307044-19614/
ebookball.com
Lecture Notes in Artificial Intelligence 1759
Subseries of Lecture Notes in Computer Science
Edited by J. G. Carbonell and J. Siekmann

Lecture Notes in Computer Science


Edited by G. Goos, J. Hartmanis and J. van Leeuwen
3
Berlin
Heidelberg
New York
Barcelona
Hong Kong
London
Milan
Paris
Tokyo
Mohammed J. Zaki Ching-Tien Ho (Eds.)

Large-Scale
Parallel Data Mining

13
Series Editors
Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA
Jörg Siekmann, University of Saarland, Saarbrücken, Germany

Volume Editors
Mohammed J. Zaki
Computer Science Department
Rensselaer Polytechnic Institute
Troy, NY 12180, USA
E-mail: [email protected]
Ching-Tien Ho
K55/B1, IBM Almaden Research Center
650 Harry Road, San Jose, CA 95120, USA
E-mail: [email protected]
Cataloging-in-Publication Data applied for

Die Deutsche Bibliothek - CIP-Einheitsaufnahme


Large scale parallel data mining / Mohammed J. Zaki ; Ching-Tien Ho (ed.)
- Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ;
Milan ; Paris ; Singapore ; Tokyo : Springer, 2000
(Lecture notes in computer science ; Vol. 1759 : Lecture notes in
artificial intelligence)
ISBN 3-540-67194-3

CR Subject Classification (1991): I.2.8, I.2.11, I.2.4, I.2.6, H.3, F.2.2, C.2.4

ISBN 3-540-67194-3 Springer-Verlag Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
liable for prosecution under the German Copyright Law.
Springer-Verlag is a company in the specialist publishing group BertelsmannSpringer
c Springer-Verlag Berlin Heidelberg 2000
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Christian Grosche
Printed on acid-free paper SPIN 10719635 06/3142 543210
Preface

With the unprecedented rate at which data is being collected today in almost all
fields of human endeavor, there is an emerging economic and scientific need to
extract useful information from it. For example, many companies already have
data-warehouses in the terabyte range (e.g., FedEx, Walmart). The World Wide
Web has an estimated 800 million web-pages. Similarly, scientific data is reach-
ing gigantic proportions (e.g., NASA space missions, Human Genome Project).
High-performance, scalable, parallel, and distributed computing is crucial for
ensuring system scalability and interactivity as datasets continue to grow in size
and complexity.
To address this need we organized the workshop on Large-Scale Parallel KDD
Systems, which was held in conjunction with the 5th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, on August 15th,
1999, San Diego, California. The goal of this workshop was to bring researchers
and practitioners together in a setting where they could discuss the design, im-
plementation, and deployment of large-scale parallel knowledge discovery (PKD)
systems, which can manipulate data taken from very large enterprise or scien-
tific databases, regardless of whether the data is located centrally or is globally
distributed. Relevant topics identified for the workshop included:

– How to develop a rapid-response, scalable, and parallel knowledge discovery


system that supports global organizations with terabytes of data.
– How to address some of the challenges facing current state-of-the-art data
mining tools. These challenges include relieving the user from time and vol-
ume constrained tool-sets, evolving knowledge stores with new knowledge
effectively, acquiring data elements from heterogeneous sources such as the
Web or other repositories, and enhancing the PKD process by incrementally
updating the knowledge stores.
– How to leverage high performance parallel and distributed techniques in
all the phases of KDD, such as initial data selection, cleaning and prepro-
cessing, transformation, data-mining task and algorithm selection and its
application, pattern evaluation, management of discovered knowledge, and
providing tight coupling between the mining engine and database/file server.
– How to facilitate user interaction and usability, allowing the representation
of domain knowledge, and to maximize understanding during and after the
process. That is, how to build an adaptable knowledge engine which supports
business decisions, product creation and evolution, and leverages information
into usable or actionable knowledge.

This book contains the revised versions of the workshop papers and it also
includes several invited chapters, to bring the readers up-to-date on the recent
developments in this field. This book thus represents the state-of-the-art in paral-
lel and distributed data mining methods. It should be useful for both researchers
VI Preface

and practitioners interested in the design, implementation, and deployment of


large-scale, parallel knowledge discovery systems.

December 1999 Mohammed J. Zaki


Ching-Tien Ho

Workshop Chairs

Workshop Chair: Mohammed J. Zaki (Rensselaer Polytechnic Institute, USA)


Workshop Co-Chair: Ching-Tien Ho (IBM Almaden Research Center, USA)

Program Committee

David Cheung (University of Hong Kong, Hong Kong)


Alok Choudhary (Northwestern University, USA)
Alex A. Freitas (Pontifical Catholic University of Parana, Brazil)
Robert Grossman (University of Illinois-Chicago, USA)
Yike Guo (Imperial College, UK)
Hillol Kargupta (Washington State University, USA)
Masaru Kitsuregawa (University of Tokyo, Japan)
Vipin Kumar (University of Minnesota, USA)
Reagan Moore (San Diego Supercomputer Center, USA)
Ron Musick (Lawrence Livermore National Lab, USA)
Srini Parthasarathy (University of Rochester, USA)
Sanjay Ranka (University of Florida, USA)
Arno Siebes (Centrum Wiskunde Informatica, Netherlands)
David Skillicorn (Queens University, Canada )
Paul Stolorz (Jet Propulsion Lab, USA)
Graham Williams (Cooperative Research Center for Advanced Computational
Systems, Australia)

Acknowledgements
We would like to thank all the invited speakers, authors, and participants for
contributing to the success of the workshop. Special thanks are due to the pro-
gram committee for their support and help in reviewing the submissions.
Table of Contents

Large-Scale Parallel Data Mining


Parallel and Distributed Data Mining: An Introduction . . . . . . . . . . . . . . . . . 1
Mohammed J. Zaki

Mining Frameworks
The Integrated Delivery of Large-Scale Data Mining: The ACSys Data
Mining Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Graham Williams, Irfan Altas, Sergey Bakin, Peter Christen,
Markus Hegland, Alonso Marquez, Peter Milne, Rajehndra Nagappan,
and Stephen Roberts

A High Performance Implementation of the Data Space Transfer Protocol


(DSTP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Stuart Bailey, Emory Creel, Robert Grossman, Srinath Gutti,
and Harinath Sivakumar
Active Mining in a Distributed Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Srinivasan Parthasarathy, Sandhya Dwarkadas, and Mitsunori Ogihara

Associations and Sequences


Efficient Parallel Algorithms for Mining Associations . . . . . . . . . . . . . . . . . . . 83
Mahesh V. Joshi, Eui-Hong (Sam) Han, George Karypis,
and Vipin Kumar

Parallel Branch-and-Bound Graph Search for Correlated Association Rules 127


Shinichi Morishita and Akihiro Nakaya

Parallel Generalized Association Rule Mining on Large Scale PC Cluster . . 145


Takahiko Shintani and Masaru Kitsuregawa
Parallel Sequence Mining on Shared-Memory Machines . . . . . . . . . . . . . . . . . 161
Mohammed J. Zaki

Classification
Parallel Predictor Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
D.B. Skillicorn
Efficient Parallel Classification Using Dimensional Aggregates . . . . . . . . . . . 197
Sanjay Goil and Alok Choudhary
VIII Table of Contents

Learning Rules from Distributed Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211


Lawrence O. Hall, Nitesh Chawla, Kevin W. Bowyer,
and W. Philip Kegelmeyer

Clustering
Collective, Hierarchical Clustering from Distributed, Heterogeneous Data . 221
Erik L. Johnson and Hillol Kargupta
A Data-Clustering Algorithm On Distributed Memory Multiprocessors . . 245
Inderjit S. Dhillon and Dharmendra S. Modha

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261


Parallel and Distributed Data Mining:
An Introduction

Mohammed J. Zaki

Computer Science Department


Rensselaer Polytechnic Institute
Troy, NY 12180
[email protected]
http://www.cs.rpi.edu/~zaki

Abstract. The explosive growth in data collection in business and sci-


entific fields has literally forced upon us the need to analyze and mine
useful knowledge from it. Data mining refers to the entire process of ex-
tracting useful and novel patterns/models from large datasets. Due to the
huge size of data and amount of computation involved in data mining,
high-performance computing is an essential component for any successful
large-scale data mining application. This chapter presents a survey on
large-scale parallel and distributed data mining algorithms and systems,
serving as an introduction to the rest of this volume. It also discusses
the issues and challenges that must be overcome for designing and im-
plementing successful tools for large-scale data mining.

1 Introduction
Data Mining and Knowledge Discovery in Databases (KDD) is a new interdis-
ciplinary field merging ideas from statistics, machine learning, databases, and
parallel and distributed computing. It has been engendered by the phenomenal
growth of data in all spheres of human endeavor, and the economic and scientific
need to extract useful information from the collected data. The key challenge in
data mining is the extraction of knowledge and insight from massive databases.
Data mining refers to the overall process of discovering new patterns or build-
ing models from a given dataset. There are many steps involved in the KDD
enterprise which include data selection, data cleaning and preprocessing, data
transformation and reduction, data-mining task and algorithm selection, and
finally post-processing and interpretation of discovered knowledge [1,2]. This
KDD process tends to be highly iterative and interactive.
Typically data mining has the two high level goals of prediction and descrip-
tion [1]. In prediction, we are interested in building a model that will predict
unknown or future values of attributes of interest, based on known values of some
attributes in the database. In KDD applications, the description of the data in
human-understandable terms is equally if not more important than prediction.
Two main forms of data mining can be identified [3]. In verification-driven data
mining the user postulates a hypothesis, and the system tries to validate it.

M.J. Zaki, C.-T. Ho (Eds.): Large-Scale Parallel Data Mining, LNAI 1759, pp. 1–23, 2000.
c Springer-Verlag Berlin Heidelberg 2000
2 Mohammed J. Zaki

The common verification-driven operations include query and reporting, multi-


dimensional analysis or On-Line Analytical Processing (OLAP), and statistical
analysis. Discovery-driven mining, on the other hand, automatically extracts
new information from data, and forms the main focus of this survey. The typical
discovery-driven tasks include association rules, sequential patterns, classifica-
tion and regression, clustering, similarity search, deviation detection, etc.
While data mining has its roots in the traditional fields of machine learning
and statistics, the sheer volume of data today poses the most serious problem.
For example, many companies already have data warehouses in the terabyte
range (e.g., FedEx, UPS, Walmart). Similarly, scientific data is reaching gigantic
proportions (e.g., NASA space missions, Human Genome Project). Traditional
methods typically made the assumption that the data is memory resident. This
assumption is no longer tenable. Implementation of data mining ideas in high-
performance parallel and distributed computing environments is thus becoming
crucial for ensuring system scalability and interactivity as data continues to grow
inexorably in size and complexity.
Parallel data mining (PDM) deals with tightly-coupled systems including
shared-memory systems (SMP), distributed-memory machines (DMM), or clus-
ters of SMP workstations (CLUMPS) with a fast interconnect. Distributed data
mining (DDM), on the other hand, deals with loosely-coupled systems such as a
cluster over a slow Ethernet local-area network. It also includes geographically
distributed sites over a wide-area network like the Internet. The main differences
between PDM to DDM are best understood if view DDM as a gradual transition
from tightly-coupled, fine-grained parallel machines to loosely-coupled medium-
grained LAN of workstations, and finally very coarse-grained WANs. There is
in fact a significant overlap between the two areas, especially at the medium-
grained level where is it hard to draw a line between them.
In another view, we can think of PDM as an essential component of a DDM
architecture. An individual site in DDM can be a supercomputer, a cluster of
SMPs, or a single workstation. In other words, each site supports PDM locally.
Multiple PDM sites constitute DDM, much like the current trend in meta- or
super-computing. Thus the main difference between PDM and DDM is that of
scale, communication costs, and data distribution. While, in PDM, SMPs can
share the entire database and construct a global mined model, DMMs generally
partition the database, but still generate global patterns/models. On the other
hand, in DDM, it is typically not feasible to share or communicate data at all;
local models are built at each site, and are then merged/combined via various
methods.
PDM is the ideal choice in organizations with centralized data-stores, while
DDM is essential in cases where there are multiple distributed datasets. In fact, a
successful large-scale data mining effort requires a hybrid PDM/DDM approach,
where parallel techniques are used to optimize the local mining at a site, and
where distributed techniques are then used to construct global or consensus pat-
terns/models, while minimizing the amount of data and results communicated.
In this chapter we adopt this unified view of PDM and DDM.
Parallel and Distributed Data Mining 3

This chapter provides an introduction to parallel and distributed data min-


ing. We begin by explaining the PDM/DDM algorithm design space, and then
go on to survey current parallel and distributed algorithms for associations, se-
quences, classification and clustering, which are the most common mining tech-
niques. We also include a section on recent systems for distributed mining. After
reviewing the open challenges in PDM/DDM, we conclude by providing a road-
map for the rest of this volume.

2 Parallel and Distributed Data Mining


Parallel and distributed computing is expected to relieve current mining meth-
ods from the sequential bottleneck, providing the ability to scale to massive
datasets, and improving the response time. Achieving good performance on to-
day’s multiprocessor systems is a non-trivial task. The main challenges include
synchronization and communication minimization, work-load balancing, finding
good data layout and data decomposition, and disk I/O minimization, which is
especially important for data mining.

2.1 Parallel Design Space


The parallel design space spans a number of systems and algorithmic components
including the hardware platform, the kind of parallelism exploited, the load
balancing strategy, the data layout and the search procedure used.

Distributed Memory Machines vs. Shared Memory Systems. The performance


optimization objectives change depending on the underlying architecture. In
DMMs synchronization is implicit in message passing, so the goal becomes com-
munication optimization. For shared-memory systems, synchronization happens
via locks and barriers, and the goal is to minimize these points. Data decom-
position is very important for distributed memory, but not for shared memory.
While parallel I/O comes for “free” in DMMs, it can be problematic for SMP
machines, which typically serialize I/O. The main challenge for obtaining good
performance on DMM is to find a good data decomposition among the nodes, and
to minimize communication. For SMP the objectives are to achieve good data
locality, i.e., maximize accesses to local cache, and to avoid/reduce false sharing,
i.e., minimize the ping-pong effect where multiple processors may be trying to
modify different variables which coincidentally reside on the same cache line.
For today’s non-uniform memory access (NUMA) hybrid and/or hierarchical
machines (e.g., cluster of SMPs), the optimization parameters draw from both
the DMM and SMP paradigms.
Another classification of the different architectures comes from the database
literature. Here, shared-everything refers to the shared-memory paradigm, with a
global shared memory and common disks among all the machines. Shared-nothing
refers to distributed-memory architecture, with a local memory and disk for each
processor. A third paradigm called shared-disks refers to the mixed case where
processors have local memories, but access common disks [4,5].
4 Mohammed J. Zaki

Task vs. Data Parallelism. These are the two main paradigms for exploiting al-
gorithm parallelism. Data parallelism corresponds to the case where the database
is partitioned among P processors. Each processor works on its local partition
of the database, but performs the same computation of evaluating candidate
patterns/models. Task parallelism corresponds to the case where the processors
perform different computations independently, such as evaluating a disjoint set
of candidates, but have/need access to the entire database. SMPs have access
to the entire data, but for DMMs this can be done via selective replication or
explicit communication of the local data. Hybrid parallelism combining both
task and data parallelism is also possible, and in fact desirable for exploiting all
available parallelism in data mining methods.

Static vs. Dynamic Load Balancing. In static load balancing work is initially
partitioned among the processors using some heuristic cost function, and there
is no subsequent data or computation movement to correct load imbalances
which result from the dynamic nature of mining algorithms. Dynamic load bal-
ancing seeks to address this by stealing work from heavily loaded processors
and re-assigning it to lightly loaded ones. Computation movement also entails
data movement, since the processor responsible for a computational task needs
the data associated with that task as well. Dynamic load balancing thus incurs
additional costs for work/data movement, but it is beneficial if the load imbal-
ance is large and if load changes with time. Dynamic load balancing is especially
important in multi-user environments with transient loads and in heterogeneous
platforms, which have different processor and network speeds. These kinds of en-
vironments include parallel servers, and heterogeneous, meta-clusters. With very
few exceptions, most extant parallel mining algorithms use only a static load
balancing approach that is inherent in the initial partitioning of the database
among available nodes. This is because they assume a dedicated, homogeneous
environment.

Horizontal vs. Vertical Data Layout. The standard input database for mining
is a relational table having N rows, also called feature vectors, transactions, or
records, and M columns, also called dimensions, features, or attributes. The data
layout can be row-wise or column-wise. Many data mining algorithms assume a
horizontal or row-wise database layout, where they store, as a unit, each trans-
action (tid), along with the attribute values for that transaction. Other methods
use a vertical or column-wise database layout, where they associate with each at-
tribute a list of all tids (called tidlist) containing the item, and the corresponding
attribute value in that transaction. Certain mining operations a more efficient
using a horizontal format, while others are more efficient using a vertical format.

Complete vs. Heuristic Candidate Generation. The final results of a mining


method may be sets, sequences, rules, trees, networks, etc., ranging from simple
patterns to more complex models, based on certain search criteria. In the inter-
mediate steps several candidate patterns or partial models are evaluated, and
Parallel and Distributed Data Mining 5

the final result contains only the ones that satisfy the (user-specified) input pa-
rameters. Mining algorithms can differ in the way new candidates are generated
for evaluation. One approach is that of complete search, which is guaranteed
to generate and test all valid candidates consistent with the data. Note that
completeness doesn’t mean exhaustive, since pruning can be used to eliminate
useless branches in the search space. Heuristic generation sacrifices completeness
for the sake of speed. At each step, it only examines a limited number (or only
one) of “good” branches. Random search is also possible. Generally, the more
complex the mined model, the more the tendency towards heuristic or greedy
search.

Candidate and Data Partitioning. An easy way to discuss the many parallel
and distributed mining methods is to describe them in terms of the computa-
tion and data partitioning methods used. For example, the database itself can
be shared (in shared-memory or shared-disk architectures), partially or totally
replicated, or partitioned (using round-robin, hash, or range scheduling) among
the available nodes (in distributed-memory architectures).
Similarly, the candidate concepts generated and evaluated in the different
mining methods can be shared, replicated or partitioned. If they are shared
then all processors evaluate a single copy of the candidate set. In the replicated
approach the candidate concepts are replicated on each machine, and are first
evaluated locally, before global results are obtained by merging them. Finally, in
the partitioned approach, each processor generates and tests a disjoint candidate
concept set.
In the sections below we describe parallel and distributed algorithms for some
of the typical discovery-driven mining tasks including associations, sequences,
decision tree classification and clustering. Table 1 summarizes in list form where
each parallel algorithm for each of the above mining tasks lies in the design
space. It would help the reader to refer to the table while reading the algorithm
descriptions below.

2.2 Association Rules


Given a database of transactions, where each transaction consists of a set of
items, association discovery finds all the item sets that frequently occur together,
the so called frequent itemsets, and also the rules among them. An example of
an association could be that, “40% of people who buy Jane Austen’s Pride and
Prejudice also buy Sense and Sensibility.” Potential application areas include
catalog design, store layout, customer segmentation, telecommunication alarm
diagnosis, etc.
The Apriori [6] method serves as the base algorithm for the vast majority
of parallel association algorithms. Apriori uses a complete, bottom-up search,
with a horizontal data layout and enumerates all frequent itemsets. Apriori is an
iterative algorithm that counts itemsets of a specific length in a given database
pass. The process starts by scanning all transactions in the database and com-
puting the frequent items. Next, a set of potentially frequent candidate itemsets
6

Algorithm Base Algorithm Machine Parallelism LoadBal DB Layout Concepts Database


Association Rule Mining
CD, PEAR, PDM, FDM, NPA Apriori DMM Data Static Horizontal Replicated Partitioned
DD, SPA, IDD Apriori DMM Task Static Horizontal Partitioned Partitioned
HD Apriori DMM Hybrid Hybrid Horizontal Hybrid Partitioned
CCPD Apriori SMP Data Static Horizontal Shared Partitioned
CandD, HPA, HPA-ELD Apriori DMM Task Static Horizontal Partitioned Partially Replicated
PCCD Apriori SMP Task Static Horizontal Partitioned Shared
APM DIC SMP Task Static Horizontal Shared Partitioned
Mohammed J. Zaki

PPAR Partition DMM Task Static Horizontal Replicated Partitioned


PE, PME, PC, PMC Eclat, Clique CLUMPS Task Static Vertical Partitioned Partially Replicated
Sequence Mining
NPSPM GSP DMM Data Static Horizontal Replicated Partitioned
SPSPM GSP DMM Task Static Horizontal Partitioned Partitioned
HPSPM GSP DMM Task Static Horizontal Partitioned Partially Replicated
pSPADE SPADE SMP Task Dynamic Vertical Partitioned Shared
D-MSDD MSDD DMM Task Static Horizontal Partitioned Replicated
Decision Tree Classification
SPRINT, SLIQ/R, SLIQ/D, ScalParC SLIQ/SPRINT DMM Data Static Vertical Replicated Partitioned
DP-att, DP-rec, PDT C4.5 DMM Data Static Horizontal Replicated Partitioned
MWK SPRINT SMP Data Dynamic Vertical Shared Shared
SUBTREE SPRINT SMP Hybrid Dynamic Vertical Partitioned Partitioned
HTF SPRINT DMM Hybrid Dynamic Vertical Partitioned Partitioned
pCLOUDS CLOUDS DMM Hybrid Dynamic Horizontal Partitioned Partitioned
Clustering
P-CLUSTER K-Means DMM Data Static Horizontal Replicated Partitioned
MAFIA - DMM Task Static Horizontal Partitioned Partitioned
Table 1. Design Space for Parallel Mining Algorithms: Associations, Sequences, Classification and Clustering.
Parallel and Distributed Data Mining 7

of length 2 is formed from the frequent items. Another database scan is made
to obtain their supports. The frequent itemsets are retained for the next pass,
and the process is repeated until all frequent itemsets (of various lengths) have
been enumerated.
Other sequential methods for associations that have been parallelized, in-
clude DHP [7], which tries to reduce the number of candidates by collecting
approximate counts (using hash tables) in the previous level. These counts can
be used to rule out many candidates in the current pass that cannot possibly be
frequent. The Partition algorithm [8] minimizes I/O by scanning the database
only twice. It partitions the database into small chunks which can be handled in
memory. In the first pass it generates a set of all potentially frequent itemsets,
and in the second pass it counts their global frequency. In both phases it uses a
vertical database layout. The DIC algorithm [9] dynamically counts candidates
of varying length as the database scan progresses, and thus is able to reduce the
number of scans.
A completely different design characterizes the equivalence class based algo-
rithms (Eclat, MaxEclat, Clique, and MaxClique) proposed by Zaki et al. [10].
These methods utilize a vertical database format, complete search, a mix of
bottom-up and hybrid search, and generate a mix of maximal and non-maximal
frequent itemsets. The algorithms utilize the structural properties of frequent
itemsets to facilitate fast discovery. The items are organized in a subset lattice
search space, which is decomposed into small independent chunks or sub-lattices,
which can be solved in memory. Efficient lattice traversal techniques are used,
which quickly identify all the frequent itemsets via tidlist intersections.

Replicated or Shared Candidates, Partitioned Database. The candidate


concepts in association mining are the frequent itemsets. A common paradigm for
parallel association mining is to partition the database in equal-sized horizontal
blocks, with the candidate itemsets replicated on all processors. For Apriori-
based parallel methods, in each iteration, each processor computes the frequency
of the candidate set in its local database partition. This is followed by a sum-
reduction to obtain the global frequency. The infrequent itemsets are discarded,
while the frequent ones are used to generate the candidates for the next iteration.
Barring minor differences, the methods that follow this data-parallel ap-
proach include PEAR [11], PDM [12], Count Distribution (CD) [13], FDM [14],
Non-Partitioned Apriori (NPA) [15], and CCPD [16]. CCPD uses shared-memory
machines, and thus maintains a shared candidate set among all processors. It
also parallelizes the candidate generation.
The other algorithms use distributed-memory machines. PDM, based on
DHP, prunes candidates using approximate counts from the previous level. It
also does parallelizes candidate generation, at the cost of an extra round of
communication. The remaining methods simply replicate the computation for
candidate generation. FDM is further optimized to work on distributed sites. It
uses novel pruning techniques to minimize the number of candidates, and thus
the communication during sum-reduction.
8 Mohammed J. Zaki

The advantage of replicated candidates and partitioned database, for Apriori-


based methods, is that they incur only a small amount of communication. In
each iteration only the frequencies of candidate concepts are exchanged; no data
is exchanged. These methods thus outperform the pure partitioned candidates
approach described in the next section. Their disadvantage is that the aggregate
system memory is not used effectively, since the candidates are replicated.
Other parallel algorithms, that use a different base sequential method in-
clude APM [17], a task-parallel, shared-memory, asynchronous algorithm, based
on DIC. Each processor independently applies DIC to its local partition. The
candidate set is shared among processors, but is updated asynchronously when
a processor inserts new itemsets.
PPAR [11], a task-parallel, distributed-memory algorithm, is built upon Par-
tition, with the exception that PPAR uses the horizontal data format. Each
processor gathers the locally frequent itemsets of all sizes in one pass over their
local database (which may be partitioned into chunks as well). All potentially
frequent itemsets are then broadcast to other processors. Then each processor
gathers the counts of these global candidates in the second local pass. Finally a
broadcast is performed to obtain the globally frequent itemsets.

Partitioned Candidates, Partitioned Database. Algorithms implementing


this approach include Data Distribution (DD) [13], Simply-Partitioned Apriori
(SPA) [15], and Intelligent Data Distribution (IDD) [18]. All three are Apriori-
based, and employ task parallelism on distributed-memory machines. Here each
processor computes the frequency of a disjoint set of candidates. However, to
find the global support each processor must scan the entire database, both its
local partition, and other processor’s partitions (which are exchanged in each it-
eration). The main advantage of these methods is that they utilize the aggregate
system-wide memory by evaluating disjoint candidates, but they are impractical
for any realistic large-scale dataset.
The Hybrid Distribution (HD) algorithm [18] adopts a middle ground be-
tween Data Distribution and Count Distribution. It utilizes the aggregate mem-
ory, and also minimizes communication. It partitions the P processors into G
equal-sized groups. Each of the G groups is considered a super-processor, and
applies Count Distribution, while the P/G processors within a group use Intel-
ligent Data Distribution. The database is horizontally partitioned among the G
super-processors, and the candidates are partitioned among the P/G processors
in a group. HD cuts down the database communication costs by 1/G.

Partitioned Candidates, Selectively Replicated or Shared Database. A


third approach is to evaluate a disjoint candidate set and to selectively replicate
the database on each processor. Each processor has all the information to gener-
ate and test candidates asynchronously. Methods in this paradigm are Candidate
Distribution (CandD) [13], Hash Partitioned Apriori (HPA) [15], HPA-ELD [15],
and PCCD [16], all of which are Apriori-based. PCCD uses SMP machines, and
Parallel and Distributed Data Mining 9

accesses a shared-database, but is not competitive with CCPD. Candidate Dis-


tribution is also outperformed by Count Distribution. Nevertheless, HPA-ELD,
a hybrid between HPA and NPA, was shown to be better than NPA, SPA, and
HPA.
Zaki et al. [19] proposed four algorithms, ParEclat (PE), ParMaxEclat
(PME), ParClique (PC), and ParMaxClique (PMC), targeting hierarchical sys-
tems like clusters of SMP machines. The data is assumed to be vertically parti-
tioned among the SMP machines. After an initial tidlist exchange phase and class
scheduling phase, the algorithms proceed asynchronously. In the asynchronous
phase each processor has available the classes assigned to it, and the tidlists for
all items. Thus each processor can independently generate all frequent itemsets
from its classes. No communication or synchronization is required. Further, all
available memory of the system is used, no in-memory hash trees are needed,
and only simple intersection operations are required for itemset enumeration.
Most of the extant association mining methods use a static load balancing
scheme; a dynamic load balancing approach on a heterogeneous cluster has been
presented in [20]. For more detailed surveys of parallel and distributed associa-
tion mining see [21] and the chapter by Joshi et al. in this volume.

2.3 Sequential Patterns


Sequence discovery aims at extracting frequent events that commonly occur over
a period of time [22]. An example of a sequential pattern could be that “70% of
the people who buy Jane Austen’s Pride and Prejudice also buy Emma within
a month”. Sequential pattern mining deals with purely categorical domains, as
opposed to the real-valued domains used in time-series analysis. Examples of
categorical domains include text, DNA, market baskets, etc.
In essence, sequence mining is “temporal” association mining. However, while
association rules discover only intra-transaction patterns (itemsets), we now also
have to discover inter-transaction patterns (sequences) across related transac-
tions. The set of all frequent sequences is an superset of the set of frequent
itemsets. Hence, sequence search is much more complex and challenging than
itemset search, thereby necessitating fast parallel algorithms.
Serial algorithms for sequence mining that have been parallelized include
GSP [23], MSDD [24], and SPADE [25]. GSP is designed after Apriori. It com-
putes the frequency of candidate sequences of length k in iteration k. The can-
didates are generated from the frequent sequences from the previous iteration.
MSDD discovers patterns in multiple event sequences; it explores the rule space
directly instead of the sequence space. SPADE is similar to Eclat. It uses verti-
cal layout and temporal joins to compute frequency. The search space is broken
into small memory-resident chunks, which are explored in depth- or breadth-first
manner.
Three parallel algorithms based on GSP were presented in [26]. All three
methods use the partitioned database approach, and are distributed-memory
based. NPSPM (with replicated candidates) is equivalent to NPA, SPSPM (with
partitioned candidates) the same as SPA and HPSPM is equivalent to HPA,
10 Mohammed J. Zaki

which have been described above. HPSPM performed the best among the three.
A parallel and distributed implementation of MSDD was presented in [27].
A shared-memory, SPADE-based parallel algorithm, utilizing dynamic load
balancing is described by Zaki, and new algorithms for parallel sequence mining
are also described by Joshi et al. in this volume.

2.4 Classification

Classification aims to assign a new data item to one of several predefined cat-
egorical classes [28,29]. Since the field being predicted is pre-labeled, classifica-
tion is also known as supervised induction. While there are several classification
methods including neural networks [30] and genetic algorithms [31], decision
trees [32,33] are particularly suited to data mining, since they can be constructed
relatively quickly, and are simple and easy to understand. Common applications
of classification include credit card fraud detection, insurance risk analysis, bank
loan approval, etc.
A decision tree is built using a recursive partitioning approach. Each internal
node in the tree represents a decision on an attribute, which splits the database
into two or more children. Initially the root contains the entire database, with
examples from mixed classes. The split point chosen is the one that best separates
or discriminates the classes. Each new node is recursively split in the same
manner until a node contains only one or a majority class.
Decision tree classifiers typically use a greedy search over the space of all
possible trees; there are simply too many trees to allow a complete search. The
search is also biased towards simple trees. Existing classifiers have used both the
horizontal and vertical database layouts. In parallel decision tree construction
the candidate concepts are the possible split points for all attributes within a
node of the expanding tree. For numeric attributes a split point is of the form
A ≤ vi , and for categorical attributes the test takes the form A ∈ {v1 , v2 , ...},
where vi is a value from the domain of attribute A.
Below we look at some parallel decision tree methods. Recent surveys on
parallel and scalable induction methods are also presented in [34,35].

Replicated Tree, Partitioned Database. SLIQ [36] was one of the earliest
scalable decision tree classifiers. It uses a vertical data format, called attribute
lists, allowing it to pre-sort numeric attributes in the beginning, thus avoiding the
repeated sorting required at each node in traditional tree induction. Nevertheless
it uses a memory-resident structure called class-list, which grows linearly in the
number of input records. SPRINT [37] removes this memory dependence, by
storing the classes as part of the attribute lists. It uses data parallelism, and a
distributed-memory platform.
In SPRINT and parallel versions of SLIQ, the attribute lists are horizontally
partitioned among all processors. The decision tree is also replicated on all pro-
cessors. The tree is constructed synchronously in a breadth-first manner. Each
processor computes the best split point, using its local attribute lists, for all the
Parallel and Distributed Data Mining 11

nodes on the current tree level. A round of communication takes place to de-
termine the best split point among all processors. Each processor independently
splits the current nodes into new children using the best split point, setting
the stage for the next tree level. Since a horizontal record is split in multiple
attribute lists, a hash table is used to note which record belongs to which child.
The parallelization of SLIQ follows a similar paradigm, except for the way
the class list is treated. SLIQ/R uses a replicated class list, while SLIQ/D uses
a distributed class list. Experiments showed that while SLIQ/D is better able
to exploit available memory, SLIQ/R was better in terms of performance, but
SPRINT outperformed both SLIQ/R and SLIQ/D.
ScalParC [38] is also an attribute-list-based parallel classifier for distributed-
memory machines. It is similar in design to SLIQ/D (except that it uses hash
tables per node, instead of global class lists). It uses a novel distributed hash
table for splitting a node, reducing the communication complexity and memory
requirements over SPRINT, making it scalable to larger datasets.
The DP-rec and DP-att [39] algorithms exploit record-based and attribute-
based data parallelism, respectively. In record-based data parallelism (also used
in SPRINT, ScalParC SLIQ/D and SLIQ/R), the records or attribute lists are
horizontally partitioned among the processors. In contrast, in attribute-based
data parallelism, the attributes are divided so that each processor is responsible
for an equal number of attributes. In both the schemes processors cooperate to
expand a tree node. Local computations are performed in parallel, followed by
information exchanges to get a global best split point.
Parallel Decision Tree (PDT) [40], a distributed-memory, data-parallel algo-
rithm, splits the training records horizontally in equal-sized blocks, among the
processors. It follows a master-slave paradigm, where the master builds the tree,
and finds the best split points. The slaves are responsible for sending class fre-
quency statistics to the master. For categorical attributes, each processor gathers
local class frequencies, and forwards them to the master. For numeric attributes,
each processor sorts the local values, finds class frequencies for split points, and
exchanges these with all other slaves. Each slave can then calculate the best local
split point, which is sent to the master, who then selects the best global split
point.

Shared Tree, Shared Database. MWK (and its precursors BASIC and
FWK) [41], a shared-memory implementation based on SPRINT uses this ap-
proach. MWK uses dynamic attribute-based data parallelism. Multiple proces-
sors co-operate to build a shared decision tree in a breadth-first manner. Using
a dynamic scheduling scheme, each processor acquires an attribute for any tree
node at the current level, and evaluates the split points, before processing an-
other attribute. The processor that evaluates the last attribute of a tree node,
also computes the best split point for that node. Similarly, the attribute lists are
split among the children using attribute parallelism.
12 Mohammed J. Zaki

Hybrid Tree Parallelism. SUBTREE [41] uses dynamic task parallelism (that
exists in different sub-trees) combined with data parallelism on shared-memory
systems. Initially all processors belong to one group, and apply data parallelism
at the root. Once new child nodes are formed, the processors are also partitioned
into groups, so that a group of child nodes can be processed in parallel by a
processor group. If the tree nodes associated with a processor group become
pure (i.e., contain examples from a single class), then these processors join some
other active group.
The Hybrid Tree Formulation (HTF) in [42] is very similar to SUBTREE.
HTF uses distributed memory machines, and thus data redistribution is required
in HTF when assigning a set of nodes to a processor group, so that the processor
group has all records relevant to an assigned node.
pCLOUDS [43] is a distributed-memory parallelization of CLOUDS [44]. It
does not require attribute lists or the pre-sorting for numeric attributes; instead
it samples the split points for numeric attributes followed by an estimation step
to narrow the search space for the best split. It thus reduces both computation
and I/O requirements. pCLOUDS employs a mixed parallelism approach. Ini-
tially, data parallelism is applied for nodes with many records. All small nodes
are queued to be processed later using task parallelism. Before processing small
nodes the data is redistributed so that all required data is available locally at a
processor.

2.5 Clustering
Clustering is used to partition database records into subsets or clusters, such
that elements of a cluster share a set of common properties that distinguish
them from other clusters [45,46,47,48]. The goal is to maximize intra-cluster
and minimize inter-cluster similarity. Unlike classification which has predefined
labels, clustering must in essence automatically come up with the labels. For this
reason clustering is also called unsupervised induction. Applications of clustering
include demographic or market segmentation for identifying common traits of
groups of people, discovering new types of stars in datasets of stellar objects,
and so on.
The K-means algorithm is a popular clustering method. The idea is to ran-
domly pick K data points as cluster centers. Next, each record or point is assigned
to the cluster it is closest to in terms of squared-error or Euclidean distance. A
new center is computed by taking the mean of all points in a cluster, setting the
stage for the next iteration. The process stops when the cluster centers cease to
change. Parallelization of K-means received a lot of attention in the past. Differ-
ent parallel methods, mainly using hypercube computers, appear in [49,50,51,52].
We do not describe these methods in detail, since they used only small memory-
resident datasets.
Hierarchical clustering represents another common paradigm. These methods
start with a set of distinct points, each forming its own cluster. Then recursively,
two clusters that are close are merged into one, until all points belong to a
single cluster. In [49,53], parallel hierarchical agglomerative clustering algorithms
Parallel and Distributed Data Mining 13

were presented, using several inter-cluster distance metrics and parallel computer
architectures. These methods also report results on small datasets.
P-CLUSTER [54] is a distributed-memory client-server K-means algorithm.
Data is partitioned into blocks on a server, which sends initial cluster centers and
data blocks to each client. A client assigns each record in its local block to the
nearest cluster, and sends results back to the server. The server then recalculates
the new centers and another iteration begins. To further improve performance
P-CLUSTER uses that the fact that after the first few iterations only a few
records change cluster assignments, and also the centers have less tendency to
move in later iterations. They take advantage of these facts to reduce the number
of distance calculations, and thus the time of the clustering algorithm.
Among the recent methods, MAFIA [55], is a distributed memory algorithm
for subspace clustering. Traditional methods, like K-means and hierarchical clus-
tering, find clusters in the whole data space, i.e., they use all dimensions for dis-
tance computations. Subspace clustering focuses on finding clusters embedded
in subsets of a high-dimensional space. MAFIA uses adaptive grids (or bins) in
each dimension, which are merged to find clusters in higher dimensions. Parallel
implementation of MAFIA is similar to association mining. The candidates here
are the potentially dense units (the subspace clusters) in k dimensions, which
have to be tested if they are truly dense. MAFIA employs task parallelism,
where data as well as candidates are equally partitioned among all processors.
Each processor computes local density, followed by a reduction to obtain global
density.
The paper by Dhillon and Modha in this volume presents a distributed-
memory parallelization of K-means, while the paper by Johnson and Kargupta
describes a distributed hierarchical clustering method.

2.6 Distributed Mining Frameworks


Recently, there has been an increasing interest in distributed and wide-area data
mining systems. The fact that many global businesses and scientific endeavors
require access to multiple, distributed, and often heterogeneous databases, un-
derscores the growing importance of distributed data mining.
An ideal platform for DDM is a cluster of machines at a local site, or cluster
of clusters spanning a wide area, the so-called computational grids, connected
via Internet or other high speed networks. As we noted earlier, PDM is best
viewed as a local component within a DDM system. Further the main differences
between the two is the cost of communication or data movement, and the fact
that DDM must typically handle multiple (possibly heterogeneous) databases.
Below we review some recent efforts in developing DDM frameworks.
Most methods/systems for DDM assume that the data is horizontally par-
titioned among the sites, and is homogeneous (share the same feature space).
Each site mines its local data and generates locally valid concepts. These con-
cepts are exchanged among all the sites to obtain the globally valid concepts.
The Partition [8] algorithm for association mining is a good example. It is in-
herently suitable for DDM. Each site can generate locally frequent itemsets at a
14 Mohammed J. Zaki

given threshold level. All local results are combined and then evaluated at each
site to obtain the globally frequent itemsets.
Another example is JAM [56,57], a java-based multi-agent system utilizing
meta-learning, used primarily in fraud-detection applications. Each agent builds
a classification model, and different agents are allowed to build classifiers using
different techniques. JAM also provides a set of meta-learning agents for combin-
ing multiple models learnt at different sites into a meta-classifier that in many
cases improves the overall predictive accuracy. Knowledge Probing [58] is another
approach to meta-learning. Knowledge probing retains a descriptive model af-
ter combining multiple classifiers, rather than treating the meta-classifier as a
black-box. The idea is to learn on a separate dataset, the class predictions from
all the local classifiers.
PADMA [59] is an agent based architecture for distributed mining. Individual
agents are responsible for local data access, hierarchical clustering in text doc-
ument classification, and web based information visualization. The BODHI [60]
DDM system is based on the novel concept of collective data mining. Naive min-
ing of heterogeneous, vertically partitioned, sites can lead to an incorrect global
data model. BODHI guarantees correct local and global analysis with minimum
communication.
In [61] a new distributed do-all primitive, called D-DOALL, was described
that allows easy scheduling of independent mining tasks on a network of work-
stations. The framework allows incremental reporting of results, and seeks to
reduce communication via resource-aware task scheduling principles.
The Papyrus [62] java-based system specifically targets wide-area DDM over
clusters and meta-clusters. It supports different data, task and model strate-
gies. For example, it can move models, intermediate results or raw data between
nodes. It can support coordinated or independent mining, and various meth-
ods for combining local models. Papyrus uses PMML (Predictive Model Markup
Language) to describe and exchange mined models. Kensignton [63] is another
java-based system for distributed enterprise data mining. It is a three-tiered sys-
tem, with a client front-end for GUI, and visual programming of data mining
tasks. The middle-layer application server provides persistent storage, task exe-
cution control, and data management and preprocessing functions. The third-tier
implements a parallel data mining service.
Other recent work in DDM includes decision tree construction over dis-
tributed databases [64], where the learning agents can only exchange summaries
instead of raw data, and the databases may have shared attributes. The main
challenge is to construct a decision tree using implicit records rather than ma-
terializing a join over all the datasets. The WoRLD system [65] describes an
inductive rule-learning program that learns from data distributed over a net-
work. WoRLD also avoids joining databases to create a central dataset. Instead
it uses marker-propagation to compute statistics. A marker is a label of a class
of interest. Counts of the different markers are maintained with each attribute
value, and used for evaluating rules. Markers are propagated among different
tables to facilitate distributed learning.
Parallel and Distributed Data Mining 15

For more information on parallel and distributed data mining see the book
by Freitas and Lavington [66] and the edited volume by Kargupta and Chan [67].
Also see [68] for a discussion of cost-effective measures for assessing the perfor-
mance of a mining algorithm before implementing it.

3 Research Issues and Challenges


In this section we highlight some of the outstanding research issues and a number
of open problems for designing and implementing the next-generation large-scale
mining methods and KDD systems.

High Dimensionality. Current methods are only able to hand a few thousand
dimensions or attributes. Consider association rule mining as an example. The
second iteration of the algorithm counts the frequency of all pairs of items,
which has quadratic complexity. In general, the complexity of different mining
algorithms may not be linear in the number of dimensions, and new parallel
methods are needed that are able to handle large number of attributes.

Large Size. Databases continue to increase in size. Current methods are able
to (perhaps) handle data in the gigabyte range, but are not suitable for terabyte-
sized data. Even a single scan for these databases is considered expensive. Most
current algorithms are iterative, and scan data multiple times. For example, it
is an open problem to mine all frequent associations in a single pass, although
sampling based methods show promise [69,70]. In general, minimizing the num-
ber of data scans is paramount. Another factor limiting the scalability of most
mining algorithms is that they rely on in-memory data structures for storing
potential patterns and information about them (such as candidate hash tree [6]
in associations, tid hash table [71] in classification). For large databases these
structures will certainly not fit in aggregate system memory. This means that
temporary results will have to be written out to disk or the database will have
to be divided into partitions small enough to be processed in memory, entailing
further data scans.

Data Location. Today’s large-scale data sets are usually logically and phys-
ically distributed, requiring a decentralized approach to mining. The database
may be horizontally partitioned where different sites have different transactions,
or it may be vertically partitioned, with different sites having different attributes.
Most current work has only dealt with the horizontal partitioning approach. The
databases may also have heterogeneous schemas.

Data Type. To-date most data mining research has focused on structured data,
as it is the simplest, and most amenable to mining. However, support for other
data types is crucial. Examples include unstructured or semi-structured (hy-
per)text, temporal, spatial and multimedia databases. Mining these is fraught
with challenges, but is necessary as multimedia content and digital libraries pro-
liferate at astounding rates. Techniques from parallel and distributed computing
will lie at the heart of any proposed scalable solutions.
16 Mohammed J. Zaki

Data Skew. One of the problems adversely affecting load balancing in paral-
lel mining algorithms is sensitivity to data skew. Most methods partition the
database horizontally in equal-sized blocks. However, the number of patterns
generated from each block can be heavily skewed, i.e., while one block may con-
tribute many, the other may have very few patterns, implying that the processor
responsible for the latter block will be idle most of the time. Randomizing the
blocks is one solution, but it is still not adequate, given the dynamic and inter-
active nature of mining. The effect of skewness on different algorithms needs to
be further studied (see [72] for some recent work).

Dynamic Load Balancing. Most extant algorithms use only a static par-
titioning scheme based on the initial data decomposition, and they assume a
homogeneous, dedicated environment. This is far from reality. A typical parallel
database server has multiple users, and has transient loads. This calls for an in-
vestigation of dynamic load balancing schemes. Dynamic load balancing is also
crucial in a heterogeneous environment, which can be composed of meta- and
super-clusters, with machines ranging from ordinary workstations to supercom-
puters.

Incremental Methods. Everyday new data is being collected, and existing


data stores are being updated with the new data or purged of the old one. To-
date there have been no parallel or distributed algorithms that are incremental
in nature, which can handle updates and deletions without having to recompute
patterns or rules over the entire database.

Multi-table Mining, Data Layout, and Indexing Schemes. Almost no


work has been done on mining over multiple tables or over distributed databases
which have different schemas. Data in a warehouse is typically arranged in a star
schema, with a central fact table (e.g., point-of-sales data), and associated dimen-
sion tables (e.g., product information, manufacturer, etc.). Traditional mining
over these multiple tables would first require us to create a large single table that
is the join of all the tables. The joined table also has tremendous amounts of re-
dundancy. We need better methods for processing such multiple tables, without
having to materialize a single large view. Also, little work has been done on the
optimal or near-optimal data layout or indexing schemes for fast data access for
mining.

Parallel DBMS/File Systems. To-date most results reported have hand-


partitioned the database, mainly horizontally, on different processors. There has
been very little study conducted in using a parallel database/file system for
managing the partitioned database, and the accompanying striping, and lay-
out issues. Recently there has been increasing emphasis on tight database in-
tegration of mining [73,74,75,76], but it has mainly been confined to sequential
approaches. Some exceptions include Data Surveyor [77], a mining tool that
uses the Monet database server for parallel classification rule induction. Also,
generic set-oriented primitive operations were proposed in [78] for classification
and clustering. These primitives were fully integrated with a parallel DBMS.
Parallel and Distributed Data Mining 17

Interaction, Pattern Management, and Meta-level Mining. The KDD


process is highly interactive, as the human participates in almost all the steps.
For example, the user is heavily involved in the initial data understanding, se-
lection, cleaning, and transformation phases. These steps in fact consume more
time than mining per se. Moreover, depending on the parameters of the search,
mining methods may generate too many patterns to be analyzed directly. One
needs methods to allow meta-level queries [79,80,81] on the results, to impose
constraints that focus on patterns of interest [82,83], to refine or generalize
rules [84,85], etc. Thus there is a need for a complete set of tools that query
and mine the pattern/model database as well. Parallel methods can be success-
ful in providing the desired rapid response in all of the above steps.

4 Book Organization

This book contains chapters covering all the major tasks in data mining including
parallel and distributed mining frameworks, associations, sequences, clustering
and classification. We provide a brief synopsis of each chapter below, organized
under four main headings.

4.1 Mining Frameworks

Graham Williams et al. present Data Miner’s Arcade, a java-based platform-


independent system for integrating multiple analysis and mining tools, using a
common API, and providing seamless data access across multiple systems. Com-
ponents of the DM Arcade include parallel algorithms (e.g., BMARS - multiple
adaptive regression B-splines), virtual environments for data visualization, and
data management for mining.
Bailey et al. describe the implementation of Osiris, a data server for wide-
area distributed data mining, built upon clusters, meta-clusters (with commodity
network like Internet) and super-clusters (with high-speed network). Osiris ad-
dresses three key issues: What data layout should be used on the server? What
tradeoffs are there in moving data or predictive models between nodes? How data
should be moved to minimize latency; what protocols should be used? Experi-
ments were performed on a wide-area system linking Chicago and Washington
via the NSF/MCI vBNS network.
Parthasarathy et al. present InterAct, an active mining framework for dis-
tributed mining. Active mining refers to methods that maintain valid mined pat-
terns or models in the presence of user interaction and database updates. The
framework uses mining summary structures that are maintained across updates
or changes in user specifications. InterAct also allows effective client-server data
and computation sharing. Active mining results were presented on a number of
methods like discretization, associations, sequences, and similarity search.
18 Mohammed J. Zaki

4.2 Association Rules and Sequences

Joshi et al. open this section with a survey chapter on parallel mining of as-
sociation rules and sequences. They discuss the many extant parallel solutions,
and give an account of the challenges and issues for effective formulations of
discovering frequent itemsets and sequences.
Morishita and Nakaya describe a novel parallel algorithm for mining corre-
lated association rules. They mine rules based on the chi-squared metric that
optimizes the statistical significance or correlation between the rule antecedent
and consequent. A parallel branch-and-bound algorithm was proposed that uses
a term rewriting technique to avoid explicitly maintaining lists of open and
closed nodes on each processor. Experiments on SMP platforms (with up to 128
processors) show very good speedups.
Shintani and Kitsuregawa propose new load balancing strategies for general-
ized association rule mining using a gigabyte-sized database on a cluster of 100
PCs connected with an ATM network. In generalized associations the items are
at the leaf levels in a hierarchy or taxonomy of items, and the goal is to discover
rules involving concepts at multiple (and mixed) levels. They show that load
balancing is crucial for performance on such large-scale clusters.
Zaki presents pSPADE, a parallel algorithm for sequence mining. pSPADE
divides the pattern search space into disjoint, independent sub-problems based
on suffix-classes, each of which can be solved in parallel in an asynchronous
manner. Task parallelism and dynamic inter- and intra-class load balancing is
used for good performance. Results on a 12 processor SMP using up to a 1 GB
dataset show good speedup and scaleup.

4.3 Classification

Skillicorn presents parallel techniques for generating predictors for classification


and regression models. A recent trend in learning is to build multiple prediction
models on different samples from the training set, and combine them, allowing
faster induction and lower error rates. This framework is highly amenable to
parallelism and forms the focus of this paper.
Goil and Choudhary implemented a parallel decision tree classifier using the
aggregates computed in multidimensional analysis or OLAP. They compute ag-
gregates/counts per class along various dimensions, which can then be used for
computing the attribute split-points. Communication is minimized by coalescing
messages and is done once per tree level. Experiments on a 16 node IBM SP2
were presented.
Hall et al. describe distributed rule induction for learning a single model
from disjoint datasets. They first learn local rules from a single site; these are
merged to form a global rule set. They show that while this approach promises
fast induction, accuracy tapers off (as compared to directly mining the whole
database) as the number of sites increases. They suggested some heuristics to
minimize this loss in accuracy.
Parallel and Distributed Data Mining 19

4.4 Clustering

Johnson and Kargupta present the Collective Hierarchical Clustering algorithm


for clustering over distributed, heterogeneous databases. Rather than gathering
the data at a central site, they generate local cluster models, which are subse-
quently combined to obtain the global clustering.
Dhillon and Modha parallelized the K-means clustering algorithm on a 16
node IBM SP2 distributed-memory system. They exploit the inherent data par-
allelism of the K-means algorithm, by performing the point-to-centroid distance
calculations in parallel. They demonstrated linear speedup on a 2GB dataset.

5 Conclusion

We conclude by observing that the need for large-scale data mining algorithms
and systems is real and immediate. Parallel and distributed computing is es-
sential for providing scalable, incremental and interactive mining solutions. The
field is in its infancy, and offers many interesting research directions to pur-
sue. We hope that this volume, representing the state-of-the-art in parallel and
distributed mining methods, will be successful in bringing to surface the require-
ment and challenges in large-scale parallel KDD systems.

References

1. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge
discovery: An overview. [86]
2. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: The KDD process for extracting
useful knowledge from volumes of data. Communications of the ACM 39 (1996)
3. Simoudis, E.: Reality check for data mining. IEEE Expert: Intelligent Systems
and Their Applications 11 (1996) 26–33
4. DeWitt, D., Gray, J.: Parallel database systems: The future of high-performance
database systems. Communications of the ACM 35 (1992) 85–98
5. Valduriez, P.: Parallel database systems: Open problems and new issues. Dis-
tributed and Parallel Databases 1 (1993) 137–165
6. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast discovery
of association rules. In Fayyad, U., et al, eds.: Advances in Knowledge Discovery
and Data Mining, AAAI Press, Menlo Park, CA (1996) 307–328
7. Park, J.S., Chen, M., Yu, P.S.: An effective hash based algorithm for mining
association rules. In: ACM SIGMOD Intl. Conf. Management of Data. (1995)
8. Savasere, A., Omiecinski, E., Navathe, S.: An efficient algorithm for mining asso-
ciation rules in large databases. In: 21st VLDB Conf. (1995)
9. Brin, S., Motwani, R., Ullman, J., Tsur, S.: Dynamic itemset counting and im-
plication rules for market basket data. In: ACM SIGMOD Conf. Management of
Data. (1997)
10. Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast dis-
covery of association rules. In: 3rd Intl. Conf. on Knowledge Discovery and Data
Mining. (1997)
20 Mohammed J. Zaki

11. Mueller, A.: Fast sequential and parallel algorithms for association rule mining: A
comparison. Technical Report CS-TR-3515, University of Maryland, College Park
(1995)
12. Park, J.S., Chen, M., Yu, P.S.: Efficient parallel data mining for association rules.
In: ACM Intl. Conf. Information and Knowledge Management. (1995)
13. Agrawal, R., Shafer, J.: Parallel mining of association rules. IEEE Trans. on
Knowledge and Data Engg. 8 (1996) 962–969
14. Cheung, D., Han, J., Ng, V., Fu, A., Fu, Y.: A fast distributed algorithm for mining
association rules. In: 4th Intl. Conf. Parallel and Distributed Info. Systems. (1996)
15. Shintani, T., Kitsuregawa, M.: Hash based parallel algorithms for mining associa-
tion rules. In: 4th Intl. Conf. Parallel and Distributed Info. Systems. (1996)
16. Zaki, M.J., Ogihara, M., Parthasarathy, S., Li, W.: Parallel data mining for asso-
ciation rules on shared-memory multi-processors. In: Supercomputing’96. (1996)
17. Cheung, D., Hu, K., Xia, S.: Asynchronous parallel algorithm for mining asso-
ciation rules on shared-memory multi-processors. In: 10th ACM Symp. Parallel
Algorithms and Architectures. (1998)
18. Han, E.H., Karypis, G., Kumar, V.: Scalable parallel data mining for association
rules. In: ACM SIGMOD Conf. Management of Data. (1997)
19. Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: Parallel algorithms for fast
discovery of association rules. Data Mining and Knowledge Discovery: An Inter-
national Journal 1(4):343-373 (1997)
20. Tamura, M., Kitsuregawa, M.: Dynamic load balancing for parallel association rule
mining on heterogeneous PC cluster systems. In: 25th Intl Conf. on Very Large
Data Bases. (1999)
21. Zaki, M.J.: Parallel and distributed association mining: A survey. IEEE Concur-
rency 7 (1999) 14–25
22. Agrawal, R., Srikant, R.: Mining sequential patterns. In: 11th Intl. Conf. on Data
Engg. (1995)
23. Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and perfor-
mance improvements. In: 5th Intl. Conf. Extending Database Technology. (1996)
24. Oates, T., Schmill, M.D., Jensen, D., Cohen, P.R.: A family of algorithms for
finding temporal structure in data. In: 6th Intl. Workshop on AI and Statistics.
(1997)
25. Zaki, M.J.: Efficient enumeration of frequent sequences. In: 7th Intl. Conf. on
Information and Knowledge Management. (1998)
26. Shintani, T., Kitsuregawa, M.: Mining algorithms for sequential patterns in paral-
lel: Hash based approach. In: 2nd Pacific-Asia Conf. on Knowledge Discovery and
Data Mining. (1998)
27. Oates, T., Schmill, M.D., Cohen, P.R.: Parallel and distributed search for structure
in multivariate time series. In: 9th European Conference on Machine Learning.
(1997)
28. Weiss, S.M., Kulikowski, C.A.: Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman (1991)
29. Michie, D., Spiegelhalter, D.J., Taylor, C.C.: Machine Learning, Neural and Sta-
tistical Classification. Ellis Horwood (1994)
30. Lippmann, R.: An introduction to computing with neural nets. IEEE ASSP
Magazine 4 (1987)
31. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learn-
ing. Morgan Kaufmann (1989)
Parallel and Distributed Data Mining 21

32. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regres-
sion Trees. Wadsworth, Belmont (1984)
33. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufman (1993)
34. Provost, F., Aronis, J.: Scaling up inductive learning with massive parallelism.
Machine Learning 23 (1996)
35. Provost, F., Kolluri, V.: A survey of methods for scaling up inductive algorithms.
Data Mining and Knowledge Discovery: An International Journal 3 (1999) 131–169
36. Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: A fast scalable classifier for data
mining. In: Proc. of the Fifth Intl Conference on Extending Database Technology
(EDBT), Avignon, France (1996)
37. Shafer, J., Agrawal, R., Mehta, M.: Sprint: A scalable parallel classifier for data
mining. In: 22nd VLDB Conference. (1996)
38. Joshi, M., Karypis, G., Kumar, V.: ScalParC: A scalable and parallel classifica-
tion algorithm for mining large datasets. In: Intl. Parallel Processing Symposium.
(1998)
39. Chattratichat, J., Darlington, J., Ghanem, M., Guo, Y., Huning, H., Kohler, M.,
Sutiwaraphun, J., To, H.W., Dan, Y.: Large scale data mining: Challenges and
responses. In: 3rd Intl. Conf. on Knowledge Discovery and Data Mining. (1997)
40. Kufrin, R.: Decision trees on parallel processors. In Geller, J., Kitano, H., Suttner,
C., eds.: Parallel Processing for Artificial Intelligence 3, Elsevier-Science (1997)
41. Zaki, M.J., Ho, C.T., Agrawal, R.: Parallel classification for data mining on shared-
memory multiprocessors. In: 15th IEEE Intl. Conf. on Data Engineering. (1999)
42. Srivastava, A., Han, E.H., Kumar, V., Singh, V.: Parallel formulations of decision-
tree classification algorithms. Data Mining and Knowledge Discovery: An Interna-
tional Journal 3 (1999) 237–261
43. Sreenivas, M., Alsabti, K., Ranka, S.: Parallel out-of-core divide and conquer
techniques with application to classification trees. In: 13th International Parallel
Processing Symposium. (1999)
44. Alsabti, K., Ranka, S., Singh, V.: Clouds: A decision tree classifier for large
datasets. In: 4th Intl Conference on Knowledge Discovery and Data Mining. (1998)
45. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall (1988)
46. Cheeseman, P., Kelly, J., Self, M., et al.: AutoClass: A Bayesian classification
system. In: 5th Intl Conference on Machine Learning, Morgan Kaufman (1988)
47. Fisher, D.H.: Knowledge acquisition via incremental conceptual clustering. Ma-
chine Learning 2 (1987)
48. Michalski, R.S., Stepp, R.E.: Learning from observation: Conceptual clustering.
In Michalski, R.S., Carbonell, J.G., Mitchell, T.M., eds.: Machine Learning: An
Artificial Intelligence Approach. Volume I. Morgan Kaufmann (1983) 331–363
49. Li, X., Fang, Z.: Parallel clustering algorithms. Parallel Computing 11 (1989)
270–290
50. Rivera, F., Ismail, M., Zapata, E.: Parallel squared error clustering on hypercube
arrays. Journal of Parallel and Distributed Computing 8 (1990) 292–299
51. Ranka, S., Sahni, S.: Clustering on a hypercube multicomputer. IEEE Trans. on
Parallel and Distributed Systems 2(2) (1991) 129–137
52. Rudolph, G.: Parallel clustering on a unidirectional ring. In et al., R.G., ed.:
Transputer Applications and Systems ’93: Volume 1. IOS Press, Amsterdam (1993)
487–493
53. Olson, C.: Parallel algorithms for hierarchical clustering. Parallel Computing 21
(1995) 1313–1325
54. Judd, D., McKinley, P., Jain, A.: Large-scale parallel data clustering. In: Intl Conf.
Pattern Recognition. (1996)
22 Mohammed J. Zaki

55. S. Goil, H.N., Choudhary, A.: MAFIA: Efficient and scalable subspace cluster-
ing for very large data sets. Technical Report 9906-010, Center for Parallel and
Distributed Computing, Northwestern University (1999)
56. Stolfo, S., Prodromidis, A., Tselepis, S., Lee, W., Fan, W., Chan, P.: Jam: Java
agents for meta-learning over distributed databases. In: 3rd Intl. Conf. on Knowl-
edge Discovery and Data Mining. (1997)
57. Prodromidis, A., Stolfo, S., Chan, P.: Meta-learning in distributed data mining
systems: Issues and approaches. [67]
58. Guo, Y., Sutiwaraphun, J.: Knowledge probing in distributed data mining. In: 3rd
Pacific-Asia Conference on Knowledge Discovery and Data Mining. (1999)
59. Kargupta, H., Hamzaoglu, I., Stafford, B.: Scalable, distributed data mining using
an agent based architecture. In: 3rd Intl. Conf. on Knowledge Discovery and Data
Mining. (1997)
60. Kargupta, H., Park, B.H., Hershberger, D., Johnson, E.: Collective data mining:
A new perspective toward distributed data mining. [67]
61. Parthasarathy, S., Subramonian, R.: Facilitating data mining on a network of
workstations. [67]
62. Grossman, R.L., Bailey, S.M., Sivakumar, H., Turinsky, A.L.: Papyrus: A system
for data mining over local and wide area clusters and super-clusters. In: Super-
computing’99. (1999)
63. Chattratichat, J., Darlington, J., Guo, Y., Hedvall, S., Kohler, M., Syed, J.: An
architecture for distributed enterprise data mining. In: 7th Intl. Conf. High-
Performance Computing and Networking. (1999)
64. Bhatnagar, R., Srinivasan, S.: Pattern discovery in distributed databases. In:
AAAI National Conference on Artificial Intelligence. (1997)
65. Aronis, J., Kolluri, V., Provost, F., Buchanan, B.: The WoRLD: Knowledge discov-
ery from multiple distributed databases. In: Florida Artificial Intelligence Research
Symposium. (1997)
66. Freitas, A., Lavington, S.: Mining very large databases with parallel processing.
Kluwer Academic Pub., Boston, MA (1998)
67. Kargupta, H., Chan, P., eds.: Advances in Distributed Data Mining. AAAI Press,
Menlo Park, CA (2000)
68. Skillicorn, D.: Strategies for parallel data mining. IEEE Concurrency 7 (1999)
26–35
69. Toivonen, H.: Sampling large databases for association rules. In: 22nd VLDB Conf.
(1996)
70. Zaki, M.J., Parthasarathy, S., Li, W., Ogihara, M.: Evaluation of sampling for data
mining of association rules. In: 7th Intl. Wkshp. Research Issues in Data Engg.
(1997)
71. Shafer, J., Agrawal, R., Mehta, M.: SPRINT: A scalable parallel classifier for data
mining. In: Proc. of the 22nd Intl Conference on Very Large Databases, Bombay,
India (1996)
72. Cheung, D., Xiao, Y.: Effect of data distribution in parallel mining of associations.
Data Mining and Knowledge Discovery: An International Journal 3 (1999) 291–314
73. Agrawal, R., Shim, K.: Developing tightly-coupled data mining applications on
a relational database system. In: 2nd Intl. Conf. on Knowledge Discovery in
Databases and Data Mining. (1996)
74. Meo, R., Psaila, G., Ceri, S.: A new SQL-like operator for mining association rules.
In: 22nd Intl. Conf. Very Large Databases. (1996)
75. Meo, R., Psaila, G., Ceri, S.: A tightly-coupled architecture for data mining. In:
Intl. Conf. on Data Engineering. (1998)
Parallel and Distributed Data Mining 23

76. Sarawagi, S., Thomas, S., Agrawal, R.: Integrating association rule mining with
databases: alternatives and implications. In: ACM SIGMOD Intl. Conf. Manage-
ment of Data. (1998)
77. Holsheimer, M., Kersten, M.L., Siebes, A.: Data surveyor: Searching the nuggets
in parallel. [86]
78. Lavington, S., Dewhurst, N., Wilkins, E., Freitas, A.: Interfacing knowledge discov-
ery algorithms to large databases management systems. Information and Software
Technology 41 (1999) 605–617
79. Kamber, M., Han, J., Chiang, J.Y.: Metarule-guided mining of multi-dimensional
association rules using data cubes. In: 3rd Intl. Conf. on Knowledge Discovery and
Data Mining. (1997)
80. Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., Verkamo, A.I.: Finding
interesting rules from large sets of discovered association rules. In: 3rd Intl. Conf.
Information and Knowledge Management. (1994) 401–407
81. Shen, W.M., Ong, K.L., Mitbander, B., Zaniolo, C.: Metaqueries for data mining.
[86]
82. Ng, R.T., Lakshmanan, L., Jan, J., Pang, A.: Exploratory mining and prun-
ing optimizations of constrained association rules. In: ACM SIGMOD Intl. Conf.
Management of Data. (1998)
83. Srikant, R., Vu, Q., Agrawal, R.: Mining Association Rules with Item Constraints.
In: 3rd Intl. Conf. on Knowledge Discovery and Data Mining. (1997)
84. Matheus, C., Piatetsky-Shapiro, G., McNeill, D.: Selecting and reporting what
is interesting. In Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy,
R., eds.: Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press
(1996)
85. Toivonen, H., Klemettinen, M., Ronkainen, P., Hätönen, K., Mannila, H.: Prun-
ing and grouping discovered association rules. In: MLnet Wkshp. on Statistics,
Machine Learning, and Discovery in Databases. (1995)
86. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., eds.: Advances in
Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA (1996)
The Integrated Delivery of
Large-Scale Data Mining:
The ACSys Data Mining Project

Graham Williams1 , Irfan Altas2 , Sergey Bakin3 , Peter Christen4 ,


Markus Hegland4 , Alonso Marquez5 , Peter Milne1 ,
Rajehndra Nagappan5 , and Stephen Roberts4
1
Cooperative Research Centre for Advanced Computational Systems
CSIRO Mathematical and Information Sciences
GPO Box 664, Canberra, ACT 2601, Australia
[email protected]
http://www.cmis.csiro.au/ALCD
2
School of Information Studies, Charles Sturt University
Wagga Wagga, NSW 2678, Australia
[email protected]
3
Department of Mathematics, The University of Queensland
Brisbane, Qld 4072, Australia
[email protected]
4
Computer Sciences Laboratory, Australian National University
Canberra, ACT 0200, Australia
[email protected]
5
Department of Computer Science, Australian National University
Canberra, ACT 0200, Australia
[email protected]

Abstract. Data Mining draws on many technologies to deliver novel


and actionable discoveries from very large collections of data. The Aus-
tralian Government’s Cooperative Research Centre for Advanced Com-
putational Systems (ACSys) is a link between industry and research fo-
cusing on the deployment of high performance computers for data min-
ing. We present an overview of the work of the ACSys Data Mining
projects where the use of large-scale, high performance computers plays
a key role. We highlight the use of large-scale computing within three
complimentary areas: the development of parallel algorithms for data
analysis, the deployment of virtual environments for data mining, and
issues in data management for data mining. We also introduce the Data
Miner’s Arcade which provides simple abstractions to integrate these
components providing high performance data access for a variety of data
mining tools communicating through XML.

1 Introduction
High performance computers and parallel algorithms provide the necessary plat-
form for the delivery of novel and actionable discoveries from extremely large

M.J. Zaki, C.-T. Ho (Eds.): Large-Scale Parallel Data Mining, LNAI 1759, pp. 24–54, 2002.

c Springer-Verlag Berlin Heidelberg 2002
ACSys Data Mining 25

collections of data. The Australian Government’s Cooperative Research Centre


for Advanced Computational Systems (ACSys) investigates industrial problems
to direct research on the deployment of high performance computers for data
mining. The multidisciplinary ACSys team draws together researchers in Statis-
tics, Machine Learning, and Numerical Algorithms from The Australian National
University and the Australian Government’s research organisation CSIRO Aus-
tralia. Commercial projects are drawn from the banking, insurance, and health
sectors.
There are many components that contribute to the successful deployment of
data mining solutions. Parallel algorithms exploit the processing capabilities of
multi-processor environments to deliver models in a timely fashion. Visualisa-
tion and Virtual Environments provide useful insights into relationships in the
data. And underlying all of these activities is the data itself, and in particular,
the mechanisms for accessing the data. Finally, we need to provide a standard,
integrated environment that can be easily tuned for particular applications, and
that can facilitate the communication of data mining outcomes. In this paper
we describe these components as have and are being developed collaboratively
by ANU and CSIRO researchers through ACSys in partnership with Australian
Industry.
We begin with a review of two algorithms developed for data mining: TPS-
FEM and BMARS. Predictive model building is a core component of data
mining—whether it is modelling response to marketing campaigns, modelling
patterns of health care, or modelling fraudulent behaviours. Gigabytes of data
collected over decades are available. And yet, it is often groups that occur infre-
quently that are important to our business (whether it is identifying the 5% who
will respond to a mail campaign, or the less than 1% who will commit insurance
fraud). Sampling is generally not an appropriate action, but instead we wish to
analyse all of the data.
Given the large amount of data as well as the large number of attributes
involved in data mining problems, two core challenges need to be faced. The
first concerns the computational feasibility of the techniques used to build the
predictive models used in data mining. This translates into the requirement that
data mining techniques scale to large data sets. The second challenge is the
interpretability of the resulting models. Specifically, one often has not only to
be able to build a predictive model but also to obtain insight from the structure
exhibited by the model. Distributing and sharing models, and combining models
built from different runs over possibly different data, can benefit from addressing
the interpretability question.
Exploring very large datasets with high dimensionality requires considerable
support to provide the Data Miner with insights that aid in their understanding
of the data. Virtual environments (VEs) for data mining are being explored
towards a number of ends. The high dimensionality of the data often presented
to the Data Miner leads to considerable complexity in coming to understand
the interplay of the many features. Exploring this interplay more effectively can
assist in the identification and selection of important features to be used for later
26 Graham Williams et al.

predictive modelling and other data mining tasks. Also, as model builders are
applied to ever larger datasets, the complexity of the resulting models increases
correspondingly. Virtual environments can also effectively provide insights into
the modelling process, and the resulting models themselves.
All aspects of data mining revolve around the data. Data is stored in a variety
of formats and within a variety of database systems. Data needs to be accessed
in a timely manner and potentially multiple times. Managing, transforming,
and efficiently accessing the data is a crucial issue. The Semantic Extension
Framework provides an environment for seamlessly extending the semantics of
Java objects, allowing those objects to be instantiated in different ways and from
different sources. We are beginning to explore the benefits of such a framework for
ongoing data mining activities. The potential of this approach lies in all stages
of the data mining process [1], from data management and data versioning,
through to access mechanisms highly tuned to suit the behaviour of access of
the particular predictive modelling tool being employed.
Finally, we need to bring these tools together to deliver highly configurable,
and often pre-packaged or ‘canned’ solutions for particular applications. The
Data Miner’s Arcade provides simple abstractions to integrate these components
providing high performance data access for a variety of data mining tools com-
municating through standard interfaces, and building on the developing XML
standards for data mining [2].

2 Parallel Algorithms
Careful, detailed examination of each and every customer, patient, or claimant
that exists in a very large dataset made available for data mining might well
lead to a better understanding of the data and of underlying processes. Given
the sheer size of data we are talking about in data mining, this is, of course not
generally feasible, and probably not desirable. Yet, with the desire to analyse
all the data, rather than statistical samples of the data, a data mining exercise
is often required to apply computationally complex analysis tools to extremely
large datasets.
Often, we characterise the task as being one of building an indicator func-
tion as a predictor of fraud, of propensity to purchase, or of improved health
outcomes. We can view the function as

y = f (x)

where y is the real valued response, indicating the likelihood of the outcome,
and x is the array of predictor variables (attributes or features) which encode
the information thought to be relevant to the outcome. The function f can be
trained on the collected data by, for example, (logistic) regression. We have been
developing new computational techniques to identify such predictive models from
large data sets.
Applications for such model building abound. Another example is in insur-
ance where a significant problem is to determine optimal premium levels. When
ACSys Data Mining 27

a new insurance policy is being underwritten it is important for an insurance


company to estimate the risk (based on the information provided by the policy
holder) or the likelihood of a claim being made against the policy. With this
knowledge the insurance companies would be able to set the ‘correct’ premium
levels and avoid undercharging as well as overcharging their customers (although
competitive factors must also come into play). To estimate the risk one has to
produce two models: one to predict if a policy holder is likely to make a claim;
and one to predict the amount of the claim.
Algorithms commonly used in such data mining projects include generalised
additive models [3], thin plate splines [4], decision tree and rule induction [5],
multivariate adaptive regression splines [6], patient rule induction methods [7],
evolutionary rule induction [8] and the combination of simple rules [9]. For data
mining, the issue of scalability must be addressed. We illustrate this with two
developments in parallel algorithms: thin plate spline finite element methods;
and Multivariate Adaptive Regression Splines using B-splines.

3 Predictive Modelling with Thin Plate Splines

A first computational challenge faced in generating a predictive model originates


from the large number of attributes or predictor variables. This challenge is often
referred to as the curse of dimensionality [10]. An effective way to deal with this
curse is provided by additive models of the form [11]


d
f (x) = f0 + fi (xi ).
i=1

Similar models are used in ANOVA, where all the variables xi are categorical.
The effects of the predictor variables are added up. Thus, the effect of the value
of a variable xi is independent of the effect of a different variable xj . We have
suggested and discussed a new scalable and parallel algorithm for the determi-
nation of a (generalised) additive model in [3].
A better model includes interactions between the variables. For example, it
could be the case that for different incomes the effect of the level of deductions
from taxable income on the likelihood of fraud varies. Interaction models are of
the form:

d 
d
f (x) = f0 + fi (xi ) + fi,j (xi , xj ).
i=1 i,j=1

This model is made identifiable by additional constraints and the components


fi and fi,j are determined by the backfitting algorithm [11] which consists of
repeated estimation of the components. Thus only methods for the estimation
of one- and two-dimensional models are required.
The form of the models depends on the type of predictor variables. In the
following we will only discuss the case of real predictor variables. In order not
to exclude important functions we choose a nonparametric approach and find
28 Graham Williams et al.

predictors f which are smooth and fit the data. thin plate splines [12] are an
established smooth model. They are designed to have small curvature. The one-
dimensional components fi (xi ) turn out to be cubic splines which are computa-
tionally very tractable using a B-spline basis. The form of the interaction terms
is also known:


n  
(k) (k)
fxi ,xj (xi , xj ) = c0 + c1 xi + c2 xj + bk φ (xi − xi )2 + (xj − xj )2
k=1

where φ(r2 ) = r2 log(r2 ) [12]. The coefficients of the thin plate splines are de-
termined by the linear system of the form
    
Φ + αI X b y
=
XT 0 c 0

where Φ is an n by n matrix with matrix elements Φi,j = φ(x



(j)
− x(i)2 ), I
(i) (i)
is the identity, X a n by 3 matrix, where the i-th row is 1, x1 , x2 , b is
the vector with k-th component bk and c = (c0 , c1 , c2 )T . Computationally, these
equations are intractable for large data sizes n by standard direct or iterative
methods, as even the formation of the matrix Φ requires O(n2 ) operations since it
is dense. The standard techniques thus give examples of algorithms which are not
scalable with respect to the data size. Only a few years ago it was thought that
the feasibility of thin plate splines (and similar radial-basis function approaches)
was limited to the case of a few hundred to thousand observations. However,
new techniques have been developed since then to push these limits. One school
of thought uses the locality of the problem, i.e., the fact that the value f (x) only
depends on observations x(k) which are near x [13,14]. The algorithms developed
are mainly for interpolation, i.e., the case α = 0.
We have developed a different approach which is provably scalable and may
be extended to higher order interactions. We use the fact that the thin plate
spline interpolant minimises the functional


n
J1 (f ) = (f (x(k) ) − y (k) )2
k=1
 2 2 2
(1)
∂2f ∂2f ∂2f
+α +2 + dx1 dx2 .
∂x21 ∂x1 ∂x2 ∂x22

The minimiser of this functional can be approximated in a finite-element space.


For the solution of this problem we suggest a non-conforming method based on
piecewise bilinear functions such that on the rectangular elements the function is
of the form a+bx1 +cx2 +dx1 x2 . The method finds an approximation u = (u1 , u2 )
of the gradient of f as a piecewise bilinear function. Instead of J1 , the following
ACSys Data Mining 29

function is minimised (obtained by inserting the gradient in J1 ):



n
J2 (f ) = (f (x(k) ) − y (k) )2
k=1
 2 2 2 2
(2)
∂u1 ∂u1 ∂u2 ∂u2
+α + + + dx1 dx2 .
∂x1 ∂x2 ∂x1 ∂x2

It can be seen that if one chooses u with curl u = 0 such that

∆f (x) = div u(x), x∈G

and
∂f
(x) = un (x), x ∈ ∂G
∂n
then the same solution as above is obtained. However, practical tests show that
the curl condition is not important in achieving a good approximation [4].
The finite element solution of the optimisation problem proceeds in two
stages:
1. The matrix and right-hand side of the linear system of equations is assem-
bled. The matrix of this linear system is the sum of low rank matrices, one
for each data point x(i) .
2. The linear system of equations is solved.
The time for the first (assembly) stage depends linearly on the data size n and
the time for the second (solution) stage is independent of n. Thus the overall
algorithm scales with the number of data points. The data points only need
to be visited once, thus there is no need to either store the entire data set
in memory nor revisit the data points several times. The basis functions are
piecewise bilinear and require a small number of operations for their evaluation.
With this technique the smoothing of millions of data points becomes feasible.
The parallel algorithm exploits different aspects of the problem for the as-
sembly and the solution stage. The time required for the assembly stage grows
linearly as a function of data size. For simplicity we assume that the data is ini-
tially equally distributed between the local disks of the processors. (If this is not
the case initial distribution costs would have to be included in the analysis.) In
a first step of the assembly stage a local matrix is assembled for each processor
based on the data available on its local disk. The matrix of the full problem is
then the sum of the local matrices and can thus be obtained through a reduction
step. This algorithm was developed and tested on a cluster of 10 Sun Sparc-5
workstations networked with a 10 Mbit/s twisted pair Ethernet using MPI [15].
The total time spent in this assembly phase is of the order

Tp = O(n/p) + O(m log2 (p))

where m characterises the size of the assembled matrix. Thus, if the number n
of the data points grows like O(p log2 (p)) for fixed matrix size m the parallel
30 Graham Williams et al.

efficiency is
T1 n
Ep = =O = O(1)
pTp n + mp log2 (p)
and thus there is no drop in parallel efficiency for larger numbers of processors.
This basic trend is confirmed by practical experiments [15].
In the solution stage the spatial parallelism of the problem is exploited. As-
sume for simplicity that the domain is rectangular. If the domain was split into
strips of equal size the values on the boundaries between the strips depends on
the data in the neighbouring strips. However, as this dependency is local, only
a fixed number of points in the neighbouring strip really have an influence on
the function values f (x) in the strip. A good approximation is obtained for the
values on the strip by solving the smoothing problem for an expanded region
containing the original strip and a sufficient number of neighbouring points. Note
that by introducing redundant computations in this way, communication can be
avoided. The size of the original strip is proportional to m/p and, in order to
add the extra k neighbouring points, it has to be expanded by a factor kp/n.
Thus the size of the expanded strip is of the order of

s = (m/p)(1 + kp/n).

As we assumed n = O(p log2 (p)) to get isoefficiency [16] of the assembly phase
the size of the strips is proportional to m/p asymptotically in p which shows
isoefficiency for the solution stage.
This approach thus ensures a fast and efficient path to the development of
predictive models.

4 Predictive Modelling with Multivariate Regression


Splines
The popular Multivariate Adaptive Regression Splines (MARS) algorithm by
Friedman [6] is able to produce continuous as well as easily interpretable regres-
sion models. The regression models are the special class of predictive models
intended to model numeric response variables as opposed to the generalised re-
gression models used in situations where the response is discrete. Here we give an
overview of the original MARS algorithm followed by a discussion of its parallel
version based on B-splines (BMARS).
MARS constructs a linear combination of basis functions which are products
of one-dimensional basis functions (indicator functions in the case of categorical
variables and truncated power functions in the case of numeric variables). The
key to the method is that the basis functions are generated recursively and
depend on the data. The important implication of the approach is that models
produced by MARS involve only variables and their interactions relevant to the
problem at hand. This property is especially useful in the data mining context.
BMARS [17] improves upon MARS by: using compactly supported B-spline
basis functions; utilising a new scale-by-scale model building strategy; and in-
ACSys Data Mining 31

troducing a parallel implementation. These modifications allow the stable and


fast (compared to MARS) analysis of very large datasets.

4.1 Multivariate Adaptive Regression Splines


For the sake of simplicity, we confine ourselves to the case of purely numeric data
though it should be remembered that the (appropriately modified) algorithm is
able to deal with data of mixed type. The required modification will be discussed
briefly, below.
In a nutshell, the original MARS is an efficient technique designed to select
a (relatively high quality) model from the space of multivariate piecewise linear
functions1 . Any such function can be represented as a linear combination of the
tensor product basis functions Tk1 ...kd (x):


K1 
Kd d
f (x) = ... ak1 ...kd , Tk1 ...kd (x), Tk1 ...kd (x) = bkj ,j (xj ) (3)
k1 =0 kd =1 j=1

K
where b0,j (xj ) = 1 and {bkj ,j (xj )}kjj=1 are univariate piecewise linear basis func-
tions of the variable xj , j = 1, ..., d. The original MARS is based on the univari-
ate truncated power basis functions:

bkj ,j (xj ) = [xj − tkj ]+ , kj = 1, ..., Kj ,


where tkj kj = 1, ..., Kj are certain prespecified knot locations on the variable xj
taken to be, for example, quantiles of the corresponding marginal distribution
of the data points. The coefficients ak1 ...kd can be determined based the least
squares fit of the general model (3) to the data at hand.

As can be seen, there are dj=1 (Kj + 1) basis functions in the expansion
(3). Therefore, the application of this approach would be feasible only in the
situation where one has to deal with a moderate number of variables as well as
knot locations. Also, it appears difficult to make any conclusion concerning the
structure of the regression function (3): all variables as well as a large number
of basis functions would generally be involved. These observations lead to the
conclusion that the approach is less appropriate in the data mining context.
The MARS algorithm aims to overcome the above problems. It traverses
the space of piecewise linear multivariate functions in a stepwise manner and
eventually arrives at a function which, on one hand, has much simpler structure
compared to the general function (3) and, on the other hand, is an adequate
model for the data. The models produced by MARS have the following structure


J
f (x) = am Tm (x)
m=0
1
Here a piecewise linear multivariate function is one which is piecewise linear with
respect to any of its numeric variables.
32 Graham Williams et al.

where the basis functions {Tm (x)}Jm=0 have the form

dm
Tm (x) = [xv(j,m) − tjm ]+ .
j=1

As can be seen, this model is similar to the general model (3) in that both belong
to the same function space. However, the distinct feature of MARS models is
that they are normally based on only a very small subset of the complete set of
tensor product basis functions. The pseudo-code of the procedure which builds
the subset of functions is shown below.

Algorithm 1 MARS algorithm


model ← {T0 (x) = 1}
for m = 1 to Jmax do
Tm (x) ← 0
for s = 0 to m − 1 do
for j = 1 to d do
if xj involved in Ts (x) then
continue
else
for kj = 1 to Kj do
c
Form Tm (x) = Ts (x)bkj ,j (xj )
c
if Tm (x) better than Tm (x) then
c
Tm (x) ← Tm (x)
end if
end for
end if
end for
end for 
model ← model Tm (x)
end for

The algorithm starts with the model containing only the constant function.
All subsequent functions are produced one at a time. At each step the algo-
c
rithm enumerates all possible candidate basis functions Tm (x) and selects the
one whose inclusion in the model results in the largest improvement of the least
squares fit of the model to the data. The three nested internal loops (correspond-
ing to the s, j, kj loop variables) implement this selection process. The selected
basis function is added to the model.
The set of candidate basis functions is seen to be comprised of all basis
functions which can be derived from the ones contained in the model via multi-
plication by a univariate basis function. Due to the utilisation of this definition
of the set of candidates, the MARS algorithm allows for a considerable reduction
in the computational cost compared with another popular technique (forward
subset selection procedure [18]). The number of basis functions Jmax produced
ACSys Data Mining 33

by MARS has to be specified by a user. It turns out that the quality of the
model can even further be improved via removal of the less optimal tensor prod-
uct basis functions from the model. This can be accomplished by means of the
backward elimination procedure (see [6] for details).
As mentioned, this approach can be modified to data of mixed types. Uni-
variate indicator functions I[x ∈ A] can be used instead of the truncated powers
whenever a categorical variable x is encountered in the Algorithm (1). Thus, the
typical tensor product basis function would have the form:

dnum
m dcat
m

Tm (x) = [xv(j,m) − tjm ]+ I[xv(j,m) ∈ Ajm ].


j=1 j=1

The algorithm for finding the appropriate subsets Ajm is very similar to the
ordinary forward stepwise regression procedure [18]. The detailed discussion of
the algorithm is given in [19].

4.2 Refinement of MARS via B-splines

MARS is thus based on truncated power basis functions which are used to form
tensor product basis functions. However, truncated powers are known to have
poor numerical properties. In our work we sought to develop a MARS-like algo-
rithm based on B-splines which form a basis with better numerical properties.
In our algorithm, called BMARS, we use B-splines of the second order (piece-
d
wise linear B-splines) to form tensor product basis functions j=1 Bkj ,j (xj ).
Thus, the models produced by MARS and BMARS belong to the space of piece-
wise linear multivariate functions. In common with MARS, BMARS traverses
the space of piecewise linear multivariate functions until it arrives at the model
which provides an adequate fit. However, the way in which the traversal occurs
is somewhat different. Apart from being a more stable basis, B-splines possess a
compact support property which allows us to build models in the scale-by-scale
way. The pseudo-code (Algorithm 2) illustrates the strategy.
To implement the scale-by-scale strategy, one needs B-splines of different
scales. The scale is the size of the support interval of a B-spline. Given a set
K of K = 2l0 + 1 knots on a variable x one can construct B-splines of l0 + 1
different scales based on l0 + 1 nested subsets Kl of K l = (K − 1)/2l−1 + 1
knots, l = 1, ..., l0 + 1 respectively. The lth subset is obtained from the full
set by retaining each 2l−1 st knot and disposing of the rest. Thus, the B-splines
constructed using the lth subset of knots have on average twice as long support
intervals as the B-splines constructed using the (l − 1)st subset.
At the start of the algorithm, the scale parameter l is set to the largest pos-
sible value l0 . Subsequently, B-splines of the largest scale only are used to form
new tensor product basis functions. Upon the formation of each new tensor prod-
uct basis function, the algorithm checks if the improvement of the fit due to the
inclusion of the new basis function is appreciable. We use the Generalised Cross-
Validation score [6] to decide if the inclusion of a new basis function improves
34 Graham Williams et al.

the fit. If this is not the case, the algorithm switches over to using B-splines of
the second largest scale.
Thus, new tensor product basis functions continue to be generated using B-
splines of the second largest scale. Again, as soon as the algorithm detects that
the inclusion of new basis functions fails to improve the fit, it switches over to
using B-splines of the third largest scale. This procedure is repeated until the
Jmax number of tensor product basis functions is produced.

Algorithm 2 BMARS algorithm


model ← {T0 (x) = 1}
l ← l0 {set current scale to largest scale}
for m = 1 to Jmax do
Tm (x) ← 0
for s = 0 to m − 1 do
for j = 1 to d do
if xj involved in Ts (x) then
continue
else
for kj = 1 to Kjl do
c
Form Tm (x) = Ts (x)Bkl j ,j (xj )
c
if Tm (x) better than Tm (x) then
c
Tm (x) ← Tm (x)
end if
end for
end if
end for
end for 
model ← model Tm (x)
if no significant improvement of fit then
l ← l − 1 {decrease current scale}
end if
end for

The advantage of this strategy over that of MARS is that it results in a consid-
erable reduction of the number of candidate basis functions to be tested at each
step of the algorithm. This is due to the fact that the number Kjl of B-splines
of a particular scale l is less than the total number of knots Kj : Kj /Kjl = 2l−1 .
This ratio is seen to be greater than one for all scales but the smallest (l = 1)
one. This results in a fewer number of iterations carried out by the inner-most
loop of Algorithm 2 compared to the similar loop of Algorithm 1. The results of
experiments suggest that this reduction in the computational complexity comes
at no cost in terms of the quality of the resulting models [20].
ACSys Data Mining 35

4.3 Parallel Implementation of BMARS

It can be shown that the computational complexity of both MARS and BMARS
algorithms is linear in the number of data points as well as the number of at-
tributes. However, when large amounts of data are to be processed, the com-
putational time still can be prohibitively large. In order to reduce the cost of
running BMARS we have developed a parallel version of the algorithm based on
the Parallel Virtual Machine (PVM) system [21]. An advantage of PVM is its
wide availability on a number of platforms, so that software based on it is very
portable.
The idea of the parallel BMARS is very simple. Following from the struc-
ture of the Algorithm (2), each new tensor product basis function is the best
function selected from the pool of candidates. The goodness of each candidate is
determined via least squares fit. It turns out that these least squares fits account
for the bulk of the computational cost of running BMARS. Thus, an efficient
parallelisation of BMARS can be achieved via parallelisation of the least squares
procedure. We use the Gram-Schmidt algorithm [22] to perform the least squares
fit. It amounts to the computation of a number of scalar products and, there-
fore, can be efficiently parallelised using the data-partitioning approach (see, for
example [21]).
Parallel BMARS was tested on a multiprocessor system having 10 SPARC
processors. It was applied to the analysis of a large motor vehicle insurance data
set (∼ 1, 000, 000 data records) [20] as well as taxation data [17]. The results of
the experiments show that the efficiency of the algorithm is close to that of an
ideal algorithm [20].
Once again, by focusing on issues relating to the performance of the algo-
rithms on extremely large datasets from real world applications, significant im-
provements can be made in the “responsiveness” of the algorithms. The result is
that these tools can be significantly more effectively employed in data mining.

5 Virtual Environments for Data Mining

All stages of a data mining project require considerable understanding of multi-


dimensional data. Visualisation tools, both for exploratory data analysis and for
exploring the models produced by the data analysis algorithms, can play a sig-
nificant role, particularly in the context of complex models generated through
data mining [23,8]. Traditional approaches tend to be limited by the mouse-
keyboard-monitor interface. Virtual environments (VEs) dramatically increase
the “canvas” on which to render graphic representations of the data that scale
to large numbers of dimensions through an interactive, immersive, environment.
An approach being explored for this task is a technique for partitioning a 3D
VE into smaller working regions, each of which is capable of holding a subspace
of the original multidimensional data space [24]. The algorithm distributes a set
of partitioning axes in a radial arrangement from a single common origin, with
one axis for each dimension in the data set. The ends of the axes thus lie on
36 Graham Williams et al.

the surface of a sphere. A convex hull is generated to connect the ends of the
axes together. The axes and the space that they form can be used for a number
of visualisation strategies, including rectangular prism and the use of density
functions.

5.1 Multidimensional Data Sets


Representing data of high dimensionality in a form that humans can both see and
understand is a considerable challenge. Understanding a large, multidimensional
data set is not a trivial task. A number of methods have been developed to
try to visualise multidimensional data, including parallel coordinates [25] , the
hyperbox [26], pixel colouration techniques [27], worlds within worlds [28], virtual
towns [29], the grand tour [30] and Chernoff faces [31].
However, to deal with the complexity and size of contemporary data sets
we are investigating new approaches to the problem using Virtual Environment
(VE) technology. The Multidimensional Data Orb (mdOrb) [32] has a number of
properties that differentiate it from those above. Firstly, it exploits the geometric
and perceptual properties of a VE to enable the presentation of more complex
data. Secondly, it is a framework on which a family of distinct visualisation
strategies can be carried out, rather than being a single fixed implementation.
Finally, it is a highly interactive framework in which the user actively explores
the data.
The mdOrb is a technique for partitioning a 3D VE into smaller working
regions, each of which is capable of holding a subspace of the original multi-
dimensional data space (see Figure 1). The algorithm first distributes a set of
partitioning axes in a radial arrangement from a single common origin, with one
axis for each dimension in the data set. The ends of the axes thus lie on the
surface of a sphere. A convex hull is generated to connect the ends of the axes
together using a Delaunay triangulation. The shape thus formed is a convex
polygonal mesh with every vertex of the mesh being linked to the centre of the
figure by an axis. Hence for a data set with N dimensions the mesh will consist of
N vertices. Each triangle in the surface mesh has three vertices, and each vertex
has its own axis that links it to the centre of the figure. The triad of axes forms
the corner of a skewed rectangular prism - the axes and the space that they form
can be used for a number of visualisation strategies.
The first strategy is to use each rectangular prism formed by a triad of axes
as a skewed Cartesian three-space for a scatter plot of points (see Figure 2).
The points in each three-space are given by the values of each point from the N
dimensional data space in the dimensions specified by the bounding axes. Hence
a single data point is represented by a mark in every three-space, where each
mark is composed of the vector sum of three vectors. Each vector’s direction is
that of one of the axes that define the three-space. Each vector’s magnitude is
the value of that point in the dimension in the data space that corresponds to
the given axis.
The second method calculates polygons for each data point in the N dimen-
sional data space. For a given data point, the corresponding polygons’ vertices lie
ACSys Data Mining 37

Fig. 1. Composition of projection spaces in the Orb.

Fig. 2. Orb visualisation of multidimensional data.


38 Graham Williams et al.

on each of the orb’s axes at lengths determined by the entry’s value in the corre-
sponding dimension. Each entry is thus represented by a tessellation of polygons
similar to the triangles that define the three-spaces of the orb. The entry’s tes-
sellation of polygons is identical to the triangles forming the orb surface mesh as
shown in Figure 1, but the length of each vertex from the origin varies according
to the value of the data point in the given dimension. Due to occlusion problems
we do not render each individual polygon opaquely. Instead we render a density
function that illustrates how many polygons pass through each region in space.
Densely populated areas appear opaque whilst sparsely populated areas appear
transparent.
The mdOrb is not a static visualisation, but rather a framework on which
dynamic interactive investigations can be carried out. Unlike a scatter plot ma-
trix [33] the mdOrb does not display every possible combination of dimensions
concurrently. Rather the only combinations shown are those in close proximity
to each other as determined by the current tessellation. However, this does not
mean that visible relationships are limited. Each axis can be moved around the
orb at will, thus allowing the user to pry apart certain regions or close them
together. Additionally, if the user moves an axis past the bounds of the triangles
that it forms, the surface mesh is recalculated for the new axis position. This
allows the user to interactively change the combinations of axes and their neigh-
bours. For example, if a user wishes to plot two dimensions against each other
they simply move the relevant axes until they are adjacent, a visual guide of the
current tessellation like that shown in Figure 1 aids them in this task.
A user may wish to “brush” (or highlight) a region of interest in the orb.
When brushing occurs all marks or other representations that correspond to the
same data entries can be highlighted. For example, if a user brushes a cluster
in one three-space, then the marks in all other three-spaces that correspond to
those same entries will also be highlighted. In this way the user can correlate
the different properties of individual entries or groups of entries across the entire
multidimensional space.

5.2 Structural Data Models


Structural information, such as decision trees, network diagrams and program
structures are often large, heavily connected and difficult to describe textually.
A structural diagram such as a graph can often convey the layout of the overall
data but their size often means that they are difficult to study in detail.
One possibility is to use a VE for visualising such a graph, and to alter the
graph structure such that close inspection is possible [34]. The graph describing
the information is first broken into multiple sections, forming a Multiple Layer
and Multiple Relationship (MLMR) graph [35]. The MLMR graph separates
nodes and edges into coherent groups that form modules or building blocks of the
overall structure. When visualised in a VE, the standard operations of altering
the viewpoint and moving and rotating the graph are supported. Additionally,
the user can interactively turn on and off individual groups of nodes and edges.
This allows them to interactively switch between visualising the entire graph
ACSys Data Mining 39

from a global viewpoint and drilling down into a particular group of nodes and
edges with irrelevant sections of the graph removed for clarity.
Figure 3 shows a visualisation of a program written in the Java language. The
view shows the entire Java API with edges representing inheritance links, the
program itself is shown by the grey group of nodes. The navigation icons in the
lower left corner allow the user to interactively control which groups are visible.
Each group of nodes is represented by a node icon and each group of edges by
an edge icon, by selecting and deselecting the icons the groups of elements in
the graph are turned on and off. In Figure 4 many of the groups of nodes and
edges have been turned off, the only ones remaining in view are the nodes of the
program and the Java packages that it inherits from. The viewpoint has been
rotated and zoomed into the visible part of the graph to examine it in greater
detail.

Fig. 3. Overview visualisation of structural data.

While still in its early stages, the deployment of virtual environments in


data mining has much unexplored potential. Providing insights into the data
through visual and immersive means allows the user to more quickly understand
relationships in the data and assists in the selection of appropriate features for
data mining. Further explorations are underway to use VE in the actual model
building process as well as in the visualisation of the resulting models themselves.

6 Data Management
Data is stored in a variety of formats and within a variety of database systems
and data warehouses, across multiple platforms. The data needs to be accessed
40 Graham Williams et al.

Fig. 4. In depth visualisation of portion of structural data.

in a timely manner, often after it has been pre-processed to suit the particular
application. And the data will often need to be accessed multiple times for use
in the single application. Efforts in this direction, including the ongoing develop-
ment of the The Data Space Transfer Protocol [36], have begun to demonstrate
the significance of the data access issue. Here, we describe an initial approach
to effectively and seamlessly providing sophisticated data access mechanisms for
data mining. A particular focus of this research is on smart caching and other
optimisations which may be tuned for particular classes of analysis algorithms
to improve the run time performance for data mining over very large datasets.
We are employing the semantic extension framework (SEF) for Java as the en-
vironment for this work.
The semantic extension framework (SEF) for Java and the High Performance
Orthogonal Persistent Java (HPOPJ) built on top of SEF [37] are abstraction
tools which provide orthogonality of algorithms with respect to the data sources.
This approach allows datasets to be transparently accessed and efficiently man-
aged from many and any source. Algorithms accessing the data simply view the
data as Java data structures which are intended to be efficiently instantiated
as required and as determined by the semantic extensions provide for the rele-
vant objects. We are now exploring the use of the SEF and HPOPJ to provide
orthogonality and optimised access to large scale datasets.
An important problem encountered when designing data mining applications
is that the programming language and the database system are different envi-
ronments. Moreover, most databases do not support the same data model as the
programming language. This quite common phenomenon, called the impedance
mismatch, means that the programmer has to map persistent variables onto
ACSys Data Mining 41

the database environment. Solving such mapping problems and keeping explicit
track of persistent information wastes a significant portion of development time
(sometimes more than 30%) and accounts for many programming errors. The
use of the SEF for data mining enables a prototype oriented development where
complex algorithms are implemented and tested quickly.

6.1 Separation of Concerns and Orthogonal Persistent Java


Separation of concerns is a new subfield of software engineering [38]. Its goal is
to enable the encapsulation of all kinds of concerns in a software system such as
persistence, versioning, configuration, etc. An outstanding example of separation
of concerns with respect to the persistence operations is orthogonal persistence.
Orthogonal persistence provides programmers with an elegant abstraction over
the persistence of data. Programmers are freed from the burden of having to
explicitly program the movement of data between persistent and transient stores.
Orthogonally persistent Java (OPJ) refers to the application of the principles
of orthogonal persistence to the Java programming language. The separation
of concern with respect to the persistence operations need to be complemented
with a similar separation with respect to the storage medium. For this purpose
a standard interface to the underlying storage medium is necessary. The PSI
interface [39] has been defined in order to address this issue. In designing PSI,
we sought to balance a number of objectives: to flexibly support the needs of
persistent programming languages such as OPJ; and to admit small and fast
implementations.
The ACSys UPSIDE project is concerned with taking the ideals of OPJ
towards industrial relevance through performance and functionality. For this
reason, performance issues being addressed by the project include high efficiency
storage, byte code optimisations and Java Virtual Machine (JVM) optimisations.
Key functionality issues include the efficient integration of powerful transaction
models into the OPJ VM (long and short transactions), and support for object
instance and class versioning.

6.2 The Semantic Extension Framework


There are a number of ways in which standard Java semantics can be transpar-
ently extended, including:
1. Modifying the virtual machine to directly implement the semantic extensions
either through the existing byte-code set [40,41], or via additional byte-codes
[42].
2. Modifying the virtual machine to implement extended reflection capabilities
through which semantic extensions can be implemented [42].
3. Preprocessing source code [43].
4. Modifying the compiler [44,45].
5. Preprocessing byte-codes (statically) [46].
6. Transforming byte-codes at class load time [47,44].
42 Graham Williams et al.

The first two approaches clearly violate the goal of portability as they depend
on a modified virtual machine. The next three approaches produce portable byte-
codes but require each producer of semantically extended code to have access to
a modified compiler or preprocessor. Moreover, the compilation approach pre-
cludes the dynamic composition of semantic extensions. Only the last method is
compatible with our goals of dynamic composition and portability. Consequently,
we have adopted the last approach to semantic extensions as the basis for our
semantic extension framework and our OPJ implementation (a semi-dynamic
approach).
Byte-code transformations are notoriously error prone. A simple mistake dur-
ing the transformation process can destroy type safety or the semantics of the
program, and may lead to the byte-code modified class being rejected at class
load time. A type-safe and declarative way to specify program transformations is
essential to the practical application of byte-code transformations. To this end,
we have defined the Semantic Extension Framework. Our framework allows for
both the semantic extension of methods and the inclusion of special ‘triggers’
(similar in concept to database triggers) that are activated on the occurrence of
particular events such as the execution of getfield or putfield Java byte-codes.
The semantic extension framework is invoked when a user class is loaded. This
action triggers a special semantic extension class loader to search for and load
any semantic extension classes that are applicable to the user class being loaded.
A first prototype of the framework has been implemented. It has been ap-
plied to the implementation of a portable OPJ and a portable object versioning
framework. We have implemented the framework using the ‘PoorMan’ library
that provides facilities for class file parsing and basic class transformations [47].

6.3 Orthogonally Persistent Systems


Orthogonally persistent systems are distinguished from other persistent systems
such as object databases by an orthogonality between data use and data persis-
tence. This orthogonality comes as the product of the application of the following
principles of persistence [48]:
Persistence Independence
The form of a program is independent of the longevity of the data which it
manipulates.
Data Type Orthogonality
All data types should be allowed the full range of persistence, irrespective of
their type.
Persistence Identification
The choice of how to identify and provide persistent objects is orthogonal to
the universe of discourse of the system.
These principles impart a transparency of persistence from the perspective of
the programming language which obviates the need for programmers to maintain
mappings between persistent and transient data. The same code will thus operate
over persistent and transient data without distinction.
Another Random Document on
Scribd Without Any Related Topics
cucumbers a dressing made of two parts of salad oil and one part of
lemon juice, with salt and paprika to taste.

Daisy salad

Cut two-inch rounds of cream or Neufchâtel cheese one-half inch in


thickness, and place on crisp lettuce leaves. Put the yolks of two
hard-boiled eggs through the vegetable-press and place a
teaspoonful of this yellow powder in the center of each round. Serve
mayonnaise or French dressing in a separate bowl.

Tongue salad

Make a good French dressing. Dip into it firm, crisp lettuce leaves.
Have ready cold boiled tongue, cut as thin as writing paper. Lay a
slice upon each leaf, and serve with heated and buttered crackers.
You can substitute ham for the tongue.

Tomato aspic

Soak a half-box of gelatine in a half-pint of water for an hour. Bring


to a boil the liquor drained from a quart can of tomatoes, and add to
it a teaspoonful of onion juice, two teaspoonfuls of sugar, a bay leaf
and a teaspoonful of minced parsley, with pepper and salt to taste.
Simmer for twenty minutes, add the gelatine, stir until dissolved, and
strain through flannel into a jelly mold. Serve when firm, garnished
with lettuce and pour over all a mayonnaise dressing. This jelly—in
culinary phrase, “aspic”—lends itself agreeably to many
combinations of salad, being susceptible of countless variations.

Tomatoes with whipped cream

Carefully peel and halve ripe tomatoes and lay them on the ice for
several hours. Transfer to a chilled platter, sprinkle with salt, garnish
with lettuce leaves and put a great spoonful of whipped cream upon
each tomato half.

Tomato and corn salad

Pour boiling water over large, smooth tomatoes to loosen the skins,
and set on ice. When perfectly cold, gouge out the center of each
tomato with a spoon, and fill the cavity with boiled corn cut from the
cob and left to get perfectly cold; then mix with mayonnaise
dressing. Arrange the tomatoes on a chilled platter lined with
lettuce, and leave on ice until wanted. Pass more mayonnaise with
the salad.

Tomato and peanut salad

Prepare the tomatoes as in the last recipe. Have ready a pint or


more of roasted peanut meats, blanched by pouring boiling water
over them, then skinned, and when cold pounded finely and mixed
with mayonnaise dressing. Fill the tomatoes with this. Serve on
lettuce leaves.

Iced tomato salad

(Contributed)
Cook a quart of raw tomatoes soft, strain and season with nutmeg,
sugar, paprika, a pinch of grated lemon peel and salt. Freeze until
firm; put a spoonful upon a crisp lettuce leaf in each plate, cover
with mayonnaise and serve immediately. It is still prettier if you can
freeze it in round apple-shaped molds.
Canned tomatoes may be used if you have not fresh.

Clam salad

(Contributed)
Remove the skins and black heads of cold clams. Marinade for ten
minutes in a French dressing and serve on a bed of shredded
lettuce.

Pear salad

(Contributed)
Peel and slice five sweet, ripe pears, sprinkle with fine sugar, and
add a little maraschino or ginger syrup. Serve with a little cream. Or
pare and slice enough ripe, sweet pears to make one pint; add one-
half cupful of blanched and chopped almonds, one-fourth of a cupful
of powdered sugar and the strained juice of two lemons. Serve in a
cup of lettuce leaves made by placing together the stem end of two
lettuce leaves taken from the inside of a head of lettuce.

Hot potato salad

(Contributed)
Put into a frying-pan one-fourth of a pound of bacon, cut into dice;
when light brown take out and sauté in the fat a small onion cut
fine. Add one-half as much vinegar as fat, a few grains of salt and
cayenne and one-half as much hot stock as vinegar. Have ready the
potatoes boiled in skins. Remove the skins and slice hot into the
frying-pan enough to take up the liquid. Add the diced bacon, toss
together and serve.

Asparagus and shrimp salad

(Contributed)
To one cupful of shrimps add two cupfuls of cold cooked asparagus
tips, and toss lightly together. Season with salt and pepper. Make a
dressing of the yolks of three hard-boiled eggs, rubbed through a
sieve, and sufficient oil and vinegar to make the consistency of
cream, using twice as much oil as vinegar. Pour over the asparagus
and shrimps.

Asparagus salad

(Contributed)
Asparagus tips heaped on lettuce leaves and served with French,
mayonnaise or boiled dressing, poured over all, make a very good
salad.

Endive salad

(Contributed)
Use the well-blanched leaves only. Wipe these with a damp cloth.
Pour over this a French dressing and serve with roasted game.

Sweetbreads and cucumber salad

(Contributed)
Marinate one pair of sweetbreads in French dressing. Chill
thoroughly. Drain and mix with equal parts of sliced cucumber; cover
with French dressing into which has been stirred whipped cream.

Spinach salad

(Contributed)
Select the young, tender leaves from the center of the stock; wash
carefully, drain and chill and serve with French dressing.

Lenten salad

(Contributed)
Line the bottom of the salad-dish with crisp lettuce leaves. Fill the
center of the dish with cold boiled or baked fish, cut into pieces, and
pour over it a pint of mayonnaise dressing. Garnish with rings of
hard-boiled eggs.

Apple and cress salad

(Contributed)
Pare and cut into small pieces four medium-sized apples. Pour over
this a French dressing. Pick carefully the leaves from a bunch of
cress. Arrange around the outside of the salad-dish and heap the
apples in the center of the dish.

Strawberry salad

(Contributed)
Choose the heart from a nice head of lettuce, putting the stems
together to form a cup. Put a few strawberries in the center and
cover with powdered sugar and one teaspoonful of mayonnaise
dressing.

Banana salad

(Contributed)
Sliced bananas, served in the same manner as the strawberries in
the above recipe, make an excellent salad.

Veal salad

(Contributed)
Use equal parts of well-cooked cold veal cut into small pieces, and
finely-chopped white cabbage. Marinate the veal for two hours.
Drain and mix with the cabbage. Season with salt and pepper, and a
little chopped pickle, and cover with mayonnaise dressing.

Cherry salad

(Contributed)
Stone a pint of large cherries, being careful not to bruise the fruit.
Place a hazelnut in each cherry to preserve the form. Chill
thoroughly, arrange in a salad dish on lettuce leaves and pour over
all a cream mayonnaise dressing.

Peach salad

(Contributed)
Pare a quart of ripe yellow peaches, and cut into thin slices; slice
very thin a half cupful of blanched almonds. Mix the fruit and nuts
with two-thirds of a cupful of mayonnaise, to which has been added
one-third of a cupful of whipped cream. Serve immediately on
lettuce leaves.

Ham salad

(Contributed)
Mix equal portions of minced, well-cooked ham and English walnuts
or almonds. Serve with mayonnaise on lettuce leaves.

Sweetbreads with celery salad

(Contributed)
Wash the sweetbreads thoroughly and let them stand in cold water
half an hour. Boil in salted water twenty minutes and then put in
cold water again for a few minutes, to harden. To one cupful of
minced sweetbreads add one cupful of diced celery and one-half
cupful of chopped nuts. Cover well with mayonnaise dressing to
which some whipped cream has been added.

Green bean salad

(Contributed)
Select fresh string beans and boil until tender in salted water. Or use
a good quality of canned string beans. Arrange on a dish and serve
with mayonnaise dressing.

Pea salad

(Contributed)
Drain and press through a sieve a can of green peas. Dissolve one
box of gelatine in one-fourth of a cup of cold water and stir over a
hot fire until heated. Take from the fire and add one-fourth
teaspoonful of onion juice, one-half teaspoonful of salt, and a dash
of pepper. Serve very cold with the following dressing: Put into a
double boiler the yolks of two eggs, two tablespoonfuls of stock and
two tablespoonfuls of oil. Stir until thick, take from the fire and add
slowly one tablespoonful of tarragon vinegar, one chopped olive and
two teaspoonfuls of chopped parsley.

Nut salad

Blanch almond kernels, and when cold and crisp shred into shavings.
Mix with these an equal quantity of English walnuts, broken into bits,
and pecan kernels. Stir a good mayonnaise dressing into the mixture
and heap within curled lettuce leaves.
LUNCHEON FRUITS, COOKED AND
RAW
Stewed rhubarb

Select only good, firm stalks, and reject those that are withered. Lay
them in cold water for an hour, and cut into half-inch pieces. Put
them over the fire in a porcelain-lined saucepan and strew each
layer plentifully with sugar. Pour in enough water to cover all, and
bring very slowly to a boil. Let the rhubarb stew gently until it is very
tender, then remove from the fire. When cold, serve with plain cake.

Rhubarb and raisins

For every cupful of raw rhubarb cut into inch lengths add a third as
much of raisins seeded and cut in half. Cook until soft, as directed in
last recipe.

Rhubarb and dates

Stone a quarter of a pound of dates, cover with hot water, and cook
five minutes. Add three cupfuls of raw rhubarb, cut into inch lengths,
and cook, closely covered, until the rhubarb is tender. Sweeten to
taste and set aside to cool in a covered bowl, after which set on ice
until needed.

Rhubarb and figs

Soak a quarter-pound of figs in warm water for two hours. Cut into
small pieces and cook as previously directed with three cups of raw
rhubarb, cut into inch lengths, until the rhubarb is tender. Eat cold.
This dish is cooling to the blood, gently laxative and pleasing to the
taste.

Stewed gooseberries

Remove the tops and stems from one quart of gooseberries, wash
and drain. Put them into a saucepan with barely enough boiling
water to cover them. Let them stew until tender. Dissolve one cupful
of sugar in one-half cupful of water and boil to a syrup, then mix it
with the fruit and set away to cool.
Agate-nickel-steel ware is altogether the best in the market for
stewing acid fruits. They should never be cooked in tin or in iron,
and unless copper has just been cleaned with vinegar to remove all
suspicion of verdigris, the use of it is dangerous. I can not say too
much of the ware I have named. It is easily kept clean, durable and
safe.

Hot green apple sauce

Utilize in this way early windfalls and unripe summer apples,


proverbially dear to the heart of the small boy and harmful to his
digestive organs.
Pare and slice thin with a silver knife or with a fruit-knife of Swedish
bronze. The crude acid forms an instant and unpleasant combination
with steel. As you slice, drop into cold water to keep the color. Cook
in an agate-nickel-steel saucepan, with just enough boiling water to
keep the apples from burning to the bottom. Fit on a close lid and
do not open the pan for half an hour, lest the steam escape. Shake
up, and sidewise, every ten minutes to insure uniform steaming.
When the half-hour is up open the saucepan, and if the apples are
soft rub quickly through a colander of the same ware with the
saucepan. Beat in sugar to taste, also a lump of butter—about a
tablespoonful to a quart of the stewed fruit; turn into a covered bowl
and serve hot. Pass thin graham bread and butter with it.
It is wholesome, anti-bilious and palatable.

Cold apple sauce

Make in the same way of ripe, tart apples, a seasoning with mace or
nutmeg to taste. When it has cooled set on ice until wanted.

Stewed apples

Pare and core a dozen tart, juicy apples. Put them into a saucepan
with just enough cold water to cover them. Cook slowly until they
are tender and clear. Then remove the apples to a bowl, and cover
to keep hot; put the juice into a saucepan with a cupful of sugar,
and boil for half an hour. Season with mace or nutmeg. Pour hot
over the apples and set away covered until cold. Eat with cream.

Baked sweet apples

Wash and core, but do not pare them. Arrange in a deep pudding-
dish; put a teaspoonful of sugar and the tiniest imaginable bit of salt
into the cavities left by coring; pour in a half cupful of water for a
large dishful of apples; cover closely and bake in a good oven forty
minutes or until soft.
Eat ice-cold, with cream and sugar.

Stewed prunes

Wash dried prunes and soak them for at least five hours in cold
water. Put them into a saucepan with enough water to cover them
and simmer very gently for twenty minutes. Now add sufficient
granulated sugar to sweeten liberally, and simmer until the prunes
are tender. Take from the fire and set aside to cool. Eat with plain
cake.
Steamed prunes

Soak as directed above. Place them in a covered roaster and steam


steadily for two hours. Make a syrup in a separate vessel with the
water left from the soaking. This recipe is especially suited to those
who desire but little sugar in prunes, as but little sweetness can be
added to the prunes in steaming.
Never boil prunes, as the flavor is thereby injured. When cooked as
directed, if the syrup is not heavy enough to suit, remove the prunes
from the syrup and boil the syrup down to the required consistency.

Stewed prunelles and sultanas

Prunelles are more than subacid, and need the modifying influence
of sweeter fruits. Allow equal parts of prunelles and of the small
sultana raisins. Wash the fruit in tepid water, and soak it in enough
cold water to cover it for several hours, on the back of the range.
Draw them forward where they will simmer gently until soft. Add
sugar to taste, let the syrup boil up once, then set away to cool.

Dried apples and peaches

The prejudice against the dried apple of commerce is pronounced,


and founded upon traditions we should have outlived. The kiln-dried
fruit of to-day is a respectable edible and capable of excellent
results. It is especially good if mixed with equal parts of dried
peaches, soaked for three hours in just enough tepid water to cover
the fruit (having been first washed); then put over the fire with the
water in which they were soaked, and simmer tender. Rub through a
colander, add sugar, cinnamon and cloves to taste, and let the
mixture get perfectly cold.

Stewed cherries
None of our small fruits are more injured by transportation than
these same luscious and ruddy lobes. If you must buy cherries which
are brought from a distance and are, of necessity, several days old,
cook them if you regard the welfare of the digestive organs of your
family. The verse that tells us “cherries are ripe” would be more
reassuring if it also informed us that they were recently picked.
Wash and pick over carefully; put over the fire in a “safe” saucepan,
such as I have already indicated, with just enough water to prevent
burning, cover closely and stew until soft, but not broken. Strain off
the liquor; set aside the cherries in a covered bowl, add three
tablespoonfuls of sugar to each pint of the juice, return to the fire;
boil fast for half an hour and pour over the fruit. Keep covered until
cold.

Raw cherries

To be eaten at their raw best they should be kept in the ice-box until
needed. Then they may be served with their stems still on in a glass
bowl with fragments of ice scattered among them.

Sugared cherries

Use large, firm cherries for this dish. Have in front of you a soup-
plate containing the whites of three eggs mixed with five
tablespoonfuls of cold water, another plate filled with sifted
powdered sugar at your right, the bowl of cherries at your left. Dip
each cherry in the water and white of egg, turn it over and over in
the sugar and lay on a chilled platter to dry. When all are done sift
more powdered sugar over the fruit and arrange carefully on a glass
dish.

Glacé cherries
Select firm, sweet cherries from which the stems have not been
removed. Into a perfectly clean porcelain-lined saucepan put a
pound of granulated sugar and a gill of cold water, and boil to a
syrup. Do not stir during the process of cooking. Try the syrup
occasionally by dropping a little in cold water. When it changes to a
brittle candy it is done. Remove the saucepan at once from the fire
and set it in a pan of boiling water. Dip each cherry quickly in the hot
syrup and lay on a waxed paper to dry. If the syrup shows signs of
becoming too thick, add more boiling water to that in the outside
pan. When all the cherries have been “dipped” stand them in a
warm place to dry.

Pineapple and orange

Cut the top from a pineapple and carefully remove the inside, so that
the shell may not be broken. Cut the pulp into bits, mix it with the
pulp of three ripe oranges, also cut very small, and liberally sweeten
the mixture. Smooth off the bottom of the pineapple shell so that it
will stand upright, refill with the fruit pulp, put on the tip and set in
the ice for three hours.

Creamed peaches

Lay large, ripe free-stone peaches on the ice for several hours, peel,
cut them in half and remove the stones. Whip half a pint of cream
light, with two tablespoonfuls of powdered sugar. Fill the hollows left
by the stones to heaping with the whipped cream. Keep in the ice-
box until time to serve the fruit.

Grapefruit and strawberries

Cut grapefruit in half and remove the tough fiber and part of the
pulp. Chop this pulp and add it to mashed and sweetened
strawberries. Refill the grapefruit rinds with the mixture, and set on
the ice for an hour or two.
Strawberries and cream

Cap the berries, one at a time, using the tips of your fingers. The
practice of holding capped berries in the hollow of the hand until one
has as many as the space will accommodate, is unclean and
unappetizing. Cap them deftly and quickly, letting each fall into a
chilled bowl, and do this just before serving, keeping in a cool place
until they are ready to go to table. Pass powdered sugar and cream,
also ice-cold, with them.

Raspberries and cream

Follow the directions given in last recipe.

Bartlett pears and cream

Select sweet, ripe pears and lay them in the ice for two hours. Do
not peel until just before they are needed. Pare deftly and quickly,
slice, sprinkle with sugar, cover with cream and serve.

Bananas and cream

Bananas are very good treated as the pears were in the last recipe.
It is a good plan to bury these in the ice until wanted for dessert.
Then the hostess may, at the table, quickly peel and slice them into
different saucers. Bananas thus prepared do not have time to
become discolored from exposure to the air.
SWEET OMELETS
Apple sauce omelet (baked)

Beat the yolks of seven eggs light; stir into them five tablespoonfuls
of powdered sugar and a cupful and a half of sweetened apple
sauce. Beat long and hard, stir in the stiffened whites, beat for a
minute longer and turn into a greased pudding-dish. Bake, covered,
for about ten minutes, then uncover and brown. Serve at once with
whipped cream. It is also good served with a hot sauce made by the
following recipe:
Into a pint of boiling water stir a half-cupful of sugar, and when this
dissolves add a teaspoonful of butter, the juice and the grated rind
of a lemon and the stiffened white of an egg. Beat for a minute over
the fire, but do not let the sauce boil.

Jam omelet

Beat the yolks of five eggs light with a heaping tablespoonful of


powdered sugar. Into this stir a teaspoonful of corn-starch dissolved
in three tablespoonfuls of milk, then the stiffened whites of the eggs.
Cook in a frying-pan until set; spread with strawberry jam, fold and
serve as dessert.

Omelet soufflé

Beat the yolks of five eggs very light, adding, gradually, four
tablespoonfuls of powdered sugar. In another dish whip the whites
to a standing froth. With a few long strokes blend the two; pour into
a buttered bake-dish and bake quickly. Sift powdered sugar on the
top at the end of two minutes, and very quickly, as the omelet will
fall if the oven stands open even a few seconds. Serve at once in the
bake-dish.

Orange omelet

(Contributed)
Beat the yolks of five eggs together until thick and lemon-colored.
Add five tablespoonfuls of orange juice, the grated rind of one
orange and five tablespoonfuls of powdered sugar. Then fold in
lightly the beaten whites of four eggs. Put a little butter in an omelet
pan, and when hot pour in the omelet mixture and spread in evenly.
Let it cook through, but not harden. Fold the edges over and turn
out upon a hot dish. Serve with a dressing of sliced oranges and
powdered sugar.

Omelet with marmalade

(Contributed)
Beat the yolks of three eggs very light. Then fold in the whites
beaten dry. Turn into an omelet pan in which one teaspoonful of
butter has been melted. Spread the omelet evenly and cook over a
slow fire to set the eggs. Then put in the oven until done. Spread
one-half of the omelet with marmalade, fold and serve on hot
platter.

Queen Mab omelets

Beat four eggs, the yolks as smooth as cream, the whites to a


standing froth. Into the yolks whip three tablespoonfuls of powdered
sugar. Mix all together, add a tablespoonful of thick cream, whip
lightly and pour into buttered “nappies,” filling each half-way to the
top. Set in a pan of boiling water in a quick oven and bake five
minutes, covered. Turn out upon a hot platter, sift powdered sugar
over them and serve at once.
FAMILIAR TALK
A commonsensible talk with the nominal
mistress of the house

There is not that household in the land where servants are employed
which is not measurably dependent upon them for peace of mind as
well as for comfort of body. Every housewife who reads this will
recall the sinking of heart, the damp depression of spirit, which has
suddenly overtaken a cheerful mood when the kitchen barometer
beckoned “storm” or “change.” Such an overtaking is not an
affliction, but it sometimes comes dangerously near to sorrow. The
independent maid of all work has it in her power to alter the family
plans with a word, when that word is “going.” Should she elect to
stay, her lowering brows and sharp or sullen speech abash a
mistress who quails at little else. In wealthier households a domestic
“strike” involves panic, disorder and suffering.
I know of a wet-nurse whose abandonment of her infant charge,
without a word of warning, at ten o’clock one Saturday night, caused
a long and terrible illness, resulting in infantile paralysis. A cook who
had lived in one family for three years resented the arrival of
unexpected guests, packed her trunk and left her mistress to get
dinner. The lady was in delicate health and all unused to such work.
She became overheated and exhausted, took a heavy cold, which
ripened into pneumonia, and died three days after the cook’s
desertion.
I need not multiply illustrations of the helplessness of American
housewives in the face of such disasters, and the possibility that
these may befall any one of us. We have no redress. The women
who helped organize the “Protective League” know this. The law
does not protect the employer. Public opinion gives her no support.
The cook whose fit of temper cost a kind mistress her life was
recommended to me within a month after an event that should have
shocked the moral sense of every housewife in the community, and
recommended by a friend of the murdered woman and of myself.
When I exclaimed in surprise, I was told: “We can not be judges of
our neighbors’ domestic affairs.”
There is no class spirit among us. For some reasons this is a matter
of congratulation to us and the public. All that is needed to make the
opening gulf between mistresses and maids impassible is
organization on our part, which signifies open war. It is,
nevertheless, I note in passing, patent that there should be a code
of honor among us with regard to employment of those who have
proved absolutely untrustworthy in other households.
We are not true to one another in this matter, and our employées,
who are held together by the unwritten laws of a union, none the
less strong because nameless and informal, know this as well as we
do. The knowledge is one of the most potent weapons in their
armory.
Let this pass for the present. I would direct your attention, my
sister-worker in the home missionary field, to the brighter side of the
vexed question.
After forty years’ careful study of this matter of domestic service—
study carried on in other lands as well as in our own—I record
thankfully my conviction that the domestics in well-regulated
American homes are better cared for, better paid and more
thoroughly appreciated than any other class of working women in
this country or abroad. I record, likewise and confidently, that the
proportion of faithful, valued and even belovéd domestics among us
is much larger than that of indifferent or worthless. Most cheerfully
and thankfully I add to this record that, personally, I have a list of
honest, virtuous, willing workers, whose terms of service in my
family varied from three to thirteen years, and who went from my
house to homes of their own, bearing with them the cordial esteem
of those they had served. Nor is my experience singular, even in
these United States. It is so far from being exceptional that I
deprecate, almost as an individual grievance, any attempt to
organize those who should be our coworkers into a faction that
considers us as “the opposition.” It is a putting asunder of those
whom a mutual need should join together.
Backed by my two-score years of experiment and action, I dare
believe that a leaf or two from my book of household happenings
may be of service to younger women and novices in the profession
which absorbs the major part of our time and strength.
To begin with—beware of discouragement during the early trial-days
of the new maid. Be slow to say, even to yourself: “She will never
suit me!” The first days and weeks of a strange “place” are a crucial
test for her as for you, and she has not your sense of proportion,
your discipline of emotion and your philosophical spirit to help her to
endure the discomforts of new machinery.
Looking back upon my housewifely experiences, I am moved to the
conclusion that the domestics who stayed with me longest and
served me best were those who did not promise great things in their
novitiate.
One—“a greenhorn, but six weeks in the country”—frankly owned
that she knew nothing of American houses and ways. She was
“willing to learn,” and—with a childish tremble of the chin—“didn’t
mind how hard she worked if people were kind to her.” I think the
quivering chin and the clouding of the “Irish blue” eyes moved me to
give her a trial. She did not know a silver fork from a pepper cruet,
or a tea-strainer from a colander, and distinguished the sideboard
from the buffet by calling the one the “big,” the other the “little
dresser.” She had been with me a month when I trusted her to
prepare some melons for dessert, giving her careful and minute
directions how to halve the nutmeg melons, take out the seeds and
fill the cavities with cracked ice, while the watermelon—royal in
proportions and the first fruits of our own vines—was to be washed,
wiped, and kept in the ice-chest until it was wanted.
At dinner the “nutmegs” appeared whole; the watermelon had been
cut across the middle and eviscerated—scraped down to the white
lining of the rind—then filled with pounded ice. The succulent
sweetness, the rosy lusciousness of the heart, had gone into the
garbage can.
Nevertheless, I kept blue-eyed Margaret for eight years. She stands
out in my grateful memory as the one and only maid I have ever
had who washed dishes “in my way.” Never having learned any
other, she mastered and maintained the proper method.
The best nursery-maid I ever knew, and who blessed my household
for eleven years, objected diffidently at our first interview to giving a
list of her qualifications for the situation. She “would rather a lady
would find out for herself by a fair trial whether she would fit the
place or not.” I engaged her because the quaint phrase took my
fancy. She proved such a perfect fit that she continued to fill the
place until she went to a snug home of her own.
What may be called the New Broom of Commerce has no misgivings
as to her ability to fill any place, however important. Upon inquiry of
the would-be employer as to the latter’s qualifications for that high
position, the N. B. of C. may decline to accept her offer of an office
which promises more work than “privileges.” But she could fill it—full
—if she were willing to “take service” with the applicant.
One of the oddest incongruities of the new-broom problem is that
we are always disposed to take it at its own valuation. With each
fresh experiment we are confident that—at last!—we have what we
have been looking for lo! these many years. She is a shrewd house-
mother who reserves judgment until the first awkward week or the
crucial first month has brought out the staying power or proved the
lack of it.
Officious activity in unusual directions is a bad omen in the New
Broom of Commerce. In sporting parlance, I at once “saw the finish”
of one whom I found upon the second day of service with me
washing a window in the cellar. She “couldn’t abide dirt nowhere,”
she informed me, scrubbing vehemently at the dim panes. I had just
passed through the kitchen where a grateful of fiery coals was
heating the range plates to an angry glow. All the drafts were open;
the boiler over the sink was at a bubbling roar; upon the tables was
a litter of dirty plates and dishes; pots, pans and kettles filled the
sink.
It is well to have a care of the corners, but the weightier matters of
the law of cleanliness are usually in full sight.
I once knew a woman who, deliberately, and of purpose, changed
servants every month. She said no new broom lasted more than four
weeks, and when one became grubby and stumpy she got rid of it.
Her house was the cleanest in town and her temper did not seem
worse for friction.
Another woman who, strange to tell, lived to be ninety years old,
“liked moving” and never lived two years in one and the same
house. She maintained that she kept clear of rubbish by frequent
flittings, and enjoyed rubbing out and beginning again. Personally, I
should have preferred a clean, lively conflagration every three years
or so, but she throve upon nomadism.
In minor details of housewifery, as in more important, make up your
mind how you will manage the home and turn a deaf ear to
gratuitous suggestions from people whose own households would be
better conducted if their energies were concentrated.
Let one example suffice: A so-called reformer felt herself called in
(or out of) the Gospel of Humanity, the other day, to inveigh in a
parlor lecture upon the unkindness and general unchristianliness of
the maid’s cap and apron which all would-be stylish mistresses insist
upon. “Have I, a Christian woman in a republic,” cried the oratress,
“the right to put the badge of servitude upon my sister woman,
because, having less money than I have, she is obliged to earn her
living? Do I not tend to degrade, instead of elevating her?
“Of a piece with the cap and apron is the black dress, now ‘the thing’
for girls in domestic service. Why should not Bridget and Dinah
exercise their own right in dress as well as I?”
These questions have been put to me many times by women who
think and act for themselves without regard to arbitrary
conventionalities.
I am so well assured that most conventionalities have a substratum
of common sense that I am slow to condemn any one of them.
I dispute, at the outset, the insinuation that black dress, white cap
and apron are a badge of servitude. I know no more independent
class of women than trained nurses, no more arbitrary men than
railway officials. I should certainly never consider the distinctive garb
of the Sisters of Charity—Protestant or Roman Catholic—as
degrading. The idea of humiliation attached to the uniform of
housemaid and child’s nurse in the mind of employees or employer is
founded upon the conviction that domestic service demeans her who
performs it. This is precisely the prejudice which sensible,
philanthropic women are trying to beat down—a prejudice that has
more to do with the complications of the servant question than all
other influences combined. If I hesitate to ask a maid entering my
service to wear the uniform of her calling, I intimate too broadly to
be misunderstood that there is something in that service which
would demean her were it generally known that she is in it.
I had one maid, years ago, who would not run around the corner to
grocery or haberdasher’s without taking time to put on her Sunday
coat and hat, and to lay off her apron. When I spoke to her of the
absurdity and inconvenience of this, she confessed, blushingly, that
the porter at the grocery was “keeping company with her,” and “it
was nat’ral a gurrel should want to look her best when she was like
to see him.”
“Ah,” I said, “doesn’t he know what your position is in my house?
Has he never seen you in cap and apron?”
“Shure, mem! Every day when he fetches the groceries.”
“Then, if he is a sensible fellow, he will respect you all the more for
not pretending to be what you are not. Since he knows what your
business is, show him that you are not ashamed of it. You are as
respectable in your place as he is in his—as I am in mine—always
providing that you respect your service and yourself.” Call the
distinctive dress of your maid a “uniform,” not a livery. Point out to
her the examples of trained nurses, of railway conductors, of the
very porters who “keep company” with her; the policemen she
admires afar off; the soldiers, whose brass buttons dazzle her
imagination. Remind her that saleswomen in fashionable shops wear
the black gown, white apron, deep linen collar and cuffs and pride
themselves upon looking their best in them. Especially make her
comprehend (if you can, for the ways of the untrained mind are past
finding out), that she has an honorable calling and need not be
ashamed to advertise it.
Congratulate yourself, above all, that a sensible fashion holds back
Bridget and Dinah from the “exercise of their own taste in dress.”
The modification of that taste wrought by the neat and modest
costume prescribed by a majority of modern housewives may be in
itself a good thing, sparing the eyes of spectators of her toilettes
when she becomes “Mrs.” and independent, and the purse of the
porter, or truckman, or mechanic, who will have to pay for them.
I have laid stress upon the advantages of long terms of service, to
maid and to mistress. Like all other good things it has its perils and
its abuses to be avoided.
Two-thirds of the scandals that poison the social atmosphere steal
out, like pestilential fogs, through servants’ gossip. We discuss “the
girl” in our bedchambers, and if so much stirred up by her works and
ways as to forget what is due to our ladyhood, compare notes in the
parlor as to these same works and ways. Being well-bred women,
the traditions of our caste prevent us from making domestic
grievances the staple of drawing-room conversation and the marrow
of table-talk. The electroplated vulgarian never calls attention more
emphatically to the absence of the “Sterling” stamp upon her
breeding, than when she chatters habitually of the virtues and the
faults of her household staff.
On the other hand, the most sophisticated of us would be amazed
and confounded if she knew what a conspicuous part She plays in
talk below stairs and on afternoons and evenings “out.”
Thackeray, prince of satirists, puts it cleverly:
“Some people ought to have mutes for servants in Vanity Fair—
mutes who could not write. If you are guilty—tremble! That fellow
behind your chair may be a Janissary with a bowstring in his plush
breeches pocket. If you are not guilty, have a care of appearances,
which are as ruinous as guilt.” We should be neither shocked nor
confounded that these things are so. If we are mildly surprised, it
argues ignorance of human nature, and of the general likeness of
one human creature to another, that proves the whole world kin.
When mistresses in Parisian toilettes, clinking gold spoons against
Dresden as they sip Bohea in boudoir or drawing-room, raise their
eyebrows or laugh musically over the latest bit of social carrion in
“our set”—Jeames or Abigail, who has caught a whiff at a door ajar,
or through a keyhole, is the lesser sinner in serving up the story in
the kitchen cabinet. The domestics are in, yet not of, the employer’s
world, living for six and a half days of the week among people with
whom they have no affinity by nature or education. Where we would
talk of “things,” the lower classes discuss what they name “folks.”
Their range of thought is pitifully narrow; the happenings in their
social life are few and tame. What wonder if they retail what we say
and do and are, as sayings, doings and characters appear to them?
What would be extraordinary, if it were not so common, is the
opportunity gratuitously afforded in—we will say, guardedly—one
family out of three for the collection of material for these sensations
of the nether story. I speak by the card in asserting that the
influence gained by the confidential maid over her well-born, well-
mannered, well-educated mistress is greater than that possessed by
any friend in the (alleged) superior’s proper circle of equals.
Without taxing memory I can tell off on my fingers ten
gentlewomen, in every other sense of the word, whose intimate
confidantes are hirelings who were strangers until they entered the
employ of their respective mistresses(?). We need not cross the
ocean to listen with incredulous horror to insinuations and open
assertions as to the hold a gigantic Scotch gilly acquired over a royal
widow. Our next-door neighbors on both sides and our
acquaintances across the way are in like bondage.
I have in mind one of the best and most refined women I ever knew
whose infatuation for her incomparable Jane was the laughing-stock
of some, the surprise and grief of others. Jane disputed the dear
soul’s will, oft and again; gave her more advice than she took, and,
behind her back, ridiculed her unsparingly—as many of the
mistress’s friends were aware. The dupe would resign the affection
and society of one and all of her compeers sooner than part with
Jane.
Another “just could not live without my Mary.” The remote
suggestion throws her into a paroxysm of distress. Her own husband
knows it to be necessary to warn her not to tell this and that
business or family secret to Mary, knowing, the while, in his sad
soul, the chances to be against her keeping her promise not to share
it with her factotum.
Ellen is the bosom friend of a third; Bridget is the right hand, the
counsellor and colleague of a fourth. A fifth confides to her second-
rate associates that her faithful Fanny knows as much of family
histories (and there are histories in the clan) as she does, and that
she—the miscalled mistress—takes no step of importance without
consulting her.
Perhaps one man in five hundred is under the thumb of his
employee, and then because the underling has come into possession
of some dangerous secret, or has a “business hold” upon him.
Have wives more need of sympathy? or are they less nice in the
choice of intimates, and more reckless in confidences?
LUNCHEON CAKES
Huckleberry shortcake

Sift two heaping teaspoonfuls of baking-powder and one of salt into


a quart and a pint of flour. Chop into this two tablespoonfuls of
cottolene or other fat and two of butter. Beat two eggs light and add
them to a pint of sweet milk. Make a hole in the flour, pour in the
milk and egg, and mix with a wooden spoon. Turn out upon a pastry
board and roll into two sheets, about a third of an inch in thickness.
Line a greased biscuit-pan with one sheet, cover it three-quarters of
an inch thick with huckleberries, strew these with granulated sugar,
fit the upper sheet of dough on the pan and bake in a steady oven
until done. Cut into squares and send to table. Split, and eat with
butter and sugar.

Currant shortcake

Mash a quart of ripe red currants and stir into them two cups of
granulated sugar. Cover and set aside for half an hour.
Make a dough as for quick biscuit, only using a tablespoonful more
butter than usual. Roll into a large round biscuit about ten inches in
diameter. Bake, and, as soon as done, split open, spread with butter
and then with half the sweetened currants. Replace the top of the
biscuit and pour the remainder of the currants and juice over and
around the shortcake. Serve at once.

Hot strawberry shortcake

Mash a quart of berries, sweeten them with plenty of granulated


sugar, and let them stand for an hour and a half.
Into a pint of flour sift a teaspoonful of baking-powder, and half a
teaspoonful of salt. Chop into this one tablespoonful of butter until it
is thoroughly incorporated. Add enough milk to make a dough that
can be easily handled. Turn this upon a floured pastry-board, roll
lightly into a huge biscuit as large as a pie-plate. Put into a greased
pan and bake in a quick oven. When done, split open quickly, spread
with butter, then thickly with the mashed berries, put the two halves
together again, pour the remaining mashed berries over the entire
cake, and serve very hot.

Cold strawberry shortcake

Cream two tablespoonfuls of butter with a cup of powdered sugar.


Beat three eggs light, add to them a quarter of a cup of cream, and
stir into the creamed butter and sugar. Beat long and hard before
adding a cupful of flour sifted twice with a teaspoonful of baking-
powder. Grease three jelly-cake tins, half-fill with the batter and bake
in a quick oven. When cold, remove the cakes from the tins, spread
each layer with halved strawberries, sprinkle with sugar and pile on
a dish. Serve with an abundance of cream.

Scotch shortcake

(Contributed)
Cream a half-pound of fresh butter with a quarter-pound of sugar,
and work into it with the hands a pound of flour. Knead long, then
turn upon a pastry-board and press into a flat sheet half an inch
thick. Cut into squares and bake until light-brown and crisp.

Orange shortcake

(Contributed)
Sift into one and one-half cupfuls of flour one-half cupful of corn-
starch, one level teaspoonful of baking-powder and one-half
teaspoonful of salt. Rub into this with the tips of the fingers one-
third of a cup of butter and moisten with milk enough to make a soft
dough. Divide the dough in halves and spread over the bottom of
two tins. When done butter the cakes, sift over each powdered
sugar, and put between them thin slices of peeled oranges.

German coffee cake (No. 1)

Two cupfuls of scalded milk, one cupful of water, one yeast-cake


(one-cent size), one cupful of sugar, one-half cupful of butter, two
eggs, a little salt.
Cream sugar and butter, add milk and yeast dissolved in the water,
the salt and eggs, well-beaten. Thicken with enough flour to make a
batter that can be stirred with a spoon. Beat well and set to rise for
about three hours. When light, add enough flour to enable you to
roll it out. Roll about an inch thick, and place in long, shallow pans.
Set to rise. When light, drop over the top bits of butter about the
size of a hickory-nut, and sprinkle generously with sugar and a little
cinnamon. Bake about thirty minutes.

German coffee cake (No. 2)

To two cupfuls of soft bread sponge that has been allowed to rise,
add one-half cupful of warm milk, a little salt, one-quarter cupful of
melted shortening, two eggs, beaten with three-quarters of a cup of
sugar. Add one-half grated nutmeg, some raisins or currants, and as
much warmed flour as can be worked in with a spoon. Put it into a
greased tin and let it rise. When very light, moisten the top with
milk, sprinkle with sugar and cinnamon, and bake in a slow oven
forty minutes. Cover with brown paper until almost done.

Potato cake
Two cupfuls of white sugar, one cupful of butter, four eggs, one-half
cupful of milk, one cupful of potatoes, one teaspoonful, each, of
cinnamon and cloves, one-half cup of chocolate, two cups of flour,
two teaspoonfuls of baking-powder, one cup of almonds. Blanch and
chop almonds; grate cold boiled potatoes; beat eggs separately,
adding whites last. Bake in a shallow pan in a moderate oven, and
cover with caramel frosting.

Huckleberry cake

Sift a scant quart of flour twice with two teaspoonfuls of baking-


powder. Cream together one cupful of butter and two of sugar, add
to them five beaten eggs, a cup and a half of milk, a half-
teaspoonful, each, of powdered cinnamon and nutmeg and the
prepared flour. Last of all, stir in a cupful of huckleberries thoroughly
dredged with flour. Bake in greased muffin tins in a steady oven.
This excellent cake is better when twenty-four hours old than when
freshly baked.

Apple cake

Cream together a half-cupful of butter and two cupfuls of sugar, and


beat into them a half-cupful of milk and five whipped eggs. Last of
all, add three cupfuls of flour into which have been sifted two small
teaspoonfuls of baking-powder. Bake in layers. When cold, make the
filling by heating in a double boiler a cupful of apple sauce, adding
sugar to taste, and then beating in gradually the yolks of two eggs
and the juice of a lemon. Cook, stirring, for a minute, and set aside
until cold before spreading on the cake.

Springleys (No. 1)

(A German recipe.)
Beat one pound of granulated sugar for ten minutes with four eggs,
leave for an hour, then add one tablespoonful of lemon extract, and
one teaspoonful of hartshorn. Work in enough flour (about two
pounds) to make it stiff enough to roll out. Powder the forms with
flour before using, so as to prevent sticking. Cut apart and lay on a
smooth slab until morning. Sprinkle anise seed in the bottom of the
tins before putting cakes in. Bake in a quick oven and watch very
closely in order to keep them from burning.

Springerlein (No. 2)

(An old German recipe.)


One cup of powdered sugar, rolled fine, sifted and warmed. Four
large eggs. Grated rind of one lemon. One pound of flour thoroughly
dried and sifted three times. One-half teaspoonful of baking-powder
sifted thoroughly with the flour.
With a silver or wooden spoon stir the sugar and eggs steadily for
one hour, stirring one way, add rind of lemon, flour and baking-
powder, mix quickly into a loaf-shape without much handling. Set
aside in a cool place for two hours. Flour your baking-board lightly—
take a small piece of dough, which by this time must be stiff enough
to cut with a knife, roll out to about a quarter of an inch thick. Put
about two tablespoonfuls of flour in a small cheese-cloth bag and
with this lightly dust the mold. Press the dough on the mold, lightly
but firmly with the finger tips, then turn the mold over and carefully
remove. With a cutter cut off surplus dough, put with remainder and
proceed as before. Use as little flour as possible in rolling out. Put a
cloth on the table, sprinkle it with anise-seed, lay the cakes on this
and stand them for twelve hours in a cool room. Bake in a moderate
oven in lightly-buttered pans. This recipe will make from sixty-five to
seventy-five cakes.

Currant bun
Warm a cupful of cream in a double-boiler, take it from the fire and
stir into it a cupful of melted butter, which has not been allowed to
cook in melting. Beat three eggs very light, add them to the cream
and butter, then stir in a cupful of sugar. Dissolve a half-cake of
yeast in a couple of tablespoonfuls of water, sift a good quart of
flour, make a hollow in it, stir into it the yeast and then, after adding
to the other mixture, a teaspoonful, each, of powdered mace and
cinnamon, put in the flour and the yeast. Beat all well for a few
minutes, add a cupful of currants that have been washed, dried and
dredged with flour, pour into a shallow baking-pan, let it rise for
several hours, until it has doubled in size; bake one hour in a rather
quick oven; sprinkle with fine sugar when done.

Cinnamon buns

Save a cupful of bread dough from the second rising. Cream a half-
cupful of butter with a half-cupful of sugar, stir in a well-beaten egg
and work these into the dough. Now add a half-teaspoonful of
cinnamon, a teaspoonful of soda, dissolved in a little hot water and a
half-cupful of cleaned currants, dredged with flour. Knead for several
minutes, form into buns, set to rise for a half-hour, then bake.

Parkin

Mix together three pounds of oatmeal, a pound and a half of


molasses, a half-pound of butter creamed with a half-pound of
sugar, a dash of ginger and as much baking-soda as will lie upon a
shilling, dissolved in a little boiling water. Mix thoroughly and bake in
flat pans.

Grandmother’s apple cake

(From an old family recipe.)


Three cups of dried apples stewed slowly in two cups of molasses,
then set aside to cool. Three cups of flour; two-thirds of a cup of
butter; two cups of brown sugar; one-half cup of raisins; currants
and grated lemon peel, mixed; eight teaspoonfuls of water, one level
teaspoonful of soda dissolved in the water, three eggs, spices to
taste.
This cake will keep for weeks. It is better when a few days old than
when first made.
The apples should be carefully washed, first in warm, then in cold
water, lying in this last for half an hour. Drain and toss in a towel
before adding the molasses.
In the “old times” the quantity of cake made by this recipe lasted the
children a month.

Bun loaf

(An English recipe.)


Cream together half a cupful of mixed butter and lard with a half-
cupful of brown sugar; beat into this one egg and work both into a
cupful of bread dough that has had its second rising. Work in, also,
half a teaspoonful of cinnamon and quarter of a grated nutmeg, half
a cupful of mixed raisins and currants, the raisins seeded and
chopped, the currants washed and dried, and both dredged with
flour, a tablespoonful of citron shredded and also dredged, and
knead all well for three or four minutes. Make into a loaf, let it rise
half an hour and bake in a moderate oven.

Fruit cake (No. 1)

One cupful of butter; one and a half cupfuls of powdered sugar; two
cupfuls of flour; six eggs; half a pound, each, of raisins and currants;
quarter-pound of citron; teaspoonful of cinnamon and nutmeg; half
teaspoonful of ground cloves; three tablespoonfuls of brandy.
Cream butter and sugar, beat in the whipped yolks of the eggs, stir
in the flour, the spice, the raisins, seeded and chopped; the currants,
washed; the citron, shredded, and all the fruit, well dredged with
flour, then the whites, beaten stiff, and the brandy. Bake about two
hours in a steady oven.

Fruit cake (No. 2)

Seed and chop a quarter of a pound of raisins; stem and wash a


quarter of a pound of currants; and mince three tablespoonfuls of
citron. Mix all this fruit together and thoroughly dredge with flour.
Rub to a cream a generous cupful of powdered sugar and a half-
cupful of butter, and beat into this five whipped eggs. Now add half
a teaspoonful, each, of ground cinnamon, nutmeg and mace, and
stir in a cupful of flour. Last of all, add the fruit, turn into a greased
cake tin and bake steadily, not fast, until done. This will probably
take from an hour to an hour and a half.

Fruit cake (No. 3)

Cream one cupful of butter with two cupfuls of powdered sugar, beat
the yolks of six eggs and add to the butter and sugar. Put in two and
a half cupfuls of sifted flour, half a pound, each, of seeded and
chopped raisins, and of washed and dried currants, a quarter of a
pound of shredded citron, all well dredged with flour, and a
teaspoonful, each, of cinnamon and grated nutmeg. Last of all, put
in the whites of the eggs beaten stiff. Bake in a steady oven.

Christmas fruit cake

This cake may be made as long before Christmas as you desire, as it


will keep for months. Cream together a half-pound, each, of butter
and sugar, and stir in six beaten eggs. Now beat in one teaspoonful,
each, of powdered nutmeg, cloves and cinnamon, one cupful of
flour, a half pound, each, of cleaned currants, seeded and chopped
raisins, and a quarter of a pound of shredded citron—all thoroughly
dredged with flour. Last of all, add a tablespoonful of rose water.
Turn into a deep tin, well greased, and bake in a steady oven until
done.

Pound cake

One pound, each, of butter, of sugar, of eggs, of flour; one


tablespoonful of brandy, one-half teaspoonful of mace.
Cream butter and sugar, beat whites and yolks separately and very
light. Add the brandy and mace to the creamed butter and sugar, stir
in the yolks, and, after beating hard for a couple of minutes, add the
flour and whites alternately, whipping them in lightly, but not stirring
after they have gone in. A pound cake batter should be as stiff as it
can be stirred. Bake in brick tins, or in small pans in a steady oven,
covering with paper to prevent too quick browning.

Grafton cake

Cream together three tablespoonfuls of butter with two cupfuls of


sugar and beat into these the yolks of three eggs, whipped light.
Add a cupful of cold water and two cupfuls of sifted flour. Stir in,
then, the whites of the eggs, beaten stiff, and another cupful of flour
into which has been sifted a heaping teaspoonful of baking-powder.
Flavor with a half-teaspoonful of nutmeg and cinnamon, mixed.

Gold cake

Cream together a cupful of butter and two cupfuls of sugar. When


well blended, stir in the beaten yolks of four eggs and a scant cupful
of milk. Now add, gradually, enough prepared flour to make a good
batter, and, at the last, the juice and grated rind of one orange. Turn
into a greased tin and bake until a straw comes out clean from the
thickest part of the loaf. Frost with an icing made by beating a cupful
of powdered sugar into the unbeaten white of one egg. When light
and smooth, add a teaspoonful of orange juice and a tablespoonful
of grated orange peel.

Silver cake

Cream together a cupful of sugar and a half-cupful of butter, and


beat into them the whites of four eggs, then a half-cupful of cold
water. Sift a pint of flour with a heaping teaspoonful of baking-
powder and add this gradually, beating to a light batter. Stir in, at
the last, a teaspoonful of rose-water and bake in a loaf. Cover with
icing flavored with rose-water.

Chocolate loaf cake (No. 1)

Cream together a cupful of sugar and a half-cupful of butter; add a


cupful of milk, four beaten eggs, and three ounces of grated
chocolate dissolved in a little milk. Beat all hard, then stir in quickly
two cupfuls of sifted prepared flour; flavor with vanilla and turn all
into a greased cake tin. Bake in a steady oven until a straw comes
out clean from the thickest part of the loaf.

Chocolate loaf cake (No. 2)

Dissolve eight tablespoonfuls of sweet grated chocolate in a gill of


hot milk. Rub to a cream a half-cupful of butter and a large cupful of
sugar, and into this beat five whipped eggs, the dissolved chocolate,
a pint of prepared flour and a teaspoonful of vanilla. Turn into a loaf-
tin and bake. Cover with chocolate icing.
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!

ebookball.com

You might also like