Large Scale Parallel Data Mining 1759 Lecture Notes in Computer Science 1st Edition by Mohammed Zaki, Ching Tien Ho ISBN 3540671943 978-3540671947
Large Scale Parallel Data Mining 1759 Lecture Notes in Computer Science 1st Edition by Mohammed Zaki, Ching Tien Ho ISBN 3540671943 978-3540671947
com
https://ebookball.com/product/large-scale-parallel-data-
mining-1759-lecture-notes-in-computer-science-1st-edition-
by-mohammed-zaki-ching-tien-ho-
isbn-3540671943-978-3540671947-19650/
OR CLICK BUTTON
DOWLOAD EBOOK
https://ebookball.com/product/machine-learning-and-data-mining-in-
pattern-recognition-lecture-notes-in-computer-science-2734-lecture-
notes-in-artificial-intelligence-1st-edition-by-susan-craw-petra-
perner-azriel-rosenfeld-isbn-354/
ebookball.com
https://ebookball.com/product/data-mining-and-analysis-fundamental-
concepts-and-algorithms-1st-edition-by-mohammed-zaki-15688/
ebookball.com
https://ebookball.com/product/lecture-notes-in-computer-science-1st-
edition-by-springer-isbn-10572/
ebookball.com
https://ebookball.com/product/advances-in-knowledge-discovery-and-
data-mining-part-ii-14th-edition-by-mohammed-zaki-jeffrey-xu-yu-
ravindran-vikram-pudi-isbn-3642136710-978-3642136719-19674/
ebookball.com
Lecture Notes in Computer Science 4075 Lecture Notes in
Bioinformatics 1st edition by Victor Markowitz, Ulf Leser,
Felix Naumann, Barbara Eckman 9783540365952
https://ebookball.com/product/lecture-notes-in-computer-
science-4075-lecture-notes-in-bioinformatics-1st-edition-by-victor-
markowitz-ulf-leser-felix-naumann-barbara-eckman-9783540365952-19912/
ebookball.com
https://ebookball.com/product/automata-for-branching-and-layered-
temporal-structures-lecture-notes-in-computer-science-5955-lecture-
notes-in-artificial-intelligence-1st-edition-by-gabriele-puppis-
isbn-3642118801-978-3642118807-195/
ebookball.com
https://ebookball.com/product/qualitative-spatial-reasoning-with-
topological-information-lecture-notes-in-computer-
science-2293-lecture-notes-in-artificial-intelligence-1st-edition-by-
jochen-renz-isbn-3540433465-978-3540433460-196/
ebookball.com
https://ebookball.com/product/iterative-software-engineering-for-
multiagent-systems-lecture-notes-in-computer-science-1994-lecture-
notes-in-artificial-intelligence-1st-edition-by-ja1-4rgen-lind-
isbn-3540421661-9783540421661-19638/
ebookball.com
https://ebookball.com/product/the-seventeen-provers-of-the-world-
lecture-notes-in-computer-science-3600-lecture-notes-in-artificial-
intelligence-1st-edition-by-freek-wiedijk-freek-wiedijk-
isbn-3540307044-978-3540307044-19614/
ebookball.com
Lecture Notes in Artificial Intelligence 1759
Subseries of Lecture Notes in Computer Science
Edited by J. G. Carbonell and J. Siekmann
Large-Scale
Parallel Data Mining
13
Series Editors
Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA
Jörg Siekmann, University of Saarland, Saarbrücken, Germany
Volume Editors
Mohammed J. Zaki
Computer Science Department
Rensselaer Polytechnic Institute
Troy, NY 12180, USA
E-mail: [email protected]
Ching-Tien Ho
K55/B1, IBM Almaden Research Center
650 Harry Road, San Jose, CA 95120, USA
E-mail: [email protected]
Cataloging-in-Publication Data applied for
CR Subject Classification (1991): I.2.8, I.2.11, I.2.4, I.2.6, H.3, F.2.2, C.2.4
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
liable for prosecution under the German Copyright Law.
Springer-Verlag is a company in the specialist publishing group BertelsmannSpringer
c Springer-Verlag Berlin Heidelberg 2000
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Christian Grosche
Printed on acid-free paper SPIN 10719635 06/3142 543210
Preface
With the unprecedented rate at which data is being collected today in almost all
fields of human endeavor, there is an emerging economic and scientific need to
extract useful information from it. For example, many companies already have
data-warehouses in the terabyte range (e.g., FedEx, Walmart). The World Wide
Web has an estimated 800 million web-pages. Similarly, scientific data is reach-
ing gigantic proportions (e.g., NASA space missions, Human Genome Project).
High-performance, scalable, parallel, and distributed computing is crucial for
ensuring system scalability and interactivity as datasets continue to grow in size
and complexity.
To address this need we organized the workshop on Large-Scale Parallel KDD
Systems, which was held in conjunction with the 5th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, on August 15th,
1999, San Diego, California. The goal of this workshop was to bring researchers
and practitioners together in a setting where they could discuss the design, im-
plementation, and deployment of large-scale parallel knowledge discovery (PKD)
systems, which can manipulate data taken from very large enterprise or scien-
tific databases, regardless of whether the data is located centrally or is globally
distributed. Relevant topics identified for the workshop included:
This book contains the revised versions of the workshop papers and it also
includes several invited chapters, to bring the readers up-to-date on the recent
developments in this field. This book thus represents the state-of-the-art in paral-
lel and distributed data mining methods. It should be useful for both researchers
VI Preface
Workshop Chairs
Program Committee
Acknowledgements
We would like to thank all the invited speakers, authors, and participants for
contributing to the success of the workshop. Special thanks are due to the pro-
gram committee for their support and help in reviewing the submissions.
Table of Contents
Mining Frameworks
The Integrated Delivery of Large-Scale Data Mining: The ACSys Data
Mining Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Graham Williams, Irfan Altas, Sergey Bakin, Peter Christen,
Markus Hegland, Alonso Marquez, Peter Milne, Rajehndra Nagappan,
and Stephen Roberts
Classification
Parallel Predictor Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
D.B. Skillicorn
Efficient Parallel Classification Using Dimensional Aggregates . . . . . . . . . . . 197
Sanjay Goil and Alok Choudhary
VIII Table of Contents
Clustering
Collective, Hierarchical Clustering from Distributed, Heterogeneous Data . 221
Erik L. Johnson and Hillol Kargupta
A Data-Clustering Algorithm On Distributed Memory Multiprocessors . . 245
Inderjit S. Dhillon and Dharmendra S. Modha
Mohammed J. Zaki
1 Introduction
Data Mining and Knowledge Discovery in Databases (KDD) is a new interdis-
ciplinary field merging ideas from statistics, machine learning, databases, and
parallel and distributed computing. It has been engendered by the phenomenal
growth of data in all spheres of human endeavor, and the economic and scientific
need to extract useful information from the collected data. The key challenge in
data mining is the extraction of knowledge and insight from massive databases.
Data mining refers to the overall process of discovering new patterns or build-
ing models from a given dataset. There are many steps involved in the KDD
enterprise which include data selection, data cleaning and preprocessing, data
transformation and reduction, data-mining task and algorithm selection, and
finally post-processing and interpretation of discovered knowledge [1,2]. This
KDD process tends to be highly iterative and interactive.
Typically data mining has the two high level goals of prediction and descrip-
tion [1]. In prediction, we are interested in building a model that will predict
unknown or future values of attributes of interest, based on known values of some
attributes in the database. In KDD applications, the description of the data in
human-understandable terms is equally if not more important than prediction.
Two main forms of data mining can be identified [3]. In verification-driven data
mining the user postulates a hypothesis, and the system tries to validate it.
M.J. Zaki, C.-T. Ho (Eds.): Large-Scale Parallel Data Mining, LNAI 1759, pp. 1–23, 2000.
c Springer-Verlag Berlin Heidelberg 2000
2 Mohammed J. Zaki
Task vs. Data Parallelism. These are the two main paradigms for exploiting al-
gorithm parallelism. Data parallelism corresponds to the case where the database
is partitioned among P processors. Each processor works on its local partition
of the database, but performs the same computation of evaluating candidate
patterns/models. Task parallelism corresponds to the case where the processors
perform different computations independently, such as evaluating a disjoint set
of candidates, but have/need access to the entire database. SMPs have access
to the entire data, but for DMMs this can be done via selective replication or
explicit communication of the local data. Hybrid parallelism combining both
task and data parallelism is also possible, and in fact desirable for exploiting all
available parallelism in data mining methods.
Static vs. Dynamic Load Balancing. In static load balancing work is initially
partitioned among the processors using some heuristic cost function, and there
is no subsequent data or computation movement to correct load imbalances
which result from the dynamic nature of mining algorithms. Dynamic load bal-
ancing seeks to address this by stealing work from heavily loaded processors
and re-assigning it to lightly loaded ones. Computation movement also entails
data movement, since the processor responsible for a computational task needs
the data associated with that task as well. Dynamic load balancing thus incurs
additional costs for work/data movement, but it is beneficial if the load imbal-
ance is large and if load changes with time. Dynamic load balancing is especially
important in multi-user environments with transient loads and in heterogeneous
platforms, which have different processor and network speeds. These kinds of en-
vironments include parallel servers, and heterogeneous, meta-clusters. With very
few exceptions, most extant parallel mining algorithms use only a static load
balancing approach that is inherent in the initial partitioning of the database
among available nodes. This is because they assume a dedicated, homogeneous
environment.
Horizontal vs. Vertical Data Layout. The standard input database for mining
is a relational table having N rows, also called feature vectors, transactions, or
records, and M columns, also called dimensions, features, or attributes. The data
layout can be row-wise or column-wise. Many data mining algorithms assume a
horizontal or row-wise database layout, where they store, as a unit, each trans-
action (tid), along with the attribute values for that transaction. Other methods
use a vertical or column-wise database layout, where they associate with each at-
tribute a list of all tids (called tidlist) containing the item, and the corresponding
attribute value in that transaction. Certain mining operations a more efficient
using a horizontal format, while others are more efficient using a vertical format.
the final result contains only the ones that satisfy the (user-specified) input pa-
rameters. Mining algorithms can differ in the way new candidates are generated
for evaluation. One approach is that of complete search, which is guaranteed
to generate and test all valid candidates consistent with the data. Note that
completeness doesn’t mean exhaustive, since pruning can be used to eliminate
useless branches in the search space. Heuristic generation sacrifices completeness
for the sake of speed. At each step, it only examines a limited number (or only
one) of “good” branches. Random search is also possible. Generally, the more
complex the mined model, the more the tendency towards heuristic or greedy
search.
Candidate and Data Partitioning. An easy way to discuss the many parallel
and distributed mining methods is to describe them in terms of the computa-
tion and data partitioning methods used. For example, the database itself can
be shared (in shared-memory or shared-disk architectures), partially or totally
replicated, or partitioned (using round-robin, hash, or range scheduling) among
the available nodes (in distributed-memory architectures).
Similarly, the candidate concepts generated and evaluated in the different
mining methods can be shared, replicated or partitioned. If they are shared
then all processors evaluate a single copy of the candidate set. In the replicated
approach the candidate concepts are replicated on each machine, and are first
evaluated locally, before global results are obtained by merging them. Finally, in
the partitioned approach, each processor generates and tests a disjoint candidate
concept set.
In the sections below we describe parallel and distributed algorithms for some
of the typical discovery-driven mining tasks including associations, sequences,
decision tree classification and clustering. Table 1 summarizes in list form where
each parallel algorithm for each of the above mining tasks lies in the design
space. It would help the reader to refer to the table while reading the algorithm
descriptions below.
of length 2 is formed from the frequent items. Another database scan is made
to obtain their supports. The frequent itemsets are retained for the next pass,
and the process is repeated until all frequent itemsets (of various lengths) have
been enumerated.
Other sequential methods for associations that have been parallelized, in-
clude DHP [7], which tries to reduce the number of candidates by collecting
approximate counts (using hash tables) in the previous level. These counts can
be used to rule out many candidates in the current pass that cannot possibly be
frequent. The Partition algorithm [8] minimizes I/O by scanning the database
only twice. It partitions the database into small chunks which can be handled in
memory. In the first pass it generates a set of all potentially frequent itemsets,
and in the second pass it counts their global frequency. In both phases it uses a
vertical database layout. The DIC algorithm [9] dynamically counts candidates
of varying length as the database scan progresses, and thus is able to reduce the
number of scans.
A completely different design characterizes the equivalence class based algo-
rithms (Eclat, MaxEclat, Clique, and MaxClique) proposed by Zaki et al. [10].
These methods utilize a vertical database format, complete search, a mix of
bottom-up and hybrid search, and generate a mix of maximal and non-maximal
frequent itemsets. The algorithms utilize the structural properties of frequent
itemsets to facilitate fast discovery. The items are organized in a subset lattice
search space, which is decomposed into small independent chunks or sub-lattices,
which can be solved in memory. Efficient lattice traversal techniques are used,
which quickly identify all the frequent itemsets via tidlist intersections.
which have been described above. HPSPM performed the best among the three.
A parallel and distributed implementation of MSDD was presented in [27].
A shared-memory, SPADE-based parallel algorithm, utilizing dynamic load
balancing is described by Zaki, and new algorithms for parallel sequence mining
are also described by Joshi et al. in this volume.
2.4 Classification
Classification aims to assign a new data item to one of several predefined cat-
egorical classes [28,29]. Since the field being predicted is pre-labeled, classifica-
tion is also known as supervised induction. While there are several classification
methods including neural networks [30] and genetic algorithms [31], decision
trees [32,33] are particularly suited to data mining, since they can be constructed
relatively quickly, and are simple and easy to understand. Common applications
of classification include credit card fraud detection, insurance risk analysis, bank
loan approval, etc.
A decision tree is built using a recursive partitioning approach. Each internal
node in the tree represents a decision on an attribute, which splits the database
into two or more children. Initially the root contains the entire database, with
examples from mixed classes. The split point chosen is the one that best separates
or discriminates the classes. Each new node is recursively split in the same
manner until a node contains only one or a majority class.
Decision tree classifiers typically use a greedy search over the space of all
possible trees; there are simply too many trees to allow a complete search. The
search is also biased towards simple trees. Existing classifiers have used both the
horizontal and vertical database layouts. In parallel decision tree construction
the candidate concepts are the possible split points for all attributes within a
node of the expanding tree. For numeric attributes a split point is of the form
A ≤ vi , and for categorical attributes the test takes the form A ∈ {v1 , v2 , ...},
where vi is a value from the domain of attribute A.
Below we look at some parallel decision tree methods. Recent surveys on
parallel and scalable induction methods are also presented in [34,35].
Replicated Tree, Partitioned Database. SLIQ [36] was one of the earliest
scalable decision tree classifiers. It uses a vertical data format, called attribute
lists, allowing it to pre-sort numeric attributes in the beginning, thus avoiding the
repeated sorting required at each node in traditional tree induction. Nevertheless
it uses a memory-resident structure called class-list, which grows linearly in the
number of input records. SPRINT [37] removes this memory dependence, by
storing the classes as part of the attribute lists. It uses data parallelism, and a
distributed-memory platform.
In SPRINT and parallel versions of SLIQ, the attribute lists are horizontally
partitioned among all processors. The decision tree is also replicated on all pro-
cessors. The tree is constructed synchronously in a breadth-first manner. Each
processor computes the best split point, using its local attribute lists, for all the
Parallel and Distributed Data Mining 11
nodes on the current tree level. A round of communication takes place to de-
termine the best split point among all processors. Each processor independently
splits the current nodes into new children using the best split point, setting
the stage for the next tree level. Since a horizontal record is split in multiple
attribute lists, a hash table is used to note which record belongs to which child.
The parallelization of SLIQ follows a similar paradigm, except for the way
the class list is treated. SLIQ/R uses a replicated class list, while SLIQ/D uses
a distributed class list. Experiments showed that while SLIQ/D is better able
to exploit available memory, SLIQ/R was better in terms of performance, but
SPRINT outperformed both SLIQ/R and SLIQ/D.
ScalParC [38] is also an attribute-list-based parallel classifier for distributed-
memory machines. It is similar in design to SLIQ/D (except that it uses hash
tables per node, instead of global class lists). It uses a novel distributed hash
table for splitting a node, reducing the communication complexity and memory
requirements over SPRINT, making it scalable to larger datasets.
The DP-rec and DP-att [39] algorithms exploit record-based and attribute-
based data parallelism, respectively. In record-based data parallelism (also used
in SPRINT, ScalParC SLIQ/D and SLIQ/R), the records or attribute lists are
horizontally partitioned among the processors. In contrast, in attribute-based
data parallelism, the attributes are divided so that each processor is responsible
for an equal number of attributes. In both the schemes processors cooperate to
expand a tree node. Local computations are performed in parallel, followed by
information exchanges to get a global best split point.
Parallel Decision Tree (PDT) [40], a distributed-memory, data-parallel algo-
rithm, splits the training records horizontally in equal-sized blocks, among the
processors. It follows a master-slave paradigm, where the master builds the tree,
and finds the best split points. The slaves are responsible for sending class fre-
quency statistics to the master. For categorical attributes, each processor gathers
local class frequencies, and forwards them to the master. For numeric attributes,
each processor sorts the local values, finds class frequencies for split points, and
exchanges these with all other slaves. Each slave can then calculate the best local
split point, which is sent to the master, who then selects the best global split
point.
Shared Tree, Shared Database. MWK (and its precursors BASIC and
FWK) [41], a shared-memory implementation based on SPRINT uses this ap-
proach. MWK uses dynamic attribute-based data parallelism. Multiple proces-
sors co-operate to build a shared decision tree in a breadth-first manner. Using
a dynamic scheduling scheme, each processor acquires an attribute for any tree
node at the current level, and evaluates the split points, before processing an-
other attribute. The processor that evaluates the last attribute of a tree node,
also computes the best split point for that node. Similarly, the attribute lists are
split among the children using attribute parallelism.
12 Mohammed J. Zaki
Hybrid Tree Parallelism. SUBTREE [41] uses dynamic task parallelism (that
exists in different sub-trees) combined with data parallelism on shared-memory
systems. Initially all processors belong to one group, and apply data parallelism
at the root. Once new child nodes are formed, the processors are also partitioned
into groups, so that a group of child nodes can be processed in parallel by a
processor group. If the tree nodes associated with a processor group become
pure (i.e., contain examples from a single class), then these processors join some
other active group.
The Hybrid Tree Formulation (HTF) in [42] is very similar to SUBTREE.
HTF uses distributed memory machines, and thus data redistribution is required
in HTF when assigning a set of nodes to a processor group, so that the processor
group has all records relevant to an assigned node.
pCLOUDS [43] is a distributed-memory parallelization of CLOUDS [44]. It
does not require attribute lists or the pre-sorting for numeric attributes; instead
it samples the split points for numeric attributes followed by an estimation step
to narrow the search space for the best split. It thus reduces both computation
and I/O requirements. pCLOUDS employs a mixed parallelism approach. Ini-
tially, data parallelism is applied for nodes with many records. All small nodes
are queued to be processed later using task parallelism. Before processing small
nodes the data is redistributed so that all required data is available locally at a
processor.
2.5 Clustering
Clustering is used to partition database records into subsets or clusters, such
that elements of a cluster share a set of common properties that distinguish
them from other clusters [45,46,47,48]. The goal is to maximize intra-cluster
and minimize inter-cluster similarity. Unlike classification which has predefined
labels, clustering must in essence automatically come up with the labels. For this
reason clustering is also called unsupervised induction. Applications of clustering
include demographic or market segmentation for identifying common traits of
groups of people, discovering new types of stars in datasets of stellar objects,
and so on.
The K-means algorithm is a popular clustering method. The idea is to ran-
domly pick K data points as cluster centers. Next, each record or point is assigned
to the cluster it is closest to in terms of squared-error or Euclidean distance. A
new center is computed by taking the mean of all points in a cluster, setting the
stage for the next iteration. The process stops when the cluster centers cease to
change. Parallelization of K-means received a lot of attention in the past. Differ-
ent parallel methods, mainly using hypercube computers, appear in [49,50,51,52].
We do not describe these methods in detail, since they used only small memory-
resident datasets.
Hierarchical clustering represents another common paradigm. These methods
start with a set of distinct points, each forming its own cluster. Then recursively,
two clusters that are close are merged into one, until all points belong to a
single cluster. In [49,53], parallel hierarchical agglomerative clustering algorithms
Parallel and Distributed Data Mining 13
were presented, using several inter-cluster distance metrics and parallel computer
architectures. These methods also report results on small datasets.
P-CLUSTER [54] is a distributed-memory client-server K-means algorithm.
Data is partitioned into blocks on a server, which sends initial cluster centers and
data blocks to each client. A client assigns each record in its local block to the
nearest cluster, and sends results back to the server. The server then recalculates
the new centers and another iteration begins. To further improve performance
P-CLUSTER uses that the fact that after the first few iterations only a few
records change cluster assignments, and also the centers have less tendency to
move in later iterations. They take advantage of these facts to reduce the number
of distance calculations, and thus the time of the clustering algorithm.
Among the recent methods, MAFIA [55], is a distributed memory algorithm
for subspace clustering. Traditional methods, like K-means and hierarchical clus-
tering, find clusters in the whole data space, i.e., they use all dimensions for dis-
tance computations. Subspace clustering focuses on finding clusters embedded
in subsets of a high-dimensional space. MAFIA uses adaptive grids (or bins) in
each dimension, which are merged to find clusters in higher dimensions. Parallel
implementation of MAFIA is similar to association mining. The candidates here
are the potentially dense units (the subspace clusters) in k dimensions, which
have to be tested if they are truly dense. MAFIA employs task parallelism,
where data as well as candidates are equally partitioned among all processors.
Each processor computes local density, followed by a reduction to obtain global
density.
The paper by Dhillon and Modha in this volume presents a distributed-
memory parallelization of K-means, while the paper by Johnson and Kargupta
describes a distributed hierarchical clustering method.
given threshold level. All local results are combined and then evaluated at each
site to obtain the globally frequent itemsets.
Another example is JAM [56,57], a java-based multi-agent system utilizing
meta-learning, used primarily in fraud-detection applications. Each agent builds
a classification model, and different agents are allowed to build classifiers using
different techniques. JAM also provides a set of meta-learning agents for combin-
ing multiple models learnt at different sites into a meta-classifier that in many
cases improves the overall predictive accuracy. Knowledge Probing [58] is another
approach to meta-learning. Knowledge probing retains a descriptive model af-
ter combining multiple classifiers, rather than treating the meta-classifier as a
black-box. The idea is to learn on a separate dataset, the class predictions from
all the local classifiers.
PADMA [59] is an agent based architecture for distributed mining. Individual
agents are responsible for local data access, hierarchical clustering in text doc-
ument classification, and web based information visualization. The BODHI [60]
DDM system is based on the novel concept of collective data mining. Naive min-
ing of heterogeneous, vertically partitioned, sites can lead to an incorrect global
data model. BODHI guarantees correct local and global analysis with minimum
communication.
In [61] a new distributed do-all primitive, called D-DOALL, was described
that allows easy scheduling of independent mining tasks on a network of work-
stations. The framework allows incremental reporting of results, and seeks to
reduce communication via resource-aware task scheduling principles.
The Papyrus [62] java-based system specifically targets wide-area DDM over
clusters and meta-clusters. It supports different data, task and model strate-
gies. For example, it can move models, intermediate results or raw data between
nodes. It can support coordinated or independent mining, and various meth-
ods for combining local models. Papyrus uses PMML (Predictive Model Markup
Language) to describe and exchange mined models. Kensignton [63] is another
java-based system for distributed enterprise data mining. It is a three-tiered sys-
tem, with a client front-end for GUI, and visual programming of data mining
tasks. The middle-layer application server provides persistent storage, task exe-
cution control, and data management and preprocessing functions. The third-tier
implements a parallel data mining service.
Other recent work in DDM includes decision tree construction over dis-
tributed databases [64], where the learning agents can only exchange summaries
instead of raw data, and the databases may have shared attributes. The main
challenge is to construct a decision tree using implicit records rather than ma-
terializing a join over all the datasets. The WoRLD system [65] describes an
inductive rule-learning program that learns from data distributed over a net-
work. WoRLD also avoids joining databases to create a central dataset. Instead
it uses marker-propagation to compute statistics. A marker is a label of a class
of interest. Counts of the different markers are maintained with each attribute
value, and used for evaluating rules. Markers are propagated among different
tables to facilitate distributed learning.
Parallel and Distributed Data Mining 15
For more information on parallel and distributed data mining see the book
by Freitas and Lavington [66] and the edited volume by Kargupta and Chan [67].
Also see [68] for a discussion of cost-effective measures for assessing the perfor-
mance of a mining algorithm before implementing it.
High Dimensionality. Current methods are only able to hand a few thousand
dimensions or attributes. Consider association rule mining as an example. The
second iteration of the algorithm counts the frequency of all pairs of items,
which has quadratic complexity. In general, the complexity of different mining
algorithms may not be linear in the number of dimensions, and new parallel
methods are needed that are able to handle large number of attributes.
Large Size. Databases continue to increase in size. Current methods are able
to (perhaps) handle data in the gigabyte range, but are not suitable for terabyte-
sized data. Even a single scan for these databases is considered expensive. Most
current algorithms are iterative, and scan data multiple times. For example, it
is an open problem to mine all frequent associations in a single pass, although
sampling based methods show promise [69,70]. In general, minimizing the num-
ber of data scans is paramount. Another factor limiting the scalability of most
mining algorithms is that they rely on in-memory data structures for storing
potential patterns and information about them (such as candidate hash tree [6]
in associations, tid hash table [71] in classification). For large databases these
structures will certainly not fit in aggregate system memory. This means that
temporary results will have to be written out to disk or the database will have
to be divided into partitions small enough to be processed in memory, entailing
further data scans.
Data Location. Today’s large-scale data sets are usually logically and phys-
ically distributed, requiring a decentralized approach to mining. The database
may be horizontally partitioned where different sites have different transactions,
or it may be vertically partitioned, with different sites having different attributes.
Most current work has only dealt with the horizontal partitioning approach. The
databases may also have heterogeneous schemas.
Data Type. To-date most data mining research has focused on structured data,
as it is the simplest, and most amenable to mining. However, support for other
data types is crucial. Examples include unstructured or semi-structured (hy-
per)text, temporal, spatial and multimedia databases. Mining these is fraught
with challenges, but is necessary as multimedia content and digital libraries pro-
liferate at astounding rates. Techniques from parallel and distributed computing
will lie at the heart of any proposed scalable solutions.
16 Mohammed J. Zaki
Data Skew. One of the problems adversely affecting load balancing in paral-
lel mining algorithms is sensitivity to data skew. Most methods partition the
database horizontally in equal-sized blocks. However, the number of patterns
generated from each block can be heavily skewed, i.e., while one block may con-
tribute many, the other may have very few patterns, implying that the processor
responsible for the latter block will be idle most of the time. Randomizing the
blocks is one solution, but it is still not adequate, given the dynamic and inter-
active nature of mining. The effect of skewness on different algorithms needs to
be further studied (see [72] for some recent work).
Dynamic Load Balancing. Most extant algorithms use only a static par-
titioning scheme based on the initial data decomposition, and they assume a
homogeneous, dedicated environment. This is far from reality. A typical parallel
database server has multiple users, and has transient loads. This calls for an in-
vestigation of dynamic load balancing schemes. Dynamic load balancing is also
crucial in a heterogeneous environment, which can be composed of meta- and
super-clusters, with machines ranging from ordinary workstations to supercom-
puters.
4 Book Organization
This book contains chapters covering all the major tasks in data mining including
parallel and distributed mining frameworks, associations, sequences, clustering
and classification. We provide a brief synopsis of each chapter below, organized
under four main headings.
Joshi et al. open this section with a survey chapter on parallel mining of as-
sociation rules and sequences. They discuss the many extant parallel solutions,
and give an account of the challenges and issues for effective formulations of
discovering frequent itemsets and sequences.
Morishita and Nakaya describe a novel parallel algorithm for mining corre-
lated association rules. They mine rules based on the chi-squared metric that
optimizes the statistical significance or correlation between the rule antecedent
and consequent. A parallel branch-and-bound algorithm was proposed that uses
a term rewriting technique to avoid explicitly maintaining lists of open and
closed nodes on each processor. Experiments on SMP platforms (with up to 128
processors) show very good speedups.
Shintani and Kitsuregawa propose new load balancing strategies for general-
ized association rule mining using a gigabyte-sized database on a cluster of 100
PCs connected with an ATM network. In generalized associations the items are
at the leaf levels in a hierarchy or taxonomy of items, and the goal is to discover
rules involving concepts at multiple (and mixed) levels. They show that load
balancing is crucial for performance on such large-scale clusters.
Zaki presents pSPADE, a parallel algorithm for sequence mining. pSPADE
divides the pattern search space into disjoint, independent sub-problems based
on suffix-classes, each of which can be solved in parallel in an asynchronous
manner. Task parallelism and dynamic inter- and intra-class load balancing is
used for good performance. Results on a 12 processor SMP using up to a 1 GB
dataset show good speedup and scaleup.
4.3 Classification
4.4 Clustering
5 Conclusion
We conclude by observing that the need for large-scale data mining algorithms
and systems is real and immediate. Parallel and distributed computing is es-
sential for providing scalable, incremental and interactive mining solutions. The
field is in its infancy, and offers many interesting research directions to pur-
sue. We hope that this volume, representing the state-of-the-art in parallel and
distributed mining methods, will be successful in bringing to surface the require-
ment and challenges in large-scale parallel KDD systems.
References
1. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge
discovery: An overview. [86]
2. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: The KDD process for extracting
useful knowledge from volumes of data. Communications of the ACM 39 (1996)
3. Simoudis, E.: Reality check for data mining. IEEE Expert: Intelligent Systems
and Their Applications 11 (1996) 26–33
4. DeWitt, D., Gray, J.: Parallel database systems: The future of high-performance
database systems. Communications of the ACM 35 (1992) 85–98
5. Valduriez, P.: Parallel database systems: Open problems and new issues. Dis-
tributed and Parallel Databases 1 (1993) 137–165
6. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast discovery
of association rules. In Fayyad, U., et al, eds.: Advances in Knowledge Discovery
and Data Mining, AAAI Press, Menlo Park, CA (1996) 307–328
7. Park, J.S., Chen, M., Yu, P.S.: An effective hash based algorithm for mining
association rules. In: ACM SIGMOD Intl. Conf. Management of Data. (1995)
8. Savasere, A., Omiecinski, E., Navathe, S.: An efficient algorithm for mining asso-
ciation rules in large databases. In: 21st VLDB Conf. (1995)
9. Brin, S., Motwani, R., Ullman, J., Tsur, S.: Dynamic itemset counting and im-
plication rules for market basket data. In: ACM SIGMOD Conf. Management of
Data. (1997)
10. Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast dis-
covery of association rules. In: 3rd Intl. Conf. on Knowledge Discovery and Data
Mining. (1997)
20 Mohammed J. Zaki
11. Mueller, A.: Fast sequential and parallel algorithms for association rule mining: A
comparison. Technical Report CS-TR-3515, University of Maryland, College Park
(1995)
12. Park, J.S., Chen, M., Yu, P.S.: Efficient parallel data mining for association rules.
In: ACM Intl. Conf. Information and Knowledge Management. (1995)
13. Agrawal, R., Shafer, J.: Parallel mining of association rules. IEEE Trans. on
Knowledge and Data Engg. 8 (1996) 962–969
14. Cheung, D., Han, J., Ng, V., Fu, A., Fu, Y.: A fast distributed algorithm for mining
association rules. In: 4th Intl. Conf. Parallel and Distributed Info. Systems. (1996)
15. Shintani, T., Kitsuregawa, M.: Hash based parallel algorithms for mining associa-
tion rules. In: 4th Intl. Conf. Parallel and Distributed Info. Systems. (1996)
16. Zaki, M.J., Ogihara, M., Parthasarathy, S., Li, W.: Parallel data mining for asso-
ciation rules on shared-memory multi-processors. In: Supercomputing’96. (1996)
17. Cheung, D., Hu, K., Xia, S.: Asynchronous parallel algorithm for mining asso-
ciation rules on shared-memory multi-processors. In: 10th ACM Symp. Parallel
Algorithms and Architectures. (1998)
18. Han, E.H., Karypis, G., Kumar, V.: Scalable parallel data mining for association
rules. In: ACM SIGMOD Conf. Management of Data. (1997)
19. Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: Parallel algorithms for fast
discovery of association rules. Data Mining and Knowledge Discovery: An Inter-
national Journal 1(4):343-373 (1997)
20. Tamura, M., Kitsuregawa, M.: Dynamic load balancing for parallel association rule
mining on heterogeneous PC cluster systems. In: 25th Intl Conf. on Very Large
Data Bases. (1999)
21. Zaki, M.J.: Parallel and distributed association mining: A survey. IEEE Concur-
rency 7 (1999) 14–25
22. Agrawal, R., Srikant, R.: Mining sequential patterns. In: 11th Intl. Conf. on Data
Engg. (1995)
23. Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and perfor-
mance improvements. In: 5th Intl. Conf. Extending Database Technology. (1996)
24. Oates, T., Schmill, M.D., Jensen, D., Cohen, P.R.: A family of algorithms for
finding temporal structure in data. In: 6th Intl. Workshop on AI and Statistics.
(1997)
25. Zaki, M.J.: Efficient enumeration of frequent sequences. In: 7th Intl. Conf. on
Information and Knowledge Management. (1998)
26. Shintani, T., Kitsuregawa, M.: Mining algorithms for sequential patterns in paral-
lel: Hash based approach. In: 2nd Pacific-Asia Conf. on Knowledge Discovery and
Data Mining. (1998)
27. Oates, T., Schmill, M.D., Cohen, P.R.: Parallel and distributed search for structure
in multivariate time series. In: 9th European Conference on Machine Learning.
(1997)
28. Weiss, S.M., Kulikowski, C.A.: Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman (1991)
29. Michie, D., Spiegelhalter, D.J., Taylor, C.C.: Machine Learning, Neural and Sta-
tistical Classification. Ellis Horwood (1994)
30. Lippmann, R.: An introduction to computing with neural nets. IEEE ASSP
Magazine 4 (1987)
31. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learn-
ing. Morgan Kaufmann (1989)
Parallel and Distributed Data Mining 21
32. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regres-
sion Trees. Wadsworth, Belmont (1984)
33. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufman (1993)
34. Provost, F., Aronis, J.: Scaling up inductive learning with massive parallelism.
Machine Learning 23 (1996)
35. Provost, F., Kolluri, V.: A survey of methods for scaling up inductive algorithms.
Data Mining and Knowledge Discovery: An International Journal 3 (1999) 131–169
36. Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: A fast scalable classifier for data
mining. In: Proc. of the Fifth Intl Conference on Extending Database Technology
(EDBT), Avignon, France (1996)
37. Shafer, J., Agrawal, R., Mehta, M.: Sprint: A scalable parallel classifier for data
mining. In: 22nd VLDB Conference. (1996)
38. Joshi, M., Karypis, G., Kumar, V.: ScalParC: A scalable and parallel classifica-
tion algorithm for mining large datasets. In: Intl. Parallel Processing Symposium.
(1998)
39. Chattratichat, J., Darlington, J., Ghanem, M., Guo, Y., Huning, H., Kohler, M.,
Sutiwaraphun, J., To, H.W., Dan, Y.: Large scale data mining: Challenges and
responses. In: 3rd Intl. Conf. on Knowledge Discovery and Data Mining. (1997)
40. Kufrin, R.: Decision trees on parallel processors. In Geller, J., Kitano, H., Suttner,
C., eds.: Parallel Processing for Artificial Intelligence 3, Elsevier-Science (1997)
41. Zaki, M.J., Ho, C.T., Agrawal, R.: Parallel classification for data mining on shared-
memory multiprocessors. In: 15th IEEE Intl. Conf. on Data Engineering. (1999)
42. Srivastava, A., Han, E.H., Kumar, V., Singh, V.: Parallel formulations of decision-
tree classification algorithms. Data Mining and Knowledge Discovery: An Interna-
tional Journal 3 (1999) 237–261
43. Sreenivas, M., Alsabti, K., Ranka, S.: Parallel out-of-core divide and conquer
techniques with application to classification trees. In: 13th International Parallel
Processing Symposium. (1999)
44. Alsabti, K., Ranka, S., Singh, V.: Clouds: A decision tree classifier for large
datasets. In: 4th Intl Conference on Knowledge Discovery and Data Mining. (1998)
45. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall (1988)
46. Cheeseman, P., Kelly, J., Self, M., et al.: AutoClass: A Bayesian classification
system. In: 5th Intl Conference on Machine Learning, Morgan Kaufman (1988)
47. Fisher, D.H.: Knowledge acquisition via incremental conceptual clustering. Ma-
chine Learning 2 (1987)
48. Michalski, R.S., Stepp, R.E.: Learning from observation: Conceptual clustering.
In Michalski, R.S., Carbonell, J.G., Mitchell, T.M., eds.: Machine Learning: An
Artificial Intelligence Approach. Volume I. Morgan Kaufmann (1983) 331–363
49. Li, X., Fang, Z.: Parallel clustering algorithms. Parallel Computing 11 (1989)
270–290
50. Rivera, F., Ismail, M., Zapata, E.: Parallel squared error clustering on hypercube
arrays. Journal of Parallel and Distributed Computing 8 (1990) 292–299
51. Ranka, S., Sahni, S.: Clustering on a hypercube multicomputer. IEEE Trans. on
Parallel and Distributed Systems 2(2) (1991) 129–137
52. Rudolph, G.: Parallel clustering on a unidirectional ring. In et al., R.G., ed.:
Transputer Applications and Systems ’93: Volume 1. IOS Press, Amsterdam (1993)
487–493
53. Olson, C.: Parallel algorithms for hierarchical clustering. Parallel Computing 21
(1995) 1313–1325
54. Judd, D., McKinley, P., Jain, A.: Large-scale parallel data clustering. In: Intl Conf.
Pattern Recognition. (1996)
22 Mohammed J. Zaki
55. S. Goil, H.N., Choudhary, A.: MAFIA: Efficient and scalable subspace cluster-
ing for very large data sets. Technical Report 9906-010, Center for Parallel and
Distributed Computing, Northwestern University (1999)
56. Stolfo, S., Prodromidis, A., Tselepis, S., Lee, W., Fan, W., Chan, P.: Jam: Java
agents for meta-learning over distributed databases. In: 3rd Intl. Conf. on Knowl-
edge Discovery and Data Mining. (1997)
57. Prodromidis, A., Stolfo, S., Chan, P.: Meta-learning in distributed data mining
systems: Issues and approaches. [67]
58. Guo, Y., Sutiwaraphun, J.: Knowledge probing in distributed data mining. In: 3rd
Pacific-Asia Conference on Knowledge Discovery and Data Mining. (1999)
59. Kargupta, H., Hamzaoglu, I., Stafford, B.: Scalable, distributed data mining using
an agent based architecture. In: 3rd Intl. Conf. on Knowledge Discovery and Data
Mining. (1997)
60. Kargupta, H., Park, B.H., Hershberger, D., Johnson, E.: Collective data mining:
A new perspective toward distributed data mining. [67]
61. Parthasarathy, S., Subramonian, R.: Facilitating data mining on a network of
workstations. [67]
62. Grossman, R.L., Bailey, S.M., Sivakumar, H., Turinsky, A.L.: Papyrus: A system
for data mining over local and wide area clusters and super-clusters. In: Super-
computing’99. (1999)
63. Chattratichat, J., Darlington, J., Guo, Y., Hedvall, S., Kohler, M., Syed, J.: An
architecture for distributed enterprise data mining. In: 7th Intl. Conf. High-
Performance Computing and Networking. (1999)
64. Bhatnagar, R., Srinivasan, S.: Pattern discovery in distributed databases. In:
AAAI National Conference on Artificial Intelligence. (1997)
65. Aronis, J., Kolluri, V., Provost, F., Buchanan, B.: The WoRLD: Knowledge discov-
ery from multiple distributed databases. In: Florida Artificial Intelligence Research
Symposium. (1997)
66. Freitas, A., Lavington, S.: Mining very large databases with parallel processing.
Kluwer Academic Pub., Boston, MA (1998)
67. Kargupta, H., Chan, P., eds.: Advances in Distributed Data Mining. AAAI Press,
Menlo Park, CA (2000)
68. Skillicorn, D.: Strategies for parallel data mining. IEEE Concurrency 7 (1999)
26–35
69. Toivonen, H.: Sampling large databases for association rules. In: 22nd VLDB Conf.
(1996)
70. Zaki, M.J., Parthasarathy, S., Li, W., Ogihara, M.: Evaluation of sampling for data
mining of association rules. In: 7th Intl. Wkshp. Research Issues in Data Engg.
(1997)
71. Shafer, J., Agrawal, R., Mehta, M.: SPRINT: A scalable parallel classifier for data
mining. In: Proc. of the 22nd Intl Conference on Very Large Databases, Bombay,
India (1996)
72. Cheung, D., Xiao, Y.: Effect of data distribution in parallel mining of associations.
Data Mining and Knowledge Discovery: An International Journal 3 (1999) 291–314
73. Agrawal, R., Shim, K.: Developing tightly-coupled data mining applications on
a relational database system. In: 2nd Intl. Conf. on Knowledge Discovery in
Databases and Data Mining. (1996)
74. Meo, R., Psaila, G., Ceri, S.: A new SQL-like operator for mining association rules.
In: 22nd Intl. Conf. Very Large Databases. (1996)
75. Meo, R., Psaila, G., Ceri, S.: A tightly-coupled architecture for data mining. In:
Intl. Conf. on Data Engineering. (1998)
Parallel and Distributed Data Mining 23
76. Sarawagi, S., Thomas, S., Agrawal, R.: Integrating association rule mining with
databases: alternatives and implications. In: ACM SIGMOD Intl. Conf. Manage-
ment of Data. (1998)
77. Holsheimer, M., Kersten, M.L., Siebes, A.: Data surveyor: Searching the nuggets
in parallel. [86]
78. Lavington, S., Dewhurst, N., Wilkins, E., Freitas, A.: Interfacing knowledge discov-
ery algorithms to large databases management systems. Information and Software
Technology 41 (1999) 605–617
79. Kamber, M., Han, J., Chiang, J.Y.: Metarule-guided mining of multi-dimensional
association rules using data cubes. In: 3rd Intl. Conf. on Knowledge Discovery and
Data Mining. (1997)
80. Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., Verkamo, A.I.: Finding
interesting rules from large sets of discovered association rules. In: 3rd Intl. Conf.
Information and Knowledge Management. (1994) 401–407
81. Shen, W.M., Ong, K.L., Mitbander, B., Zaniolo, C.: Metaqueries for data mining.
[86]
82. Ng, R.T., Lakshmanan, L., Jan, J., Pang, A.: Exploratory mining and prun-
ing optimizations of constrained association rules. In: ACM SIGMOD Intl. Conf.
Management of Data. (1998)
83. Srikant, R., Vu, Q., Agrawal, R.: Mining Association Rules with Item Constraints.
In: 3rd Intl. Conf. on Knowledge Discovery and Data Mining. (1997)
84. Matheus, C., Piatetsky-Shapiro, G., McNeill, D.: Selecting and reporting what
is interesting. In Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy,
R., eds.: Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press
(1996)
85. Toivonen, H., Klemettinen, M., Ronkainen, P., Hätönen, K., Mannila, H.: Prun-
ing and grouping discovered association rules. In: MLnet Wkshp. on Statistics,
Machine Learning, and Discovery in Databases. (1995)
86. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., eds.: Advances in
Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA (1996)
The Integrated Delivery of
Large-Scale Data Mining:
The ACSys Data Mining Project
1 Introduction
High performance computers and parallel algorithms provide the necessary plat-
form for the delivery of novel and actionable discoveries from extremely large
M.J. Zaki, C.-T. Ho (Eds.): Large-Scale Parallel Data Mining, LNAI 1759, pp. 24–54, 2002.
c Springer-Verlag Berlin Heidelberg 2002
ACSys Data Mining 25
predictive modelling and other data mining tasks. Also, as model builders are
applied to ever larger datasets, the complexity of the resulting models increases
correspondingly. Virtual environments can also effectively provide insights into
the modelling process, and the resulting models themselves.
All aspects of data mining revolve around the data. Data is stored in a variety
of formats and within a variety of database systems. Data needs to be accessed
in a timely manner and potentially multiple times. Managing, transforming,
and efficiently accessing the data is a crucial issue. The Semantic Extension
Framework provides an environment for seamlessly extending the semantics of
Java objects, allowing those objects to be instantiated in different ways and from
different sources. We are beginning to explore the benefits of such a framework for
ongoing data mining activities. The potential of this approach lies in all stages
of the data mining process [1], from data management and data versioning,
through to access mechanisms highly tuned to suit the behaviour of access of
the particular predictive modelling tool being employed.
Finally, we need to bring these tools together to deliver highly configurable,
and often pre-packaged or ‘canned’ solutions for particular applications. The
Data Miner’s Arcade provides simple abstractions to integrate these components
providing high performance data access for a variety of data mining tools com-
municating through standard interfaces, and building on the developing XML
standards for data mining [2].
2 Parallel Algorithms
Careful, detailed examination of each and every customer, patient, or claimant
that exists in a very large dataset made available for data mining might well
lead to a better understanding of the data and of underlying processes. Given
the sheer size of data we are talking about in data mining, this is, of course not
generally feasible, and probably not desirable. Yet, with the desire to analyse
all the data, rather than statistical samples of the data, a data mining exercise
is often required to apply computationally complex analysis tools to extremely
large datasets.
Often, we characterise the task as being one of building an indicator func-
tion as a predictor of fraud, of propensity to purchase, or of improved health
outcomes. We can view the function as
y = f (x)
where y is the real valued response, indicating the likelihood of the outcome,
and x is the array of predictor variables (attributes or features) which encode
the information thought to be relevant to the outcome. The function f can be
trained on the collected data by, for example, (logistic) regression. We have been
developing new computational techniques to identify such predictive models from
large data sets.
Applications for such model building abound. Another example is in insur-
ance where a significant problem is to determine optimal premium levels. When
ACSys Data Mining 27
d
f (x) = f0 + fi (xi ).
i=1
Similar models are used in ANOVA, where all the variables xi are categorical.
The effects of the predictor variables are added up. Thus, the effect of the value
of a variable xi is independent of the effect of a different variable xj . We have
suggested and discussed a new scalable and parallel algorithm for the determi-
nation of a (generalised) additive model in [3].
A better model includes interactions between the variables. For example, it
could be the case that for different incomes the effect of the level of deductions
from taxable income on the likelihood of fraud varies. Interaction models are of
the form:
d
d
f (x) = f0 + fi (xi ) + fi,j (xi , xj ).
i=1 i,j=1
predictors f which are smooth and fit the data. thin plate splines [12] are an
established smooth model. They are designed to have small curvature. The one-
dimensional components fi (xi ) turn out to be cubic splines which are computa-
tionally very tractable using a B-spline basis. The form of the interaction terms
is also known:
n
(k) (k)
fxi ,xj (xi , xj ) = c0 + c1 xi + c2 xj + bk φ (xi − xi )2 + (xj − xj )2
k=1
where φ(r2 ) = r2 log(r2 ) [12]. The coefficients of the thin plate splines are de-
termined by the linear system of the form
Φ + αI X b y
=
XT 0 c 0
n
J1 (f ) = (f (x(k) ) − y (k) )2
k=1
2 2 2
(1)
∂2f ∂2f ∂2f
+α +2 + dx1 dx2 .
∂x21 ∂x1 ∂x2 ∂x22
and
∂f
(x) = un (x), x ∈ ∂G
∂n
then the same solution as above is obtained. However, practical tests show that
the curl condition is not important in achieving a good approximation [4].
The finite element solution of the optimisation problem proceeds in two
stages:
1. The matrix and right-hand side of the linear system of equations is assem-
bled. The matrix of this linear system is the sum of low rank matrices, one
for each data point x(i) .
2. The linear system of equations is solved.
The time for the first (assembly) stage depends linearly on the data size n and
the time for the second (solution) stage is independent of n. Thus the overall
algorithm scales with the number of data points. The data points only need
to be visited once, thus there is no need to either store the entire data set
in memory nor revisit the data points several times. The basis functions are
piecewise bilinear and require a small number of operations for their evaluation.
With this technique the smoothing of millions of data points becomes feasible.
The parallel algorithm exploits different aspects of the problem for the as-
sembly and the solution stage. The time required for the assembly stage grows
linearly as a function of data size. For simplicity we assume that the data is ini-
tially equally distributed between the local disks of the processors. (If this is not
the case initial distribution costs would have to be included in the analysis.) In
a first step of the assembly stage a local matrix is assembled for each processor
based on the data available on its local disk. The matrix of the full problem is
then the sum of the local matrices and can thus be obtained through a reduction
step. This algorithm was developed and tested on a cluster of 10 Sun Sparc-5
workstations networked with a 10 Mbit/s twisted pair Ethernet using MPI [15].
The total time spent in this assembly phase is of the order
where m characterises the size of the assembled matrix. Thus, if the number n
of the data points grows like O(p log2 (p)) for fixed matrix size m the parallel
30 Graham Williams et al.
efficiency is
T1 n
Ep = =O = O(1)
pTp n + mp log2 (p)
and thus there is no drop in parallel efficiency for larger numbers of processors.
This basic trend is confirmed by practical experiments [15].
In the solution stage the spatial parallelism of the problem is exploited. As-
sume for simplicity that the domain is rectangular. If the domain was split into
strips of equal size the values on the boundaries between the strips depends on
the data in the neighbouring strips. However, as this dependency is local, only
a fixed number of points in the neighbouring strip really have an influence on
the function values f (x) in the strip. A good approximation is obtained for the
values on the strip by solving the smoothing problem for an expanded region
containing the original strip and a sufficient number of neighbouring points. Note
that by introducing redundant computations in this way, communication can be
avoided. The size of the original strip is proportional to m/p and, in order to
add the extra k neighbouring points, it has to be expanded by a factor kp/n.
Thus the size of the expanded strip is of the order of
s = (m/p)(1 + kp/n).
As we assumed n = O(p log2 (p)) to get isoefficiency [16] of the assembly phase
the size of the strips is proportional to m/p asymptotically in p which shows
isoefficiency for the solution stage.
This approach thus ensures a fast and efficient path to the development of
predictive models.
K1
Kd d
f (x) = ... ak1 ...kd , Tk1 ...kd (x), Tk1 ...kd (x) = bkj ,j (xj ) (3)
k1 =0 kd =1 j=1
K
where b0,j (xj ) = 1 and {bkj ,j (xj )}kjj=1 are univariate piecewise linear basis func-
tions of the variable xj , j = 1, ..., d. The original MARS is based on the univari-
ate truncated power basis functions:
J
f (x) = am Tm (x)
m=0
1
Here a piecewise linear multivariate function is one which is piecewise linear with
respect to any of its numeric variables.
32 Graham Williams et al.
dm
Tm (x) = [xv(j,m) − tjm ]+ .
j=1
As can be seen, this model is similar to the general model (3) in that both belong
to the same function space. However, the distinct feature of MARS models is
that they are normally based on only a very small subset of the complete set of
tensor product basis functions. The pseudo-code of the procedure which builds
the subset of functions is shown below.
The algorithm starts with the model containing only the constant function.
All subsequent functions are produced one at a time. At each step the algo-
c
rithm enumerates all possible candidate basis functions Tm (x) and selects the
one whose inclusion in the model results in the largest improvement of the least
squares fit of the model to the data. The three nested internal loops (correspond-
ing to the s, j, kj loop variables) implement this selection process. The selected
basis function is added to the model.
The set of candidate basis functions is seen to be comprised of all basis
functions which can be derived from the ones contained in the model via multi-
plication by a univariate basis function. Due to the utilisation of this definition
of the set of candidates, the MARS algorithm allows for a considerable reduction
in the computational cost compared with another popular technique (forward
subset selection procedure [18]). The number of basis functions Jmax produced
ACSys Data Mining 33
by MARS has to be specified by a user. It turns out that the quality of the
model can even further be improved via removal of the less optimal tensor prod-
uct basis functions from the model. This can be accomplished by means of the
backward elimination procedure (see [6] for details).
As mentioned, this approach can be modified to data of mixed types. Uni-
variate indicator functions I[x ∈ A] can be used instead of the truncated powers
whenever a categorical variable x is encountered in the Algorithm (1). Thus, the
typical tensor product basis function would have the form:
dnum
m dcat
m
The algorithm for finding the appropriate subsets Ajm is very similar to the
ordinary forward stepwise regression procedure [18]. The detailed discussion of
the algorithm is given in [19].
MARS is thus based on truncated power basis functions which are used to form
tensor product basis functions. However, truncated powers are known to have
poor numerical properties. In our work we sought to develop a MARS-like algo-
rithm based on B-splines which form a basis with better numerical properties.
In our algorithm, called BMARS, we use B-splines of the second order (piece-
d
wise linear B-splines) to form tensor product basis functions j=1 Bkj ,j (xj ).
Thus, the models produced by MARS and BMARS belong to the space of piece-
wise linear multivariate functions. In common with MARS, BMARS traverses
the space of piecewise linear multivariate functions until it arrives at the model
which provides an adequate fit. However, the way in which the traversal occurs
is somewhat different. Apart from being a more stable basis, B-splines possess a
compact support property which allows us to build models in the scale-by-scale
way. The pseudo-code (Algorithm 2) illustrates the strategy.
To implement the scale-by-scale strategy, one needs B-splines of different
scales. The scale is the size of the support interval of a B-spline. Given a set
K of K = 2l0 + 1 knots on a variable x one can construct B-splines of l0 + 1
different scales based on l0 + 1 nested subsets Kl of K l = (K − 1)/2l−1 + 1
knots, l = 1, ..., l0 + 1 respectively. The lth subset is obtained from the full
set by retaining each 2l−1 st knot and disposing of the rest. Thus, the B-splines
constructed using the lth subset of knots have on average twice as long support
intervals as the B-splines constructed using the (l − 1)st subset.
At the start of the algorithm, the scale parameter l is set to the largest pos-
sible value l0 . Subsequently, B-splines of the largest scale only are used to form
new tensor product basis functions. Upon the formation of each new tensor prod-
uct basis function, the algorithm checks if the improvement of the fit due to the
inclusion of the new basis function is appreciable. We use the Generalised Cross-
Validation score [6] to decide if the inclusion of a new basis function improves
34 Graham Williams et al.
the fit. If this is not the case, the algorithm switches over to using B-splines of
the second largest scale.
Thus, new tensor product basis functions continue to be generated using B-
splines of the second largest scale. Again, as soon as the algorithm detects that
the inclusion of new basis functions fails to improve the fit, it switches over to
using B-splines of the third largest scale. This procedure is repeated until the
Jmax number of tensor product basis functions is produced.
The advantage of this strategy over that of MARS is that it results in a consid-
erable reduction of the number of candidate basis functions to be tested at each
step of the algorithm. This is due to the fact that the number Kjl of B-splines
of a particular scale l is less than the total number of knots Kj : Kj /Kjl = 2l−1 .
This ratio is seen to be greater than one for all scales but the smallest (l = 1)
one. This results in a fewer number of iterations carried out by the inner-most
loop of Algorithm 2 compared to the similar loop of Algorithm 1. The results of
experiments suggest that this reduction in the computational complexity comes
at no cost in terms of the quality of the resulting models [20].
ACSys Data Mining 35
It can be shown that the computational complexity of both MARS and BMARS
algorithms is linear in the number of data points as well as the number of at-
tributes. However, when large amounts of data are to be processed, the com-
putational time still can be prohibitively large. In order to reduce the cost of
running BMARS we have developed a parallel version of the algorithm based on
the Parallel Virtual Machine (PVM) system [21]. An advantage of PVM is its
wide availability on a number of platforms, so that software based on it is very
portable.
The idea of the parallel BMARS is very simple. Following from the struc-
ture of the Algorithm (2), each new tensor product basis function is the best
function selected from the pool of candidates. The goodness of each candidate is
determined via least squares fit. It turns out that these least squares fits account
for the bulk of the computational cost of running BMARS. Thus, an efficient
parallelisation of BMARS can be achieved via parallelisation of the least squares
procedure. We use the Gram-Schmidt algorithm [22] to perform the least squares
fit. It amounts to the computation of a number of scalar products and, there-
fore, can be efficiently parallelised using the data-partitioning approach (see, for
example [21]).
Parallel BMARS was tested on a multiprocessor system having 10 SPARC
processors. It was applied to the analysis of a large motor vehicle insurance data
set (∼ 1, 000, 000 data records) [20] as well as taxation data [17]. The results of
the experiments show that the efficiency of the algorithm is close to that of an
ideal algorithm [20].
Once again, by focusing on issues relating to the performance of the algo-
rithms on extremely large datasets from real world applications, significant im-
provements can be made in the “responsiveness” of the algorithms. The result is
that these tools can be significantly more effectively employed in data mining.
the surface of a sphere. A convex hull is generated to connect the ends of the
axes together. The axes and the space that they form can be used for a number
of visualisation strategies, including rectangular prism and the use of density
functions.
on each of the orb’s axes at lengths determined by the entry’s value in the corre-
sponding dimension. Each entry is thus represented by a tessellation of polygons
similar to the triangles that define the three-spaces of the orb. The entry’s tes-
sellation of polygons is identical to the triangles forming the orb surface mesh as
shown in Figure 1, but the length of each vertex from the origin varies according
to the value of the data point in the given dimension. Due to occlusion problems
we do not render each individual polygon opaquely. Instead we render a density
function that illustrates how many polygons pass through each region in space.
Densely populated areas appear opaque whilst sparsely populated areas appear
transparent.
The mdOrb is not a static visualisation, but rather a framework on which
dynamic interactive investigations can be carried out. Unlike a scatter plot ma-
trix [33] the mdOrb does not display every possible combination of dimensions
concurrently. Rather the only combinations shown are those in close proximity
to each other as determined by the current tessellation. However, this does not
mean that visible relationships are limited. Each axis can be moved around the
orb at will, thus allowing the user to pry apart certain regions or close them
together. Additionally, if the user moves an axis past the bounds of the triangles
that it forms, the surface mesh is recalculated for the new axis position. This
allows the user to interactively change the combinations of axes and their neigh-
bours. For example, if a user wishes to plot two dimensions against each other
they simply move the relevant axes until they are adjacent, a visual guide of the
current tessellation like that shown in Figure 1 aids them in this task.
A user may wish to “brush” (or highlight) a region of interest in the orb.
When brushing occurs all marks or other representations that correspond to the
same data entries can be highlighted. For example, if a user brushes a cluster
in one three-space, then the marks in all other three-spaces that correspond to
those same entries will also be highlighted. In this way the user can correlate
the different properties of individual entries or groups of entries across the entire
multidimensional space.
from a global viewpoint and drilling down into a particular group of nodes and
edges with irrelevant sections of the graph removed for clarity.
Figure 3 shows a visualisation of a program written in the Java language. The
view shows the entire Java API with edges representing inheritance links, the
program itself is shown by the grey group of nodes. The navigation icons in the
lower left corner allow the user to interactively control which groups are visible.
Each group of nodes is represented by a node icon and each group of edges by
an edge icon, by selecting and deselecting the icons the groups of elements in
the graph are turned on and off. In Figure 4 many of the groups of nodes and
edges have been turned off, the only ones remaining in view are the nodes of the
program and the Java packages that it inherits from. The viewpoint has been
rotated and zoomed into the visible part of the graph to examine it in greater
detail.
6 Data Management
Data is stored in a variety of formats and within a variety of database systems
and data warehouses, across multiple platforms. The data needs to be accessed
40 Graham Williams et al.
in a timely manner, often after it has been pre-processed to suit the particular
application. And the data will often need to be accessed multiple times for use
in the single application. Efforts in this direction, including the ongoing develop-
ment of the The Data Space Transfer Protocol [36], have begun to demonstrate
the significance of the data access issue. Here, we describe an initial approach
to effectively and seamlessly providing sophisticated data access mechanisms for
data mining. A particular focus of this research is on smart caching and other
optimisations which may be tuned for particular classes of analysis algorithms
to improve the run time performance for data mining over very large datasets.
We are employing the semantic extension framework (SEF) for Java as the en-
vironment for this work.
The semantic extension framework (SEF) for Java and the High Performance
Orthogonal Persistent Java (HPOPJ) built on top of SEF [37] are abstraction
tools which provide orthogonality of algorithms with respect to the data sources.
This approach allows datasets to be transparently accessed and efficiently man-
aged from many and any source. Algorithms accessing the data simply view the
data as Java data structures which are intended to be efficiently instantiated
as required and as determined by the semantic extensions provide for the rele-
vant objects. We are now exploring the use of the SEF and HPOPJ to provide
orthogonality and optimised access to large scale datasets.
An important problem encountered when designing data mining applications
is that the programming language and the database system are different envi-
ronments. Moreover, most databases do not support the same data model as the
programming language. This quite common phenomenon, called the impedance
mismatch, means that the programmer has to map persistent variables onto
ACSys Data Mining 41
the database environment. Solving such mapping problems and keeping explicit
track of persistent information wastes a significant portion of development time
(sometimes more than 30%) and accounts for many programming errors. The
use of the SEF for data mining enables a prototype oriented development where
complex algorithms are implemented and tested quickly.
The first two approaches clearly violate the goal of portability as they depend
on a modified virtual machine. The next three approaches produce portable byte-
codes but require each producer of semantically extended code to have access to
a modified compiler or preprocessor. Moreover, the compilation approach pre-
cludes the dynamic composition of semantic extensions. Only the last method is
compatible with our goals of dynamic composition and portability. Consequently,
we have adopted the last approach to semantic extensions as the basis for our
semantic extension framework and our OPJ implementation (a semi-dynamic
approach).
Byte-code transformations are notoriously error prone. A simple mistake dur-
ing the transformation process can destroy type safety or the semantics of the
program, and may lead to the byte-code modified class being rejected at class
load time. A type-safe and declarative way to specify program transformations is
essential to the practical application of byte-code transformations. To this end,
we have defined the Semantic Extension Framework. Our framework allows for
both the semantic extension of methods and the inclusion of special ‘triggers’
(similar in concept to database triggers) that are activated on the occurrence of
particular events such as the execution of getfield or putfield Java byte-codes.
The semantic extension framework is invoked when a user class is loaded. This
action triggers a special semantic extension class loader to search for and load
any semantic extension classes that are applicable to the user class being loaded.
A first prototype of the framework has been implemented. It has been ap-
plied to the implementation of a portable OPJ and a portable object versioning
framework. We have implemented the framework using the ‘PoorMan’ library
that provides facilities for class file parsing and basic class transformations [47].
Daisy salad
Tongue salad
Make a good French dressing. Dip into it firm, crisp lettuce leaves.
Have ready cold boiled tongue, cut as thin as writing paper. Lay a
slice upon each leaf, and serve with heated and buttered crackers.
You can substitute ham for the tongue.
Tomato aspic
Carefully peel and halve ripe tomatoes and lay them on the ice for
several hours. Transfer to a chilled platter, sprinkle with salt, garnish
with lettuce leaves and put a great spoonful of whipped cream upon
each tomato half.
Pour boiling water over large, smooth tomatoes to loosen the skins,
and set on ice. When perfectly cold, gouge out the center of each
tomato with a spoon, and fill the cavity with boiled corn cut from the
cob and left to get perfectly cold; then mix with mayonnaise
dressing. Arrange the tomatoes on a chilled platter lined with
lettuce, and leave on ice until wanted. Pass more mayonnaise with
the salad.
(Contributed)
Cook a quart of raw tomatoes soft, strain and season with nutmeg,
sugar, paprika, a pinch of grated lemon peel and salt. Freeze until
firm; put a spoonful upon a crisp lettuce leaf in each plate, cover
with mayonnaise and serve immediately. It is still prettier if you can
freeze it in round apple-shaped molds.
Canned tomatoes may be used if you have not fresh.
Clam salad
(Contributed)
Remove the skins and black heads of cold clams. Marinade for ten
minutes in a French dressing and serve on a bed of shredded
lettuce.
Pear salad
(Contributed)
Peel and slice five sweet, ripe pears, sprinkle with fine sugar, and
add a little maraschino or ginger syrup. Serve with a little cream. Or
pare and slice enough ripe, sweet pears to make one pint; add one-
half cupful of blanched and chopped almonds, one-fourth of a cupful
of powdered sugar and the strained juice of two lemons. Serve in a
cup of lettuce leaves made by placing together the stem end of two
lettuce leaves taken from the inside of a head of lettuce.
(Contributed)
Put into a frying-pan one-fourth of a pound of bacon, cut into dice;
when light brown take out and sauté in the fat a small onion cut
fine. Add one-half as much vinegar as fat, a few grains of salt and
cayenne and one-half as much hot stock as vinegar. Have ready the
potatoes boiled in skins. Remove the skins and slice hot into the
frying-pan enough to take up the liquid. Add the diced bacon, toss
together and serve.
(Contributed)
To one cupful of shrimps add two cupfuls of cold cooked asparagus
tips, and toss lightly together. Season with salt and pepper. Make a
dressing of the yolks of three hard-boiled eggs, rubbed through a
sieve, and sufficient oil and vinegar to make the consistency of
cream, using twice as much oil as vinegar. Pour over the asparagus
and shrimps.
Asparagus salad
(Contributed)
Asparagus tips heaped on lettuce leaves and served with French,
mayonnaise or boiled dressing, poured over all, make a very good
salad.
Endive salad
(Contributed)
Use the well-blanched leaves only. Wipe these with a damp cloth.
Pour over this a French dressing and serve with roasted game.
(Contributed)
Marinate one pair of sweetbreads in French dressing. Chill
thoroughly. Drain and mix with equal parts of sliced cucumber; cover
with French dressing into which has been stirred whipped cream.
Spinach salad
(Contributed)
Select the young, tender leaves from the center of the stock; wash
carefully, drain and chill and serve with French dressing.
Lenten salad
(Contributed)
Line the bottom of the salad-dish with crisp lettuce leaves. Fill the
center of the dish with cold boiled or baked fish, cut into pieces, and
pour over it a pint of mayonnaise dressing. Garnish with rings of
hard-boiled eggs.
(Contributed)
Pare and cut into small pieces four medium-sized apples. Pour over
this a French dressing. Pick carefully the leaves from a bunch of
cress. Arrange around the outside of the salad-dish and heap the
apples in the center of the dish.
Strawberry salad
(Contributed)
Choose the heart from a nice head of lettuce, putting the stems
together to form a cup. Put a few strawberries in the center and
cover with powdered sugar and one teaspoonful of mayonnaise
dressing.
Banana salad
(Contributed)
Sliced bananas, served in the same manner as the strawberries in
the above recipe, make an excellent salad.
Veal salad
(Contributed)
Use equal parts of well-cooked cold veal cut into small pieces, and
finely-chopped white cabbage. Marinate the veal for two hours.
Drain and mix with the cabbage. Season with salt and pepper, and a
little chopped pickle, and cover with mayonnaise dressing.
Cherry salad
(Contributed)
Stone a pint of large cherries, being careful not to bruise the fruit.
Place a hazelnut in each cherry to preserve the form. Chill
thoroughly, arrange in a salad dish on lettuce leaves and pour over
all a cream mayonnaise dressing.
Peach salad
(Contributed)
Pare a quart of ripe yellow peaches, and cut into thin slices; slice
very thin a half cupful of blanched almonds. Mix the fruit and nuts
with two-thirds of a cupful of mayonnaise, to which has been added
one-third of a cupful of whipped cream. Serve immediately on
lettuce leaves.
Ham salad
(Contributed)
Mix equal portions of minced, well-cooked ham and English walnuts
or almonds. Serve with mayonnaise on lettuce leaves.
(Contributed)
Wash the sweetbreads thoroughly and let them stand in cold water
half an hour. Boil in salted water twenty minutes and then put in
cold water again for a few minutes, to harden. To one cupful of
minced sweetbreads add one cupful of diced celery and one-half
cupful of chopped nuts. Cover well with mayonnaise dressing to
which some whipped cream has been added.
(Contributed)
Select fresh string beans and boil until tender in salted water. Or use
a good quality of canned string beans. Arrange on a dish and serve
with mayonnaise dressing.
Pea salad
(Contributed)
Drain and press through a sieve a can of green peas. Dissolve one
box of gelatine in one-fourth of a cup of cold water and stir over a
hot fire until heated. Take from the fire and add one-fourth
teaspoonful of onion juice, one-half teaspoonful of salt, and a dash
of pepper. Serve very cold with the following dressing: Put into a
double boiler the yolks of two eggs, two tablespoonfuls of stock and
two tablespoonfuls of oil. Stir until thick, take from the fire and add
slowly one tablespoonful of tarragon vinegar, one chopped olive and
two teaspoonfuls of chopped parsley.
Nut salad
Blanch almond kernels, and when cold and crisp shred into shavings.
Mix with these an equal quantity of English walnuts, broken into bits,
and pecan kernels. Stir a good mayonnaise dressing into the mixture
and heap within curled lettuce leaves.
LUNCHEON FRUITS, COOKED AND
RAW
Stewed rhubarb
Select only good, firm stalks, and reject those that are withered. Lay
them in cold water for an hour, and cut into half-inch pieces. Put
them over the fire in a porcelain-lined saucepan and strew each
layer plentifully with sugar. Pour in enough water to cover all, and
bring very slowly to a boil. Let the rhubarb stew gently until it is very
tender, then remove from the fire. When cold, serve with plain cake.
For every cupful of raw rhubarb cut into inch lengths add a third as
much of raisins seeded and cut in half. Cook until soft, as directed in
last recipe.
Stone a quarter of a pound of dates, cover with hot water, and cook
five minutes. Add three cupfuls of raw rhubarb, cut into inch lengths,
and cook, closely covered, until the rhubarb is tender. Sweeten to
taste and set aside to cool in a covered bowl, after which set on ice
until needed.
Soak a quarter-pound of figs in warm water for two hours. Cut into
small pieces and cook as previously directed with three cups of raw
rhubarb, cut into inch lengths, until the rhubarb is tender. Eat cold.
This dish is cooling to the blood, gently laxative and pleasing to the
taste.
Stewed gooseberries
Remove the tops and stems from one quart of gooseberries, wash
and drain. Put them into a saucepan with barely enough boiling
water to cover them. Let them stew until tender. Dissolve one cupful
of sugar in one-half cupful of water and boil to a syrup, then mix it
with the fruit and set away to cool.
Agate-nickel-steel ware is altogether the best in the market for
stewing acid fruits. They should never be cooked in tin or in iron,
and unless copper has just been cleaned with vinegar to remove all
suspicion of verdigris, the use of it is dangerous. I can not say too
much of the ware I have named. It is easily kept clean, durable and
safe.
Make in the same way of ripe, tart apples, a seasoning with mace or
nutmeg to taste. When it has cooled set on ice until wanted.
Stewed apples
Pare and core a dozen tart, juicy apples. Put them into a saucepan
with just enough cold water to cover them. Cook slowly until they
are tender and clear. Then remove the apples to a bowl, and cover
to keep hot; put the juice into a saucepan with a cupful of sugar,
and boil for half an hour. Season with mace or nutmeg. Pour hot
over the apples and set away covered until cold. Eat with cream.
Wash and core, but do not pare them. Arrange in a deep pudding-
dish; put a teaspoonful of sugar and the tiniest imaginable bit of salt
into the cavities left by coring; pour in a half cupful of water for a
large dishful of apples; cover closely and bake in a good oven forty
minutes or until soft.
Eat ice-cold, with cream and sugar.
Stewed prunes
Wash dried prunes and soak them for at least five hours in cold
water. Put them into a saucepan with enough water to cover them
and simmer very gently for twenty minutes. Now add sufficient
granulated sugar to sweeten liberally, and simmer until the prunes
are tender. Take from the fire and set aside to cool. Eat with plain
cake.
Steamed prunes
Prunelles are more than subacid, and need the modifying influence
of sweeter fruits. Allow equal parts of prunelles and of the small
sultana raisins. Wash the fruit in tepid water, and soak it in enough
cold water to cover it for several hours, on the back of the range.
Draw them forward where they will simmer gently until soft. Add
sugar to taste, let the syrup boil up once, then set away to cool.
Stewed cherries
None of our small fruits are more injured by transportation than
these same luscious and ruddy lobes. If you must buy cherries which
are brought from a distance and are, of necessity, several days old,
cook them if you regard the welfare of the digestive organs of your
family. The verse that tells us “cherries are ripe” would be more
reassuring if it also informed us that they were recently picked.
Wash and pick over carefully; put over the fire in a “safe” saucepan,
such as I have already indicated, with just enough water to prevent
burning, cover closely and stew until soft, but not broken. Strain off
the liquor; set aside the cherries in a covered bowl, add three
tablespoonfuls of sugar to each pint of the juice, return to the fire;
boil fast for half an hour and pour over the fruit. Keep covered until
cold.
Raw cherries
To be eaten at their raw best they should be kept in the ice-box until
needed. Then they may be served with their stems still on in a glass
bowl with fragments of ice scattered among them.
Sugared cherries
Use large, firm cherries for this dish. Have in front of you a soup-
plate containing the whites of three eggs mixed with five
tablespoonfuls of cold water, another plate filled with sifted
powdered sugar at your right, the bowl of cherries at your left. Dip
each cherry in the water and white of egg, turn it over and over in
the sugar and lay on a chilled platter to dry. When all are done sift
more powdered sugar over the fruit and arrange carefully on a glass
dish.
Glacé cherries
Select firm, sweet cherries from which the stems have not been
removed. Into a perfectly clean porcelain-lined saucepan put a
pound of granulated sugar and a gill of cold water, and boil to a
syrup. Do not stir during the process of cooking. Try the syrup
occasionally by dropping a little in cold water. When it changes to a
brittle candy it is done. Remove the saucepan at once from the fire
and set it in a pan of boiling water. Dip each cherry quickly in the hot
syrup and lay on a waxed paper to dry. If the syrup shows signs of
becoming too thick, add more boiling water to that in the outside
pan. When all the cherries have been “dipped” stand them in a
warm place to dry.
Cut the top from a pineapple and carefully remove the inside, so that
the shell may not be broken. Cut the pulp into bits, mix it with the
pulp of three ripe oranges, also cut very small, and liberally sweeten
the mixture. Smooth off the bottom of the pineapple shell so that it
will stand upright, refill with the fruit pulp, put on the tip and set in
the ice for three hours.
Creamed peaches
Lay large, ripe free-stone peaches on the ice for several hours, peel,
cut them in half and remove the stones. Whip half a pint of cream
light, with two tablespoonfuls of powdered sugar. Fill the hollows left
by the stones to heaping with the whipped cream. Keep in the ice-
box until time to serve the fruit.
Cut grapefruit in half and remove the tough fiber and part of the
pulp. Chop this pulp and add it to mashed and sweetened
strawberries. Refill the grapefruit rinds with the mixture, and set on
the ice for an hour or two.
Strawberries and cream
Cap the berries, one at a time, using the tips of your fingers. The
practice of holding capped berries in the hollow of the hand until one
has as many as the space will accommodate, is unclean and
unappetizing. Cap them deftly and quickly, letting each fall into a
chilled bowl, and do this just before serving, keeping in a cool place
until they are ready to go to table. Pass powdered sugar and cream,
also ice-cold, with them.
Select sweet, ripe pears and lay them in the ice for two hours. Do
not peel until just before they are needed. Pare deftly and quickly,
slice, sprinkle with sugar, cover with cream and serve.
Bananas are very good treated as the pears were in the last recipe.
It is a good plan to bury these in the ice until wanted for dessert.
Then the hostess may, at the table, quickly peel and slice them into
different saucers. Bananas thus prepared do not have time to
become discolored from exposure to the air.
SWEET OMELETS
Apple sauce omelet (baked)
Beat the yolks of seven eggs light; stir into them five tablespoonfuls
of powdered sugar and a cupful and a half of sweetened apple
sauce. Beat long and hard, stir in the stiffened whites, beat for a
minute longer and turn into a greased pudding-dish. Bake, covered,
for about ten minutes, then uncover and brown. Serve at once with
whipped cream. It is also good served with a hot sauce made by the
following recipe:
Into a pint of boiling water stir a half-cupful of sugar, and when this
dissolves add a teaspoonful of butter, the juice and the grated rind
of a lemon and the stiffened white of an egg. Beat for a minute over
the fire, but do not let the sauce boil.
Jam omelet
Omelet soufflé
Beat the yolks of five eggs very light, adding, gradually, four
tablespoonfuls of powdered sugar. In another dish whip the whites
to a standing froth. With a few long strokes blend the two; pour into
a buttered bake-dish and bake quickly. Sift powdered sugar on the
top at the end of two minutes, and very quickly, as the omelet will
fall if the oven stands open even a few seconds. Serve at once in the
bake-dish.
Orange omelet
(Contributed)
Beat the yolks of five eggs together until thick and lemon-colored.
Add five tablespoonfuls of orange juice, the grated rind of one
orange and five tablespoonfuls of powdered sugar. Then fold in
lightly the beaten whites of four eggs. Put a little butter in an omelet
pan, and when hot pour in the omelet mixture and spread in evenly.
Let it cook through, but not harden. Fold the edges over and turn
out upon a hot dish. Serve with a dressing of sliced oranges and
powdered sugar.
(Contributed)
Beat the yolks of three eggs very light. Then fold in the whites
beaten dry. Turn into an omelet pan in which one teaspoonful of
butter has been melted. Spread the omelet evenly and cook over a
slow fire to set the eggs. Then put in the oven until done. Spread
one-half of the omelet with marmalade, fold and serve on hot
platter.
There is not that household in the land where servants are employed
which is not measurably dependent upon them for peace of mind as
well as for comfort of body. Every housewife who reads this will
recall the sinking of heart, the damp depression of spirit, which has
suddenly overtaken a cheerful mood when the kitchen barometer
beckoned “storm” or “change.” Such an overtaking is not an
affliction, but it sometimes comes dangerously near to sorrow. The
independent maid of all work has it in her power to alter the family
plans with a word, when that word is “going.” Should she elect to
stay, her lowering brows and sharp or sullen speech abash a
mistress who quails at little else. In wealthier households a domestic
“strike” involves panic, disorder and suffering.
I know of a wet-nurse whose abandonment of her infant charge,
without a word of warning, at ten o’clock one Saturday night, caused
a long and terrible illness, resulting in infantile paralysis. A cook who
had lived in one family for three years resented the arrival of
unexpected guests, packed her trunk and left her mistress to get
dinner. The lady was in delicate health and all unused to such work.
She became overheated and exhausted, took a heavy cold, which
ripened into pneumonia, and died three days after the cook’s
desertion.
I need not multiply illustrations of the helplessness of American
housewives in the face of such disasters, and the possibility that
these may befall any one of us. We have no redress. The women
who helped organize the “Protective League” know this. The law
does not protect the employer. Public opinion gives her no support.
The cook whose fit of temper cost a kind mistress her life was
recommended to me within a month after an event that should have
shocked the moral sense of every housewife in the community, and
recommended by a friend of the murdered woman and of myself.
When I exclaimed in surprise, I was told: “We can not be judges of
our neighbors’ domestic affairs.”
There is no class spirit among us. For some reasons this is a matter
of congratulation to us and the public. All that is needed to make the
opening gulf between mistresses and maids impassible is
organization on our part, which signifies open war. It is,
nevertheless, I note in passing, patent that there should be a code
of honor among us with regard to employment of those who have
proved absolutely untrustworthy in other households.
We are not true to one another in this matter, and our employées,
who are held together by the unwritten laws of a union, none the
less strong because nameless and informal, know this as well as we
do. The knowledge is one of the most potent weapons in their
armory.
Let this pass for the present. I would direct your attention, my
sister-worker in the home missionary field, to the brighter side of the
vexed question.
After forty years’ careful study of this matter of domestic service—
study carried on in other lands as well as in our own—I record
thankfully my conviction that the domestics in well-regulated
American homes are better cared for, better paid and more
thoroughly appreciated than any other class of working women in
this country or abroad. I record, likewise and confidently, that the
proportion of faithful, valued and even belovéd domestics among us
is much larger than that of indifferent or worthless. Most cheerfully
and thankfully I add to this record that, personally, I have a list of
honest, virtuous, willing workers, whose terms of service in my
family varied from three to thirteen years, and who went from my
house to homes of their own, bearing with them the cordial esteem
of those they had served. Nor is my experience singular, even in
these United States. It is so far from being exceptional that I
deprecate, almost as an individual grievance, any attempt to
organize those who should be our coworkers into a faction that
considers us as “the opposition.” It is a putting asunder of those
whom a mutual need should join together.
Backed by my two-score years of experiment and action, I dare
believe that a leaf or two from my book of household happenings
may be of service to younger women and novices in the profession
which absorbs the major part of our time and strength.
To begin with—beware of discouragement during the early trial-days
of the new maid. Be slow to say, even to yourself: “She will never
suit me!” The first days and weeks of a strange “place” are a crucial
test for her as for you, and she has not your sense of proportion,
your discipline of emotion and your philosophical spirit to help her to
endure the discomforts of new machinery.
Looking back upon my housewifely experiences, I am moved to the
conclusion that the domestics who stayed with me longest and
served me best were those who did not promise great things in their
novitiate.
One—“a greenhorn, but six weeks in the country”—frankly owned
that she knew nothing of American houses and ways. She was
“willing to learn,” and—with a childish tremble of the chin—“didn’t
mind how hard she worked if people were kind to her.” I think the
quivering chin and the clouding of the “Irish blue” eyes moved me to
give her a trial. She did not know a silver fork from a pepper cruet,
or a tea-strainer from a colander, and distinguished the sideboard
from the buffet by calling the one the “big,” the other the “little
dresser.” She had been with me a month when I trusted her to
prepare some melons for dessert, giving her careful and minute
directions how to halve the nutmeg melons, take out the seeds and
fill the cavities with cracked ice, while the watermelon—royal in
proportions and the first fruits of our own vines—was to be washed,
wiped, and kept in the ice-chest until it was wanted.
At dinner the “nutmegs” appeared whole; the watermelon had been
cut across the middle and eviscerated—scraped down to the white
lining of the rind—then filled with pounded ice. The succulent
sweetness, the rosy lusciousness of the heart, had gone into the
garbage can.
Nevertheless, I kept blue-eyed Margaret for eight years. She stands
out in my grateful memory as the one and only maid I have ever
had who washed dishes “in my way.” Never having learned any
other, she mastered and maintained the proper method.
The best nursery-maid I ever knew, and who blessed my household
for eleven years, objected diffidently at our first interview to giving a
list of her qualifications for the situation. She “would rather a lady
would find out for herself by a fair trial whether she would fit the
place or not.” I engaged her because the quaint phrase took my
fancy. She proved such a perfect fit that she continued to fill the
place until she went to a snug home of her own.
What may be called the New Broom of Commerce has no misgivings
as to her ability to fill any place, however important. Upon inquiry of
the would-be employer as to the latter’s qualifications for that high
position, the N. B. of C. may decline to accept her offer of an office
which promises more work than “privileges.” But she could fill it—full
—if she were willing to “take service” with the applicant.
One of the oddest incongruities of the new-broom problem is that
we are always disposed to take it at its own valuation. With each
fresh experiment we are confident that—at last!—we have what we
have been looking for lo! these many years. She is a shrewd house-
mother who reserves judgment until the first awkward week or the
crucial first month has brought out the staying power or proved the
lack of it.
Officious activity in unusual directions is a bad omen in the New
Broom of Commerce. In sporting parlance, I at once “saw the finish”
of one whom I found upon the second day of service with me
washing a window in the cellar. She “couldn’t abide dirt nowhere,”
she informed me, scrubbing vehemently at the dim panes. I had just
passed through the kitchen where a grateful of fiery coals was
heating the range plates to an angry glow. All the drafts were open;
the boiler over the sink was at a bubbling roar; upon the tables was
a litter of dirty plates and dishes; pots, pans and kettles filled the
sink.
It is well to have a care of the corners, but the weightier matters of
the law of cleanliness are usually in full sight.
I once knew a woman who, deliberately, and of purpose, changed
servants every month. She said no new broom lasted more than four
weeks, and when one became grubby and stumpy she got rid of it.
Her house was the cleanest in town and her temper did not seem
worse for friction.
Another woman who, strange to tell, lived to be ninety years old,
“liked moving” and never lived two years in one and the same
house. She maintained that she kept clear of rubbish by frequent
flittings, and enjoyed rubbing out and beginning again. Personally, I
should have preferred a clean, lively conflagration every three years
or so, but she throve upon nomadism.
In minor details of housewifery, as in more important, make up your
mind how you will manage the home and turn a deaf ear to
gratuitous suggestions from people whose own households would be
better conducted if their energies were concentrated.
Let one example suffice: A so-called reformer felt herself called in
(or out of) the Gospel of Humanity, the other day, to inveigh in a
parlor lecture upon the unkindness and general unchristianliness of
the maid’s cap and apron which all would-be stylish mistresses insist
upon. “Have I, a Christian woman in a republic,” cried the oratress,
“the right to put the badge of servitude upon my sister woman,
because, having less money than I have, she is obliged to earn her
living? Do I not tend to degrade, instead of elevating her?
“Of a piece with the cap and apron is the black dress, now ‘the thing’
for girls in domestic service. Why should not Bridget and Dinah
exercise their own right in dress as well as I?”
These questions have been put to me many times by women who
think and act for themselves without regard to arbitrary
conventionalities.
I am so well assured that most conventionalities have a substratum
of common sense that I am slow to condemn any one of them.
I dispute, at the outset, the insinuation that black dress, white cap
and apron are a badge of servitude. I know no more independent
class of women than trained nurses, no more arbitrary men than
railway officials. I should certainly never consider the distinctive garb
of the Sisters of Charity—Protestant or Roman Catholic—as
degrading. The idea of humiliation attached to the uniform of
housemaid and child’s nurse in the mind of employees or employer is
founded upon the conviction that domestic service demeans her who
performs it. This is precisely the prejudice which sensible,
philanthropic women are trying to beat down—a prejudice that has
more to do with the complications of the servant question than all
other influences combined. If I hesitate to ask a maid entering my
service to wear the uniform of her calling, I intimate too broadly to
be misunderstood that there is something in that service which
would demean her were it generally known that she is in it.
I had one maid, years ago, who would not run around the corner to
grocery or haberdasher’s without taking time to put on her Sunday
coat and hat, and to lay off her apron. When I spoke to her of the
absurdity and inconvenience of this, she confessed, blushingly, that
the porter at the grocery was “keeping company with her,” and “it
was nat’ral a gurrel should want to look her best when she was like
to see him.”
“Ah,” I said, “doesn’t he know what your position is in my house?
Has he never seen you in cap and apron?”
“Shure, mem! Every day when he fetches the groceries.”
“Then, if he is a sensible fellow, he will respect you all the more for
not pretending to be what you are not. Since he knows what your
business is, show him that you are not ashamed of it. You are as
respectable in your place as he is in his—as I am in mine—always
providing that you respect your service and yourself.” Call the
distinctive dress of your maid a “uniform,” not a livery. Point out to
her the examples of trained nurses, of railway conductors, of the
very porters who “keep company” with her; the policemen she
admires afar off; the soldiers, whose brass buttons dazzle her
imagination. Remind her that saleswomen in fashionable shops wear
the black gown, white apron, deep linen collar and cuffs and pride
themselves upon looking their best in them. Especially make her
comprehend (if you can, for the ways of the untrained mind are past
finding out), that she has an honorable calling and need not be
ashamed to advertise it.
Congratulate yourself, above all, that a sensible fashion holds back
Bridget and Dinah from the “exercise of their own taste in dress.”
The modification of that taste wrought by the neat and modest
costume prescribed by a majority of modern housewives may be in
itself a good thing, sparing the eyes of spectators of her toilettes
when she becomes “Mrs.” and independent, and the purse of the
porter, or truckman, or mechanic, who will have to pay for them.
I have laid stress upon the advantages of long terms of service, to
maid and to mistress. Like all other good things it has its perils and
its abuses to be avoided.
Two-thirds of the scandals that poison the social atmosphere steal
out, like pestilential fogs, through servants’ gossip. We discuss “the
girl” in our bedchambers, and if so much stirred up by her works and
ways as to forget what is due to our ladyhood, compare notes in the
parlor as to these same works and ways. Being well-bred women,
the traditions of our caste prevent us from making domestic
grievances the staple of drawing-room conversation and the marrow
of table-talk. The electroplated vulgarian never calls attention more
emphatically to the absence of the “Sterling” stamp upon her
breeding, than when she chatters habitually of the virtues and the
faults of her household staff.
On the other hand, the most sophisticated of us would be amazed
and confounded if she knew what a conspicuous part She plays in
talk below stairs and on afternoons and evenings “out.”
Thackeray, prince of satirists, puts it cleverly:
“Some people ought to have mutes for servants in Vanity Fair—
mutes who could not write. If you are guilty—tremble! That fellow
behind your chair may be a Janissary with a bowstring in his plush
breeches pocket. If you are not guilty, have a care of appearances,
which are as ruinous as guilt.” We should be neither shocked nor
confounded that these things are so. If we are mildly surprised, it
argues ignorance of human nature, and of the general likeness of
one human creature to another, that proves the whole world kin.
When mistresses in Parisian toilettes, clinking gold spoons against
Dresden as they sip Bohea in boudoir or drawing-room, raise their
eyebrows or laugh musically over the latest bit of social carrion in
“our set”—Jeames or Abigail, who has caught a whiff at a door ajar,
or through a keyhole, is the lesser sinner in serving up the story in
the kitchen cabinet. The domestics are in, yet not of, the employer’s
world, living for six and a half days of the week among people with
whom they have no affinity by nature or education. Where we would
talk of “things,” the lower classes discuss what they name “folks.”
Their range of thought is pitifully narrow; the happenings in their
social life are few and tame. What wonder if they retail what we say
and do and are, as sayings, doings and characters appear to them?
What would be extraordinary, if it were not so common, is the
opportunity gratuitously afforded in—we will say, guardedly—one
family out of three for the collection of material for these sensations
of the nether story. I speak by the card in asserting that the
influence gained by the confidential maid over her well-born, well-
mannered, well-educated mistress is greater than that possessed by
any friend in the (alleged) superior’s proper circle of equals.
Without taxing memory I can tell off on my fingers ten
gentlewomen, in every other sense of the word, whose intimate
confidantes are hirelings who were strangers until they entered the
employ of their respective mistresses(?). We need not cross the
ocean to listen with incredulous horror to insinuations and open
assertions as to the hold a gigantic Scotch gilly acquired over a royal
widow. Our next-door neighbors on both sides and our
acquaintances across the way are in like bondage.
I have in mind one of the best and most refined women I ever knew
whose infatuation for her incomparable Jane was the laughing-stock
of some, the surprise and grief of others. Jane disputed the dear
soul’s will, oft and again; gave her more advice than she took, and,
behind her back, ridiculed her unsparingly—as many of the
mistress’s friends were aware. The dupe would resign the affection
and society of one and all of her compeers sooner than part with
Jane.
Another “just could not live without my Mary.” The remote
suggestion throws her into a paroxysm of distress. Her own husband
knows it to be necessary to warn her not to tell this and that
business or family secret to Mary, knowing, the while, in his sad
soul, the chances to be against her keeping her promise not to share
it with her factotum.
Ellen is the bosom friend of a third; Bridget is the right hand, the
counsellor and colleague of a fourth. A fifth confides to her second-
rate associates that her faithful Fanny knows as much of family
histories (and there are histories in the clan) as she does, and that
she—the miscalled mistress—takes no step of importance without
consulting her.
Perhaps one man in five hundred is under the thumb of his
employee, and then because the underling has come into possession
of some dangerous secret, or has a “business hold” upon him.
Have wives more need of sympathy? or are they less nice in the
choice of intimates, and more reckless in confidences?
LUNCHEON CAKES
Huckleberry shortcake
Currant shortcake
Mash a quart of ripe red currants and stir into them two cups of
granulated sugar. Cover and set aside for half an hour.
Make a dough as for quick biscuit, only using a tablespoonful more
butter than usual. Roll into a large round biscuit about ten inches in
diameter. Bake, and, as soon as done, split open, spread with butter
and then with half the sweetened currants. Replace the top of the
biscuit and pour the remainder of the currants and juice over and
around the shortcake. Serve at once.
Scotch shortcake
(Contributed)
Cream a half-pound of fresh butter with a quarter-pound of sugar,
and work into it with the hands a pound of flour. Knead long, then
turn upon a pastry-board and press into a flat sheet half an inch
thick. Cut into squares and bake until light-brown and crisp.
Orange shortcake
(Contributed)
Sift into one and one-half cupfuls of flour one-half cupful of corn-
starch, one level teaspoonful of baking-powder and one-half
teaspoonful of salt. Rub into this with the tips of the fingers one-
third of a cup of butter and moisten with milk enough to make a soft
dough. Divide the dough in halves and spread over the bottom of
two tins. When done butter the cakes, sift over each powdered
sugar, and put between them thin slices of peeled oranges.
To two cupfuls of soft bread sponge that has been allowed to rise,
add one-half cupful of warm milk, a little salt, one-quarter cupful of
melted shortening, two eggs, beaten with three-quarters of a cup of
sugar. Add one-half grated nutmeg, some raisins or currants, and as
much warmed flour as can be worked in with a spoon. Put it into a
greased tin and let it rise. When very light, moisten the top with
milk, sprinkle with sugar and cinnamon, and bake in a slow oven
forty minutes. Cover with brown paper until almost done.
Potato cake
Two cupfuls of white sugar, one cupful of butter, four eggs, one-half
cupful of milk, one cupful of potatoes, one teaspoonful, each, of
cinnamon and cloves, one-half cup of chocolate, two cups of flour,
two teaspoonfuls of baking-powder, one cup of almonds. Blanch and
chop almonds; grate cold boiled potatoes; beat eggs separately,
adding whites last. Bake in a shallow pan in a moderate oven, and
cover with caramel frosting.
Huckleberry cake
Apple cake
Springleys (No. 1)
(A German recipe.)
Beat one pound of granulated sugar for ten minutes with four eggs,
leave for an hour, then add one tablespoonful of lemon extract, and
one teaspoonful of hartshorn. Work in enough flour (about two
pounds) to make it stiff enough to roll out. Powder the forms with
flour before using, so as to prevent sticking. Cut apart and lay on a
smooth slab until morning. Sprinkle anise seed in the bottom of the
tins before putting cakes in. Bake in a quick oven and watch very
closely in order to keep them from burning.
Springerlein (No. 2)
Currant bun
Warm a cupful of cream in a double-boiler, take it from the fire and
stir into it a cupful of melted butter, which has not been allowed to
cook in melting. Beat three eggs very light, add them to the cream
and butter, then stir in a cupful of sugar. Dissolve a half-cake of
yeast in a couple of tablespoonfuls of water, sift a good quart of
flour, make a hollow in it, stir into it the yeast and then, after adding
to the other mixture, a teaspoonful, each, of powdered mace and
cinnamon, put in the flour and the yeast. Beat all well for a few
minutes, add a cupful of currants that have been washed, dried and
dredged with flour, pour into a shallow baking-pan, let it rise for
several hours, until it has doubled in size; bake one hour in a rather
quick oven; sprinkle with fine sugar when done.
Cinnamon buns
Save a cupful of bread dough from the second rising. Cream a half-
cupful of butter with a half-cupful of sugar, stir in a well-beaten egg
and work these into the dough. Now add a half-teaspoonful of
cinnamon, a teaspoonful of soda, dissolved in a little hot water and a
half-cupful of cleaned currants, dredged with flour. Knead for several
minutes, form into buns, set to rise for a half-hour, then bake.
Parkin
Bun loaf
One cupful of butter; one and a half cupfuls of powdered sugar; two
cupfuls of flour; six eggs; half a pound, each, of raisins and currants;
quarter-pound of citron; teaspoonful of cinnamon and nutmeg; half
teaspoonful of ground cloves; three tablespoonfuls of brandy.
Cream butter and sugar, beat in the whipped yolks of the eggs, stir
in the flour, the spice, the raisins, seeded and chopped; the currants,
washed; the citron, shredded, and all the fruit, well dredged with
flour, then the whites, beaten stiff, and the brandy. Bake about two
hours in a steady oven.
Cream one cupful of butter with two cupfuls of powdered sugar, beat
the yolks of six eggs and add to the butter and sugar. Put in two and
a half cupfuls of sifted flour, half a pound, each, of seeded and
chopped raisins, and of washed and dried currants, a quarter of a
pound of shredded citron, all well dredged with flour, and a
teaspoonful, each, of cinnamon and grated nutmeg. Last of all, put
in the whites of the eggs beaten stiff. Bake in a steady oven.
Pound cake
Grafton cake
Gold cake
Silver cake
ebookball.com