Cp5293 Big Data Analytics 1
Cp5293 Big Data Analytics 1
Intelligent
Data Analysis
An Introduction
Sprin ger
Editors
Michael Berthold
Universitat Konstanz
FB Informatik und Informationswissenschaft
78457 Konstanz
Germany
[email protected]
David J. Hand
Department of Mathematics
Imperial College
Huxley Building
180 Queen's Gate
London, SW7 2BZ
UK
[email protected]
ACM Computing Classification (1998): 1.2, H.3, G.3,1.5.1,1.4, J.2, J.l, J.3, F.4.1, F.l
November 2002
The obvious question, when confronted with a book with the title of this
one, is why "inteUigent" data analysis? The answer is that modern data analysis
uses tools developed by a wide variety of intellectual communities and that
"inteUigent data analysis", or IDA, has been adopted as an overall term. It should
be taken to imply the intelligent application of data analytic tools, and also the
application of "intelligent" data analytic tools, computer programs which probe
more deeply into structure than first generation methods. These aspects reflect
the distinct influences of statistics and machine learning on the subject matter.
The importance of intelhgent data analysis arises from the fact that the
modern world is a data-driven world. We are surrounded by data, numerical
and otherwise, which must be analysed and processed to convert it into infor-
mation which informs, instructs, answers, or otherwise aids understanding and
decision making. The quantity of such data is huge and growing, the number of
sources is effectively unlimited, and the range of areas covered is vast: industrial,
commercial, flnancial, and scientific activities are all generating such data.
The origin of this book was a wish to have a single introductory source to
which we could direct students, rather than having to direct them to multiple
sources. However, it soon became apparent that wider interest existed, and that
potential readers other than our students would appreciate a compilation of some
of the most important tools of intelligent data analysis. Such readers include
people from a wide variety of backgrounds and positions who find themselves
confronted by the need to make sense of data.
Given the wide range of topics we hoped to cover, we rapidly abandoned
the idea of writing the entire volume ourselves, and instead decided to invite
appropriate experts to contribute separate chapters. We did, however, make
considerable efforts to ensure that these chapters complemented and built on
each other, so that a rounded picture resulted. We are especially grateful to the
authors for their patience in putting up with repeated requests for revision so
as to make the chapters meld better.
In a volume such as this there are many people whose names do not explicitly
appear as contributors, but without whom the work would be of substantially
reduced quality. These people include Jay Diamond, Matt Easley, Sibylle Frank,
Steven Greenberg, Thomas Hofmann, Joy HoUenback, Joe Iwanski, Carlo March-
esi, Roger Mitton, Vanessa Robins, Nancy Shaw, and CamiUe Sinanan for their
painstaking proofreading and other help, as well as Stefan Wrobel, Chris Road-
VIII Preface to the First Edition
February 1999
1. Introduction 1
1.1 Why "Intelligent Data Analysis"? 1
1.2 How the Computer Is Changing Things 4
1.3 The Nature of Data 8
1.4 Modern Data Analytic Tools 12
1.5 Conclusion 14
2. Statistical Concepts 17
2.1 Introduction 17
2.2 Probability 18
2.3 Sampling and Sampling Distributions 29
2.4 Statistical Inference 33
2.5 Prediction and Prediction Error 46
2.6 Resamphng 57
2.7 Conclusion 68
3. Statistical M e t h o d s 69
3.1 Introduction 69
3.2 Generahzed Linear Models 70
3.3 Special Topics in Regression ModeUing 93
3.4 Classical Multivariate Analysis 100
3.5 Conclusion 129
4. Bayesian M e t h o d s 131
4.1 Introduction 131
4.2 The Bayesian Paradigm 132
4.3 Bayesian Inference 135
4.4 Bayesian Modeling 143
4.5 Bayesian Networks 153
4.6 Conclusion 167
References 475
Index 501
A u t h o r Addresses 513