0% found this document useful (0 votes)
40 views553 pages

Data Quality Management with SPSS

This document is a comprehensive guide on data quality using SPSS, covering various aspects such as completeness, uniformity, duplicates, missings, outliers, and plausibility. It provides practical recommendations, control options, and methodologies for ensuring high data quality, along with real-world examples of data quality issues. The book serves as a multi-tool for understanding and resolving data problems, emphasizing the importance of data quality in decision-making processes.

Uploaded by

Marcela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views553 pages

Data Quality Management with SPSS

This document is a comprehensive guide on data quality using SPSS, covering various aspects such as completeness, uniformity, duplicates, missings, outliers, and plausibility. It provides practical recommendations, control options, and methodologies for ensuring high data quality, along with real-world examples of data quality issues. The book serves as a multi-tool for understanding and resolving data problems, emphasizing the importance of data quality in decision-making processes.

Uploaded by

Marcela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Table of Contents

Table of Contents
Preface to International Edition
Preface
1 The DQ Pyramid
1.1 The basis and the next level: First DQ criteria
1.2 Let’s talk about the DQ in DQ Pyramid
1.3 DQ: Cost-benefit considerations. Really?
1.4 Fact before fiction (range of effects)
2 Recommendations
2.1 Structures: Language, infrastructure, and priorities
2.2 Proposals for sustainable work
2.3 A few Words of Caution
3 Completeness
3.1 Control options at the level of number of datasets
3.2 Control options at the level of the number of cases (rows)
3.3 Control options at the level of the number of variables (columns)
3.4 Control options on the level of values resp. missings
4 Uniformity - Standardizing numbers, time units and strings
4.1 A first simple example: Inconsistent data
4.2 Identifying Inconsistency: Check for Variations in Strings
4.3 Standardizing Strings 1: REPLACE
4.4 Standardizing Strings 2: UPCASE, LTRIM, DO-IF, IF, INDEX and
SUBSTR
4.5 Standardizing symbols and special characters
4.6 Standardizing currencies and units of measurements
4.7 Standardizing via acronyms
4.8 Standardizing by removing identical strings
4.9 Standardizing via counting of String Templates
4.9.1 Standardizing using one template (loop, LOOP)
4.9.2 Standardizing using several templates (macro)
4.10 Standardizing mixed strings (phone numbers)
4.10.1 Completely filled fields (LOOP-END LOOP)
4.10.2 Incompletely filled fields (IF)
4.11 Standardizing time and date specifications
4.11.1 Date specifications: The special role of the date format
4.11.2 Time specifications: Three classic errors: A typical example for
inconsistent time data
4.12 Uniformity of punctuation resp. decimal places
4.12.1 Adding punctuation resp. decimal places
4.12.2 Removing punctuation from strings
4.12.3 Standardizing the "punctuation" of date variables
4.13 Uniformity of Missings
4.14 Consistency of analyses and designs (SET, SHOW and others)
4.14.1 Representative uniformity (‚Corporate Design’)
4.14.2 Technical and methodological consistency (‚Technical Design’)
5 Duplicate values and multiple data rows
5.1 Causes and consequences of duplicates
5.2 Checking for duplicates: To be or not to be?
5.2.1 Situation 1: The dataset contains one case per data row
5.2.2 Situation 2: The dataset contains several identical cases
5.3 Removing duplicate data rows only via ID Variable
5.4 Removing duplicate data rows over several variables (also excl. ID)
5.5 Information about type and number of duplicates (identification)
5.6 Displaying filtered and duplicate data rows
5.7 Identification of duplicates when reading data rows (grouped data)
5.8 Identification of duplicates when reading data rows (nested data)
6 Missings
6.1 Causes (patterns), consequences, extent and mechanisms
6.1.1 Causes and patterns of Missings
6.1.2 Consequences of missings
6.1.3 Mechanisms of missings
6.2 Which missings should not be replaced or deleted by values?
6.3 Deleting Missings
6.3.1 Deleting pair-wise vs. list-wise
6.3.2 Technical problems as a cause of missings - Delete completely empty
rows
6.4 Reconstruction and replacement of missings
6.4.1 Cold deck Imputation
6.4.2 Random-based approach
6.4.3 Logical approach
6.4.4 Stereotype-guided approach
6.4.5 Univariate estimation
6.4.6 Multivariate similarity (hot deck imputation)
6.4.7 Multivariate estimation
6.4.8 Conclusion
6.5 Calculating with missings
7 Outliers - Identify, Understand and Handle
7.1 Characteristics of outliers
7.1.1 The perspective is also decisive ("Frames")
7.1.2 Univariate or/and multivariate
7.1.3 The data is to blame: Which data?
7.2 Univariate Outliers
7.2.1 Identification via measures
7.2.2 Identification via rules
7.2.3 Identification via tests
7.2.4 Identification via diagrams
7.3 Multivariate Outliers
7.3.1 Identification via measures
7.3.2 Identification via rules
7.3.3 Special features of (bivariate) measurement series
7.3.4 Identification via diagrams
7.4 Causal analysis: Outliers or not?
7.5 Handling of outliers
8 Plausibility
8.1 Formal and content-related approach
8.2 Practical check of data plausibility
8.2.1 Plausibility of one variable
8.2.2 Plausibility of two or more variables: "Qualitative" approach
8.2.3 Multivariate data plausibility (detection of anomalies): "Quantitative"
approach
9 More Efficiency
9.1 Validate data: Basic Checks
9.2 Loading and applying predefined validation rules for single variables
9.2.1 By mouse
9.2.2 By syntax
9.2.3 Explanation of VALIDATEDATA syntax
9.3 Creating and executing custom rules for single variables
9.4 Programming and executing rules for multiple variables
9.5 Further examples of check rules (uncommented)
9.6 Checking of rules (conditions)
10 More Flexibility: Screening and more
10.1 Counting and ID Variables: Options for "Counting" through a Dataset
10.2 Screenings within a column (variable)
10.3 Screenings within several columns (variables)
10.3.1 Counting specific Values, Strings or Missings
10.3.2 Counting the Combinatorics of several Variables – Analysis of the
Levels in several Variables
10.3.3 Column-wise analysis for absolute match
10.3.4 Column-wise and row-wise analysis of several numerical data
10.3.5 Recoding of values and missings in several variables
10.3.6 Uniform "filling" of several data rows (LAG function)
10.3.7 Renaming numerous variable names (prefixes, suffixes)
11 Working with several (separate) datasets
11.1 Checking rules for joining
11.2 Checking several datasets for completeness
11.2.1 Checking segmented stored data
11.2.2 Checking continuously stored data
11.3 Screening of separate consecutively named datasets (macro)
11.4 Combining of consecutively named datasets (macro)
11.5 Comparing structurally identical datasets on absolutely identical
content
11.6 Macro for standardizing values in separate datasets
11.7 Splitting a dataset
11.7.1 Splitting a dataset by categories (e.g. IDs) (macro)
11.7.2 Splitting a dataset into uniformly filtered subsets
11.8 Working with several files (SPSS command DATASET)
11.8.1 Meaning and limits of the DATASET approach
11.8.2 Examples of common applications
11.8.3 An overview of DATASET syntax
11.9 Digression: Working with FILE HANDLE
12 Time and date related problems – Detect and resolve
12.1 Insights by time differences
12.2 Checking date entries (transposed digits)
12.3 Variants for solving the "Year 2000" problem (ISO 8601, Y2K)
12.4 Time stamps
13 Further criteria for data quality
14 A little exercise
15 A program example for a first strategy
16 Notes on IBM SPSS Modeler
16.1 Nodes Palette
16.2 Nodes for Data Preparation and Data Quality
16.2.1 Data Audit node
16.2.2 Auto Data Prep node
16.2.3 Distinct node
16.2.4 Other nodes
16.3 Modeler and SPSS Syntax: An old story …
17 Notes for Macintosh Users
18 Checklist: Test documentation
19 Communication of quality
19.1 Critera for data quality
19.2 Criteria for the quality of data analysis
19.3 Criteria for the quality of communicating results
19.4 Criteria for "mortal sins" of professional work
20 Literature
21 Your opinion about this book
22 Author

Preface to International Edition


This international edition is the only title worldwide on data quality with
SPSS.
This title is based on an updated translation of a successful German language
publication on data quality with SPSS.
This book provides you with a Swiss Army knife, a real multi-tool:
- You learn criteria for data quality and their relationships
- You recognize data problems and the possible risks they pose
- You understand your data problems, and their consequences
- You solve data problems = establish data quality
- You’re enabled to communicate data quality
Suboptimal data quality appears everywhere and anywhere, for example:
Completeness: The UK Health Authority did not process
complete Covid-19 test results, thousands of Britons were
probably too late to learn of their potential risk contacts (BBC
News, 2020b).
Duplicates: The German Federal Central Tax Office issues the
11-digit personal tax identification number (TIN), which is
supposed to be as unique as a fingerprint. In more than 164.000
cases it assigned either one taxpayer two TINs, or two taxpayers
got the same TIN (Süddeutsche Zeitung, 2014).
Uniformity: NASA lost a $125 million dollar satellite on
September 23, 1999, because one team of engineers worked with
metric units while another team worked with English units.
Missings: The Alaskan tax authorities deleted 2006 accidentally
damaged both the data drive and the backup drive. 800.000
documents all had to be rescanned, checked and processed for
months.
Correctness: Since 2008, 50 peso coins have been circulating in
Chile on which CHIIE was carelessly minted instead of CHILE.
It was decided not to withdraw the coins from circulation, but the
manager of the national mint did.
Plausibility: In 2004, “Der Spiegel” published a university
ranking in which it even presented institutions at the top ranks
that did not actually exist. Among computer scientists, for
example, the first three places did not even offer a diploma in
computer science (Schendera, 2006).

How to use this book:


If you want to start immediately with checking and correcting your data
quality with SPSS, go straight to Chapter 3, about Completeness. However, if
you want to know something about the relationship of the quality criteria to
each other, and therefore also about the theoretical and didactical structure of
this book, I recommend the original preface and at least a peek at the DQ
Pyramid in Chapter 1; Chapter 2 offers recommendations on how to tackle a
DQ project and reasons why you should work with syntax (experienced users
may want to skip this). If you have little or no experience with programming
in SPSS, don't worry: The data and programs are manageable, you want to
learn something and should be guided and motivated by transparency and
clarity.
Still, note one difference to data analysis: If you include a value in a data
analysis, you usually include this value only once, e.g. provide a value for a
sum, mean, t-test, you name it. A value (even a blank) in a cell is processed
only once. In data quality, each cell (entries, blanks) is processed several
times. The reason is, each cell undergoes multiple quality checks, e.g.
completeness, uniformity, duplicates, missings, outliers and plausibility.
Already at the review phase, the effort for data quality is several times higher
in view of the number of quality criteria. Not to mention the possible
correction phase. The effort has got to be much higher. If the above examples
did still not answer, why you should invest it, I also answered this question in
a more business-related article, which I will reproduce here in extracts:
Trust is good, control is better: The quality of data put to the test
Incorrect data cause incorrect decisions. The erroneous decisions cause
damages in the multi-digit millions. The damages in the millions also led to
personnel consequences on the part of those responsible (e.g. Citibank is to
pay $400 million civil money penalty related to deficiencies e.g. in risk
management, data quality resp. governance, and internal controls, OCC,
07.10.2020). The causes of incorrect data resp. figures are different:
Erroneous collection, incorrect calculation, incorrect programming, targeted
manipulation. What do examples show on the plus-side?

The quality of data in business processes is an indispensable


requirement for decisions.
Data is trusted because its quality is an indispensable prerequisite
for decisions.

Who benefits from the raised forefinger ...


The (obvious) question arises: If the quality of data is so important, why can
such cases happen? The superficially simple answer is: A person generally
thinks and acts economically. Data is (initially) interpreted as reliable.
"Initially" means: as long as there is no reason to assume the opposite (even
indications of the opposite can be relativized by various self-reassurance
strategies). Just imagine you want to check every piece of information in an
annual report, a management summary or a productive DWH (even if you
should sometimes). In some environments, e.g. the BCBS (Basel Committee
on Banking Supervision) context, you have to meanwhile. In fact, the answer
to the question of how the examples listed at the beginning can happen is the
question of how to check them, and thus, even more precisely, the concrete
criteria for checking the quality of data.
... if it does not point in the right direction?
One could also ask the question: Why do we not hear more often about data
quality problems in e.g. the public sector, in banks, in health care and
defense? Yes, what if they were? One would lose confidence in the authority
concerned, the company, the bank, the health care and defense sector, etc. Is
that what you want? Damage would possibly be even greater if factual data
quality problems were communicated. In order to avoid playing into the
hands of the competition, for example, one often tries (too long) to solve
these problems internally (if they are even recognized in their completeness)
before they become public or the competition notices them.
Let's look around. Not only companies, banks or shareholders are
affected by data quality, but everyone. Every day. In the news, these
references are often found in anecdotal form (incorrectly issued parking fees
in the millions, incorrectly printed tickets, etc.), but the consequences are not
always so entertaining. For example, incorrectly calculated tax assessments
by the tax office, incorrect health insurance notifications, unemployment
benefits not transferred on time, bodily injury due to incorrect drug names or
dosages, unreliable scientific publications, incorrect information in censuses,
etc.
We live in an information society. The quantities, structures and
formats of data in storages and processes are increasing every day. One may
realistically assume that these trends are now accelerating exponentially. We
also live in an information society that is always in constant competition for
performance, efficiency and optimization both internally and externally. At
the same time, the demands on the architectures of data storage and
processes, data analysis procedures, as well as the quality of data, which is
the essential basis of everything, will increase in the sense of a forward-
looking concept. If you take a closer look around, you have to doubt that this
is the case. For numerous examples of the consequences of poor data quality,
the (unnecessary) insight applies: "Afterwards, one is always smarter...". But,
do you want to follow such "wisdom" when a lot or even loads of money is
involved?
What you can know in advance, you should know in advance ... A first
look at the concept of data quality
Where can you learn about data quality? First of all, about the experiences of
others with the more or less optimal quality of data. First, read this book. For
now more than 30 years I have been confronted with the professional work
with data. Another recommended source is data warehouse literature. Why?
Because a lack of data quality in business usually shows quickly where it
hurts: Money, money, money. So true.
However, what does the literature say about data quality? As a
classic definition of data quality appears Juran & Godfrey’s (1999, 2) "data to
be of high quality if they are fit for their intended uses in operations, decision
making and planning ". So far, so good. What is striking about this
definition? This definition makes it easy and only claims that data is ready
for use as soon as it is "fit" for its intended uses in operations, decision
making and planning. However, this definition does not specify when data is
ready for use, i.e. which processes it has to go through, which criteria (e.g.
accuracy, uniformity, etc.) it has to meet, and which tolerances it should
meet. I define the quality of data a bit more complex, but perhaps more
realistic as a multi-parametric relation predicate, which results from the
type and number of the required criteria ("criteria canon"), the methods of
their examination, the tolerances/limit values of the respective criteria used,
as well as the respective excluded criteria etc. (cf. Schendera, 2007, 7). As
concrete test criteria I suggested then (and now): "completeness",
"uniformity", "doubles", "missings", "outliers", "time/date related problems,
as well as "plausibility". And this book will help you get through it.
What you need to be aware of: The quality of data is a complex
matter. If data represent the (economic) reality (of a company), then the
semantic and syntactic structures (of the company) in turn define the
relations and definitions of the data. Data storage is nothing more than a
reflection of the real complexity of a company, its data, processes and
criteria. Data is complex because its context is complex. One could, of
course, add in a completely consistent way that data can also be chaotic and
lacking in concepts because the management itself is still lacking in concepts
in this respect. - For example, so-called distributed spreadsheet databases in
particular do not even “notice” that their data is out of order because of their
"architecture". - Ensuring the quality of data is complex because this process
must be appropriate to the complexity of the company. Often, the power of
imagination (be it business, technical, or IT) is put under great strain at this
point:
One simply cannot imagine in every respect what is so difficult about
data quality.
It is precisely for this reason that many DWHs risk failing, or even the
attempts to save the quality of the data they contain. The difficult thing about
the quality of data is the processes and structures in which it is embedded and
which it should represent realistically.
In the face of reality: A look ahead
What should distinguish a well-functioning company is not only that it is
economical, but also competitive. This also applies to the quality of the data.
In two respects: data quality must be given. Because data and figures, as a
consequence resp. results of data storage and processes, flow back into the
reality of the company and form the basis for future entrepreneurial action. If
the data for the secured entrepreneurial action is no longer correct, the basis
for entrepreneurial decisions is pulled out from under the feet. The fact that
the consequences are not purely "academic" is illustrated by the examples
presented later (it can be assumed that the number of undetected cases is
considerable). Data quality is not a matter of faith, but a verifiable fact.
What is the benefit of data quality (if it exists) when viewed soberly? If we
disregard all the promises of salvation, potential competitive advantages and
possible savings: Data quality says objective and verifiable: You know
where you stand. This can be good news, but also bad news: First of all,
always the bad news: You know what went "wrong", to what extent the data
was incorrect, when which errors were made and how, and what
consequences these errors had. For example, in the case of fraud: Only
correct data protects against fraud. How else can we distinguish between
correct data and data in terms of fraudulent behavior? The keywords here are
fraud detection and anti-money laundering measures in the banking sector,
and in the government sector, for example, checking data for corruption and
benefit fraud. You will also learn, for example, how many customers you
really have (there may be unpleasantly fewer than you thought). You only
find out all this in retrospect! when you take care of your data quality. Was
that really necessary?
Data quality clears the fog
It could also be good news: You will find out that your data and decisions
were okay. What you can at least do more for your company. So data quality
not only protects against misconduct, bad investments and fraud. Data
quality also opens up perspectives. Precise data enables you, for example, to
carry out sophisticated forecasts and simulations of future processes beyond
standard reporting of the current status. Only quality data enables you to
design prediction scenarios (only in a fertile combination of entrepreneurial
and analytical expertise, of course) that can tell you very precisely how you
can decide when and why. Data quality also proactively supports
entrepreneurial action and profitable investments. In the health care sector,
for example, only quality data enables the correspondingly reliable
management of health insurance companies and clinics, an optimal
comparison between service providers and service billers, right up to the
optimal care of a patient already at the moment of admission.
Is the way the goal?
There are no limits to the entrepreneurial imagination. However, you should
perhaps also know what you can get out of data in order to be able to really
recognize what treasure is hidden in them. It is not reliability. It is, despite all
the cost-benefit calculations, indices and factors: it is their closeness to
reality. Only if the reality of a company is reflected in its data, the data is
able to provide information about past, present and future, its business
development. One can be satisfied with poking around in the fog, accepting
that one could also be already standing at an abyss. Or enjoy a clear view.

Hergiswil NW, Switzerland, November 2020


Dr. CFG Schendera

PS:
Although the original title remained largely unchanged, minor adjustments
were made. For example, some programming examples were
internationalized, e.g. names or places. All SPSS programs were also adapted
to changed IBM SPSS functionalities, e.g. in programming with VECTOR.
All LOOP programs were tested on SPSS v22. Various IBM SPSS product
names have been updated. On top, I did all the translation myself. If you may
still find some Master Yoda lingo, incomprehensible statements or still
undetected SPSS bugs, please feel free to feedback me. Your feedback is
encouraged and appreciated. Let me help you improve your data, help me in
return to improve this book. Welcome on our journey into Data Quality land!
Schendera

Preface
It does apply to not only working with SPSS:
Data quality is not everything, but without data quality everything is nothing.
For almost 20 years I have been confronted with the professional work with
data. "Confrontation" because the analysis of data is indeed a challenge,
even on three levels of working with data: Data management - data quality -
data analysis (overlaps are possible). Data management is the more general
preparation of data, data quality is the more criteria-led provision of data,
data analysis is the (primarily) inferential statistical analysis of data. Already
during my studies at the University of Heidelberg I had numerous
opportunities to apply and expand my analysis skills (at that time still on
mainframes). Something struck me again and again: Textbooks (e.g. on SPSS
or statistics) always referred only to ideal data situations. If there is one thing
I have experienced during all these years, hundreds of projects and countless
analyses, it is that data is usually not ideal: Data is flawed, has outliers, gaps,
duplicates, and has all kinds of errors, both conceivable and unthinkable.
Data is "dirty". Clean data is the exception, contaminated data the rule.
What I often wished for at that time was a compilation of rules, as well as an
overarching concept that would have told me: “So, when you work with data,
the following problems can occur, among others. If you have this problem
with your data: Check this. If you have that problem, check that one.” But
first of all, this would have been the wrong approach: Data quality problems
are not dealt with reactively in case of recognizable errors, but proactively
because of the possibility of problems that cannot be recognized by sight.
Secondly, unfortunately, there was no such book or concept. On the contrary,
this book you are holding in your hands might even be the first book on SPSS
ever to deal exclusively with selected data quality problems and their
practical solution. I sincerely wish myself that this book had been around
twenty years ago. How much work, trouble and time do you think I would
have been saved? And what data aspects do you think could cause problems
and pitfalls? The next section will tell you.
This book will first introduce you to criteria and measures to ensure a
definable and verifiable optimal data quality with SPSS. The criteria were
compiled from recognized standards as well as suggestions and
recommendations from friends and colleagues in the DWH and
methodological area. Despite the scope of this publication it must be
emphasized that it can only be a selection of criteria and recommendations.
The book is kept so general that the compiled criteria can be applied to SPSS
datasets, but in principle also to data warehouses (e.g. using IBM SPSS
Modeler aka Clementine, see Chapter 16). Possible differences in the
definition of data (e.g. business data vs. data from basic research) and in the
process of their examination are discussed in Chapter 2. Aspects such as the
analysis of complex (business, data, etc.) processes or the concrete planning
of data quality projects can only be hinted at. Unfortunately, data
warehousing or data mining in connection with special fields of application
(e.g. Basel II) cannot be discussed due to lack of space.
Chapter 1 introduces the most frequently occurring problem areas, e.g.
completeness, uniformity, duplicates, missings, outliers and plausibility. The
DQ Pyramid concept clarifies the interrelationships of these criteria, and the
fundamental benefits of data quality. Further criteria for the quality of data, as
well as their communication, are presented in chapters 13 and 19. The DQ
Pyramid provides also the structure for this book.
Chapter 2 offers basic recommendations and proposals how to tackle a DQ
project, among other things language, resources, support, structures, priorities
and metrics, and considerations about sustainable work (e.g. switch from a
reactive to an active approach).
Chapter 3 describes first options to control the completeness of datasets,
cases (rows), variables (columns) and values resp. missings; e.g. metadata,
counters, checksums, visual inspection, and possible limits of verifiability.
Chapter 4 presents numerous possibilities to identify inconsistencies resp. to
standardize them in numerical values, time units and strings.
Chapter 5 introduces the problem area of recognizing, interpreting and (if
necessary) filtering multiple values or data rows. This section is
supplemented by hints for the identification of duplicates when reading in
grouped as well as nested data.
Chapter 6 introduces the handling of missing data. After evaluating missing
data with regard to causes (patterns), consequences, extent and mechanisms,
numerous methods of reconstruction and replacement of missing data are
described: These include cold deck imputation, random or logical
approaches, univariate estimation, multivariate similarity (hot deck
imputation) or multivariate estimation (internal consistency approach,
missing value analysis, MVA).
Chapter 7 explains how to recognize, interpret and deal with outliers. In
connection with the characteristics of outliers, the special role of expectations
("frames") is first discussed. Following the identification of univariate or
multivariate outliers by means of measures, rules, tests and diagrams,
possibilities of dealing with outliers are presented.
Chapter 8 describes qualitative and quantitative approaches to check
plausibility. The review of the quality of data in practice is first explained
using a single variable (with examples for a categorical variable, a string
variable and a metric variable). Subsequently, the examination of the
multivariate quality of data is presented using a qualitative as well as a
genuinely quantitative approach (anomaly approach).
Chapter 9 introduces the efficient checking of several variables and criteria
using check rules. This chapter introduces the since SPSS 14 available (if
licensed), powerful menu item "Validation" and/or the SPSS procedure
VALIDATEDATA. The chapter first introduces the control by mouse;
finally, it moves over to the extension by self-written check programs in
SPSS syntax.
Chapter 10 contains numerous further examples for checking several values,
rows and columns in one dataset at once. Of particular interest are, among
other things, the numerous variants of the counting variables (counters)
presented, as well as other special applications, e.g., the renaming of
numerous variable names (prefixes, suffixes).
Chapter 11 contains numerous other examples of working with several
(separate) datasets at once. Of particular interest are the various macros for
screening, splitting or merging several datasets. Various application
possibilities of the options of the SPSS command DATASET are also
presented.
Chapter 12 deals with time- or date-related problems, and their recognition
and solution. Of special interest may be the section on time stamps.
Chapter 13 introduces further criteria for the quality of data, such as
quantity, unambiguity, relevance, accuracy or comprehensibility. The
adherence to these criteria can usually rather be judged by the user and/or
client on the basis of exactly formulated (semantic) objectives than on the
part of SPSS by means of formal check rules.
Chapters 14 to 18 contain a small exercise (Chapter 14), a program example
for the implementation of a first strategy (Chapter 15), notes about for data
quality and SPSS syntax in IBM SPSS Modeler (Chapter 16) and Macintosh
Users (Chapter 17), and a checklist (test documentation). These chapters,
which are less concerned with criteria but rather with the concrete working
practice with SPSS, together also form a demarcation to the following chapter
on planning, analysis and result quality. Chapter 16 demonstrates how IBM
SPSS Modeler can used to check data quality, and how earlier versions could
be even extended by including SPSS syntax and procedures.
Chapter 18 contains a check list of selected criteria that users can use to log
the way in which quality criteria are implemented.
Chapter 19 compiles commented criteria for communicating the quality of
data, surveys and analyses. If one is concerned about the quality of data and
results during the analysis phase, one should avoid destroying this positive
impetus by suboptimal communication. A separate chapter is reserved for the
"mortal sins" of professional work. The purpose of this chapter is to avoid
already at the outset that certain practices are often not even recognized as
dubious or even unprofessional until it is too late. This chapter is also
intended to counteract the increasing decline in quality in scientific
publications, which is observed with growing concern by the professional
working part of science. It is important to avoid that scientific work is no
longer taken seriously by the public and also by parts of the scientific
community itself. I would like to take the liberty at this point of asking all
readers to contribute their feedback and suggestions to supplement or
improve this book, especially for this chapter (but not only) in future editions
(see Chapter 21).
Chapters 20 to 24 contain the literature, a request for feedback on this book,
a short introduction of the author, and finally the lists of SPSS syntax and
keywords.
By the end of this book, you should know many data quality criteria and the
most important features of SPSS to ensure them, be able to define them
according to your standard of optimality, and apply them to your data using
mouse or syntax controls. Users can check existing data ex post to see if it
meets defined criteria, and by developing a standard (rules) before entering or
migrating data, they can also anticipate how to prevent new data from
entering the analysis system incorrectly in the first place, e.g. by integrating
filters during data access or data entry.
The measures for ensuring data quality are summarized in the literature under
various generic terms. The terms vary from describing the process (e.g. data
cleaning, data cleansing, data standardization, data checking, data
cleaning/cleansing, deduplication, data refinement, data hygiene,
preprocessing, plausibility analysis, plausibility check, data scrubbing, data
stewardship) to that of the target (e.g. data plausibility, data quality, data
integrity, quality assurance, quality management). These terms originate
primarily from the context of data warehousing, which seems to be a broad
field (or rather: a playground?) for many other technicistic appearing -ing and
–ion neologisms (among others) (see Matching, Parsing, Householding,
Consolidation, Standardization, etc.). These measures support you not only in
ensuring the quality of quantitative entries, but also qualitative entries, e.g.
letters or texts. A very first application would be, for example, checking resp.
ensuring the uniformity of product names or other qualitative information or
codes.
However, all measures to ensure data quality have in common that they have
first priority over the actual analysis and can be strenuous, time-consuming,
but also dangerous. "Dangerous" in the truest sense of the word: Boredom or
stress in particular can quickly cause errors due to a loss of concentration or
motivation.
You will therefore be very challenged: In contrast to the generally quite
uncomplicated hypothesis test, the demanding nature of the measures for
ensuring data quality does not aim to examine known factors or variables for
expected correlations or differences, but rather to examine unknown factors
or variables for also unexpected effects. In all this detective work and causal
analysis, remember a maxim of Sherlock Holmes: "When you have
eliminated the impossible, whatever remains, however improbable, must be
the truth".
Data analysis is not much different from real life. "Dirt", "fog" or "noise"
(whichever image you choose for data contamination) only reveal the actual
"nature of things" when they are no longer there. Therefore I connect with
this book the hope that you will be able to realize later: This manual can be a
kind of filter, a kind of door opener, with which you can use error or
plausibility analyses to remove the "dirty layer" from the data and get to the
real phenomena. I would be pleased if this book could also support and
inspire users to exploit and interpret the (in this book necessarily only
touched upon) possibilities of complex multivariate statistics and to discover
real phenomena. Ensuring and guaranteeing data quality is like an art and is
at least equal, if not often more demanding than the "actual" analysis of data
in terms of effort and complexity (cf. Schendera, 2005). Wilcox (1999) is not
alone in wondering how many phenomena have not been discovered simply
because users have not been able and are not able to apply and exploit the
possibilities of complex multivariate statistics with regard to data analysis
and data quality. The reader is also urgently advised to plan early on and not
to underestimate the resources available for data quality (Chapter 2).
The quality of data is not an end in itself. Data at the end of a quality process
is therefore always also information. If information is the basis of knowledge,
knowledge in turn means power, then it should be clear what erroneous data
means. Data quality comes before analysis quality and is the very first
basis of knowledge construction, i.e. the objectivity, reliability and validity
of knowledge, including research methods and statistics. Accordingly, the
quality of data is not "only" the basis of a doctoral or diploma thesis. The
quality of data also determines the reputation of entire disciplines, companies
or even research areas, scientific professionalism and credibility per se (see
also Chapter 19).
The concept of "quality" represented here (for an elaboration of this term, see
the following chapters) is thus understood as independent of the strategy of
scientific research orientation, i.e. whether it is a matter of "quantitative"
approaches or "qualitative" research approaches. This conceptual dichotomy
"quantitative vs. qualitative" is only artificially selective with respect to the
actual methodologically differentiating characteristics, but in reality it is
misleading (e.g. Kromrey, 2005). The methodology of both disciplines is
partly characterized by fundamental similarities: Different questions
necessarily require different methodological approaches, which makes these
disciplines and their procedures on an equal footing and constructively
complements each other in terms of methodological pluralism. Different
questions require a differential, but always a professional (i.e. rule- and
criteria-based) approach. Some qualitative approaches even sketch an explicit
quantifiability in their approach (e.g. Mayring, 1990), while conversely
quantitative approaches are quite capable of working with qualitative "data",
quantified to varying degrees (see examples below). Both research areas, for
example, naturally require optimally high-quality "data" (measurements,
texts, interviews, images, etc.) in the sense of images of empirical reality as a
result of an object(via prior understanding)-method interaction. Both
disciplines also share, for example, problem areas that may limit the quality
of the respective approach, such as bias, subjectivity or selectivity (especially
Wilson, 1981, 57-61). Qualitative- as well as quantitative-based publications
can reveal massive errors when critically examined. In scientific institutions
the critical reading of such publications is partly part of the methodological
training. I even have publications available, some of which have been
awarded prizes, but which were not even able to correctly interpret a Pearson
correlation coefficient.
The presented quality criteria (e.g. completeness, uniformity, or the criteria
presented later, such as timeliness, quantity or relevance) can be applied
without restriction to work with qualitative data. Another reason is, as
already mentioned above, that the semantics of qualitative "data" (e.g. texts,
interview transcriptions) can often be further processed by software, both
"qualitatively" and "quantitatively":

e.g. in SPSS as a standalone tool (see e.g. Schendera, 2005, Chapter


7)
e.g. IBM SPSS Text Analytics [for Surveys]. This application lets
you transform unstructured survey text into quantitative data. The
solution uses natural language processing (NLP) technologies,
categorizes responses (e.g. using pre-built categories) and integrates
results with other survey data to gain insight using sentiment analysis.
e.g. in combination with third-party tools, for example as an
application for content analysis, e.g. MAXqda (formerly: WinMax),
[Link] or NVivo (formerly: NUD*IST).

The same therefore applies to both methodological approaches (disciplines):


Data quality comes before analysis quality. The presented criteria are
therefore also suitable for working with qualitative "data" (texts, strings) and
(depending on the criterion, measure and approach) also for large to very
large data volumes, such as can occur in data mining with IBM SPSS
Modeler.
The book was developed mainly on the basis of and for SPSS syntax
(originally v15, but also successfully tested on v22). Beginners in SPSS for
Windows should know that SPSS syntax can be requested automatically via
mouse clicks resp. SPSS programs can be written easily by the user (cf.
"Datenmanagement mit SPSS", Schendera, 2005). By the way, this book and
"Data Quality with SPSS" (Schendera, 2007) were developed from the
beginning as separate but complementary manuals. "Data Management with
SPSS" offers a solid introduction to working with SPSS syntax including a
first programming of macros; based on this, "Data Quality with SPSS" covers
a complex and practical application of SPSS syntax for checking data resp.
optimizing its quality if necessary. Experienced SPSS users probably already
know the numerous advantages of working with SPSS (see Chapter 2). It
should be pointed out at this point that they cannot import and run scripts
created in IBM SPSS Statistics within IBM SPSS Modeler anymore. In the
past, Clementine/Modeler users may have found it advantageous to be able to
use SPSS syntax and procedures to significantly enhance the performance of
Clementine/Modeler in terms of criteria-driven data quality checking and
assurance, both by including simple SPSS functions (such as "SPSS
Transform" and "SPSS Output") and by using SPSS syntax and procedures
(e.g. like REPLACE), complex user-developed syntax programs, as well as
by including the performance of special SPSS procedures, such as
VALIDATEDATA. Since this allowed the delivery of data, results and
models that were based on a systematic canon of criteria for the quality of the
underlying data (which can be verified at any time), this increased the
transparency, credibility and thus the professionalism of data mining with
Clementine/Modeler. The chapter on Modeler was written on the latest
version, IBM SPSS Modeler Subscription 1.0.
In the previous sections the terms "truth" and "professionalism" have also
been mentioned. The central intention of this book is to compile clear and
verifiable criteria, unambiguous standards to be adhered to, and conduct to be
refrained from. Of course, these requirements are not an end in themselves,
but an expression of professional ethics, of responsibility towards
professional scientific work. With the compiled criteria, standards and (to be
refrained from) behaviors, I also connect the wish that it was possible to
communicate clearly and sustainably which basic expectations were and will
continue to be placed on the quality of data, analyses and research in general.
There is one point on which I ask for your indulgence and would like to
emphasize clearly: Despite the wealth of material compiled, it cannot be an
exhaustive presentation. On the contrary, it is only a selection of the most
important aspects, guided by a certain subjectivity. The selection of the
aspects, as well as their respective weighting, could certainly be discussed,
and I would like to expressly invite all users to do so.
I would like to take this opportunity to invite all readers once again to
contribute their feedback and suggestions to supplement or improve this book
in future editions (see Chapter 21).
However, I would not want to compromise if a reader assumed that he or she
knew everything about the quality of data, analysis or research after reading
this book. I must clearly counter this assumption: On the contrary, this is only
a beginning, a first attempt at an explorative systematization, a first starting
point for interested readers and professional scientists.
执子之手,与子偕老---献给我珍爱的中国公
主.
Also dedicated to my grandparents Lore & Albert Schmid and Klara
Schendera

I am grateful for professional advice and/or a contribution in the form of


syntax, data and/or documentation among others: Prof. Gerd Antos (Martin-
Luther-Universität Halle-Wittenberg), Prof. Johann Bacher (Johannes-
Kepler-Universität Linz, Österreich), Prof. Vijay Chatterjee (Mount Sinai
Medical School, New York University, USA), Prof. Mark Galliker
(Universität Bern, Schweiz), Werner E. Helm (FH Darmstadt), Prof. Jürgen
Janssen (Universität Hamburg), Raynald Levesque (Boucherville QC,
Canada), Prof. Roderick J.A. Little (University of Michigan USA), Prof.
Daniel McFadden (University of Berkeley USA), Dr. James W. McNally
(University of Michigan USA), Prof. Theo Van der Weegen (Radboud
Universität Nijmegen, Niederlande), Prof. Rainer Schlittgen (Universität
Hamburg), Dr. Jonathan D. Shanklin (Head of Meteorology & Ozone
Monitoring Unit, British Antarctic Survey, Madingley Road, Cambridge,
England, United Kingdom), Prof. Stephen G. West (Arizona State University
USA), Matthew M. Zack (Centers for Disease Control, Atlanta, Georgia
(USA).
I would also like to thank Mr. Alexander Bohnenstengel, as well as Ms.
Sabine Wolfrum and Ms. Ingrid Abold of [formerly] SPSS GmbH Software
(Munich, Germany) for generously providing the software and technical
documentation. I would also like to thank Dr. Schechler of Oldenbourg
Verlag for his confidence in publishing this book as well as for his always
generous support. Volker Stehle (Eppingen) designed the print style sheet.
Stephan Lindow (Hamburg) designed the graphics. If anything in this book
should be unclear or faulty, the responsibility lies solely with the author.
Being allowed to do research is a privilege. Being able to research is a value.
Give both sustainability.

Heidelberg, July 2007


CFG Schendera

1 The DQ Pyramid
Data quality [DQ] is essential and omnipresent

e.g. in governments, offices and agencies (Office of Management


and Budget, 2007, Chapter III, 2006; US Census Bureau, 2006;
United Nations, 2003, 1995, 1983; OECD, 2003; Eurostat, 2004,
2003, 2002, 1999/1998; Statistische Ämter des Bundes und der
Länder, 2003; Körner & Schmidt, 2006; Blanc et al., 2001 etc.)
e.g. in banks and business (Lee et al., 2006; Konno, 2006; Willeke
et al., 2006; Bettschen, 2005; Goerk, 2005, 2004; Infeld & Sebastian-
Coleman, 2004; McKeon, 2003; Wan et al., 2002; Ofori-Kyei et al.,
2002; English, 1999; Internationaler Währungsfond (IMF, e.g. Carson
& Liuksila, 2001; Carson, 2000) etc.)
e.g. in research associations (e.g. DFG, 2005, 1998; DeGEval, 2004,
Bundesärztekammer, 2003; Wilkinson & APA Task Force on
Statistical Inference, 1999 etc.).
However, the quality of data is not a matter of course. Only if the quality
criteria resp. the measures necessary for their review are known, the quality
of data can be established. In other words: Only those who know what to
look for will have a chance to find the errors in their data and (hopefully)
correct them in time. This book is based on the assumption that the quality
of data is not something automatically given, but rather a product that can
only be purposefully induced and proven by explicit definitions, measures
and criteria.
Data quality is a genuinely interdisciplinary topic. The quality criteria
presented in the book originate from survey research (primary goal: data
collection), but also from the field of data warehouses (primary goal: data
management). In each case, the field of statistics provides further criteria,
assumptions and corresponding background concepts. The literature is
correspondingly heterogeneous, see e.g. Batini & Scannapieco, 2006, Kap. 2;
Lee et al., 2006; Gackowski, 2004; Laliberté et al.; 2004; Fugini et al., 2002;
Pernici & Scannapieco, 2002; English, 2002, 1999; Helfert et al., 2001;
Carson & Liuksila, 2001; Carson, 2000; Berry & Linoff, 2000; Naumann &
Rolker, 2000; Brackstone, 1999; Chapman et al., 1999; Garvin, 1998; Wang
& Strong, 1996 and many more.
The following anecdote may be a banal example, but it is revealing for a
necessarily meaningful coordination of data quality, data warehouses and
expertise already in basic concepts of statistics. A large company planned to
implement a test algorithm based on the null hypothesis test into a DHW.
The considerations of the responsible project manager were as follows: If a
certain value differs too much from all other values in the DWH, the value
reaches statistical significance and can thus be identified as an error. What
was overlooked was that the DWH contained millions of entries and almost
every tested value achieved "significance" just because of the huge amount of
data (see also Chapter 19.3. for the significance concept), less because of the
absolute deviation from the other values. After consultation with statistically
experienced data analysts, a check algorithm was implemented that
successfully identified erroneous values regardless of the amount of data. To
be fair, it must be emphasized that the responsible project manager fell for
top-level management information in which the functionality of the classical
significance test was not correctly represented.
Depending on the primary goal, there are currently still different application-
related accentuations within the individual systematizations and
hierarchizations in the respective disciplines. However, it can be assumed that
they will converge over time, since they are ultimately elements of one and
the same process: Data collection always requires data storage for the data
obtained. Thus, data storage always requires that data collection has taken
place resp. that collected data is available. However, both sides assume resp.
ideally proactively advocate that data quality is given. Most methodologies,
whether as a systematic collection of the subjective assessment of data
quality by the users (e.g. as IQA Survey), or as an objective test rule on the
concrete data level (e.g. as Codd Integrity Constraint) test more or less the
same set of criteria (e.g. Lee et al., 2006, Kap. 3, 4, also Appendix 3; Batini
& Scannapieco, 2006, Ka. 7.2.; Laliberté et al.; 2004; Long et al., 2004, 201–
203; Gackowski, 2004; Statistics Canada, 2000; Carson, 2000; Davies &
Smith, 1999c, d).
Even apparently theoretically divergent background frameworks, such as
empirical, semiotic, ontological, or TDQM (Total Data Quality Management)
approaches, show wide overlaps with respect to the concrete criteria, which
in turn sometimes have completely different definitions (see Gackowski,
2004, 127-130). The terminology for the criteria in detail and as a generic
term is currently anything but uniform. Various publications, for example,
use the term "dimension" (i.a. Lee et al., 2006; Batini & Scannapieco, 2006,
e.g. Chap. 2; Redman, 2004, 2001), some on the contrary also „attribute“
(Gackowski, 2004; O’Brien & Marakas, 2003) etc. These terms, however,
represent (statistical) concepts which, in terms of complexity, in some cases
go far beyond individual (techno)logical terms originating more from
computer science, such as the dataset-centered term "integrity constraint".

1.1 The basis and the next level: First DQ


criteria
The basis: Six first criteria
Of the numerous quality criteria, completeness, uniformity, missings,
duplicates, outliers, and plausibility are presented first (see Chapters 13 and
19 for further criteria). Many further criteria build on these. The criteria
"completeness", "uniformity", "duplicates", as well as "missings", for
example, form a basis. However, the review of these criteria may possibly
require that further frameworks are given, e.g. that a plan and enough time for
the review using SPSS are available (see Chapter 2). Before using SPSS it is
recommended to get an overview of the other criteria and their fulfillment.
The six criteria for data quality, which will be discussed in more detail in the
following chapters,

completeness (Chapter 3) resp.


controlled missings (Chapter 6),
avoiding duplicate data (Chapter 5),
uniformity (Chapter 4),
evaluation of outliers (Chapter 7) and
plausibility (interpretability) (Chapter 8)

are closely interwoven. These criteria, as well as others presented in Chapter


13, together form
the basis of the checklist in Chapter 18.
Completeness is defined, for example, in such a way that the number of data
in a final analysis dataset corresponds exactly to the sum of valid and missing
data in a structured environment, e.g. in a survey (questionnaire) or the partial
datasets from which it was formed. If, for example, information is missing in
a questionnaire, this should correspond to controlled missings in the dataset.
If there is more valid or missing data in the analysis dataset than was
collected or delivered, this is referred to as overcompleteness (to be carefully
checked). Uncontrolled missings and duplicates should be checked carefully
after ensuring completeness.
For many analysis, transformation and verification measures, the uniformity
of the data, especially if they come from different sources, is an indispensable
prerequisite. Only when the uniformity of the data is given, e.g. of currencies
or time units, is it possible to check for outliers. The following diagram
introduces the concept underlying this book.
The criteria "completeness", "uniformity", "duplicates", as well as "missings"
form the basis (see the lowest level in the diagram above) of the DQ
Pyramid. All other criteria are based on this (for other hierarchies see e.g.
Gackowski, 2004, 139-140; Oliveira et al., 2005). At first you need to read
the DQ Pyramid from bottom to top; the base itself is to be read from left to
right (as is the layer above, but not the third layer): For example,
completeness is a prerequisite for ensuring uniformity, excluding duplicate
lines or values, and dealing appropriately with possible missings.
If the DQ Pyramid is read from bottom to top, it makes evident that as long as
e.g. no complete, uniform etc. data are available, it makes little sense to check
data for further criteria (e.g. plausibility), let alone to perform inferential
statistical analyses or to communicate any "results" (hastily). If the DQ
Pyramid is read from bottom to top, it can be a first orientation to organize
your own data checks in reasonable steps based on each other. However, the
DQ Pyramid can also be read from top to bottom: Starting with the "crown"
of professional (scientific) work, a publication or a report, e.g. to
management or superiors, the publication or report itself can already be
checked for compliance with quality standards, right down to the concrete
professional handling of further criteria, such as outliers, missings or even
completeness. The "bottom-up" perspective is linked with the hope that this
DQ Pyramid will already optimize the process of scientific work and
communication with regard to selected aspects of data quality resp. provide a
certain degree of security. The "top-down" perspective can, among other
things, help to evaluate a product of scientific work and communication
(possibly before publication or forwarding) and, if necessary, to optimize it in
time (for a more differentiated interpretation of the DQ Pyramid with regard
to the criterion "plausibility", see Chapter 8).
This diagram also highlights the dimensions that structure this book. The
book is divided according to quality criteria: Each chapter introduces a
quality criterion. For example, Chapter 3 deals with completeness, Chapter 4
with uniformity, etc. Some chapters deal with the same criterion; Chapters 8
and 9, for example, each deal with plausibility. Chapter 8 introduces the basic
principle and initially simple approaches; Chapter 9 introduces the use of
much more sophisticated screening programs. Some (later) chapters deal with
working with several criteria resp. introduce further criteria (e.g., chapters 9,
13, 16, and 19). Chapter 13 introduces criteria for which the use of SPSS is
not necessarily required, e.g. quantity, unambiguity, relevance, accuracy, or
even comprehensibility; nevertheless, these criteria are relevant for
professional work with data. In chapter 19, for example, criteria for
communicating the quality of data, surveys and analyses are presented. Each
chapter on quality criteria is preceded by the DQ Pyramid in order to
illustrate once again the interconnectedness of the presented criterion
(highlighted by a subtle 3d-effect) with the others. A further dimension of the
book's structure is the complexity of SPSS programs; this has always been
ordered from simple to sophisticated, whenever possible. Complexity is also
related to the application possibilities of SPSS; these begin with single
values, variables and datasets and extend to the quite demanding work with
resp. programming for several values, variables or datasets (e.g. Chapters 10
and 11). Because the DQ Pyramid is structured according to quality criteria,
the diagram does not contain all chapters of this book, e.g. the chapters on
IBM SPSS Modeler aka Clementine (Chapter 16), MacIntosh (Chapter 17),
and on communicating data and analysis quality (Chapter 19) are not shown.
Chapter 12 is the only chapter resp. criterion (for time- resp. date-related
problems) whose sequence in the pyramid hierarchy does not match the
sequence in the book's table of contents. This chapter has a special position
because the "time" factor does not always need to be checked. This means:
Chapter 12 is only necessary if the data to be checked actually contains time
or date variables. In this case the criteria from Chapter 12 are to be applied
before the examination of the general plausibility in the sense of Chapters 8
or 9 (see diagram). However, since datasets do not always contain time or
date variables, Chapter 12 has been placed after Chapters 10 and 11 because
it may be used less frequently than the other chapters (see Table of Contents).
The next level: Plausibility
All measures to ensure data quality, together initially strive for the goal of
plausibility. Plausibility is defined in such a way that the content of the data
in a dataset not only corresponds exactly to the content of an already existing
dataset or information in a structured survey, e.g. in a dataset or a
questionnaire, but qualitatively goes beyond this, e.g. by identifying,
correcting or removing errors made inadvertently (or even intentionally).
Data of higher quality can thus be interpreted more accurately and
semantically more clearly. Derived information is more exact and better
justifiable or without fuzziness and uncertainty. Data at the end of a quality
process is information.
Plausibility is the most demanding to check and requires (as shown in the
DQ Pyramid) among other things completeness, uniformity and further
formal correctness of the data (in fact, the various criteria can in principle be
assigned to concrete "degrees" of "plausibility", see Chapter 8). The
plausibility of the data to be checked presupposes the plausibility of other
data for the reconstruction of missings or for checking outliers. Every
approach to ensure plausibility is based on the fact that the data that form the
actual basis for the plausibility decision are themselves plausible. To this
extent, such verification and correction measures must be carried out with
great care. However, the plausibility of data, analyses and results does not
mean that the work with regard to quality is finished. Quality must also be
communicated correctly (see Chapter 19); it is essential to avoid that
suboptimal communication gives a dilettante impression of what is actually
professional work.
1.2 Let’s talk about the DQ in DQ
Pyramid
The verification of data quality with SPSS is based on the assumption that
data quality is objectively definable, measurable and therefore provable.
Completeness, for example, is objectively definable, measurable and
provable: Either there is an entry in a field (valid value) or not (missing). The
same applies to all values of a data storage: Either a data storage is 100%
complete or only 95%, 88% etc. The same point of view can be applied to
duplicate entries etc.: Either a data storage does not contain any
(inadmissible) duplicates or, if it does, it does so to an objectively and
precisely quantifiable extent. The situation is different, however, when it
comes to data quality requirements. Data requirements can be quite
subjective and can deviate from optimal objective quantities. For example,
depending on the application, data storage can only be 95% complete from a
decision maker's point of view, a certain scatter range is permissible for
certain quantities according to an expert, etc. However, the subjective
requirements for data quality should not be confused with their factually
objective measurability. Whatever reasons, criteria or causes are put forward
by users to explain the deviation from optimal data quality, the consequences
are the same:

Any deviation from perfect data quality is always suboptimal data


quality.
Suboptimal data quality is therefore always to be explained resp.
accounted for by the subjective reasons, criteria or causes brought
forward by the users.
(Sub)Optimal data quality is not an end in itself, but an attribute with
consequences.

Data quality: Definitions


The classical definition for data quality is „data to be of high quality if they
are fit for their intended uses in operations, decision making and planning“
(Juran & Godfrey, 19995, 2; see the definition of the US Census Bureau,
2006, 1: „The Census Bureau defines quality as ’fitness for use’.“). As
accurate as this definition is in its universality, it is equally imprecise, as this
definition does not specify which criterion constitutes the quality of data (e.g.
accuracy, uniformity etc.) and also to what extent. I take the liberty to define
data quality as:
The quality of data is a multi-parametric relation predicate, which results
from the type and number of required criteria ("criteria canon"), the methods
of their examination, the tolerances/limit values of the respective criteria (e.g.
100%, 3Sigmas etc.), as well as the excluded criteria etc.
The user does not only have to define the optimum; this is self-evident. The
real challenge, however, will be to avoid any (possibly necessary) deviation
from the optimum, if not inevitable for pragmatic reasons, to explain and
justify the associated consequences. To avoid any misunderstanding right
from the start: Data quality is not relative. Each context may well have its
necessarily own definition of data quality, which is appropriate to the subject
or data. The respective definition and warrantee of data quality are central
tasks of the user. The justification to be demanded serves the proof of the
adequacy of the quality definition and thus also the proof of its objectivity,
thus in practice also the exclusion of arbitrary measures. This book will
therefore not provide a definition of data quality in the sense of a global
criterion, but a framework concept resp. a canon of criteria that should make
it possible to analyze any definition of data quality. The starting point for this
is the totality of all criteria, optimal methods resp. tolerances/limits as a goal.
However, any exclusion of criteria, any use (waiver) of test methods, as well
as any applied tolerance criteria must be made explicitly transparent and
justified.
Data quality before analysis quality: Experiences with DWH
Data quality comes before analysis quality. The fact that dirty data lead to
distorted results is fact-evident and can be demonstrated in detail with
numerous specific examples, depending on the type of "contamination",
e.g. for completeness (Chapter 3), duplicates (Chapter 5), and outliers
(Chapter 6). Without knowing the specific type, cause and extent of data
errors, their effects are inconsistent and unpredictable (e.g. Haughton et al.,
2003, 69-75 based on a simulation using credit data). Decisions based on
such data can therefore be correspondingly completely incorrect. Data quality
first creates the foundation and prerequisite for meaningful analyses.
Securing and warranting data quality cannot be taken for granted; complexity
or effort can easily exceed that of the planned "real" analyses. In some
projects the phases of data management and data cleansing are many times
more complex than the analysis itself. Berry & Linoff (2000, 177) therefore
call "dirty data" the "curse of data mining".
In data warehouses, 95% of respondents indicated that poor data
quality impacts their business performance; 30% (sic) admitted their
data is poor quality (Experian, 2019, 10, 6). Approx. 90% of the time
required is spent on data preparation and post-processing (Cabena et al.
1998, 43); approx. 70% of the costs are caused, for example, by measures to
ensure data quality. The resources and skills required for this should by no
means be underestimated. Gartner (2018) estimates that this is costing
companies an average of $15 million per year. Eckerson (2002) quotes a
study by the Data Warehouse Institute, according to which "managing data
quality and consistency" and "reconciling customer records" are among the
greatest challenges in CRM projects. Data warehouses often fail due to
insufficient data quality. The FleetBoston Financial Corp. (USA), for
example, failed in 1996, despite a budget running into millions, because it
was underestimated how difficult and time-consuming it can be to
standardize and merge data from 66 systems (Eckerson, 2002, 7). In addition
to cuts in the actual available budget, ensuring data quality is the top 2
problem (Peterson, 2003). The author, for example, took on a project that was
initially completely muddled, in which the quality assurance measures
required were ultimately in a ratio of approx. 20:1 to the effort required for
the actual analysis. This was mainly due to datasets and databases with non-
uniform formats and contents. A reduction to "footwork" is therefore a
complete misunderstanding of the importance of this activity and its
complexity (see also Lee et al., 2006, 2). One of the aims of this book is
therefore to bring this all too often neglected activity and its significance to
the fore.
A professionally executed data quality project, on the other hand, is quite
capable of saving several hundred million euros over a period of several
years. Batini & Scannapieco (2006, 188-199), for example, cite a case in
which a budget of about €6 million for architecture and maintenance over a
three-year period should allow savings of at least €600 million, and
depending on the calculation method, even up to €1.2 billion.
Data quality before analysis quality: Experiences in research
Researchers often put a lot of effort into analyzing and publishing their
studies in the most sophisticated way possible. In contrast to this is an often
frighteningly naive and conceptless handling of their foundation, the quality
of the data. One could almost formulate as if some users are blind on this eye
(the data quality) and only have the quality of the analysis in the (other) eye.
In this context, I could provide numerous, rather drastic examples from
personal experience.
For example, some time ago I was commissioned to examine the securing
and warranting of data quality in a multi-center clinical study supervised by
a third party with a duration of several years. The specific cause was that
important publications were due to be published, but inconsistencies in the
data were discovered by chance, which could neither be checked nor
explained. After several days of reviewing data and documentation, it was
clear that the study directors and their staff had made almost every mistake
they could. The most massive mistakes that were immediately noticed
included duplicate datasets and data rows, but also data gaps in the master
dataset despite complete individual datasets. Unnoticed, encodings for
missings were also included as real values in the analysis. When checking the
programs for data management and data analysis (so-called code review) in
detail, further logical or statistical errors were found. In the end, a page-long
list of errors and omissions came together, which at the same time created the
conditions for a constructive start. Within the scope of a trouble-shooting, all
possible sources of error were systematically eliminated one after the other
with the help of this list. Analyses of the now qualitatively optimized data
situation finally resulted in numerous informative effects, whose nuance and
differentiation were included in numerous international publications. While
this project got off with a "black eye", due to my own re-analyses I am aware
of other projects whose incorrect results even found their way into
publications on governmental level. Some projects could only be declared to
have failed beyond repair. Every reader may have his own thoughts about the
consequences. The examples outlined at the beginning will hopefully make it
clear that one cannot assume that the data is in order just because one does
not check one's data or its transformation. Lack of data quality is not fiction,
but can happen at any time and on any scale, from student theses to data
warehouses.
1.3 DQ: Cost-benefit considerations.
Really?
“Is the effort for data quality worth it?” is certainly the next, but clearly
absurd question. You fasten your seatbelt in your car because it is safer, but
not because an accident is going to happen (this logic should also be followed
by seatbelt opponents). Of course it is worth the effort. Data with higher
quality are more accurate and more interpretable. If the data contains e.g.
cash-value information, then data quality ensures more precise analyses (e.g.
customer groups, customer relationship phases), more concrete predictions
(e.g. product acceptance, future purchasing behavior), justifiable decisions,
and ultimately a competitive advantage. Each inaccuracy costs cash money,
e.g. if e.g. an entrepreneur works with inaccurate sales data or purchase
profiles of his customers and this information blur should flow into strategies.
Data quality leads to more information (because it is correct) and thus to
more sales and profit. Data quality helps to optimize the benefit factor and
supports a more precise identification of potential customers (target
marketing, up-/cross-selling), increased hit rates in direct marketing, and thus
reduction of advertising costs. Data quality thus also protects against
misconduct and bad investments. In CRM (Customer Relationship
Management) data quality prevents, for example, that customers are treated
incorrectly. In marketing, data quality prevents, for example, that actions
based on duplicate, inaccurate or incorrect information cause damage to the
image or deter interested parties (e.g., letters with misspelled surnames,
advertising of undesirable products or services, or the issuing of incorrect
invoices). Optimized processes, products and strategies support demand resp.
offer acceptance, customer satisfaction and customer loyalty. Misdirected
marketing investments (e.g. by multiple mailings to the same address) can be
saved immediately, losses due to incorrectly calculated or transmitted prices
can be directly prevented (see for numerous examples among others Redman,
2004, 2001; Eckerson, 2002 passim). Inadequate data also causes erroneous
analyses, which in turn lead to wrong decisions and ultimately to a refusal
attitude of decision makers towards the data storages (Helfert, 2000, 65).
Working with high-quality data not only increases transparency within the
company, but also leads to quality awareness, which in turn will have a
positive effect on the working and business climate right up to the interested
party. Especially a documentation of quality assurance measures and their
fulfillment that is carried to the outside world supports the confidence of
interested parties in the data quality. The Return of Invest (ROI) is therefore
shown in two sides: Avoiding costs (bad investments, wrong decisions) and
increasing the benefit (profit, competitive advantage, competitiveness). The
effort to ensure data quality is therefore worthwhile as long as these measures
prevent misinvestments and wrong decisions and help maximize profits and
competitive advantages. The ROI for ensuring data quality is likely to be
quickly recouped, not only in the case of corporate data. Ultimately, data
quality is linked to clear material consequences, especially the credibility to
empower one's own actions. To name some facts in conclusion: Redman
(2004), for example, developed the COPDQ factor (COPDQ, cost of poor
data quality) over the years and assumes that the COPDQ factor increases
costs tenfold. If, for example, measures with data that are OK cost 1€, then
the same measures with data that are not OK would cost 10€. Redman
estimates, however, that the true costs are much higher and that suboptimal
data quality costs the industry at least 10%, if not 20% of revenues. The Data
Warehousing Institute estimates that, for example, American companies lost
more than $610 billion in revenue in 2001 alone due to "dirty" data; however,
the true cost is probably much higher (Eckerson, 2002, 5; Peterson, 2003).
Causes of suboptimal data quality
The causes of suboptimal data quality are multi-layered and, especially with
increasingly large heterogeneous data storages, are also extremely complex
from a technical point of view. The following causes are possible with data
warehouses (see also Lee et al., 2006, 80–108; Eckerson, 2002):

suboptimal (too restrictive/too tolerant), only temporarily working or


no test algorithms or tools at all.
subjective factors in data delivery: e.g. unintentional/intentional bias.
fuzzy definitions: In data management systems, precise data
definitions must be used. However, "old" can be relative, for
example. In the food trade, for example, goods are moved as quickly
as possible, because otherwise they can no longer be consumed.
However, an incorrect definition of "age" in the database of a large
food retailer then led to the fact that even wines were still sold
quickly at a low price. Because of this mistake, high quality "old"
wines were sold at a fraction of their actual value.
high volatility of the content of the data storage: Certain types of data
are invariant (e.g. dates of birth, publication or foundation), while
others can be highly volatile (e.g. stock prices, schedules, sales data).
Personal data, for example, is subject to permanent dynamics (e.g.
marriage, relocation, change of address, etc.).
inconsistent representation of data: e.g. with names in the form first
name - middle name - last name as opposed to last name - first name -
middle name.
complex data representations: e.g. in the form of graphics or detailed
texts instead of compressed (numeric) codes.
increased or in terms of content changed requirements for the data:
If the data management does not allow flexible changes of data, there
is a risk that data is "provisionally" transferred to the DWH, with the
consequence that, among other things, inconsistencies may occur.
different storage locations or formats for data: e.g. storage of data in
dBASE, Microsoft Access, ORACLE and Microsoft Excel, among
others (partly caused by a change to another software, by changed
organizational or departmental structures, etc.).
insufficient hardware and software: e.g. unreliable computers
(servers, systems), data transfers, inaccurate statistical programs or
algorithms (e.g. Keeling & Pavur, 2007; Knusel, 2005; McCullough
& Wilson, 2005, 2002, 1999; McCullough, 1999, 1998; Davies &
Smith, 1999b).
conflicts within and between hardware and software: The
functionality of accesses (e.g. SQL queries) may depend on hardware,
operating system or even the format of the data to be accessed. The
data itself may be ok, but its (technically speaking: more complex)
environment may cause problems, especially if the systems are
(geographically) distributed.
huge amounts of data: For example, the more data that accumulates,
the less time is often left for thorough examination and
implementation of long-term solutions to ensure data quality.
insufficient understanding of data quality (e.g. of inputters, users or
decision makers): e.g. identification, interpretation and correction of
possible errors (e.g. Laliberté et al., 2004, 16), as well as in relation to
realistic but complex cost-benefit assessments (see also Lee et al.
2006, 13-26; Batini & Scannapieco, 2006, 88-95; Eppler & Helfert,
2004).

1.4 Fact before fiction (range of effects)


It is pointless to speculate about causes, number of unreported cases, and
consequences of undiscovered errors. In each case, the correctness of one's
own actions was either not questioned (by the respective user or by others) or
correction processes were not (effectively) checked (by the respective user or
by others) to see if they were correct or successful.
In basic science, for example, it might be of interest how users and others can
trust the results achieved. In a science based on "publish or perish", nothing is
more feared than if it is found after publication that the results are based on
errors in data or simplest calculations (see also Schendera, 2006; Höding et
al., 2005; Friedmann et al., 2004; McKinsey & Company, 2004). If the data
contain, for example, clinical information, then patients, physicians and
pharmaceutical companies not infrequently risk reputation, trust and health in
the event of incorrect data, which is ruled out from the outset with
responsibly verified data quality and often enough the benefits outweigh the
costs (see Richardson & Chen, 2001; McFadden, 1998; Gassman et al.,
1995).
Thus, effects can not only go in the direction of alpha or beta errors, i.e., they
simulate differences where there are none, or conceal effects that would be
highly interesting (see also Haughton et al., 2003, 72). When data quality is
warranted, one does not just have the certainty that the results present
(regardless of whether they are in the direction of significance or non-
significance) are not impaired by poor data quality. Others not only have the
certainty of being able to make their decisions based on reliable information.
Some scientistic-euphemistic formulations pretend that any lack of data
quality is only a matter of subtle nuances in the correctness of considerations
or hypothesis tests, which could have also been different, but at least similar.
To put it bluntly: "Suboptimal" data quality can also mean that the data and
everything based on it can be complete junk. Irrecoverably lost, not
recyclable, a complete waste of resources with consequences up to and
including recourse or litigation.
Poor data quality can sometimes have far more devastating consequences:
Incorrectly converted data caused, for example, the crash of the first Ariane 5
rocket on June 4, 1996, and incorrect address data caused NATO's accidental
attack on the Chinese embassy in the Kosovo war on May 8, 1999. Due to a
wrong signal, the US space agency NASA lost contact with the Mars probe
"Mars Global Surveyor" in November 2006. The Federal Employment
Agency reported false unemployment figures, for example, in the months
December 2006 to April 2007 (SPIEGEL ONLINE, 2007). Incorrect data
endanger annual reports of companies and may deter investors, jeopardize
trust in democratic elections (e.g. counting of votes), the fight against
poverty, terrorism or crime, etc. (see also Atherton, 2007; Beikler, 2005;
HEISE online news, 2005; Klotz, 2005; IMF Survey, 2004; Redman, 2004,
2001; Bange & Schinzer, 2001; Eurostat, 1999/1998).
Plausibility analyses not only make sense: Plausibility analyses create sense.
Plausibility analyses only create the credibility of data (trust). Without data
plausibility no analysis. Plausibility analyses make sense for data volumes of
any size up to data warehouses: With very small datasets, plausibility
analyses rule out that a single data error can have a massive negative impact
on the results. For very large datasets, plausibility analyses rule out that
apparently isolated errors add up to unmanageable misinvestments. Data
quality is a priority not only for very large datasets. Ensuring and providing
data quality is an element of phase 2 ("data understanding") of the CRISP-
DM methodology (see SPSS, 2000; Chapman et al., 1999). Experience shows
that data mining (also) involves data of untested quality, where the various
errors and problems often only become apparent at the moment of analysis or
modeling (Berry & Linoff, 2000, 177-181). "Data cleaning" is one of the best
practices, the omission of a clear "mistake" (Rud, 2001).
Plausibility analyses protect professional data mining from data littering and
create the basis for credible results, considerations and decisions. And the
best thing is: Data quality is not a matter of faith, but a verifiable fact.
Criteria of data quality (e.g. completeness) can be objective and measurable
(e.g. either a value is present or not). The following measures for verifiably
ensuring optimal data quality in the most frequently occurring problem areas
of completeness, uniformity, duplication, missings, outliers and plausibility
show how. Data that has been successfully subjected to these measures can
be justified and verifiably described as being of high quality.
Optimal data quality plays a fundamental role especially in studies with
extreme consequences (lethality), effort (resources) and phenomena in
fringe ranges. Possible consequences of suboptimal data quality can be
shown using the example of a case that is certainly based on optimal data
quality. In 2004, for example, in the scandal surrounding the pain medication
Vioxx and its side effects (including a doubling of heart attacks and strokes
when taking medication for more than 18 months), a large pharmaceutical
company was reported to the US Securities and Exchange Commission
(SEC) by consumers and relatives of actual and alleged fatalities for "false
and misleading information regarding the safety profile". Only a few minutes
after the report, the company's share price plummeted. A few weeks later,
even the US Department of Justice intervened. According to the U.S. Food
and Drug Administration (FDA), there was a so-called "signal" of a potential
side effect when the drug was approved, but to a statistically insignificant
extent. A few months later, the FDA estimates that in the U.S. alone, between
88,000 and 140,000 people may have suffered serious cardiovascular disease
from taking Vioxx since its approval in 1999 (see also Graham et al., 2005).
The drug was withdrawn from the market at the end of September 2004.
According to its own statements, the pharmaceutical company is facing
several thousand lawsuits. In August 2006, the pharmaceutical manufacturer
was sentenced by a U.S. court to pay compensation in the millions following
a lawsuit because it knowingly provided false information to physicians and
acted negligently because it did not sufficiently inform physicians about the
risks involved in taking Vioxx. In another case, the Supreme Court of the
State of New Jersey overturned a verdict in favor of the pharmaceutical
company and ordered a new trial. The court justified its ruling by stating that
since the first ruling, new evidence had emerged that the drug company knew
that patients could suffer heart attacks even if they had been taking Vioxx for
less than 18 months.
The literature on data quality is heterogeneous, mostly theory-based and often
contradictory. In analogy to the impetus of so-called Good Clinical Practice
to identify and eliminate quality problems (e.g. in the form of the ICH
Guidelines), there are repeated efforts to establish (inter)national standards
for data quality in connection with quality assurance and quality
management, e.g. in the form of DIN or ISO standards (e.g. ISO 9000, ISO
8402). So far, however, no standard has been able to establish itself because
it seems to be difficult to agree on a selection and hierarchization of quality
criteria and then to enforce them as mandatory. The "Curriculum
Qualitätssicherung / Ärztliches Qualitätsmanagement" (Curriculum Quality
Assurance / Medical Quality Management) published by the German Medical
Association (2003), for example, emphasizes on the one hand that "clinical
studies require the highest degree of data quality" (see p. 78), but on the other
hand it defines data quality in another place in an extremely meaningless way
as the "property of a date with regard to the quality criteria objectivity,
validity and reliability" (p. 72). As will also be seen later, this definition is
clearly incorrect. ISO 9000, for example, is a series of standards (ISO 9001 -
ISO 9004, see also: ISO 10011-ISO 10013) with recommendations and
standards for quality management. Concrete specifications regarding the
quality of a product or service and the measures to be taken are not part of the
content of these standards. The definition of the term quality as well as the
criteria and processes required to achieve the objectives are determined by
the institution to be certified. ISO 9000 thus only states that a certified
institution is only able to offer certain procedures and forms of
documentation, but is free to design its own procedures and is not obliged to
comply with them.
The next chapter will now provide initial information for planning a data
quality project, including the prioritization of the first substeps.

2 Recommendations
The intention of this chapter is to give you some basic recommendations on
how to tackle a DQ project; in principle, make sure that you have the
necessary time, tranquility and resources for your work.

Language and support: Choose the right language and metrics.


Develop a strategy.
Objective: Choose the big picture (context!).
Relevance vs. interesting: Choose relevant data.
Overview, package and prioritization: Develop an overall
accepted priority list (work packages).
Different approaches for different data: Sometimes correcting
data is not enough.
Proposals for sustainable work: From an ex post process to
prevention, process-oriented vs. state-oriented approach, control
by protocol, work with syntax, and make copies.
Words of Caution: Avoid ad hoc approaches, avoid "Optimal
data quality in the shortest time with minimal effort".

2.1 Structures: Language, infrastructure,


and priorities
Language, support and strategy
Establish a connection between poor data quality and a risk to research and
business outcomes. Make sure that improved data quality is desirable,
quantifiable and tangible to stakeholders. Use research or business language
to make stakeholders understand the importance of data quality, and request
support and sponsorship. If you are linking data quality with metrics, make
sure you provide metrics for positively (e.g. higher ROI, higher sales, more
precise results, higher speed etc.) and negatively formulated successes (e.g.
reducing cost, avoiding fines (remember Citibank?), avoiding image loss etc.)
Make sure data quality in itself is an asset, thus also an incentive and
motivation. However, beware of promises of salvation (cf. Preface to
International Edition) (e.g. Batini & Scannapieco, 2006, 7.5.; Lee et al., 2006,
Chap. 2, 7 and 10; Redman, 2004, 2001; OECD, 2003; Berry & Linoff, 2000,
Part Three; Kimball & Merz, 2000, Chap. 15; Calvert & Ma, 1996, Part II).

Objective
Before you get down to work, make sure that strategy and general objective
are present and accepted.
An overriding objective can be e.g. the improvement (positively formulated;
see above) of a product, a service or a process (treatment, marketing,
production). The specific goal is then to check and optimize the quality of the
available data of the product, service or process. In data warehouses, for
example, data with a high ROI has priority over data with a low ROI, etc. In
clinical research, for example, so-called primary variables have priority over
secondary and other variables.
Relevance vs. interesting
In a first step, separate relevant resp. current data (thus data to be checked)
from data that is neither current nor relevant, nor will ever be included in the
analysis(es). Relevant data has priority over interesting data.
Once I was involved in a major data mining project where we categorized a
possible relation of hundreds of input variables on selected target variables
from a content-related point of view. We categorized them as A (relevant by
fact), B (relevant by theory), and C (not sure); all the others we excluded.
This categorization helped a lot to narrow the focus on really relevant fields
and to allocate resources to time-consuming activities like data cleansing and
feature-engineering.
Think in the long term. If you are not sure whether certain data might become
relevant at some point in the future, it is more economical to also check it on
this occasion. Imagine yourself: While you are reviewing the data, you are
actually doing nothing more than designing a concept for ensuring data
quality. The verification process, which now may take days or weeks to
complete, will take only a few hours on the next run.

Overview, prioritization and conflict resolution


Consult the DQ Pyramid (cf. 3.). Develop an overview of the checks and data
transformations to be performed. Create priorities. Each application area can
have different priorities. Again, relevant data has priority over interesting
data. Translate priorities (e.g. replacing missings) into concrete individual
measures. Develop work packages, e.g. replacing missings in names, dates
and values. If it is not possible to agree on the respective maximum
requirements, then at least minimum compliance requirements should be
formulated. Check your measures for success and acceptance, e.g. of
measures, codes or labels; e.g. for companies with an international focus it is
worth considering whether e.g. datetime formats should be standardized to
international, European or American datetime formats.
Try to agree on a generally accepted priority list with all parties. Since
different requirements (rapid implementation of measures vs. thorough
measures) in terms of different criteria (timeliness vs. consistency) can often
conflict with each other, resp. different priorities are often assigned to one
and the same requirement (e.g. 100% vs. 75% elimination of inconsistencies),
this is not always easy in practice (see below also the comments on
unfulfillable maximum requirements and their resolution). A related problem
is that although data quality criteria can be objective and context-
independent, the demands made by the participants on these criteria are often
subjective and context-dependent. The definition of criteria is one of the
highest requirements, which experience shows can often be faced with
several difficulties at once:

Decision-makers and sometimes analysts cannot or do not always


want to assess the status quo prior to a data quality measure. The
reasons are often lack of time, lack of patience and/or insufficient
technical understanding of the complexity of the required measures
(Lee et al., 2006).
Decision makers and sometimes analysts cannot or do not always
want to communicate the required complexity of data quality
measures. This situation is particularly difficult if these measures
should be communicated to external third parties, e.g. financiers or
customers who have no technical knowledge at all or often (much
worse) a pseudo-understanding (Kromrey, 1999).
Ultimately, one consequence is that necessary data quality measures
often cannot be prioritized sufficiently. Some decision-makers then
either often want everything at once, and as immediately as possible,
but this cannot work de facto. Some decision-makers may be content
with short-sighted, because apparently cost-effective criteria, but in
the long term they may have inadequate and possibly even more
expensive consequences, which can only be understood as strategic or
self-calming pseudo-activism instead of the actually required
measures. In the long run this can also not work.

These problems are mostly characterized by technical complexities. The only


solution is to use a research or business language to make people understand
the importance of data quality, and request support and sponsorship from key
stakeholders

Different approaches for different data


Different data require different approaches. Sometimes correcting the data is
not enough; sometimes you have to check the causes too (cf. 7.1. for the
relevance of “frames”). Depending on the result of your checks, different
results required approach. - When checking for data quality, the following
(rather data-driven) scheme can often be used first. The basic principle
applies to smaller SPSS datasets as well as complete data warehouses (e.g.
Batini & Scannapieco, 2006, 170–188, Lee et al., 2006, 65ff.; Eckerson,
2002):

Definition of the data


Creation of a test plan (incl. resources, data processes etc.)
Checking the data (based on defined criteria, measures and
tolerances)
Diagnosis of the causes (in case of deviations from definitions)
Development of a solution for data and causes (planning,
calculation)
Correction of the data
Checking the data (after correction)
Monitoring of data and its quality (ideally including prevention
measures and auditing of project activities)

A (very simple) example of this scheme could be the replacement of two-


digit annual values by four-digit annual values (so-called Y2K problem).
However, this scheme cannot always be applied.
An example is e.g. basic research. If, for example, unexpected data occurs
here, then it is not necessarily the data that needs to be adjusted, but possibly
the theory behind it, the definition of the data. In this case a (rather theory-
driven) scheme could look like this:

Definition of the data


Creation of a test plan (incl. resources, data processes etc.)
Checking the data (based on defined criteria, measures and
tolerances)
Diagnosis of the causes (in case of deviations from definitions)
Development of a solution for data and causes (planning,
calculation) [from here it’s different!]
Re-definition of data (e.g. using optimized theories)
Checking the theory (cause)

Monitoring of data and its quality (process control and auditing


of project activities)

An example for this scheme could be the changed view on the development
of the ozone concentration over Antarctica. Accordingly, outliers are no
longer errors in the sense of deviations from an expected development, but
empirically valid representations of an (un)expected behavior due to the
changed view on reality (see Chapters 7 and 8). Thus, this process of
checking the data quality should ideally be carried out by someone who has
the necessary familiarity with the subject. This expert knowledge is
necessary

to assess the content-related plausibility of data about a product, a


service or a process (comparison with content-related
expectations).
for understanding the realization (collection, transformation,
migration, etc.) of the data itself and its sources of variation and
error.

2.2 Proposals for sustainable work


From an ex post process to prevention
Checking for data quality is a process that often starts only when the data is
already present, a reactive ex post process so to speak. However, the
knowledge gained in this process can be translated into a proactive
prevention concept that allows future quality assurance measures to be
reduced to a minimum. If, for example, it is ascertained during quality control
that certain process phases are prone to errors, then these phases, for
example, can be specifically optimized and thus drastically reduce the
frequency of errors. If, for example, it is determined that one of the most
difficult errors to correct was the inconsistent spelling of product names, then
preventive measures (e.g. standardizing codes, instructions, training) can also
drastically reduce the extent of errors. The earlier sources of error are
identified and corrected, the better.
Process-oriented vs. state-oriented approach
The measures can be differentiated between a more process-oriented and a
more state-oriented approach:
In the more process-oriented approach, just the required subset
is cleansed (the complete data source is maybe not cleansed at
all). Faulty data does not enter analyzes at all because of suitable
measures (e.g. just applying the required filters).
In the more state-oriented approach, the erroneous data is
corrected in the data storage. A required subset does not need to
be cleansed because the complete data source was cleansed
before.
Although the analysis of a data subset itself appears to be the same, the point
is the efficiency and resources of the (possibly repeated) data quality related
measures and processes involved.

The process-oriented approach is required every time, (different)


data is accessed/processed. If repeated, this process and therefore
requires resources multiple times (maybe each time on just a
subset of variables). The process-oriented approach is possibly
more efficient for a single resp. simple process (because less
even no filtering etc. has to be done), but not for changing resp.
iterative processes.
In comparison, the state-oriented approach is done only once
(but on a possibly complex and huge data volume). A state-
oriented approach appears to be too costly for a single resp.
simple process. You simply don’t want resp. need to cleanse a
whole data storage if you just require a small subset. Especially if
the correction of the complete data storage is quite extensive, but
not for iterative or changing processes.
The good news is: Any separately developed data quality process for a
subset you could also collect and apply on the whole data source. Any
cleansed data source does not require extra data quality processes for
retrieving a subset.
Control and protocol
Log each of your actions in such detail that, if necessary, certain corrections
can also be traced by third parties and, if necessary, vice versa. You can log
information by technical protocols, by notes in project flows, or by comments
in header or lines of syntax programs. A protocol gives you control.
Do not be tempted to make any changes to data by hand. Compliance usually
requires to not modify data manually. One reason is that many systems do not
automatically log data modifications (by whom, when, reason, authorization,
and so on). The more profane reason is that you certainly do not remember all
manual changes. So, when it comes to ensuring data quality, I strongly
recommend to work with syntax (see Schendera, 2005).

Why you should work with syntax


The main reason is: A syntax protocol or syntax program is the only way to
check the mouse clicks made; there is no alternative. The sequence of mouse
clicks is not otherwise recorded in any other form. To assume that the type
and sequence of mouse clicks can be controlled by the result, an SPSS output,
is a fundamental error. An output only shows the descriptive, graphical or
inferential result, but does not log all SPSS preferences, e.g. how to handle
missings. It is also a misconception to assume that you can always remember
the mouse clicks you have made; this is certainly not possible in a situation
where you have misclicked. The mouse control is typically very susceptible
to this. Ultimately, this should also be a schooling for working process-
oriented with the output log and the user-developed syntax program instead
result-oriented with the output window. The results can only be viewed when
options and kind of requirements have been checked and found to be OK.

Validation: Syntax control contains the constructively to be seen


necessity of validation of an analysis in terms of content; this means
that programming is more likely to force you to think about why and
how something should be executed by SPSS than mouse control,
which can be thoughtless by all means sometimes. The mechanical
use of menus, buttons and options is generally not recommended.
SPSS as syntax generator: SPSS can be set to output the command
syntax generated in the background for mouse clicks and inputs,
which you can then save, copy, rewrite and whatever else for your
own purposes.
Automation and reusability: Once written or saved, you can use a
syntax program over and over again. In addition, you can use
INSERT (from version 13 on; stops on errors, but can also be set to
continue running by using ERROR=CONTINUE) or INCLUDE (old
version, stops on errors) to execute other SPSS programs. With
SCRIPT you can also execute further commands from an SPSS
program.
Speed: The processing of a syntax program is many times faster than
(repeated) clicking on menus.
Openness: You can always extend or revise a program by copying
lines of code directly into it or by hand.
Efficiency: You can rewrite program code to macros that further
increase the automation and efficiency of process flows. With
increasing professionalization, syntax enables you to write programs
(using macros, among other things) that achieve the same scope of
performance with only a fraction of code lines, for example.
Flexibility: Syntax control is more flexible and offers more
possibilities for data management than menu control; there are some
functions in SPSS that you cannot access using the mouse but only
using syntax control (e.g. MANOVA, Ridge Regression).
Clarity and systematization: Syntax control offers clarity when
analyzing even several hundred variables. Syntax is much more
suitable for the analysis of large datasets than mouse control.
Uniformity: Syntax is a uniform and technically clearly defined
language, and in principle always explains itself. An instruction for
syntax control is therefore also an instruction for mouse control.
The exchange of syntax, which is in principle self-explanatory,
between or within internationally working research projects facilitates
communication, evaluation and interaction, and contributes to
clarification. The performance range of the various SPSS procedures
can be explained in a more differentiated way by the corresponding
syntax than by illustrations (screenshots).
Individualization: Syntax control allows you to check every option
set by reading the syntax; this means that you will also discover
(well-intentioned) defaults from SPSS, but which can definitely be
dysfunctional for your individual data situation. Trust is good, control
is better.
Exchange: Syntax programs can be sent all over the world as text
documents; if you have questions about the appropriateness of an
analysis, for example, simply copy the syntax into an e-mail and off
you go. Try this with mouse clicks.....
Protocolling and documentation: This aspect is not unimportant and
can help to avoid embarrassing situations, especially in peer reviews
and evaluations. For example, if you have performed a lengthy
analysis using mouse control and someone wants to see how you
calculated the analysis, and you don't have a syntax program at hand,
then you don't necessarily look good in this situation. Some
institutions require the analysis syntax by default in order to be able
to review works or projects.
Permanence: You cannot save years of mouse clicks. But years of
syntax programming. Once you have written a program, you can use
it again years later without any changes. Syntax once offered by SPSS
is not "thrown away". If a syntax (e.g. LOGLINEAR, does not output
deviation residuals) is replaced by another one (e.g. GENLOG,
outputs deviation residuals) via the menu control, you can still take
advantage of the replaced syntax. In contrast to GENLOG,
LOGLINEAR e.g. allows to reparametrize the categories of a factor
via contrasts. So what cannot be calculated with GENLOG via mouse
and syntax access, LOGLINEAR manages via syntax control.
Independence: SPSS syntax is upward compatible and largely
platform independent. Once an SPSS program has been written, it
will run on any higher SPSS version (which also means that once
syntax has been written, it will automatically address any optimized
algorithms). If the SPSS program does not include any hardware
specifics, the SPSS code is also platform independent. If you procure
other operating systems, you can continue to use SPSS programs.
Error susceptibility: Syntax is generally less error-prone, but also
works when buttons or menus fail to control the mouse (see for
example MAPS). In mouse-controlled analysis it is not uncommon
for errors in the programming of buttons to "butt in".
Ease of learning: Easier learning of other programming languages
(e.g. C, GPL, Python, .NET, R, SaxBasic, Spark, XML) is also an
additional qualifying investment into the future. Knowledge of other
external programming languages can also be useful for expanding the
scope of SPSS or even IBM SPSS Modeler aka Clementine. Python
can also be used in Modeler, for example.
Extensibility: You can extend the regular performance range of
SPSS by integrating additional functions into SPSS via ready-made
SPSS macros, SPSS programs or scripts (e.g. via R, Python or Spark).
Customizing: SPSS can be individually extended, e.g. via your own
programming in .NET, R or Python. With Python you can develop
your own SPSS procedures, e.g. comparable to REGRESSION resp.
SPSS data can be transferred to external programming languages for
further processing (e.g. Levesque, 20074, Part II). Using BEGIN
PROGRAM - END PROGRAM, instructions and data can be
transferred to other programming languages, e.g. Python. For Python,
the IBM SPSS Statistics-Python Integration Plug-In (SPSS version
14.0 or later) is required, for other external languages such as .NET
the installation of the Microsoft .NET Framework is required.
IBM SPSS Modeler qualification: You will benefit from
programming skills in SPSS in two ways. You can apply
programming skills (Python, R, Spark, SPSS, etc.), depending on the
Modeler version. In earlier Clementine/Modeler versions, you could
even create SPSS syntax via special nodes. - Already created SPSS
syntax programs could be integrated into the desired analysis
sequence (stream). At the moment, you cannot import and run scripts
created in IBM SPSS Statistics within IBM SPSS Modeler. –
Programming skills again help to enhance Modeler's functionality
range for more sophisticated data quality checks and more analytics-
oriented application purposes (cf. Chapter 16).

Copies, copies and again copies


Always make backup copies of the original data and of your programs.
Always work with backup copies only. Design "Worst Case" and "Data
Recovery" scenarios. For example, develop strategies that go beyond simple
backups. Always make copies of the original variables before transformations
etc. of variables or values; always make backup copies of the output after
transformations, e.g. to be able to detect undocumented changes of the
software algorithms and thus of your results in case of version changes
(upgrades).
Perform transformations etc. systematically and always only on the copies of
the original variables. For orientation purposes, you can add a "T" for
"Transformed", an "N" for "New", or a "C" for "Corrected" or similar before
the numerical suffix (before the numerical suffix because it may still be
needed for vectors (SPSS command VECTOR)). Check each of your steps
several times. From time to time, make comparisons between the transformed
data and the original data; in the case of complex transformations, after each
individual step. Your work is the basis for any work, analysis and decision
based on it.

2.3 A few Words of Caution


Avoid ad hoc approaches (efficiency and consistency)
Ad hoc approaches are procedures that can be characterized by the following
features:

The procedure is conceptless. The procedure is not based on a plan, a


logical sequence of steps or a goal. Suboptimal working conditions
(e.g. deadline constraints) will cause suboptimal data quality even
after the installation of completely new systems.
Not all relevant criteria are checked, but only an arbitrary selection.
This means that fewer criteria are checked than would be necessary.
The measures used to check the selected criteria are banal, e.g. only a
univariate reconstruction of missings instead of the required
multivariate approach.
The set criteria are too tolerant. Instead of e.g. 100% correct data,
only 75% of the data is corrected, and this without further
justification.
The ad hoc measure(s) are performed by mouse click and are not
documented anywhere. Instead of a process-oriented approach via
syntax, a state-oriented approach via mouse is chosen.
The actual causes of the data problems are not identified and
corrected. The data problems continue to occur and have to be solved
again and again from the beginning resp. are increasingly "tolerated".

Avoid ad hoc approaches from the outset. Ad hoc approaches are not only
ineffective in the long term, but also cause more and more costs (time,
money). Supporting causes can be: an ignorance of the complexity of the
problem and an illusion of being protected from its far-reaching material
consequences, an irrational error tolerance towards (unresolved) data
problems that are still occurring or will occur again, as well as a certain
frustration instead of the necessary motivation.
Approach a data quality project with the necessary resources, first of all
professionalism counts. Design solutions or avoid problems by planning long
term and with foresight. Depending on the nature of your project, consider
the following factors (see Lee et al., 2006; Totterdell, 2005; Dravis, 2004,
Appendix A; OECD, 2003; United Nations, 2003):

management (number, hierarchies, competencies, availability,


support),
personnel (number, skills, availability, support),
financial budget (volume, availability, flexibility),
time frame (weeks, months; urgency, flexibility),
computers (hardware, program versions, support),
data (volume, location, contact persons, primary/secondary variables,
etc.),
project phases and complexity (e.g. information on project schedules,
quality criteria, possible solutions, among other things),
already existing materials (e.g. documentation or standards for data
quality or data problems), programs (modules) or SPSS syntax),
any resources required for monitoring and auditing and, if applicable,
know-how to be procured or purchased (workshops, training, external
support, etc.).

Do not (initially) take anything for granted in the area of data quality;
question everything. Practice the maxim: "Trust is good, control is better".
Develop comprehensive and detailed documentation. Develop a process
structure (e.g. either by quality criteria, by variables or by data storage).
Consider alternatives in the project schedule if necessary. Run continuous
monitoring, e.g. between planned and actually necessary project phases, e.g.
a comparison between planned and actual time, etc.
Why you should avoid "Optimal data quality in the shortest time with
minimal effort"
Finally, once again: Make sure that you have the necessary time and
tranquility for your work. Stress always has a negative impact on the quality
of data. Avoid being torn between the three "millstones" of a possible claim
in the form (1) optimal data quality (2) in the shortest time (3) with minimal
effort. This claim is easy to say, but all three criteria cannot possibly be
fulfilled at the same time. Optimal data quality requires simply the necessary
(maximum) temporal, as well as material effort. At first glance, only two
arbitrary sides of this requirement can always be fulfilled, in principle any
third has to be accepted as not fulfilled. Since the work goal is however
optimal data quality, resolutions are only permissible in the direction of (a)
only minimum costs (in maximum time), (b) only maximum costs (in
minimum time) or also (c) maximally necessary costs and time. If this
maximum requirement is not resolved to one side, one ends up with
maximum costs (up to personal breakdown), data quality simply not worth
discussing, and/or not met deadlines. If the necessary resources are not
available for the effort necessary, the data quality is not yet directly at risk,
provided that at least enough time is available. Positively formulated are the
solutions: Maximum effort (costs) and maximum time cause maximum data
quality. Maximum effort (costs) and highest data quality do not necessarily
require a long project time. Highest data quality in maximum time does not
necessarily cause maximum effort (costs). Communicate with superiors
(supervisors), co-workers or employees the three sides of the maximum
requirement and lay down an obligatory solution. If an obligatory resolution
of the maximum demand is not feasible, the criteria should at least be
prioritized.
Now that you have structured and planned your project, you can start by
checking the individual criteria, obviously at first with "Completeness" (see
the next chapter).

3 Completeness
Completeness is defined here as data that is expected to exist in the dataset is
actually available in the dataset. To be more precise: The number of data in
a final analysis dataset corresponds exactly to the sum of valid plus missing
information in a structured environment, e.g. in one or more questionnaires,
customer files, text collections, partial datasets, etc. This definition highlights
two aspects:

the complete mapping and storage of external information.


a quantifying definition of internal completeness, i.e. the ratio of
valid values and (ideally: controlled) missing information
(“missingness”).

This differentiation allows to better assess and control completeness. If, for
example, all available data are without any entries, then this data is complete
externally, but incomplete internally, i.e. completely empty.
This following description of completeness uses a simplifying static scenario;
the same principles apply to highly dynamic complex (e.g. real-time
distributed) business landscapes.

"Completeness" is one of the basic criteria in the DQ Pyramid, along with


"uniformity", "duplicates", and "missings", which SPSS can be used to check.
The review of all other criteria is based on these.
Completeness, in turn, is a prerequisite for ensuring uniformity, excluding
duplicate rows or values, as well as appropriately handling possible missings.
Completeness is, so to speak, the most important prerequisite, the sine non
qua condition (with one clarification!): it must be the correct dataset.
Checking the wrong dataset (i.a. for completeness) is one of the biggest
mistakes that can happen. Two deviations from completeness are possible:
Something is missing (cf. 3.2ff., 6.). Something is too much; here are two
causes possible (also in combination): Duplicates may cause
overcompleteness; cf. 5.). Or, the expectations do not reflect the reality of the
system (cf. 7.1.1. for “frames”; there are more data, these are not duplicates
though). Completeness can be checked on several levels:

The number of individual datasets (e.g. updates) that have been


included in a master dataset (cf. 3.1.).
The number of cases (observations) that have entered the dataset in
the form of data rows (cf. 3.2.)
The number of variables that have been entered into the dataset in the
form of data columns (cf. 3.3.)
The existing (valid) values resp. controlled missings for each case,
e.g. participants in a survey (cf. 3.4.).
Comment: Sometimes observations, rows resp. cases are often referred to as
"datasets". This book reserves the term "dataset" exclusively for a file (e.g. an
SPSS dataset), the entirety of cells even if this file should contain only one
row.

3.1 Control options at the level of number


of datasets
With completeness on the level of datasets, the following defines that all data
storages are available, e.g. all SPSS files. The question whether these files
also contain data as expected is checked in a next step. First of all, however,
it is checked whether all data storages are complete. The question whether all
files are present seems to be a trivial problem, but it is not; if complete
datasets are missing, then a lot of data is missing. The consequences can be
far-reaching (see also 3.2. what may happen if you lost only a few data rows).
For example, the Federal Employment Agency reported incorrect
unemployment figures for the months December 2006 to April 2007. The
reason was that a complete dataset was lost during data transfer. However,
the error log created by the system for such cases was apparently also
overlooked (SPIEGEL ONLINE, 2007). In July 2006, a computer technician
from the Alaskan IRS accidentally deleted data worth about $38 billion; the
real problem here was that the nightly backup tapes were not readable
(Sutton, 2007). The Bundeswehr Intelligence Center, built for the most
sensitive data in Germany (including intelligence reports, reports from the
CIA and other intelligence services), lost militarily extremely sensitive data
at the end of 2004, according to the current state of knowledge in connection
with a defective data backup robot, thus triggering political as well as IT-
related scandals (REPORT MAINZ, 2007) (see 6.1.2.).
Complete datasets can be lost faster than one suspects; possible causes can
be, for example: Saving on defective hard disk sectors, when sending via
email/internet (e.g. antivirus programs or firewalls) or conflicts with other
software, e.g. zip programs. The various datasets in a project should be
logged in more than one form, e.g. as syntax, log, project documentation
(audit trail) or as a regular copy of the hard disk or server.
In this context, a problem with a zip program happened to colleagues. A zip
program is generally used to compress the size of files. If several files are
zipped at the same time, files that belong together are also together in one
folder, which is practical and clear. One department of a company now
packed several files into a single zip file and sent them to another department
for review. Unfortunately, the second department always came up with
different results than the first department. The excitement was great. Until,
after a lengthy search and only by chance, it turned out that although the
correct communication channels, the correct analysis procedures and also the
correct zip file were involved, the zip file itself had a "life of its own". The
zip program "swallowed" data. After unzipping, only one of the original three
zipped files was left. If the files were then individually zipped and sent, the
data was complete and the analysis results were consistent.
The completeness of a dataset is defined as the sum of its parts, e.g. the
number of datatset deliveries, partial datasets (subsets), updates, etc. The
completeness of datasets is usually not easily verifiable within a data
management system. Usually it should be possible to access external
information (see the above notes on logging). The number of subsets can be
checked in different ways depending on the application area, e.g. by
comparing them with data documentation (e.g. project plan, metadata) or
with other, already existing datasets. Individual datasets should be checked
for a part-whole ratio, e.g. if individual datasets (so-called master datasets)
are composed of other partial datasets. This situation often occurs when, for
example, information from different departments within a company is
brought together, e.g. marketing and controlling, or when, for example,
scientists use several questionnaires and want to evaluate these together in
one file during analysis.

Check option: Visual inspection:

Check the completeness of all rows and columns visually using the
variable and data view. SPSS numbers the existing entries on the left side
of the Variable and Data View. This method may be sufficient for
relatively small amounts of data; for large to very large amounts of data,
other approaches should be chosen.

Check option: Metadata:

Based on a project documentation (e.g. project plan, questionnaire,


coding list, etc.), you can check how many subsets, the whole dataset
should at least contain. If, for example, the number of rows (cases) is
smaller, the checked dataset is probably not complete.
Based on system data (e.g. dictionaries, system info etc. etc.), you could
e.g. check whether the datasets you want to analyze exist at all. You do
this by searching for their names in the enterprise-wide system. If they
exist, you could start checking by analyzing their names (e.g. if their
numeric pre-/suffices are without gaps), whether they contain the same
amount of columns, whether the column names are identical and the like.
Dictionaries are a big help (cf. Schendera, 2012).
Check option: Comparison dataset:

The present dataset is compared with another dataset (e.g. from the
previous year) with regard to its completeness, e.g. with regard to the
number of variables (columns). If, for example, the number of columns is
lower, the checked dataset is probably not complete. If you want to know
exactly, you can also use SPSS to check two structurally identical datasets
for absolutely identical contents (see 11.5.). Since SPSS v21 you can use
Data → Compare Datasets... (SPSS procedure COMPARE
DATASETS). Non-SPSS Statistics data must be opened in SPSS
Statistics before they can be compared.

Check option: Syntax:

The (partial) datasets in a master dataset can be checked e.g. by the


executed syntax or also by ID variables. It does not matter whether
UPDATE, ADD FILES or MATCH FILES was used: On the program
itself you can read which datasets have been joined, but not necessarily
how often. Also check if the output (log) reports any errors.
Macro programs can have their own pitfalls, e.g. if central variables for
macro control contain missings. Chapter 11.7.1. shows an example where
a dataset could not be created because the required variable contained a
missing. If a later step of the macro now automatically joins all
automatically created files together, this is done without detecting the
missing file.
Even with a correct program, it is not always possible to read how often
the datasets were merged. If an ADD FILES command is executed several
times, for example, the new dataset is appended to the master dataset
several times; it is possible that datasets were included too often. ADD
FILES or MATCH FILES only allow the handling of a maximum of 50
datasets at once; it is possible that the remaining datasets were not
included. To exclude these and other possibilities, further options are
available:

Check option: Counters (row number, record count):

The completeness of a dataset can be checked by simply counting the


number of rows (syn.: records, cases). The number of the individual partial
subsets should match the total row count of the whole dataset. A row
number is the number SPSS assigns to the existing entries on the left side
of the Data View. A record count is the total number of data rows. Batch
total and hash total are more sophisticated variants of such counters and
are presented in the section about how to control the number of cases
(rows, 3.2.).

If a number of updates always contains an identical number of


rows, the row count could be multiplied by the number of
updates to arrive at the total row number that the master
dataset should contain.
If the updates contain control variables for the update number
(cf. 11.1, 11.2), the analysis of these variables can be used to
check the correct one-time (or possibly inadvertently repeated)
combination.

Even if neither counters nor irregular variable structures are present,


duplicate datasets can be identified (and removed) by checking for
duplicate rows. Always save an updated master dataset under a different
name than the old master dataset. It would be disastrous if the master
dataset were to be accidentally overwritten with the wrong values, but no
copy was available anymore.

Check option: Names of datasets and variables:

The completeness of the individual datasets can be checked by the name


of the contained variables which often use systematic prefixes or suffixes.
For example, the dataset names of the Socio-Economic Panel (SOEP) of
the German Institute for Economic Research vary according to the
prefixes, e.g. "AHBRUTTO", "BHBRUTTO", "CHBRUTTO" etc. The
prefixes "A", "B" and "C" represent the respective year of the survey
wave, in this case 1984, 1985 and 1986. The remaining string "BRUTTO"
indicates the main content of the annual datasets, in this case the annual
gross (“brutto”) person data. Alternatively, it would be conceivable to use
dataset names where the prefixes remain constant and only the suffixes
vary, e.g. HBRUTO84, HBRUTO85, HBRUTO86, etc. For each dataset
that should be present, the name must also appear in the syntax. If the
expected partial data (e.g. from the year 1985) are not available in the
complete dataset, either the name in the syntax was wrong or the syntax
was correct, but the dataset was inadvertently incorrectly designated. The
same applies to the level of variable names. All SOEP variables have the
same prefix indicating the corresponding year of the survey wave. So, if
for example, the variable list of the “complete” dataset does not contain
any names beginning with B, you can conclude that the contents of the
1985 dataset have not arrived in the overall dataset for reasons yet to be
clarified.

Limits of verifiability:
A statement about the completeness of a data storage stands and falls with
the quality (completeness) of its documentation (metadata). If a data
documentation does not allow a statement about the TARGET
completeness, then a review is not able to derive a deviating ACTUAL
completeness or incompleteness.
Practical work with several datasets simultaneously is generally considered
one of the more demanding requirements in the field of data
management/data analysis and is therefore dealt with in Chapter 11 of this
book.

3.2 Control options at the level of the


number of cases (rows)
Files are composed of rows (cases, observations, questionnaires, etc.),
possibly also in the form of row-wise updates. What would be the
consequence if you lost only a few rows instead of entire records? It depends
on the type of data you lose.
During the Covid-19 pandemic, the UK Health Authority used the outdated
Excel file format .XLS to automatically pull the lab data together so that it
could then be uploaded to a central system. As the XLS file type tolerates a
maximum of only 65.000 lines, further Covid-19 test results transmitted
every evening were simply left off. As a consequence, 15.841 positively
tested cases did appear in the corona case overview. The root cause for that is
not the choice of a limited file format, but that there were not even basic
checks and controls for the data upload, such as record counts, batch totals,
or even hash totals.
Getting back to the question in the beginning: What would be the
consequence if you lost only a few rows instead of entire records? In this
case, thousands of lives put at risk because of a delayed contact-tracing
process (BBC News, 2020b).
As seen in the example above, the completeness of a dataset is defined as the
sum of its parts, e.g. the number of rows, added either individually or group-
wise (subsets). If we understand groups or single rows as datasets themselves,
then the checking for completeness follows that of datasets, e.g. checking for
a part-whole ratio. Basic approaches are e.g. visual inspection, metadata,
comparison dataset (e.g. before the upload), or counters. The control of the
completeness of columns (variables) is treated in 3.3.

Check option: Visual inspection:

Check the completeness of all rows visually using the data view. SPSS
numbers the existing entries on the left side of the Data View (see below).
This method may be sufficient for relatively small amounts of data; for
large to very large amounts of data, other approaches should be chosen.

Check option: Metadata:

Based on project data (e.g. return logs, inbox lists, tally sheets, electronic
logs, etc.), the system checks how many cases (rows) the dataset should
contain at least. If, for example, the number of cases (rows) is lower, the
checked dataset is probably not complete.

Check option: Counter (record count):

The number of cases (syn.: rows, cases, observations, records,


questionnaires) can be controlled by creating a systematically
incrementing counting variable (so-called counter, see 10.1), which
increases by one unit for each newly added questionnaire. A record count
is the total number of data rows.
compute Assign a counter (increment) per data row
ID=$CASENUM. COMPUTE assigns a unique number to each
exe. data row. The SPSS system variable
$CASENUM counts the number of data rows
by assigning each row a value increasing by 1,
starting from 1 and ending at the last row (for
details cf. 10.1). Note that $CASENUM may
also assign a value to completely empty data
rows.
An ID can be checked by using the row number on the left in the Data
View.
The screenshot shows the row number
on the left, and the counter ID next to
it. Every entry in ID corresponds to
every row number.
However, it is recommended not to
check the beginning of a dataset, but
its end (see below).
The dataset used for this
demonstration is “mouse
[Link]”. $CASENUM was used
to add the ID column.

If a counter starts Beware: $CASENUM If row number and ID


at 1 and has no gaps may assign a value to do not match (cf. ID
up to its maximum, completely empty data 210 and higher), there
only the number of rows (cf. highlighted may be several causes
the last row needs to row number 208 and to be checked. (1)
be compared with higher). There may be gaps in
ID; if both match, Empty rows may appear the data; e.g. IDs 207
the data is complete, after loading third party
etc. exist but are
i.e. all questionnaires files into SPSS. missing. (2) IDs 207
have been added. etc. do not exist and
are not missing (no
data gaps). (3) Both
scenarios may happen
at the same time.
If a counting variable does not start at 1, the gaps must be checked
especially careful. If the number of (complete) questionnaires
corresponds to the maximum of the counter (record count), all the
questionnaires are entered completely.
For row-wise updates, the number of rows can be multiplied by the
number of updates to arrive at the total number of rows that the master
dataset would have to contain. If the update datasets contain control
variables, the analysis of these counters can be used to check the
completeness of the row-wise updates (cf. 10.1 on group-wise counting of
rows).

Check option: Check sums (batch total, hash total):

A batch total is the sum of a selected numeric field used as a control to


ensure that the data are complete. Choose any numeric variable (ideally
without any missings), e.g. an account or product number, and calculate an
aggregating total. Compare this so-called batch total with a value you
calculated on the same data, but in a different source, medium or location.
If the numbers perfectly match, then also the number of rows should
match; if the numbers differ, then there are data missing or even too many.
A hash total can be calculated using even more numeric fields. For
comparisons for completeness or changes, this value is compared with the
original. Mismatches indicate data differences.

Check option: Syntax:

Also check if the output (log) reports any errors. For example, SAVE resp.
XSAVE should report the number of stored cases after a write operation;
if this feedback is missing, there is a possibility that the dataset was not
completely saved.

Limits of verifiability:

Also here a completeness assessment stands and falls with the quality
(completeness) of its documentation (metadata). If a data documentation
does not allow a statement about the TARGET completeness, then a
review is not able to derive a deviating ACTUAL (over)completeness or
incompleteness.
The analysis of data from data storages and surveys of third parties can pose
particular problems. Due to limited control, as experience shows, it is
advisable to make precisely defined, complete and unambiguous agreements,
including control options regarding completeness, bias, etc. If, for example,
you commission a third party to collect data online, it is possible that the third
party will only provide complete data rows, but not e.g. the data of the
dropouts or data that is completely missing apart from the ID. If you only
have the "complete" data rows available, you will not be able to easily
estimate the quantitative and qualitative extent of completeness. You simply
do not know how much is missing (possibly detectable by an incomplete ID).
You also do not know what is missing (for what reasons). The problem of the
interpretability of the data is closely linked to the completeness of the data;
without recourse to the missing data, it cannot be reliably ruled out that the
existing data has a bias whose nature and extent cannot be reconstructed
precisely because the missings were not made available. The completeness of
missings also makes sense, as the section on handling missings will show.

3.3 Control options at the level of the


number of variables (columns)
The completeness of variables (columns; in data mining also: fields) is
usually determined by the number of variables. The completeness of
variables is also usually not easily verifiable within a data storage. Usually it
should be possible to access external information. The number of variables
can be checked in different ways depending on the application area (see 3.1.):

Check option: Visual inspection:

Check the completeness of all columns visually using the variable view.
SPSS numbers the existing entries on the left side of the variable view.
This method may be sufficient for relatively small amounts of data; for
large to very large amounts of data, other approaches should be chosen.

Check option: Metadata:

A comparison with a data documentation (e.g. coding lists, project plan,


metadata etc.), is used to check how many columns (variables) the dataset
should contain at least. If, for example, the number of columns (variables)
is lower, the checked dataset is probably not complete.

Check option: Variables:


If the contained variables are assigned according to a certain systematics
(e.g. ITEM_1, ITEM_3, etc.), then breaks in the systematics are an
indication that variables are missing (e.g. ITEM_2). Such breaks can also
be detected by checking syntax programs. See also the SOEP example
about names of datasets and variables in 3.1.

Check option: Check sums (hash total):

If a hash total was calculated using two or more numeric fields, then
mismatches with the same data (but in a different source, medium or
location) may indicate data differences caused by missing variables.

Check option: Syntax:

Also check if the output (log) reports any errors.

Check option: Comparison dataset:

The present dataset is checked for completeness (number of variables)


with other, already existing datasets, or by creating a test dataset as a mask
for questionnaire input. If the test dataset cannot accommodate all the
information from the questionnaire, the dataset is not complete, and the
missing variables must be added. This approach is especially
recommended for questionnaires that show slight variations (which should
be avoided in practice from the outset). This procedure is especially
recommended for the timely detection of input problems in the case of
multiple responses.

Limits of verifiability:

A statement about the completeness of a data storage stands and falls with
the quality (completeness) of its documentation (metadata). For example,
the comparison with a complete test dataset is more complex than the
comparison with a coding plan, but it is much safer, because the
comparison with a coding plan depends on a completeness, which is to be
checked. For example, the "Data Audit” node in IBM SPSS Modeler
calculates as a quality measure the completeness of variables (fields) in
percent. However, IBM SPSS Modeler does not check whether the
number of fields is complete, but the completeness of the entries, i.e. the
completeness of the values.

3.4 Control options on the level of values


resp. missings
The completeness of the existing values or missings can be determined using
simple test programs. The completeness of values has to be checked in two
ways. First, the series of values are complete (i.e. without missings) and
second, the data are also complete (i.e. no specific values are missing) (see
10.2.) The "Data Audit” node in IBM SPSS Modeler 18 calculates e.g. as a
quality measure the completeness of variables (fields) in percent, as well as
the number of user- and system-defined missing values. The quality of a
variable is thus derived from the completeness of the entries. You could also
use this node to replace missing values with calculated values using various
methods.
For example, the following check programs determine the sum of the system-
defined missings, the program below the sum of the system- and user-
defined missings (vg. Schendera, 2005). If the number of existing values
corresponds to the number of variables, then the data is complete (cf. 10.3.)
Another approach would be to form sums, e.g. using the function SUM.
These sums can be compared with each other or with external reference
values (e.g. sums of other data storages). Matching sums then indicate that
e.g. the valid values within the selected variables that are included in the
summation are equal to each other and are therefore complete, provided that
it can be assumed that the reference values used are themselves valid.
Differences between sums can be attributed to missing variables in the
calculation and/or missing values. The problem of missing values and a
frequent error in the calculation of completeness indices is presented in
chapter 6 on Missings.
count SYSMISUM=ITEM1 ITEM2 ITEM3 (SYSMIS).
exe.
count MISSUM =ITEM1 ITEM2 ITEM3 (MISSING).
exe.
compute SYSMISUM=SYSMIS(ITEM1) + SYSMIS(ITEM2)
+ SYSMIS(ITEM3).
exe.
compute MISSUM=MISSING(ITEM1) + MISSING(ITEM2)
+ MISSING(ITEM3).
exe.
A prerequisite for this approach is that missing values have already been
assigned a user-defined code during data entry. This approach ensures that at
the time of the analysis a distinction can be made between missings that are
known, i.e. controlled, to be missing, and those missings, which had escaped
control during data input and would have to be checked again. This procedure
could be extended to an index for a quality criterion in general (for a
completeness index see 6.1.2.) resp. plausibility, i.e. data quality in particular
(see Chapters 8 and 9). Here, the contents of different variables are first
checked for plausibility. If a value is correct, an index variable of a first
checked variable receives the value 1, otherwise the value 0. If all variables
are checked, all index variables are added up and divided by their number.
The resulting quotient indicates the extent of the correct values per case
(line). Other calculation variants are possible, e.g. 1 minus number of faulty
entries divided by their number etc. (see also Batini & Scannapieco, 2006,
chapter 2). This procedure applies to strings only with restrictions, e.g. for
strings or missings up to 8 characters in length.
Checking the completeness of values to be expected is best explained using a
category variable, in principle this procedure is also valid for variables of
higher scaling. For a category variable, the available values are requested by
means of simple frequency tables. Most categorical variables have a
manageable number of values and can be requested and checked list-wise. By
means of these frequency tables it is checked whether all expected values are
present. If, for example, when checking the ALPHABET variable, only the
values "A" and "C" are displayed, although "B" should be present, then there
are two usual possibilities: The "B" values are missing completely or have
been assigned to another category by mistake, e.g. "A" or "C". Differences in
the frequencies often indicate quickly which possibility could be present (a
further step would be to check for redundancy).
However, all of the presented possibilities for checking completeness
presuppose that the data has been recorded at all and is available in any data
storage. Often enough, however, it is found that the data has not yet been
collected (recorded, measured), entered or stored. The basic test is therefore
whether the available data is at all complete in the sense of accessible.
If the data is complete, then not only large, complex datasets should be
checked for duplicate data.
A special form of completeness for values can be necessary, if there are
several (duplicate) data rows for one ID, which are however differently
complete. In the case of duplicate data rows, the following approach (see
also the approach of the LAG function under 10.3.6.) shows how to identify,
sort and, in case of gaps (missings), fill them with the respective content of
the first (complete) data row. This approach only works for numeric values;
strings can be replaced uniformly by the LAG function introduced in 10.3.6.
In the following example, data rows for one and the same ID (e.g. "111") are
distributed over three rows. Note the missing postal code information. The
goal is to fill up the postal code information in the lines with gaps. If the data
is sorted according to the ID and at the same time according to the variable to
be filled in, the gaps will be filled in using the existing (and preceding due to
the sorting) value. Variant (a) shows how easily gaps can be closed by
uniform values. The variant (b) shows a common error; namely, what this
program does not do.
data list list (",")
/ID (F3.0) PLZ (F5.0) ORT (A10) PRODNR (F5.0).
begin data
111,, Hamburg, 655
111, 20245, Hamburg, 541
111,, , Hamburg, 41
111,, Hamburg, 652
222, 60598, Frankfurt, 3412
222,, Frankfurt, 3221
333, 81669, Munich, 65464
333,, Munich, 64623
333,, Munich, 65435
end data.
list.
* (Example a) Intentional uniform filling: Filling of PLZ *.
sort case ID(a) PLZ(d) .
add files file=*
/ by ID / first=start1 .
if start1 eq 1 #tmpPLZ=PLZ .
if start1 eq 0 PLZ=#tmpPLZ .
exe .
list.
ID PLZ ORT PRODNR start1
111 20245 Hamburg 541 1
111 20245 Hamburg 655 0
111 20245 . 0
111 20245 Hamburg 652 0
222 60598 Frankfurt 3412 1
222 60598 Frankfurt 3221 0
333 81669 Munich 65464 1
333 81669 Munich 64623 0
333 81669 Munich 65435 0
Number of cases read: 9 Number of cases listed: 9
Using variant (a), the postal code gaps per ID were correctly filled with
uniform values. Per ID (e.g. "111"), complete and uniform postal codes are
available (e.g. 20245).
* (Example b) Potentially incorrect standardization: Filling of PRODNR
*.
sort case ID(a) PRODNR(d) .
add files file=*
/ by ID / first=start2 .
if start2 eq 1 #tmpPNR=PRODNR .
if start2 eq 0 PRODNR=#tmpPNR .
exe .
list.

ID PLZ ORT PRODNR start1 start2


111 20245 Hamburg 655 0 1
111 20245 Hamburg 655 0 0
111 20245 Hamburg 655 1 0
111 20245 655 0 0
222 60598 Frankfurt 3412 1 1
222 60598 Frankfurt 3412 0 0
333 81669 Munich 65464 1 1
333 81669 Munich 65464 0 0
333 81669 Munich 65464 0 0
Number of cases read: 9 Number of cases listed: 9
In contrast to the first example, variant (b) overwrites the different PRODNR
gaps by uniform values (after the replacement e.g. 655); this is an error,
however, if the value variations (before the replacement: 541, 655, 652) were
not caused by typing errors but were correct entries in PRODNR. The
presented variant for filling up missings of multiple data rows is therefore
very suitable for filling up such typing errors in an uncomplicated way and
for standardizing them correctively at the same time, provided that it is
ensured that no value variation must occur per ID and corresponding
PRODNR. This approach to ensure uniformity is of a rather general nature
and could also be used in case of missings that have to be replaced uniformly,
when dealing with duplicate data rows, as well as for the uniformity of
entries.

4 Uniformity - Standardizing
numbers, time units and strings
Uniformity must be ensured in data analysis at several levels: At the level of
datasets, of numerical variables, labels for variables and values, and
especially of unique information (usually in string form) and date variables.
"Uniformity" is one of the basic criteria of the DQ Pyramid, along with
"completeness", "duplicates", and "missings", which SPSS can be used to
check. The examination of all other criteria is based on these. However,
uniformity requires at least completeness.
Uniformity means on the level of a single dataset that the type of a variable
(Variable View) matches the content of a column (Data View). This means
that e.g. variables, which contain exclusively numerical values (e.g. age), are
not defined as string, but as "numeric". This is not a matter of course,
especially with data from third parties or after data migrations. Uniformity
means on the level of several datasets, which for example are to be appended
one after the other (e.g. updates), that name, type (e.g. string/numeric), width,
format and label of the variables are absolutely identical in all datasets. A
variable that is called ITEM237 in a master dataset, for example, is numeric,
is of width 12, has the format 8.2, and has the label "Test variable" must also
be called ITEM237 in an update dataset, be numeric, be of width 12, have the
format 8.2, and have the label "Test variable".
Uniformity means on the level of numerical variables that the variables in
the update dataset must have the same codes and value labels as in the master
dataset. An ITEM237 variable must have the same codes (e.g. 0-1-2) and
value labels in the update dataset as in the master dataset. For string variables
the uniformity is limited to name, type, label and length.
Uniformity on the level of labels for variable names and values usually
means that they are spelled correctly. This does not seem to be self-evident in
practice. Not so long ago, I was given a dataset in which more than every
tenth variable label was orthographically incorrect, including about ten (!)
variations of the word "Inanspruchnahme" (transl.: utilization; see also the
comments on the professional representation of scientific work under
4.14.1.).
Particularly important, but sometimes also quite complicated, is the
uniformity of values (especially strings, but not only), especially in unique
information in key variables ("keys"), such as names of persons, places and
medicines, or even dates, telephone or also personal numbers. Since names
(e.g. people, products or places) are often used as key variables, special care
is required when entering and checking them. The consequences of non-
uniformly written strings can be serious.

SPSS is not able to automatically include upper and lower case


strings ("SPSS", "spss"), i.e. variants of the same string, as
semantically identical (uniform) in an analysis. If no precautions are
taken for standardization, upper and lower case variants are evaluated
as different although they actually belong together.
If, for example, the “same" strings are written in upper and lower
case, they are not sorted uniformly via SORT CASES. Instead, upper
case strings (e.g. "SPSS") are sorted before lower case strings (e.g.
"spss"), which can cause massive problems at the latest when
merging datasets, if these strings are e.g. key variables.
Automatic recoding using AUTORECODE, for example, also
requires a uniform spelling. Upper and lower case strings are not
recoded uniformly, but differently. The result is a variable that
contains incorrect recodings. Applying an incorrectly working
AUTORECODE recoding scheme to other variables can also have
devastating consequences.
Common sources of variation in strings are typing errors, umlauts, the triple
S (“ß”), upper and lower case, falling for name stereotypes, hyphenated
terms, blanks, special and accented characters (accents), foreign languages or
the change of names, e.g. through marriage or similar.
For all these sources of variation (and others as well), a guideline (standard)
should be drafted from the very beginning, especially if it is clear from the
outset that over a longer period of time several (changing) persons will be
entrusted with the input or documentation of the data. The easiest way is to
write the names in the name field starting at the very left (!) correctly (!) in
uppercase without umlauts (i.e. instead of UE of Ü) and without any other
special characters, blanks or spacing, e.g. "MUELLERTHURGAU" instead
of "Müller-Thurgau". These uniform strings in the sense of keys can then be
assigned the desired notation with all necessary linguistic or technical
nuances via a macro or a control dataset.
Labels of names or values of variables are not data in the true sense, but
represent data and their quality to the outside world. Often enough it happens,
however, that in outputs or even only during a presentation, inconsistent or
misspelled labels are detected. Such errors on the surface often enough lead
to the sometimes not unjustified conclusion: If the labels on the surface of
data have been sloppily edited, the quality of the data behind itself could be
comparably poor as well. Such apparently only cosmetic impairments are
particularly embarrassing when names of clients, products, services, i.e. the
primary variables per se are affected. Since it can be unproductive to have
correct data values, but to assign non-uniform or even misspelled labels to
them, it is useful to use a spell checker for orthographically correct resp.
uniform labels for names and values of variables; this can be done by loading
the corresponding syntax program or a saved document with the data file
information into a spell checker program. However, the following approaches
can also help.
Uniformity in programs is not directly related to data quality, but under
certain circumstances to the criterion “efficiency” however. If, for example,
you refer within an SPSS program (possibly even in a macro) to US
American data inconsistently with "US", "USA", or "U.S.A.", the readability
of the program in general and the writing of macros in particular is made
unnecessarily complicated. Therefore, use uniform terms resp. codes in
programs as far as possible.
In the following, several possibilities for the standardization of data are
presented. The approaches are based on different procedures because they are
tailored to different properties of the source data to be standardized, e.g.
string vs. numeric vs. mixed, umlauts vs. ASCII character sets, upper vs.
lower case, replacement of single characters vs. replacement of several
characters (strings), single vs. several variables, etc. Some corrections can
easily be made using string functions, but unfortunately not all
standardizations are as uncomplicated as with mixed character strings (e.g.
telephone numbers or dates).
Finally, possibilities for the representation of companies, institutions or even
projects ('Corporate Design', 4.14.1.) and the technical-methodical uniformity
of data, analyses etc. ('Technical Design', 4.14.2.) are presented.

4.1 A first simple example: Inconsistent


data
The following example is taken from my rather earlier analysis activities.
During the cleaning of a medical database at least two variants were found as
entries for the result "positive", e.g. the symbol "+" and at the same time also
the character string "positive" (compare the values in the variable RESULT
in the example syntax below). Experience has shown that causes for such
inconsistent entries are e.g. missing guidelines, several (changing) persons,
illegible documents or also time pressure, which then in the moment of data
entry tempts the data entry clerks use the shortest possible replacement
character for a longer formulation, e.g. the "+" symbol instead of the
formulation "positive". On the contrary, this apparent (individual)
"abbreviation" often causes a considerable amount of rework and is therefore
more labor-intensive than labor-saving. Experience shows that a proactive
training, as a preventive measure, would significantly reduce the extent of
errors.
If a guideline for data entry (if any) had taken into account possible time
pressure from the outset, perhaps only "+" would have been used as the code
for a positive result. The phrase "positive" would then not have appeared at
all. The same applies to the formulation or symbol for "negative". At this
point, however, no recommendation for a string or the symbol variant can or
should be given. String variants are more complex and more susceptible to
typing errors, but are resistant to confusion, i.e. even a misspelled variant
such as "psosiiv" can still be interpreted as "positive". Small character
variation, no (clinical) effect. Symbol variants are more efficient; however,
entries due to keyboard confusion lead to semantically radically different
meanings. Instead of a "+", confusions such as "++" or even "-" can mean
something completely different from a clinical point of view. Small character
variation, large (clinical) effect. In addition, one-digit codes, such as the
number "0", often run the risk of being typed as the letter "O" in strings.
data list
/ID 1-3 MY_RESULT 5-20 (A) .
begin data
001 positive
002 +
003 negative
004 -
005 positive
006 psosiive
007 +
end data.
exe.
Statistical programs such as SPSS are not "smart" enough by themselves to
automatically understand that different codes for the same information are
equivalent. It is therefore necessary to relieve SPSS of the interpretation
work and to standardize the different codes so that SPSS can process them as
uniform. The following program shows how these input variants "positive",
"+", etc. can be standardized. Since analysts can never predict the
"orthographic creativity" of the data entry staff, and thus each input variant,
100%, the standardization program presented here includes a useful
"safeguard" in the form of a preceding COMPUTE section.
compute REF_CODE=999.
exe.
if MY_RESULT="positive" REF_CODE=1.
* Explanation: Via IF the phrase "positive" becomes the numeric code "1".*
exe.
if MY_RESULT="+" REF_CODE=1.
* Via IF the symbol "+" becomes the numeric code "1".*.
exe.
if MY_RESULT="negative" REF_CODE=0.
* Via IF the formulation "negative" becomes the numerical code "0". *.
exe.
if MY_RESULT="-" REF_CODE=0.
* Via IF the symbol "-" becomes the numeric code "0". *.
exe.
value labels
/REF_CODE
1 "positive"
0 "negative".
exe.
frequencies variables=REF_CODE .
list variables=ID MY_RESULT REF_CODE .
Following COMPUTE, the actual standardization program consists of four IF
commands (which could have been programmed more elegantly, but here the
focus is first on the transparency of the procedure and only then on the
aesthetics of programming). The standardization program begins with a
"backup" in the form of a COMPUTE REF_CODE=999, and COMPUTE
creates a numeric variable REF_CODE that contains only 999 values. For
each string "positive" in the variable MY_RESULT, the first IF assigns a 1
and overwrites with it all existing 999 values in REF_CODE. For the symbol
"+", the second IF overwrites the 999 in REF_CODE with ones. The other
IFs overwrite entries in REF_CODE with zeros. The "backup" is thus that the
replacement of strings was successful if REF_CODE no longer contains any
999. Final FREQUENCIES and LIST sections perform a check for this
purpose. Before that, the assigned 0 and 1 are provided with the (hopefully
correct) labels for "positive" and "negative".
REF_CODE
Cumulative
Frequency Percent Valid Percent Percent
Valid negative 2 28,6 28,6 28,6
positive 4 57,1 57,1 85,7
999,00 1 14,3 14,3 100,0
Total 7 100,0 100,0

The FREQUENCIES step shows that the 999 code occurs somewhere (but
not exactly where). The LIST step then shows that the 999 code occurs at ID
6 because in MY_RESULT not only the phrase "positive" occurs but also
another variant, namely "psosiive". LIST can be a bit impractical for very
large amounts of data, and a user may need to use a (preceding) SELECT IF
as an alternative.
ID MY_RESULT REF_CODE
1 positive 1,00
2+ 1,00
3 negative ,00
4- ,00
5 positive 1,00
6 psosiive 999,00
7+ 1,00
Number of cases read: 7 Number of cases listed: 7
For the variant "psosiive" only one additional IF step would have to be
created analogous to the standardization program to make even the last 999
code disappear. If there are many write or error variants, one of the further
variants could be used, see e.g. 4.3., 4.4., 4.7. or also 4.9.

4.2 Identifying Inconsistency: Check for


Variations in Strings
Variations in terms of unwanted non-uniformity are not always as obvious as
in example 4.1. Normally, users are forced to check the uniformity of each of
their character strings for each relevant variable. The approach is
theoretically quite uncomplicated. You define a OUGHT as a template (e.g.
ANSWER="yes") and check if there are any deviations in the sense of
(unwanted) IS (e.g. ANSWER="yeah"). Since there are certainly more
verification variants than templates (OUGHT), some of the more
straightforward but very effective approaches (SELECT IF, INDEX, VALUE
LABELS) to identifying deviations (inconsistency) are summarized below
with more or less brief comments. The correction of string variations is the
subject of the following chapters. The following syntax examples are without
SPSS output.
Deviations in one string variable
This SELECT IF approach outputs all cases, which are neither coded as "m"
nor as "w" in GENDER.
get file data="C:\YOUR_DATA.SAV".
select if not any (GENDER,'m','w') .
exe.
This SELECT IF approach outputs all cases that have encodings in
STATUSG other than those specified.
get file data="C:\YOUR_DATA.SAV".
select if not any (statusg,'ISEF','JIOJ','KERK','LPMS','MEDE',
'OJFW','OWJG','P2PC','PMKO','STST','TEKL',
'UHNS','WERE').
exe.
A special feature for detecting deviations in string variables (INDEX
function)
This INDEX approach outputs all cases that contain characters in STATUSG
specified in the given list (without blanks). For example, if digits in strings
were undesirable, the INDEX function would detect the "2" in the string
"P2PC". Thus, if the variable ERROR contains any number not equal to 0,
this is an indication that the examined string (e.g. STATUSG) contains at
least one specified unwanted value.
compute ERROR=index(STATUSG,'0123456789',1) .
exe.
Deviations with several string variables (multivariate)
Of course it is also possible to check several variables at once. This SELECT
IF approach with OR outputs all cases which have different encodings in
GENDER or in STATUSG than the specified codes.
get file data="C:\YOUR_DATA.SAV".
select if not any (GENDER,'m','w') or
not any (statusg,'ISEF','JIOJ','KERK','LPMS','MEDE',
'OJFW','OWJG','P2PC','PMKO','STST','TEKL',
'UHNS','WERE').
exe.
Deviations with several string variables incl. missings (multivariate)
In the event that missings should also represent an undesirable variation, this
approach makes it possible to check several variables not only for deviations
from predefined codes, but also for missings. This SELECT IF approach
outputs all cases that may contain missings ("") in addition to deviations in
GENDER or STATUSG.
get file data="C:\YOUR_DATA.SAV".
select if not any (GENDER,'m','w', '') or
not any (statusg,'ISEF','JIOJ','KERK','LPMS','MEDE',
'OJFW','OWJG','P2PC','PMKO','STST','TEKL','UHNS',
'WERE', '').
exe.
Experience shows that multivariate OR approaches are often too wide-
meshed, since they filter out cases that have a deviating value somewhere. It
is often not easy to have an overview of the output. Multivariate AND
approaches, on the other hand, are often too strict, since they only filter out
cases that have a deviating value in all specified variables. The following
approach might be an alternative.
A completely different approach: Formatting (VALUE LABELS using
the example of strings)
For the sake of completeness, an approach based on the assignment of labels
to expected correct values will be presented. If, for example, the expected
values are exclusively assigned the format "-valid-", the deviations from the
expected correct spellings are quickly noticed in a subsequent frequency
table.
value labels
/STATUSG
"ISEF" "-valid-"
"JIOJ" "-valid-"
"KERK" "-valid-". etc.
All presented programs are case-sensitive and assume an absolutely correct
spelling of the templates. If the templates can be uniformly capitalized, e.g.,
the string function UPCASE can be used (see 4.4.); if the templates cannot be
standardized (e.g., because a capitalization represents something different
than a lowercase string), e.g., both upper- and lowercase templates can be
included in the IN lists.

4.3 Standardizing Strings 1: REPLACE


The REPLACE function (since version 14) is especially useful for the regular
searching and replacing of strings, if there are not too many spelling and/or
error variants in the strings. If there are several string variants or an SPSS
version prior to version 14, then the solutions under 4.4. will help.
REPLACE proceeds in only two steps. In the first step, a string variable must
be created via STRING in which the result of the REPLACE function can be
stored. Please note that the string variable defined with STRING is long
enough to accommodate the changes. This is especially important if
REPLACE can cause an extension of the original variable (see the following
examples i.a. for the created string variables TEST_RP1 to TEST_RP6).
In a second step, four parameters are defined. First, the source variable is
defined. The source variable is the variable where changes (e.g.
standardizations) are to be made. The source variable must be a string
variable. In the following examples the variable TEST_WORD is the source
variable.
Next, the character string to be searched for is specified; then the character
string is specified to replace the found character string. Please note that
REPLACE is case-sensitive when searching for resp. inserting character
strings. REPLACE is also suitable for searching and replacing punctuation
marks (see also 4.12.), as well as missings (if strings, see below). Finally, a
fourth parameter (integer value) can be set to define how often the
replacement process should be executed. If no integer is specified, every
string found will be replaced.
Example: REPLACE approach Example:
Before After
string TEST_RP1 (A8).
“HELLO“ compute TEST_RP1= “HEYLO“
replace(TEST_WORD,"L","Y",1).
exe.
REPLACE replaces a found, capital "L" with a capital "Y". The
replacement process should take place only once. If further "L" are found
in a string of the source variable TEST_WORD, they are not replaced by a
"Y".
string TEST_RP2 (A8).
“HEyLO“ compute TEST_RP2= “HELLO“
replace(TEST_WORD, "y", "L") .
exe.
REPLACE replaces a found, small "y" with a capital "L". The replacement
process is not restricted, so it will happen as often as "y" is found. If further
"y" are found in a string of the source variable TEST_WORD, they would
also be replaced by an "L".
string TEST_RP3 (A8).
“HEL:LO“ compute TEST_RP3= “HEL-LO“
replace(TEST_WORD, ":", "-") .
exe.
REPLACE also works for searching and replacing punctuation marks.
REPLACE replaces a found ": "with a "-". The replacement process is not
restricted, so it will occur as often as ":" is found (see also 4.12.).
string TEST_RP4 (A8).
“ “ “_BLANK!_“
compute TEST_RP4=
replace(TEST_WORD," ",
"_BLANK!_") .
exe.
REPLACE also works for searching and replacing missings with strings.
REPLACE replaces a found " " with a "_BLANK!_". The replacement
process is not restricted, so it will take place as often as " " is found (as
long as the source string is long enough).
string TEST_RP5 (A30).
“ 12345678 “ compute TEST_RP5= “ --- “
replace(TEST_S30,"12345678","---
", 3) .
exe.
REPLACE also works for searching and replacing in strings longer than 8
characters. For example, the source variable TEST_S30 is 30 characters
long. The replacement process should take place three times. The above
example replaces an 8-character search string "12345678" with a 3-
character replacement string "---". The entries in the source variable are
therefore shortened, not the length of the source variable itself.
string TEST_RP6 (A40).
“ 1234567890 “ compute TEST_RP6= “12345678901234“
replace(TEST_S30,"1234567890",
"12345678901234", 2) .
exe.
REPLACE also works for searching and replacing with strings longer than
8 characters. In the example above, a 10-character search string
"1234567890" is replaced by a 14-character string. The entries of the
source variables are thereby extended. The storage variable (cf. string
TEST) must therefore be sufficiently large to accommodate the extension
of the source variable. The replacement process should take place twice.
The SPSS 15 or 14 introductions to string functions ("Universals") are not
quite correct. The source variable must not be specified in quotation marks in
the REPLACE function. REPLACE is quite comfortable to use and is
therefore very suitable for standardizing different string variants.

4.4 Standardizing Strings 2: UPCASE,


LTRIM, DO-IF, IF, INDEX and
SUBSTR
The following two examples standardize strings when there are many
spelling resp. error variants, especially in names.
The first standardization is done in three steps. The strings of the WINE
variable are first capitalized uniformly using the UPCASE function. The
LTRIM function is then used to move the strings uniformly to the left.
Finally, the INDEX function is used to search for a continuously occurring
text string, e.g., "SILVA" in "SILVANER" or "SILVAHNER". The constant
position of the text string is stored as a code in a numeric variable and can be
viewed in the dataset. Several IFs are used to assign uniform notations to
these codes. The functionality of this program is independent of the order of
standardization of the respective strings; if you would change the program
steps, the program would still work without errors.
data list
/ID 1-3 WINE_0 5-30 (A) .
begin data
001 Müller-Thurgau
002 Müller-Thurgau
003 Mueller-Thurgau
004 Müller Thrugau
005 MÜLLER-THURGAU
006 Riesling
007 RIESLING
008 Silvaner
009 SILVANER
010 Silvahner
end data.
exe.
string WINE_1 (A18) .
compute WINE_1=upcase(WINE_0).
* Explanation: All strings are capitalized. *.
exe.
string WINE_2 (A18) .
compute WINE_2=ltrim(WINE_1).
* All strings are moved to the left. *.
exe.
compute INDEXS=index(WINE_2,"SILVA").
exe.
compute INDEXT=index(WINE_2,"GAU").
exe.
compute INDEXR=index(WINE_2,"SLING").
exe.
string WINE_3 (A18) .
if INDEXT > 0 WINE_3 = "MUELLERTHURGAU".
* The INDEXT index standardizes "Müller-Thurgau" *.
exe.
if INDEXR > 0 WINE_3 = "RIESLING".
* The INDEXR index standardizes "Riesling" *.
exe.
if INDEXS=1 WINE_3="SILVANER".
* The INDEXS index standardizes "Silvaner" *.
exe.
list variables=WINE_0 WINE_1 WINE_2 WINE_3 .
frequencies variables=WINE_0 WINE_1 WINE_2 WINE_3 .
exe.
WINE_3
Cumulative
Frequency Percent Valid Percent Percent
Valid MUELLERTHURGAU 5 50,0 50,0 50,0
RIESLING 2 20,0 20,0 70,0
SILVANER 3 30,0 30,0 100,0
Total 10 100,0 100,0

In the following standardization, the strings of the variable WINE_0 are also
first uniformly capitalized and uniformly moved to the left via UPCASE. The
further procedure differs from the approach presented above. The INDEX
function is used to search the "SILVANER" variants for a continuously
occurring text string (the uniform spelling is assigned to the position code
only at the end of the program).
The following DO-IF function assigns the spelling "RIESLING" in WINE3
to all "RIESLING" variants in WINE2 (which is apparently redundant, but
prevents "RIESLING" variants from being overwritten by mistake in the next
step).
Afterwards, in the remaining character strings, i.e. the "MÜLLER..." and the
"SILVANER" variants, the rest of the strings are standardized from the
position of the umlaut in "MÜLLER..." (the "SILVANER" variants are called
"SUELLERTHURGAU" until the next step).
By means of the position code, however, this incorrect spelling is finally
overwritten with the correct "SILVANER".
data list
/ID 1-3 WINE_0 5-30 (A) .
begin data
001 Müller-Thurgau
002 Müller-Thurgau
003 Mueller-Thurgau
004 Müller Thrugau
005 MÜLLER-THURGAU
006 Riesling
007 RIESLING
008 Silvaner
009 SILVANER
010 Silvahner
end data.
exe.
string WINE_1 (A18) .
compute WINE_1=upcase(WINE_0).
* All strings are capitalized. *.
exe.
string WINE_2 (A18) .
compute WINE_2=ltrim(WINE_1).
* All strings are moved to the left. *.
exe.
compute INDEX=INDEX(WINE_2,"SILVA").
exe.
string WINE_3 (A18) .
compute WINE_3=WINE_2.
exe.
do if WINE_2="RIESLING".
* Explanation: Strings are changed according to conditions. *.
compute WINE_3="RIESLING".
else .
compute substr(WINE_3,2,14)="UELLERTHURGAU".
* Explanation: With the umlaut also the remaining string of "Müller-
Thurgau" is unified. *.
end if.
if INDEX=1 WINE_3="SILVANER".
* The INDEX index standardizes "Silvaner". *.
exe.
list variables=WINE_0 WINE_1 WINE_2 WINE_3 .
frequencies variables=WINE_0 WINE_1 WINE_2 WINE_3 .
exe.
In this approach, the timing of standardization is decisive; in other words, the
functionality of the program depends on the sequence of program steps. This
program unifies in the sequence "RIESLING", "MUELLERTHURGAU" and
"SILVANER". If you were to rearrange the program steps without further
adjustments und were e.g. "SILVANER" the first to unify, the program
would no longer function correctly.
WINE_3
Cumulative
Frequency Percent Valid Percent Percent
Valid MUELLERTHURGAU 5 50,0 50,0 50,0
RIESLING 2 20,0 20,0 70,0
SILVANER 3 30,0 30,0 100,0
Total 10 100,0 100,0

The change of names (e.g. after marriage) can be checked by first name,
place of residence or date of birth. For the date of birth, however, this
requires that this and other date variables have been checked and found to be
correct.

4.5 Standardizing symbols and special


characters
The previous techniques standardized complete strings. The following
examples allow you to easily standardize special characters.
For example, the following program allows to replace the $ character with the
€ character.
data list
/ID 1-3 PRICES 5-15 (A) .
begin data
001 $100000
002 $10000
003 $1000
004 $100
005 $10
end data.
exe.
string PRICES2 (A10) .
compute PRICES2=PRICES. /* Create copy of PRICES. */
exe.
set mxloops=1.
loop if index(PRICES2,"$")>0.
compute substr(PRICES2, index(PRICES,"$"),1)="€".
end loop.
exe.
list.
The SUBSTR argument specifies that strings are stored in the variable
PRICES2 with a € sign instead of the $ sign.
MXLOOPS specifies that LOOP IF should be executed a maximum of once.
If multiple $ characters occur, the MXLOOPS value should be adjusted
accordingly.
ID PRICES PRICES2
1 $100000 €100000
2 $10000 €10000
3 $1000 €1000
4 $100 €100
5 $10 €10
Number of cases read: 5 Number of cases listed: 5
In the following example, a symbol should precede sequences of numbers
(source variable PARAGRAF, numeric). The target variable is PARAMIT
(string format).
data list free
/PARAGRAF(F3).
begin data
123
456
789
end data.
string PARAMIT (A4).
compute PARAMIT = concat('$',string(PARAGRAF,F3)).
exe.
list.
PARAGRAF PARAMIT
123 $123
456 $456
789 $789
Number of cases read: 3 Number of cases listed: 3
CONCAT converts PARAGRAF into a string component, which is
concatenated with a preceding symbol (here e.g. $) and stored in the variable
PARAMIT.

4.6 Standardizing currencies and units of


measurements
Before an analysis is performed, units of measurement (e.g. kilometer, mile,
liter, gallon, pint, etc.) as well as currencies (e.g. Deutsche Mark (“DM”),
Dollar, Euro, etc.) must be standardized. Before, it should be carefully
checked and documented.
The National Treasury Management Agency explained that the agency
purchased a fund in dollars, but it was not designated or marked on the
spreadsheet record as such. It was recorded as a euro fund. “Subsequently,
when the error was discovered, the dollar exchange rate had moved against
the NTMA and the investment return was down € 750.000” (RTÉ, 2019).
The introduction of the Euro in 2002, for example, led to the fact that some
datasets contained DM and Euro currency at the same time (so-called "dual
currency phase") (e.g. the variable PRICE). The following DO IF program
standardizes the currency in PRICE to Euro and stores the results in the
variable E_PRICE.
data list free
/YEAR PRICE .
begin data
1998 1 1999 10 2000 100 2001 1000 2002 5000 2003 10000 2004 15000
end data.
exe.
compute E_PRICE=PRICE.
exe.
do if (YEAR < 2002).
compute E_PRICE=(PRICE/1.95583).
end if.
exe.
format YEAR PRICE E_PRICE (F7.1).
list.

4.7 Standardizing via acronyms


The previous techniques standardized strings without changing them in
semantics or length. In the course of a historically grown business
development, different names of one and the same company can appear in a
database, e.g. "MB", "Daimler-Benz", "Mercedes-Benz" or "Daimler-
Chrysler". Standardization using acronyms (abbreviations formed from the
first letters of several words, e.g. "IBM") limits the probability of reading and
typing errors and leads to better use of processing speed and capacity for
large amounts of data. The LOOP example standardizes a group of variants.
In principle, the macro example can standardize any number of name
variants. For an introduction to macro programming please refer to Schendera
(2005, Chapter 9).
Note: The following example has then (June 2007) been confirmed by reality.
After the sale of Chrysler, the name "Daimler-Chrysler" was not used
anymore, but the brand name "Daimler AG".
data list
/ID 1-3 COMPANY 5-25 (A) .
begin data
001 Daimler-Benz
002 Mercedes-Benz
003 Daimler-Chrysler
004 MB
005 Benz-Daimler
end data.
exe.
string COMPANY2 (A25) .
compute COMPANY2=COMPANY. /* Copy of COMPANY is created. */
exe.
loop if index(COMPANY2,"Daimler") > 0 .
compute substr(COMPANY2,1,25)="Daimler-Chrysler".
end loop.
loop if index(COMPANY2,"Mercedes") > 0 .
compute substr(COMPANY2,1,25)="Daimler-Chrysler".
end loop.
exe.
list.
This program is basically based on a sequence of two LOOP IF statements,
which are concluded with an END LOOP. If "Daimler" or "Mercedes" is
found somewhere in a string (see ID 005), it is completely replaced by the
string "Daimler-Chrysler". The long strings in the example can be replaced
by any abbreviation. "MB" is not replaced because the program does not
search for it.
ID COMPANY COMPANY2
1 Daimler-Benz Daimler-Chrysler
2 Mercedes-Benz Daimler-Chrysler
3 Daimler-Chrysler Daimler-Chrysler
4 MB MB
5 Benz-Daimler Daimler-Chrysler
Number of cases read: 5 Number of cases listed: 5
In contrast to the LOOP example, the following macro example can
standardize any number of name variants. In the FINDEN macro, the
templates and the target acronyms are defined.
data list
/ COMPANY 1-50 (a).
begin data
IBM
Industrial Business Machines
IBM Ltd.
Industrial Business Machines International
MB
Daimler-Benz
DaimlerChrysler
Mercedes-Benz
Daimler-Chrysler
end data.

define FINDEN (!pos !charend('/') / !pos !tokens(1)).


!do !i !in (!1).
string AKRONYM (A20).
if (index(upcase(COMPANY), (!quote(!upcase(!i)))) ne 0)
AKRONYM = (!quote(!2)).
exe.
!doend.
!enddefine.
FINDEN IBM Industrial Business Machines / IBM.
FINDEN MB Daimler Benz Chrysler Mercedes / DC.
list variables = AKRONYM.
Compared to the original dataset, the output of the macro is uniform and
manageable.
AKRONYM
IBM
IBM
IBM
IBM
DC
DC
DC
DC
DC
Number of cases read: 9 Number of cases listed: 9
This macro is also very well suited for the analysis of free text responses. If
short texts contain only one answer, this macro can scan the strings and
assign uniform numeric codes to them. The prerequisite is that all occurring
answer variants are specified under FINDEN. If the result still contains gaps
after a first run, there is still the possibility to add further FINDEN lines to
the program.
In the section on working with multiple datasets you can find a program
version, which also stores the values separately in uniformly filtered datasets.
SPSS issues a warning message when the acronyms are created repeatedly;
this can be ignored.
>Error # 4822 in column 8. Text: AKRONYM
>A variable with this name is already defined.
>This command not executed.

4.8 Standardizing by removing identical


strings
After data transfers, it may happen that strings are appended with uniform
characters, e.g. "0" (see the demonstration below).
data list
/ STRING 1-20 (a).
begin data
STRINGA0
STRINGB000
STRINGC0000
STRINGD00000
end data.
For example, the COMPUTE program removes all superfluous zeros.
string STRING2 (A20).
compute STRING2=
ltrim(rtrim(lpad(rtrim(STRING,' '), 20,' '),'0'),' ').
list.
The strings are now without trailing zeros.
STRING STRING2
STRINGA0 STRINGA
STRINGB000 STRINGB
STRINGC0000 STRINGC
STRINGD00000 STRINGD
Number of cases read: 4 Number of cases listed: 4
The REPLACE function (see 4.3.) would be an alternative approach.

4.9 Standardizing via counting of String


Templates
If identical character strings (e.g. "Daimler") occur in strings, you can also
count them. The MY_COUNTER value then only needs to be provided with
a uniform value label.

4.9.1 Standardizing using one template (loop,


LOOP)
This program checks for the occurrence of a template ("Daimler") and
distinguishes between upper and lower case.
data list / ID 1-3 CORPORATE 5-20 (A).
begin data
001 Daimler
002 Benz-Daimler
003 Daimler-Chrysler
004 MB-Daimler
005 Daimler-Benz
end data.
exe.
compute MY_COUNTER= 0.
loop #i=1 to 6.
compute MY_COUNTER = MY_COUNTER +
INDEX (SUBSTR(CORPORATE,#i,7),'Daimler').
end loop.
exe.
list.
The created code MY_COUNTER now only needs to be assigned with a
uniform value label (not displayed anymore).
ID CORPORATE MY_COUNTER
1 Daimler 1,00
2 Benz-Daimler 1,00
3 Daimler-Chrysler 1,00
4 MB-Daimler 1,00
5 Daimler-Benz 1,00

Number of cases read: 5 Number of cases listed: 5

4.9.2 Standardizing using several templates (macro)


The following macro works for several templates (e.g. "daimler", "benz") and
can ignore different capitalization because of UPCASE.
data list / ID 1-3 CORPORATE 5-20 (A).
begin data
001 Daimler
002 Benz-Daimler
003 Mercedes-Benz
003 Daimler-Chrysler
004 MB-Daimler
005 DAIMLER-Benz
006 Benz
end data.
exe.
define mactext (!pos !charend('/')).
!do !i !in (!1).
if (index(upcase(CORPORATE), (!quote(!upcase(!i)))) ne 0)
MY_COUNTER = 1.
exe.
!doend.
!enddefine.
mactext daimler benz /.
list.
The created code MY_COUNTER now only needs to be assigned with a
uniform value label (not displayed anymore).
ID CORPORATE MY_COUNTER
1 Daimler 1,00
2 Benz-Daimler 1,00
3 Mercedes-Benz 1,00
3 Daimler-Chrysler 1,00
4 MB-Daimler 1,00
5 DAIMLER-Benz 1,00
6 Benz 1,00
Number of cases read: 7 Number of cases listed: 7
Please note the logic of this program version: In case both search terms occur
in one string (see ID 2), only one (first) code is assigned.

4.10 Standardizing mixed strings (phone


numbers)
The unification of mixed strings, e.g. entries in a telephone directory, can be
quite tedious. The more inconsistent the character strings are, the more
extensive are the necessary corrections. The LOOP variant, for example, only
requires that all three fields in the telephone directory (international area
code, area code and extension) have been filled in and that all strings or
special characters may be removed.

4.10.1 Completely filled fields (LOOP-END LOOP)


If telephone fields are inconsistent but completely filled in, the spelling of
telephone numbers can be standardized using the following program.
data list
/ T_NUMBER (A20).
begin data
(491) 234-567
(+4912)3-4567
+49/123-4567
+49+123-4567 PRIV
49 123 4567
49-1234567
+49 123-4567
+491234567
end data.
list.
string TN_OLD (A20).
compute TN_OLD=T_NUMBER.
exe.
loop .
compute #I
= INDEX(T_NUMBER,"+()-
/.\ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜß ",1) .
if #I > 0
T_NUMBER=concat(substr(T_NUMBER,1,#I-1),
substr(T_NUMBER,#I+1)).
end loop if #I=0 .
list variables TN_OLD T_NUMBER.
Using "compute TN_OLD" creates a copy of the telephone numbers. The
corrections mainly base on INDEX, SUBSTR and CONCAT. INDEX
removes special characters and strings, such as blanks, letters, umlauts, and
symbols such as "+", "(" and ")" from the strings of the phone numbers; while
doing so, "#I" counts the position of the first occurrence of a removed
character.
SUBSTR defines the individual digit sequences as strings with different
lengths (the position of the removed characters is important for measuring of
the respective length) and CONCAT concatenates them to a complete string.
TN_OLD T_NUMBER
(491) 234-567 491234567
(+4912)3-4567 491234567
+49/123-4567 491234567
+49+123-4567 PRIV 491234567
49 123 4567 491234567
49-1234567 491234567
+49 123-4567 491234567
+491234567 491234567
Number of cases read: 8 Number of cases listed: 8
The LOOP-END LOOP program works regardless of the number of filled
fields. If the program is unchanged, the result for five fields (see TN_OLD)
looks like this:
TN_OLD T_NUMBER
(491) 234-567-8-910 4912345678910
(+4912)3-4567-8-910 4912345678910
+49/123-4567-89-10 4912345678910
+49+123-4567-89-10 P 4912345678910
49 123 4567-89 10 4912345678910
49-1234567-89+10 4912345678910
+49 123-4567-8910 4912345678910
+491234567-8910 4912345678910
Number of cases read: 8 Number of cases listed: 8
Strictly speaking, the LOOP-END LOOP program even works independently
of the completeness of the filled fields (see lines 7 and 8). In the end, it is
only important that the numerical information is in the correct sequence; it is
irrelevant whether it is distributed over three, four or five fields with or
without gaps. For example, the LOOP-END LOOP program would not work
for the following data (see line 5).

4.10.2 Incompletely filled fields (IF)


Standardization is more complicated, if fields are neither complete nor
uniformly filled out. In principle, the program must consider a separate
correction for each individual spelling variant. On the data of the following
example in line 5, the above LOOP-END LOOP program would make an
error.
DATA LIST
/ TN_OLD (A14).
BEGIN DATA
(+49) 123-4567
(+49)123-4567
+49/123-4567
123-4567 PRIV
123-4567/9876
123 4567
1234567
+49 123-4567
+491234567
END DATA.
LIST.
STRING T_NUMBER (A14).
if (index(TN_OLD, "(") = 1 and index (TN_OLD, " ") =6)
T_NUMBER = substr (TN_OLD, 7, 8).
if (index(TN_OLD, "(" )= 1 and index (TN_OLD, " ") =14)
T_NUMBER = substr(TN_OLD, 6, 8).
if (index (TN_OLD, "1") =1)
T_NUMBER = substr (TN_OLD, 7,8).
if (index (TN_OLD, "/")=4 and index (TN_OLD, " ") =13)
T_NUMBER = substr (TN_OLD, 5, 8).
if (index (TN_OLD, "-")= 4 and index (TN_OLD, " ") =9)
T_NUMBER = substr (TN_OLD, 1, 8).
if (index (TN_OLD, "-") =4 and index (TN_OLD, "/")=9)
T_NUMBER = substr (TN_OLD, 1, 8).
if (index (TN_OLD, "-") =4 and index (TN_OLD, "-")=9)
T_NUMBER = substr (TN_OLD, 1, 8).
if (index (TN_OLD, " ") =4 and length (rtrim (TN_OLD))=8)
T_NUMBER = concat (substr(TN_OLD,1,3), "-",substr (TN_OLD, 5,4)).
if (length (rtrim (TN_OLD))=7)
T_NUMBER = concat (substr(TN_OLD,1,3), "-",substr(TN_OLD, 4,4)).
if (index (TN_OLD, " ")=4 and length (rtrim (TN_OLD)) >= 12)
T_NUMBER = concat (substr(TN_OLD,5,3), "-",substr (TN_OLD, 9,4)).
if (length (rtrim(TN_OLD))=10)
T_NUMBER = concat (substr (TN_OLD,4,3),"-", substr(TN_OLD, 7,4)).
exe.
list.
The corrections are based mainly on INDEX, CONCAT and SUBSTR and
were explained in the previous example.
RTRIM cuts off characters to the right of the phone number and replaces
them with a blank. LENGTH determines the length of the string is.
TN_OLD T_NUMBER
(+49) 123-4567 123-4567
(+49)123-4567 123-4567
+49/123-4567 123-4567
123-4567 PRIV 123-4567
123-4567/9876 123-4567
123 4567 123-4567
1234567 123-4567
+49 123-4567 123-4567
+491234567 123-4567
Number of cases read: 9 Number of cases listed: 9

4.11 Standardizing time and date


specifications
Date and time specifications are particularly prone to problems. While string
variables can still be "read" to a certain extent via their semantics, the
standardization of date and time specifications is often a very tedious and
thankless matter (not to mention the identification and correction of other
date errors, e.g. typing errors), although it is absolutely central in time- or
date-related data analyses, because their plausibility is the foundation of the
analysis.
In the recent past, a client gave me a patient database for review and, if
necessary, correction and analysis. One of the variables contained the
recorded duration from the time of the accident to the first aid measure. A
closer examination revealed that although the entries were all numerical, the
units logged different time units, namely seconds, minutes or even hours.
For example, a "1" could be interpreted as a second, minute or hour. A direct
analysis was not possible; a complex standardization of these entries into
uniform minutes was necessary before.
While a uniform date format (see 4.11.1.) plays a central role for date
specifications (e.g. "20-DEC-2005"), the logging of uniform time
specifications (e.g. "11:07") is prone to several problems (see 4.11.2.). In
variables that log date and time at the same time (e.g. as "20-DEC-2005
11:07", so-called DATETIME format), such problems may add up under
certain circumstances.

4.11.1 Date specifications: The special role of the


date format
For date variables all requirements apply as for numeric variables. This
means for example: A variable, which is called GEBDATUM in a master
dataset, is a date variable, has a certain date format (there are many different
ones) and has the label "date of birth", must also be called GEBDATUM in
an update dataset, be a date variable, have the same date format and have the
label "date of birth".
Beyond that, date variables can cause special problems:

The date variables are inconsistent in European ([Link]) or


American format ([Link]).
The date variables are inconsistent in the output digits, e.g.
[Link] or [Link].
The date variables are inconsistent in punctuation, e.g.
[Link] or DD-MM-YYYY (see 4.12.3.).
The date variables are inconsistent in the indicated time, e.g.
[Link], [Link] or YYYY etc.

The following syntax example contains date specifications in a seemingly


uniform format. In fact, the data situation in the import step (after BEGIN
DATA) is not as clear as it should look at first. The final output in American
(MMDDYYY) and European (DDMMYYY) format will draw attention to
the problems associated with this.
data list free
/ MIX_DATA (A10).
begin data
1/7/99 12/11/98 1/12/01
10/1/98 12-8-2002
10-7-00 1.12.99
end data .
compute L=index(MIX_DATA,'/.-,',1).
compute R=rindex(MIX_DATA,'/.-,',1).
compute T=number(substr(MIX_DATA,L+1,R-L-1),F2).
compute M=number(substr(MIX_DATA,1,L-1),F2).
compute J=number(substr(MIX_DATA,R+1),F4).
compute DATE_US=[Link](M,T,J).
format DATE_US (date).
compute DATE_EU=[Link](T,M,J).
format DATE_EU (date).
list
variables= DATE_US DATE_EU.
The central aspects to be considered are: (a) The source format of the data
must be known in order to be reconstructable. Ideally, the source format
should be uniform; inconsistent, i.e. European and American formats at the
same time, are tantamount to a data disaster due to different day and month
specifications. (b) The target format should be defined and matched to the
source format of the date data.
DATE_US DATE_EU
01-JUL-1999 07-JAN-1999
12-NOV-1998 11-DEC-1998
01-DEC-2001 12-JAN-2001
10-JAN-1998 01-OCT-1998
12-AUG-2002 08-DEC-2002
10-JUL-2000 07-OCT-2000
01-DEC-1999 12-JAN-1999
Number of cases read: 7 Number of cases listed: 7
Differently complete date variables are difficult to standardize. As long as it
does not mean a loss of information, date variables in different formats can
be unified to the "lowest common denominator", e.g. [Link] and
[Link] to [Link].
Many studies use two dates (start, end) to record the duration of a campaign,
treatment or even a (hospital) stay. The check in the form of a subtraction of
the start date from the end date must yield a positive value. Negative values
are mainly caused by input errors, which can be identified e.g. by a check
variable ERROR. The error index ERROR only needs to be queried via
FREQUENCIES or LIST (e.g.):
data list
/TIME_1 1-8 (EDATE) TIME_2 10-17 (EDATE) .
begin data
13/08/90 21/10/90
end data.
if TIME_1 <= TIME_2 ERROR=1.
exe.
In some analyses, data must be available from exactly the same point in time;
inconsistent date specifications are interpreted as indications of incorrect
surveys. Sometimes, however, it is sufficient if the date specifications lie
within (or even outside) a defined range. For further possibilities of handling
date variables, please refer to the unit on data management (see also
Schendera, 2005).
Finally, a note on standardization with regard to time itself. In distributed
systems, the time stamp may be relevant with which certain (automatically
generated) data packets are sent to a master dataset. In order to guarantee the
uniformity of time here, it should be ensured that the connected data-
delivering workstations have set the identical system time (which is
especially important in hierarchical layered manufacturing lines and
corresponding data flows (e.g. process, MES, ERP according to ISA
standards). Time stamps are introduced in 12.4.
The standardization of the punctuation of date variables in general is
discussed in section 4.12.3.

4.11.2 Time specifications: Three classic errors: A


typical example for inconsistent time data
Errors can easily occur when logging time specifications, similar to date
specifications. The following example, based on the logging of measurements
within twenty-four hours, wants to raise awareness of three classic errors.
data list free
/ MIX_DATA (A6).
begin data
0h 1:30 12 18 20”15’ 24h
end data .
These apparently six correct time entries have three errors. The three errors,
which can almost be called classical, are the inconsistent assignment of a
unit, the inconsistent punctuation and an inconsistent definition of the end
points of the 24h measurement period.

Error 1: The specification of the unit "h" is inconsistent (0 is with,


12 without "h").

Remedy: Log the specifications without "h".

Error 2: The punctuation is inconsistent. The specification "1:30" is


logged with a colon, the specification "20"15'" is logged with
quotation marks.

Remedy: Standardize the punctuation (see 4.12.).

Error 3: The logging of "0h" and "24h" is inconsistent. For a


measurement within 24 hours, a measurement at 24h corresponds to a
measurement at 0h.

Remedy: Standardize the logging, either 24h or 0h.


Further problems may occur when logging time specifications in string
format. Typical possible error are e.g. intentional input of originally helpful
comments (e.g. placing a "?" to indicate that this input might need to be
checked again), typing errors when entering letters ("s" or a special character
is accidentally entered instead of the unit "h",) or typing errors when entering
numbers (transposed digits, e.g. 23:95 is entered instead of 23:59).
Similarly to the differentiation into an American (e.g. 3 PM) or European
indication (e.g. 3 p.m.), in the case of time specifications, the format of a time
plays a comparably important role for correct display. For example, the
internally stored numerical value "12994157300" is displayed as
"3609488:08" using the TIME8 format, but as "21-JUL-1994 08:08" using
the DATETIME17 format. If you now made a difference between the values
"3609488:08" and "21-JUL-1994 08:08", the result would be "0" (both
specifications are stored identically as "12994157300").
Nevertheless, it can happen that a difference between two identically
displayed dates does not result in the expected zero, but in positive and also
negative differences. If, in the following example, a difference between the
two values "21-JUL-94" would be made, the difference would not be zero but
"-1".
data list free
/ DATE_3 (F15) DATE_4 (F15).
begin data
12994157300 12994157301
end data .
exe.
formats DATE_3 (date9) DATE_4 (date9).
compute DIFF3=DATE_3-DATE_4.
exe.
list.
DATE_3 DATE_4 DIFF3
21-JUL-94 21-JUL-94 -1,00
Number of cases read: 1 Number of cases listed: 1
The explanation is that the specifications "21-JUL-94" are based on a
formatting of the values 12994157300 and 12994157301 respectively. These
two values differ by one unit at the last position. In the display, the differing
values are rounded, so to speak, and thus appear in the display as equal,
although they actually are not. However, calculating the difference uses the
non-rounded values and thus arrives at a difference between the apparently
identical date specifications.
Thus if arithmetic operations shall be made with date values, there should be
made sure, that values from different sources and with different formats are
ultimately uniform and lead to exactly matching results.

4.12 Uniformity of punctuation resp.


decimal places
The standardization of punctuation is usually a fairly straightforward process,
since punctuation usually occurs regularly, in a uniform manner (comma,
period, etc.) and with equal frequency (see 4.12.1.) Removing punctuation
from strings may be a somewhat trickier process from a programming resp.
logical point of view, because experience shows that punctuation can occur
irregularly, in different ways and with different frequency (see 4.12.2.).

4.12.1 Adding punctuation resp. decimal places


In practice, the problem may occur that in (parts of) data storages the
punctuation (commas, decimal points) was either not included, got lost
during migrations or has to be adjusted in the context of new data
definitions. If values (e.g., 3.14) lose their decimal points, they take on
completely different orders of magnitude (e.g., 314). Since these values
would thus distort any analysis, a uniform and correct punctuation must be
reconstructed. Explicit assumption of the following approach is, that the
number of digits before the decimal point is known and above all uniform.
For a non-uniform number of digits before the comma the values would have
to be subjected to separate standardization steps.
data list free
/NOCOMMA(F11).
begin data
1
123
12345
12345678
end data.
string NSTRING (A11).
compute NSTRING = rpad(rtrim(ltrim(string(NOCOMMA,F8))),8,'0').
exe.
compute COMMA = number(concat(substr(NSTRING,1,5),'.',substr
(NSTRING,1,2)),F5.2).
exe.
list variables= NOCOMMA COMMA.
NOCOMMA COMMA
1 100,00
123 123,00
12345 123,45
12345678 123,45
Number of cases read: 4 Number of cases listed: 4
In a first step, the approach is based on the conversion of the read-in
numerical values (variable NOCOMMA) into left-aligned strings
(NSTRING), which were then turned into numerical sequences of uniform
length using zeros. In a second step, the first SUBSTR is used to read the
digits for the places before the decimal point and the second SUBSTR is used
to read the digits for the places after the decimal point from this string.
CONCAT places these extracted digit sequences before and after the required
decimal point. Concluding, F5.2 defines the format for this variable
(COMMA).This format must not be larger than the total length of the
extracted digits. If the "F-value" exceeds the sum of the characters, an error
message is triggered and SPSS is not able to create a COMMA variable. This
variant is based on the fact that the digits were aligned to the left and padded
to the specified length with so-called "trailing zeros" (e.g. "1" became
"10000").
Using SPSS, it is also possible to align the digits to the right and fill them up
to the specified length with so-called "leading zeros"; in this case, e.g. "1"
can become "00001"). The requirement is that the variable in question is
numeric; subsequently only the N format needs to be assigned to this variable
using FORMATS resp. in the COMPUTE expression (see "N8" on the right).
The N format ("restricted numeric format)" is only permissible for unsigned
integer values.
data list free
/NOCOMMA(F11).
begin data
1
123
12345
12345678
end data.
formats NOCOMMA (N8.0).
string NSTRING2 (A11).
compute NSTRING2 = rpad(rtrim(ltrim(string(NOCOMMA,N8))),8,'0').
exe.
list var NOCOMMA NSTRING2 .
NOCOMMA NSTRING2
00000001 00000001
00000123 00000123
00012345 00012345
12345678 12345678
Number of cases read: 4 Number of cases listed: 4
The same approach can be used to insert several points into a sequence of
numbers. Inserting several dots may be necessary, for example, if product
codes have been extended from a one-digit to a two-digit punctuation and the
existing data storages must be adapted accordingly. With this approach the
resulting variable can only be a string variable. The number of characters
before, between and after the respective punctuation is known and uniform.
data list free
/NOCOMMA(F11).
begin data
1
123
12345
12345678
end data.
string NSTRING (A11).
compute NSTRING = rpad(rtrim(ltrim(string(NOCOMMA,F8))),8,'0').
exe.
string COMMAS (A11).
compute COMMAS = concat(substr(NSTRING,1,3),'.',
substr(NSTRING,4,2),'.',substr(NSTRING,6,2)).
exe.
string COMMAS2 (A11).
compute COMMAS2 = concat(substr(NSTRING,1,3),':',
substr(NSTRING,4,2),':',substr(NSTRING,6,2)).
exe.
list variables= NOCOMMA COMMAS COMMAS2.
NOCOMMA COMMAS COMMAS2
1 100.00.00 [Link]
123 123.00.00 [Link]
12345 123.45.00 [Link]
12345678 123.45.67 [Link]
Number of cases read: 4 Number of cases listed: 4
The COMMAS2 variant uses a colon instead of a dot. The created character
sequences COMMAS and COMMAS2 are strings. Using a similar method,
first and last names stored in separate variables or month and year
specifications can be combined into one single string variable (with or
without punctuation).
A slightly modified approach can be used to create the levels of a string
variable from individual values of several variables. This procedure is
useful, if e.g. a new classification variable is to be created from the values of
several variables. With two to three variables, including values in strings
allows e.g. the display of the position in a 2 to 3 dimensional coordinate
system (see below).
data list free
/VARX(F3) VARY(F3).
begin data
123 123
124 124
125 125
126 126
127 127
128 128
end data.
string LABEL2 (A7).
compute LABEL2= concat((string(VARX,F3.0)),":",
(string(VARY,F3.0))) .
exe.
list variables= LABEL2.
LABEL2
123:123
124:124
125:125
126:126
127:127
128:128
Number of cases read: 6 Number of cases listed: 6
The created character sequence LABEL2 is a string. When defining the
string, make sure that the assigned format (e.g., A7) exactly matches the sum
of the individual string elements, e.g., 2 x F3.0 and one position for the
punctuation symbol (e.g. ":"). If the "A" value is smaller than the sum of the
combined characters, a variable LABEL2 is created, but the strings are
truncated.

GRAPH
/SCATTERPLOT(BIVAR)=
VARX WITH VARY BY
LABEL2 (NAME)
/MISSING=LISTWISE .
By integrating numbers in
strings it is possible to
display a position in a
coordinate system.
Using a similar method, first and last names stored in separate variables can
be combined into a single string variable.
For very large amounts of data resp. in the case of many variables with many
levels each, it is often also necessary to form grouping variables which have
exactly one single value or code for each combination of levels from the
variables included, e.g. as follows:
if (VAR1='AA' & VAR2='AA' & VAR3='AA' & VAR4='AA') KODE = 1 .
exe.
if (VAR1='AA' & VAR2='AA' & VAR3='AA' & VAR4='BB') KODE = 2 .
exe.
The derivation using IF statements is theoretically possible, but with many
variables, levels, and level combinations rather costly resp. error-prone on the
part of programming. An IF approach also presupposes implicitly the
consistency of assigned codes, while one could regard the
CONCAT/STRING approach rather as an automatism, for which it is rather
secondary how the data were coded.
The CONCAT/STRING approach could replace an IF approach completely
or at least support it lastingly. In case of a data volume that is fixed already
(thus not further updated resp. changed), this approach could be used to
explore first, which of the theoretically possible level combinations are
actually (empirically) present at all. If you have to program using an IF
approach anyway, then such a programming can be limited to those level
combinations which actually occur in the data.
The following approach was designed for four short strings (ideally of equal
length), but can easily be extended to any number or length of strings.
data list free
/VAR1(A2) VAR2(A2) VAR3 (A2) VAR4 (A2).
begin data
AA AA AA BB BB BB CC CC CC AA BB CC
AA BB CC AA BB BB BB CC AA AA BB AA
AA AA AA BB BB BB CC CC CC AA BB CC
AA BB CC AA BB BB BB CC AA AA BB AA
end data.
string CODE (A11).
compute CODE= concat(substr(VAR1,1,2),':',
substr(VAR2,1,2),':',
substr(VAR3,1,2),':',
substr(VAR4,1,2)) .
exe.
AUTORECODE
VARIABLES=CODE /INTO CODE_NUM
/PRINT.
list variables= CODE CODE_NUM .
The approach essentially corresponds to the procedures already presented and
is therefore not explained further. For the sake of clarity, this
CONCAT/SUBSTR combination was supplemented by a final
AUTORECODE step (for its weaknesses see Schendera, 2005). The
punctuation makes the joined strings easier to read.
CODE CODE_NUM
[Link] 1
[Link] 5
[Link] 6
[Link] 3
[Link] 4
[Link] 2
[Link] 1
[Link] 5
[Link] 6
[Link] 3
[Link] 4
[Link] 2
Number of cases read: 12 Number of cases listed: 12
Each CODE or CODE_NUM value represents a unique combination of the
assembled variables. "1" represents e.g. the combination "[Link]"
and thus all cases, which have the strings "AA" in the variable VAR1, in the
variables VAR2 and VAR3 likewise the strings "AA", and in the variable
VAR4 the strings "BB". The code "1" was assigned only to this combination
and no other. The highest CODE_NUM value represents the variability of the
combinatorics. In the example, the value "6" represents that e.g. the line-wise
combination of four variables results in six different combinations. Or,
expressed in other words, the cases/rows can be divided into six different
groups according to their levels in the variables VAR1 to VAR4.
The combination of the values of several variables is in principle nothing else
than the creation of a so-called compound index (syn.: indicator). An
indicator is used e.g. if a dataset is sorted by the levels of several, but always
the same variables. It is much more effective from an information technology
point of view, especially in the case of very large data storages, to sort a
dataset using an indicator instead of using all the variables of which it is
composed at one time. CODE_NUM consists e.g. of the variables VAR1 to
VAR4 and, in such a way, can be used as index instead of these variables.
The resulting sorting is in each case the same, only the speed is many times
faster. This way, the performance when working with large datasets can be
increased very simply.

4.12.2 Removing punctuation from strings


Removing punctuation from strings may be a trickier process from a
programming point of view or logically, because punctuation may occur
irregularly, in different ways and with different frequency.
If no existing punctuations etc. needs to be considered when removing a large
number of different punctuations resp. symbols, then the approach introduced
in Section 4.10.1. can be applied to punctuations etc. too. This approach
removes the indicated punctuation completely, independent of its location of
occurrence.
data list free
/COMMA (A9).
begin data
1:2.3456 12.:3456 1.23:45.6
1.234:56 1.23.45:6
end data.
string COMMA2 (A9).
compute COMMA2=COMMA.
exe.
loop .
compute #I = INDEX(COMMA,".;,:",1) .
if #I > 0
COMMA=concat(substr(COMMA,1,#I-1),substr(COMMA,#I+1)).
end loop if #I=0 .
list variables COMMA COMMA2 .
The second COMPUTE line is important; in this line, the punctuation etc. to
be eliminated can be specified after INDEX.
COMMA COMMA2
123456 1:2.3456
123456 12.:3456
123456 1.23:45.6
123456 1.234:56
123456 1.23.45:6
Number of cases read: 5 Number of cases listed: 5
Running a REPLACE function twice would have a similar effect. REPLACE
is e.g. also suitable for searching and replacing punctuation marks, but is only
available in SPSS from version 14 on (see 4.3.).
However, if only certain characters are to be removed, the removal of
punctuation from strings is a little trickier. Explicit assumptions of the
following approach are that a processed string contains only one character at
a time and that (initially) only one character type (comma, colon, etc.) is used
for punctuation by default.
data list free
/COMMA (A7).
begin data
1:23456 12:3456 123:456
1234:56 12345:6
end data.
compute P_POSITION=index (COMMA,':').
exe.
string DIGITS_LEFT (A7).
compute DIGITS_LEFT=substr(COMMA,1,P_POSITION-1).
exe.
string DIGITS_RIGHT (A7).
compute DIGITS_RIGHT=substr(COMMA,P_POSITION+1,7-
P_POSITION).
exe.
string NOCOMMA (A12).
compute NOCOMMA=concat ((rtrim(DIGITS_LEFT,' ')),
(rtrim(DIGITS_RIGHT,' '))).
exe.
list.
In a first step, INDEX determines the position of the (uniform) punctuation
mark and store it as value in the P_POSITION variable. In a second step,
SUBSTR reads all characters of the concerning strings up to the punctuation
mark (P_POSITION) and stores them in the variable DIGITS_LEFT. In a
third step, all characters after P_POSITION are read and stored in the
variable DIGITS_RIGHT. In a fourth step, the DIGITS_LEFT and
DIGITS_RIGHT strings are concatenated and stored in the variable
NOCOMMA.
COMMA P_POSITION DIGITS_LEFT DIGITS_RIGHT NOCOMMA
1:23456 2,00 1 23456 123456
12:3456 3,00 12 3456 123456
123:456 4,00 123 456 123456
1234:56 5,00 1234 56 123456
12345:6 6,00 12345 6 123456
Number of cases read: 5 Number of cases listed: 5
This approach also works if the strings have different punctuation characters
(provided there is only one character per string). The output error message
can be ignored. If, for example, the above syntax would be submitted over
slightly modified sample data which has a dot instead of the usual colon at
one position, all strings are stored cleansed in the created variable
NOCOMMA with one exception.
COMMA P_POSITION DIGITS_LEFT DIGITS_RIGHT
begin data
NOCOMMA
1:23456
12:3456 1:23456 2,00 1 23456 123456
123:456 12:3456 3,00 12 3456 123456
1234:56 123:456 4,00 123 456 123456
12345.6 1234:56 5,00 1234 56 123456
end data. 12345.6 ,00 12345.6 12345.6
Number of cases read:5 Number of cases listed: 5
The program then only needs to be adapted for the exception "12345.6" in
such a way that it now searches the entries in NOCOMMA for "." instead of
the entries in COMMA for ":". – If from the beginning only dots are to be
eliminated however, only a dot symbol instead of a colon needs to be
specified in the INDEX line of the above program. Removing dots leaves the
correctly placed colons unchanged. The data of this example contained only
one punctuation per string. If multiple equal punctuations (e.g., two colons)
occur, then the INDEX approach is only valid under the condition that only
the first punctuation of multiple equal punctuations can be replaced resp.
removed. The INDEX function always searches for the first occurrence of a
certain character string. However, if the punctuation is irregular (e.g., a string
contains two characters, the first character of which is correct), the INDEX
function would remove the actually correct character and leave the
subsequent incorrect character standing. In contrast, the RINDEX function,
which searches for the last occurrence of a certain string, has the same
problem with irregular punctuation. For example, if a string contains two
characters (where e.g. the second character is correct), also the RINDEX
function would remove the actually correct character and leave the wrong
character occurring before it (see the example for "[Link]" below). – At
this point, we assume that the RINDEX function, which specifies the last
occurrence of a certain string, would not help; this is only an assumption; for
this sample data, it could be a first solution. – Furthermore, a special
difficulty occurs: Several punctuations with the same number of digits mean
that the strings become longer; accordingly, the greater length of the strings
must be considered in CONCAT or the respective definition of the strings.
For example, if easily changed example data would show two instead of the
usual single colon at one place, then the accordingly adapted syntax stores all
cleaned strings with one exception. The following example illustrates this
with the value "[Link]". It is only important that this string contains two
symbols; the punctuation in the other strings is irrelevant.
COMMA P_POSITION DIGITS_LEFT DIGITS_RIGHT
begin data
NOCOMMA
1:23456
12:3456 1:23456 2,00 1 23456 123456
123:456 12:3456 3,00 12 3456 123456
[Link] 123:456 4,00 123 456 123456
12345:6 [Link] 2,00 1 234:56 1234:56
end data. 12345:6 6,00 12345 6 123456
Number of cases read: 5 Number of cases listed: 5
Assuming the first colon is actually superfluous in the value "[Link]"; in
this case the program worked correctly and removed it from the string
correctly. However, if the second colon were superfluous, the program would
have removed the wrong colon from the string. The same applies to the
RINDEX function: If the first colon were superfluous, the program would
have worked incorrectly; but if the second colon were superfluous, the
program would have worked correctly.
In order to avoid such pitfalls (again, assuming a regularity of multiple
occurring symbols; e.g. the second symbol has always to be removed), the
value of the P_POSITION_ variable could be provided with a correction
factor, which is based on a combination of the INDEX and RINDEX
function, e.g., in the following way (please note “do if (P_POSITION_2 >
P_POSITION_1)” in the syntax below):
data list free
/COMMA (A8).
begin data
1:23456 12:3456 123:456
[Link] 12345:6
end data.
compute P_POSITION = index(COMMA,':').
exe.
compute P_POSITION_1=index(COMMA,':').
exe.
compute P_POSITION_2=rindex(COMMA,':').
exe.
do if (P_POSITION_2 > P_POSITION_1) .
compute P_POSITION_=2 .
end if.
string DIGITS_LEFT (A9).
compute DIGITS_LEFT=substr(COMMA,1,P_POSITION-1).
exe.
string DIGITS_RIGHT (A9).
compute DIGITS_RIGHT=substr(COMMA,P_POSITION+1,8-
P_POSITION).
exe.
string NOCOMMA (A9).
compute NOCOMMA=concat ((rtrim(DIGITS_LEFT,' ')),
(rtrim(DIGITS_RIGHT,' '))).
exe.
list.
The implicit assumption of this example is that always the first punctuation
character should be kept and always the second character removed.
COMMA P_POSITION P_POSITION_1 P_POSITION_2 DIGITS_LEFT
DIGITS_ RIGHT NOCOMMA
1:23456 2,00 2,00 2,00 1 23456 123456
12:3456 3,00 3,00 3,00 12 3456 123456
123:456 4,00 4,00 4,00 123 456 123456
[Link] 6,00 2,00 6,00 1:234 56 1:23456
12345:6 6,00 6,00 6,00 12345 6 123456
Number of cases read: 5 Number of cases listed: 5
If there are more than one colon per string, e.g. three colons, this correction
factor would have to be checked again exactly for its functionality, more
precisely: to see whether the correct (wrong) colons always occur at the
beginning, in the middle, or at the end of a string resp. if not, under which
conditions they occur at these positions.

4.12.3 Standardizing the "punctuation" of date


variables
One and the same date can be written in different punctuation. For example,
August 14, 2005 can e.g. not only be written as "14.08.2005", but also "14-
08-2005" or "14/08/2005", but also completely irregularly as "14.08-2005" or
"14/08.2005". If there are inconsistent date spellings, mainly in external text
files (see e.g. the variable MIX_DATA below), they can be standardized
using the following program.
data list free
/ MIX_DATA (A10).
begin data
14/8/2005 14-8-2005 14.8.2005 14.8/2005
14.8-2005 14-8/2005 14/8.2005 14/8/2005
14-8.2005 14/8-2005 14.8-2005 14/8.2005
end data .
compute #x=index(MIX_DATA,'/.-,',1).
compute #y=rindex(MIX_DATA,'/.-,',1).
compute MY_DAY=number(substr(MIX_DATA,1,#x-1),F2).
compute MY_MONTH=number(substr(MIX_DATA,#x+1,#y-#x-1),F2).
compute MY_YEAR=number(substr(MIX_DATA,#y+1),F4).
format MY_DAY MY_MONTH (F2.0) MY_YEAR (F4.0).
compute
STANDARD_DATA=[Link](MY_MONTH,MY_DAY,MY_YEAR).
format STANDARD_DATA (edate).
list.
The mainly used INDEX, RINDEX and SUBSTR functions have already
been explained earlier (for further work with time and date variables see
Schendera, 2005).
MIX_DATA MY_DAY MY_MONTH MY_YEAR STANDARD_DATA
14/8/2005 14 8 2005 14.08.2005
14-8-2005 14 8 2005 14.08.2005
14.8.2005 14 8 2005 14.08.2005
14.8/2005 14 8 2005 14.08.2005
14.8-2005 14 8 2005 14.08.2005
14-8/2005 14 8 2005 14.08.2005
14/8.2005 14 8 2005 14.08.2005
14/8/2005 14 8 2005 14.08.2005
14-8.2005 14 8 2005 14.08.2005
14/8-2005 14 8 2005 14.08.2005
14.8-2005 14 8 2005 14.08.2005
14/8.2005 14 8 2005 14.08.2005
Number of cases read: 12 Number of cases listed: 12
For time and date variables, the format must be observed, e.g. European vs.
American format (see 4.11.).

4.13 Uniformity of Missings


Missings can be inconsistent in at least three ways:

Variant 1: There are codes for user-defined missings, some of which


have not yet been transferred to SPSS.
Variant 2: There are different codes for the same phenomenon (also
the opposite case, same codes for different phenomena, would have to
be checked).
Variant 3: There are both system-defined and user-defined missings.
Variant 1:
For all codes for user-defined missings it should be checked whether they
have been transferred to SPSS. A syntax provides an overview (if one exists
at all):
MISSING VALUES VAR_1 (98,99) VAR_2 (99) VAR_STRG (’ ’).
If individual codes (e.g. 98 in VAR_2) or even complete variables (e.g.
VAR_3, see "Completeness") are not included, then they must be included in
this definition of user-defined missings (see variant 2).
Variant 2:
There are different codes for the same phenomenon (or vice versa). A
unifying coding can be achieved either from the outset during data entry resp.
afterwards via standardizing IF, RECODE or DO IF instructions (cf.
Schendera, 2005). A standardizing definition resp. pass to SPSS can be
achieved with the following syntax statement.
MISSING VALUES VAR_1 VAR_2 VAR_3 (98,99) VAR_STRG
(’Missing’).
Note: Often, when entering data, both 0 and an empty cell (‘blank’) are used
simultaneously as 'codes' for missing values. Because here the value 0 is
erroneously equated with the information 'cell is empty', a quite serious
problem arises, because experience shows, that these data usually cannot be
distinguished afterwards and can be corrected resp. unified by syntax
statements. Furthermore, a sequence problem of SPSS has to be considered
when unifying zeros and empty cells in pairs (e.g., VAR_1 and VAR_2).
If, for example, VAR_1 and VAR_2 are to be set to missing if VAR_1 and
VAR_2 each contain zeros or blanks, then the first following SPSS statement
is logically correct though, but only the second statement works for both
variables.
* (a) Command does not work *.
if any (VAR_1, 0, SYSMIS) & any(VAR_2, 0, SYSMIS)
VAR_1=$SYSMIS .
exe.
if any (VAR_1, 0, SYSMIS) & any(VAR_2, 0, SYSMIS))
VAR_2=$SYSMIS .
exe.
* (b) Command works *.
if any (VAR_1, 0, SYSMIS) & VAR_2 <=0 VAR_1=$SYSMIS .
exe.
if SYSMIS(VAR_1) & VAR_2 <=0 VAR_2=$SYSMIS .
exe.
Variant 3:
If both system-defined and user-defined missings are present, this is not an
error. For economic reasons, however, it should be considered whether a
uniform definition, i.e. only system-defined or only user-defined missings,
would not be more efficient for implementing the present requirements.

4.14 Consistency of analyses and designs


(SET, SHOW and others)
The importance of representative labels of names or values of variables was
already hinted at in the section on correctly spelled names of clients, products
or services. The related representation of companies, institutions or projects
by a uniform design is often referred to as 'Corporate Design'. The optically
uniform 'corporate design' serves thereby the self-portrayal of enterprises,
institutions or also projects by uniform communication of their values, goals
and qualities ('Corporate Identity'). 'Corporate Design' again includes,
among other things, designs, colors, slogans, anagrams etc.
Also for analyses a uniform procedure is often necessary. Uniformity is
important here, for example, when (partial) analyses with SPSS require
uniformity of decimal places, measurement levels, currency units, etc.,
despite different platforms (e.g. Macintosh, Windows), versions (e.g. SPSS
18, 17, 16, 15, 14 etc.), (distributed) workstations resp. (changing) users, both
internally and externally. In contrast to the 'Corporate Design' (4.14.1.) we
could assign this aspect of the rather technical-methodical and only
subordinate representative uniformity to the 'Technical Design' (4.14.2.).
By using the SET command, SPSS allows to develop uniform designs resp.
analyses in several dimensions, or to ensure by the uniform use of SPSS
options resp. settings:

Uniform color definitions and sequences.


Uniform fonts (font, size, formatting).
Templates for pivot tables, depending on the type, if necessary.
Templates for diagrams, depending on the type, if necessary.
Uniform settings (e.g. storage, language etc.).
You have to realize: Using various possibilities of SPSS defaults, SET
makes it possible to tailor the data management and the data analysis in such
a way that (1) post-processing become only marginally necessary, and (2) co-
ordination problems between different jobs, coworkers or also project phases
should be excluded as far as possible. The procedure is as simple as it is
effective: Together, you develop a SET file that all project participants can
agree on, possibly the appropriate templates for diagrams and tables, install
them on each computer and you have optimal interchangeability. In order to
check in between which SPSS defaults resp. whether all user-defined settings
are still effective, only one simple command is needed. All currently effective
SET options can be viewed with a simple SHOW command.
SHOW.
This command e.g. outputs all current settings in alphabetical order (the
following output is abbreviated). The meaning of the respective SET options
is explained in detail later.
System Settings
Keyword Description Setting
BLANKS Value to which blanks are translated System-missing
CACHE 5 (2 data source(s) in use
Data caching setting
now)
CCA Custom currency format A "-,,,", e.g. -1,234.56
… Output abbreviated

WEIGHT Variable used to weight cases File is not weighted
WIDTH Page width for text output 80
WORKSPACE Special workspace memory limit in
6148
kilobytes

Usually the settings remain effective until they are overwritten by a new SET
default. By the way, not everything that SHOW outputs (e.g. LICENSE)
corresponds to a SET option. The installed license can be viewed e.g. via
SHOW LICENSE., but e.g. cannot be set via SET LICENSE.
SPSS offers further possibilities for the uniformity of 'Corporate Design'
resp. 'Technical Design', e.g. the sequence of colors, patterns or markers.
However, these are currently not directly controllable via the SPSS command
syntax. However, adjustments using mouse control can be stored as so-called
Chart Templates (*.sgt; in SPSS V15 e.g. under Edit → Options in the
"Charts" tab), and reused using the SET option CTEMPLATES for
automation.
The following section presents the most important possibilities of the SET
option for the so-called 'Corporate Design' (for further possibilities please
refer to the SPSS Command Syntax Reference).

4.14.1 Representative uniformity (‚Corporate


Design’)
In this section different options for designing a representative uniformity
('Corporate Design') are presented.
For example, the OLANG option can be used to set the language for most
elements of the SPSS output.
SET OLANG=German.
This SET command causes e.g. a uniform output in German. OLANG is not
applicable to simple text output and interactive diagrams (IGRAPH) or maps
(MAPS). SPSS offers different output languages. In case a desired output
language is not available, the software's default language setting can be used
via DEFOLANG.
The TLOOK option can be used to specify either SPSS table templates or
user-defined templates for pivot tables. The TLOOK template can be used to
control e.g., borders, headline placement, line and column labels, and text
font.
SET TLook 'C:\My_data\[Link]' .
Using TLOOK, this SET command assigns e.g. a uniform table design, in
this case a design offered by SPSS.
SET TLook 'C:\My_data\Client [Link]' .
This SET command assigns e.g. via TLOOK a uniform table design, in this
case a design developed by the user himself and provided with an own name.
The default setting for TLOOK is NONE.
With the CTEMPLATE option, user-developed templates for diagrams can
be specified. The template specified under CTEMPLATE can control lines,
colors, fill patterns and the font of the text, among other things.
SET CTEMPLATE 'C:\My_data\SPSS\Looks\[Link]' .
This SET command defines e.g. via CTEMPLATE and the user-defined
diagram template "[Link]" a uniform design for
diagrams. The default setting for CTEMPLATE is NONE. SET
CTEMPLATE has the same, only more general effect as a /TEMPLATE='...'
[TEMPLATE without 'C'!] in a GRAPH program. CTEMPLATE applies to
all diagrams, /TEMPLATE= only to the one for which the template was
explicitly requested.
The TFIT option can be used to specify the type of column width for pivot
tables.
SET TFIT both .
This SET command assigns via BOTH e.g. a uniform column width, which is
wide enough for labels as well as for values. Using LABELS, the column
width would only be wide enough for labels; values that should be longer
than labels are displayed as asterisks in this case.
The options ONUMBERS and OVARS resp. TNUMBERS and TVARS
can be used to specify how variables should be displayed in pivot tables. The
TNUMBERS and TVARS options are used to set the display of values and
variables in the table itself. The ONUMBERS and OVARS options are used
to set the display of values and variables in the frame ('Outline') of the table.
It is possible to display variables with names (NAMES), values (VALUES),
labels (LABELS) or both (BOTH, labels and values or names and values).
SET TNUMBERS values TVARS labels ONUMBERS labels OVARS labels
.
This SET command defines e.g. a uniform display of variables.
The CCA to CCE options can be used to pass up to five user-defined
currency and/or other special output formats with prefixes and suffixes to
SPSS and assign them in analyses e.g., through FORMATS (see below, also:
WRITE FORMATS, PRINT FORMATS, WRITE and PRINT). The
designations CCA to CCE are fixed for user-defined formats and cannot be
changed.
set CCA=’,$,,’ formats VARA(CCA4.0)
CCB=’,,%,%’ / VARB(CCB4.0)
CCC=’[,mmHg ,,]’ / VARC(CCC4.0)
CCD=’,€,,’. / VARD(CCD4.0).
Using four specifications, SPSS allows you to specify up to five of your own
(currency) formats. The separation within the fixed sequence of negative
prefix, prefix, over suffix to the negative suffix is done either by dot or
comma (depending on what is not used as decimal point) and thus always
contains three dots or commas. A single specification cannot be longer than
16 characters. The expression is placed in quotation marks. By the way, the
example CCC uses a blank to the right of "mmHg"; see the output below.
Examples for four user-defined formats
Format CCA CCB CCC CCD
element
Negative None None [ None
prefix
Prefix $ None mmHg €
Suffix None % None None
Negative suffix None % ] None
Delimiter Comma Comma Comma Comma
Pos. example $1200 1200% mmHg 1200 €1200
Neg. example $1200 1200% [mmHg €1200
1200]
The LENGTH (default: 59; minimum: 40, maximum: 999.999 lines) and
WIDTH (default: 80; minimum: 80, maximum: 255 characters) options allow
to uniformly define the maximum length and width of text output.
SET LENGTH=60 WIDTH=100 .
This SET command uses e.g. LENGTH to specify a page length of 60 lines
and a page width of 100 characters to define a uniform text output.
The HEADER option can be used to assign headings to the SPSS output.
HEADER refers to headings by the system, and by the user using TITLE
resp. SUBTITLE.
SET HEADER=YES.
This SET command provides e.g. the SPSS output with headings.
BLANK=NO suppresses headings. The option BLANK is interesting in that
it does not cause headings, but it can cause a page break.
4.14.2 Technical and methodological consistency
(‚Technical Design’)
This section presents various options for ensuring technical and
methodological consistency ('Technical Design').
Uniform analyses, values (blanks) and decimal places, scientific notation
The DECIMAL option can be used to specify for numeric values whether
the decimal places are output as comma or dot. This option is new in version
15
SET DECIMAL=COMMA [DOT].
DOT outputs the decimal places as a dot, COMMA outputs the decimal
places as a comma. DECIMAL can override the settings made by the
operating system or a preceding LOCALE command. However, a subsequent
LOCALE command can overwrite DECIMAL again. LOCALE is presented
at the end of this chapter.
The FORMAT option can be used to change the default F8.2 print and write
format for numeric variables.
SET FORMAT=F6.1.
Via FORMAT, this SET command e.g. assigns numeric variables a uniform
print and write format with six digits, including decimal point and one
decimal place. If, for example, new variables are created using COMPUTE,
they are immediately created in F6.1 format and not in the preset F8.2 format.
This format refers to the display and not the internal SPSS storage.
If two-digit years are present or if two-digit years are to be subjected to
certain date operations, the "Y2K" problem can be avoided by using the SET
option EPOCH. EPOCH tells SPSS how to "understand" certain two-digit
numbers by specifying an "interpretational range" of 100 years.
SET EPOCH=AUTOMATIC.
This EPOCH option AUTOMATIC causes SPSS, for example, to interpret
all two-digit values as if they were within a span of 69 years after the current
system year and 30 years before. For example, if the system year is 2005,
SPSS interprets the year specification "05" as "2005" and not as "1905".
However, if the year is to be "1905", EPOCH must be changed by specifying
the beginning of the 100-year span.
SET EPOCH=1900.
This EPOCH option passes the information to SPSS, for example, that all
two-digit values are in the time or number space from 1900 to 1999. The year
"05" is interpreted as "1905" and not as "2005".
An EPOCH option that does not begin in a "century" is interpreted more
complicated by SPSS.
SET EPOCH=1950.
This EPOCH-option passes e.g. the information to SPSS that all values
between 50 and 99 start with "19", and all values between 00 and 49 start
with "20". The year "05" is thus interpreted as "2005" and not as "1905".
The EPOCH option is very useful for two-digit year specifications. For many
reasons, however, you should rather try not to have any two-digit years in
your data resp. replace them with correct four-digit years (see the examples
elsewhere).
SPSS is set to interpret an empty space or empty cell (so-called 'blank') as a
system-defined missing when reading text files or creating datasets. With the
BLANKS option, this default setting (BLANKS=SYSMIS) can be adapted to
the actual data situation resp. to the wishes of the data entry staff, e.g. if
missing data is coded as a "0". With BLANKS=0, SPSS may therefore expect
that the data coded in such a way has no gaps. If data gaps occur
nevertheless, they would trigger an error message. BLANKS works only for
empty cells resp. empty fields for numeric variables.
SET BLANKS=0.
This BLANKS command causes SPSS to read in the value "0" as a
placeholder for a missing value (only one uniform code can be assigned);
"correct" blanks would now lead to error messages during the read-in
process. The value "0" is not yet interpreted by SPSS as (user-defined)
missing. If this is so to be "understood" by SPSS, the value "0" is to be
passed to SPSS as a missing in a second step, e.g. via MISSING VALUE.
When reading in data, the SET option BLANKS=... must be set before
initiating the reading process. SPSS also offers the possibility to issue a
warning message for invalid values using UNDEFINED.
SET BLANKS=0 /UNDEFINED=WARN.
This BLANKS command causes SPSS to issue a warning message if
anything other than the specified number or blank occurs at the position of
the numeric value.
The SMALL option can be used to set the display of small numeric values.
The SMALL option only refers to the display, not to the derivation and/or
internal storage of the values.
SET SMALL=0.000100 .
This SET command uses SMALL to specify that all values smaller than
0.0001 should be output uniformly in scientific notation.
SET SMALL=0 .
This example specifies, for example, that no value should be displayed in
scientific notation.
Uniform functionality
The option RNG ("random number generator") refers to the random number
generator integrated in SPSS (in SPSS version 12 or earlier: MC, from
version 13 on: MT) and their initial values, the so-called 'seeds' of these
random number generators (to be specified in MC via SEED, in MT via
MTINDEX). The RNG option together with the SEED or MTINDEX
specifications is especially useful if you want to reproduce random results; in
other words, if you want to achieve the same results e.g. with random
drawings. The fact that the notion of randomness turns out to be a pseudo-
randomness (because it depends on the given starting value) should only be
mentioned in passing. Using the 'seed' option RANDOM, MT and MC output
"randomly dependent" results.
SET RNG=MC SEED=4567.
This SET command instructs SPSS to use the MC (Monte Carlo) random
number generator and its starting value 4567. The start value must be an
integer between 0 and 2000000 (incl.).
SET RNG=MT MTINDEX=-123.4.
This SET command instructs SPSS to use the random number generator MT
(Mersenne Twister) and the start value -123.4. Unlike MC's SEED, the start
value can be specified with decimal places and can also be negative.
RNG=MT does not necessarily reproduce the same results as RNG=MC for
the same start value (the same applies vice versa).
The CACHE option saves a complete copy in a temporary file after a certain
number of changes (default: 20) to the active dataset. Working with CACHE
not only speeds up the work with SPSS, but is indispensable for certain tasks,
e.g. accessing databases via SQL.
SET CACHE 10.
For example, this SET command causes SPSS to write a copy to a temporary
file after 10 changes to the active dataset.
The COMPRESSION option can also influence the processing speed.
COMPRESSION specifies whether so-called "scratch files" are created in
compressed or uncompressed form. Compressed scratch files take up less
disk space; uncompressed scratch files can be processed faster.
SET COMPRESSION=ON.
This SET command causes e.g. SPSS to create compressed "scratch files".
The following options are not settings for a uniform way of working, but
rather options when individual computers run the risk of not functioning
uniformly due to insufficient memory. If certain processes overload the
currently available RAM, the SET options MXCELLS resp. WORKSPACE
can be used to provide memory to complete the process in question. SPSS
gives a feedback before that there is not enough memory available and how
much memory should be allocated using MXCELLS or WORKSPACE.
If e.g., pivot tables are requested which exceed the available memory, the
SET option MXCELLS can be used to short-term increase the memory.
The SET option WORKSPACE can be used to allocate more memory if e.g.,
certain procedures exceed the available working memory, e.g.,
FREQUENCIES or CROSSTABS.
The SPSS documentation recommends to use WORKSPACE and
MXCELLS only cautiously and only if SPSS has issued a corresponding
warning message before, and to reset e.g. MXCELLS to its default values
immediately after the output of the corresponding pivot table.
Uniform processes
The following options are important settings for customizing your
workflows. These settings are recommended if employees want to have their
familiar working and analysis environment on different PCs. In this way,
SPSS sessions that are uniformly set up increase the efficiency of workflows.
The SET option MXLOOPS applies independently of the SPSS installation.
SORT, MXWARNS, MXERRS, as well as LOCALE are important if SPSS
is installed on a server (SPSSB).
With a maximum value (default: 40), MXLOOPS prevents infinite loops in
the LOOP-END LOOP command. However, if MXLOOPS is too low, a loop
may be exited before a condition has been processed correctly. Thus, if
programs with complex LOOP-END LOOP loops are executed on different
computers, it is recommended to make sure that all of them have appropriate
MXLOOPS defaults.
SPSS is preset so that it does not use its internal sorting algorithm (SORT=
SPSS), but tries to access server-based sorting algorithms via the preset SET
option SORT = EXTERNAL. If an SPSSB server is available on which
additional sorting algorithms are installed, SPSS tries to access the
performance of server-based sorting algorithms via the SORT option
EXTERNAL (syn.: SS) to reduce the processing time of large datasets.
MXERRS is only valid for a server-based SPSS installation (SPSSB) and not
for SPSS for Windows or SPSS for Macintosh, for example. MXERRS
counts errors up to a certain maximum limit (default: 100). If the limit is
reached, SPSS stops processing the commands, but scans the programs for
further errors. In SPSSB, MXWARNS is related to MXERRS (see below).
MXWARNS limits the number of warnings issued (default: 10); the mode of
operation of MXWARNS differs depending on the installed version. For
example, if the limit is exceeded in SPSS for Windows or SPSS for
Macintosh, no further warnings will be issued. If, for example, no further
warnings are issued in SPSS for Windows, this should not be interpreted as
meaning that no further errors occur. If, for example, the MXWARNS limit is
exceeded in SPSSB or the total number of warnings and errors exceeds the
MXWARNS limit, SPSS will abort the processing of the commands but scan
the programs for further errors.
The following options are not presented in detail: Using BLOCK and BOX,
you can e.g. set the characters with which boxes (for frames) or blocks (in bar
charts) are to be drawn in case of output in text format (e.g. Text Viewer).
SCALEMIN is only relevant for SPSS datasets prior to version 8.0.
The options ERRORS, MESSAGES, RESULTS and PRINTBACK control
the text output. The other default settings and sub-commands depend on the
operating system; possible aliases are e.g. ON/YES resp. OFF/NO. ERRORS
(error and warning messages; LISTING by default), MESSAGES (e.g.,
headings and memory requirements; NONE by default) and RESULTS
(result output in text format; LISTING by default) refer to text output only.
PRINTBACK (default: BOTH), on the other hand, refers to commands
passed through syntax or dialog boxes, e.g., as they are logged in the journal
('Log').
MEXPAND, MPRINT, MNEST and MITERATE are settings for macro
programming and are introduced e.g. in Schendera (2005). In order to ensure
uniform functionality of macros, of course, make sure to use uniform default
settings also here. The EXTENSIONS and MXMEMORY commands are no
longer supported.
The SET option LOCALE is new in version 15. LOCALE allows you to
change the default local settings ('locales', including country, system
language and character set) that SPSS or the SPSS Analytic Server uses for
data analysis. The advantage of LOCALE is that the user can process datasets
in other locales without having to change the computer settings. A default
setting could look like this: "German_Germany.1252". The first (English)
specification specifies the system language, the second the country and the
third the so-called code page for the character set used (here e.g. Windows
Latin-1). The change to other local settings is permanent and can only be
undone with another explicit LOCALE command. LOCALE overwrites local
commands such as DECIMAL (see above). For the definition resp.
application of server-based area schemes it is recommended to consult the
server administrators.
In SPSS version 15 the SET option JOURNAL (also: LOG) in is no longer
supported. In earlier SPSS versions JOURNAL supported the quality and
speed of transformations and data analysis. Such log files, so-called logs,
promote process-oriented instead of result-oriented working and contribute
sustainably to the securing and proof of data and analysis quality.
SET JOURNAL ON.
This SET command causes SPSS e.g. to create a so-called log. This log file
logs all (mouse) commands and warning messages (from SPSS) of a session.
The logged commands can be transferred to a separate syntax file or can also
be rewritten to macros (see Schendera, 2005). Warning messages are
indications of errors that still occur. Warning messages that are not issued are
not necessarily a guarantee that there are no errors. Logs are an "automatic
memory" and especially useful for phases of creative trying out different
syntax resp. calculation variants.
SET Journal 'C:\Programs\SPSS\project17_2005.jnl' Journal On.
This SET command causes SPSS to create a so-called log with the name
"project17_2005" in a specific directory. The advantage of user-defined logs
is that they are not deleted at the beginning of a session but are continued
constantly. However, logs that are continued for a very long time can become
correspondingly extensive and may even slow down the speed of the
computer.
If the SET option JOURNAL is passed to SPSS V15, SPSS will issue the
following error message.
SET JOURNAL ON.
>Warning # 802 in column 5. Text: JOURNAL
>You have attempted to use a SET subcommand which is obsolete.
If uniform default settings (and templates, if applicable) were agreed upon for
a project, a unifying SET section could be submitted in advance of every
SPSS program, for example. Such a partial program could then look like this:
SET width 80.
SET length none.
SET compression=on.
SET header=no.
SET mxwarns=300000.
SET blanks=sysmis.
SET tlook 'C:\My_data\[Link]' .
SET ctemplate 'C:\My_data\SPSS\Looks\ [Link]' .
After that the usual SPSS syntax programs could follow (see e.g. Chapter 15).
As a final note: You should not rely on any software, not even SPSS, that
syntax-based templates and settings etc. are 100% upward-compatible or
hardware-independent. When upgrading to a newer version resp. migrating to
a new operating system, the functionality of all settings and templates should
be carefully checked against the previous versions that worked successfully
before.

5 Duplicate values and multiple


data rows
Identify, understand and (if necessary) filter

If the data is complete, the next step is to check whether data occurs more
than once (so-called duplicates, multiple entries/listings). Duplicates are
generally cases that have the same values in at least one key variable (e.g.
IDs) (assuming the entries in the key variable are correct). In the following,
duplicates are therefore (at first) single values or complete data rows, which
occur more than once. In section 5.5., the attribute "duplicate" is further
differentiated into "multiple" and "duplicate ".
Multiple occurring single values (e.g., IDs) do not necessarily mean
automatically that the complete data row is identical; however, this
possibility should be investigated and explicitly excluded as a source of error.
The checking of duplicates is a necessity especially for large, complex
databases up to data warehouses and should be checked even for small
datasets in principle.
Along with "completeness", "uniformity" and "missings", "duplicates" is one
of the basic criteria of the DQ Pyramid that SPSS can be used to check. The
examination of all other criteria is based on these. However, the check for
duplicates requires completeness and uniformity.
Duplicates cause a so-called overcompleteness; in the dataset is more data
present than expected (cf. 3., 7.1.). The data themselves are not invalid; there
are just too many copies, either individual rows or even identical subsets.
This is an example for the fact that the definition in the "Curriculum
Qualitätssicherung / Ärztliches Qualitätsmanagement" of the German
Medical Association [Bundesärztekammer] as "property of a date with regard
to the quality criteria objectivity, validity and reliability" (2003, 72) is clearly
wrong. Duplicates (especially intra-group duplicates) are by definition
objective, valid and reliable with regard to the characteristics of a single case.
5.1 Causes and consequences of duplicates
There are numerous reasons for multiple occurrences of data. Possible causes
for duplicate data rows (not only) in large data stores may be

accidentally multiple appending of identical cases in SPSS (e.g. by


ADD FILES),
the different allocation (coding) of the same personal or address data,
ID variables changing over time for longitudinal data,
the automatic repeated saving of the data of a single case, e.g. due to
incorrect data collection (especially for online surveys),
or creating copies of identical row (cases) or even complete datasets
via faulty programs.
Even with small datasets, it can happen that the same questionnaire is
entered several times, especially with different data entry staff or
different input stations.
The practical consequences of duplicate data are by no means trivial. Want
an example? Not so long ago, a data error inflated Wells Fargo’s operational
risk capital capital by $5.2 billion. The removal of duplicate data led to a
sharp fall in Q1 op risk-weighted assets from $403.6 billion to $338.7 billion.
Naturally also raising questions on the soundness of the bank’s op risk
management (RiskNet, 2020).

The mere occurrence of duplicates affects storage capacity and


computer speed.
Depending on the type and function of duplicate data, duplicates
can lead to multiple processes that per se consume resources and
cause further damage.
Duplicate address data leads to multiple mailings to the same
address, and thus also to image damage.
Duplicate patient data leads to incorrect billing of patients or
health insurance companies.
Duplicate product data leads to incorrect calculations of sales,
excessive income or expenditure, etc.
Duplicate trigger data leads to unnecessary multiple execution of
further actions (e.g. sending, saving, loops, clustering, etc.).
Due to their more frequent occurrence, multiple cases have at
least twice as much probability to be sampled as cases that occur
only once, and not only if duplicates are only distributed to one
list, group or cell (within-group duplicates).
A more complicated situation exists if duplicates are distributed
over several groups (between-group duplicates).
Duplicates may already be distributed to two different groups due
to one single different coding, e.g. two different sampling units
("frame units").

A first difficulty is to be able to identify these multiple cases at all. Who


would expect that separate groups would contain the same cases as
duplicates? If, at with a corresponding frequency (weight) of other multiple
occurring cases, these duplicates are not identified, they can reduce a possible
difference between the two groups precisely because their identical
(duplicate) characteristics are distributed over both groups. A further
difficulty is to be able to decide with certainty which of the two cases is the
correct case and which the wrong one. With unclear criteria and/or large
amounts of data (e.g. > 100,000 cases), this is not always easy (Kostanich &
Haines, 2003; 2002 Kostanich, 2003; Mule, 2002, 2001).
In inferential statistics or knowledge construction based on it (incl. data
mining) the effects of duplicate data can be more subtle.

For example, duplicate data (especially of the type of within-


group duplicates) generally endanger a variation of measured
values (variance). For example, the more often the value 2 in
column X is caused by duplicate data rows, the more the
expected mean value of X approaches the value 2. If a variance is
restricted, then this undermines from a certain extent for example,
the application of further multivariate procedures, e.g. the
correlation and factor analysis. Cluster procedures generally
assume an equal weighting of cases or variables. The
overweighting of repeatedly occurring person or feature
characteristics distorts e.g. in cluster analyses the resulting
clusters in the direction of the overrepresented case or feature. In
numerous multivariate procedures, data independence is
assumed, e.g. in multiple linear regression. If there are duplicates
of absolutely identical entries of one and the same case, they
favor e.g. problems like multicollinearity. Data mining in the
sense of revealing data structures in general, as well as derived
key figures such as predictive accuracy in particular, are basically
endangered by duplicates.
Duplicate data of the type of between-group duplicates
undermine e.g. tests for differences between groups. Depending
on the frequency of different duplicates in the groups to be
compared with each other, the location, dispersion and
distribution form of the comparison groups can, among other
things, become alike.
If it concerns a repeatedly occurring interval-scaled value (in extreme cases: a
constant), then, e.g. in a parametric analysis of variance, the F-value strives
for a minimum value and the significance for a maximum value. If the
multiple occurring value is e.g. categorically scaled (in extreme cases: a
constant), then, e.g., in a table analysis, the Chi²-value strives towards a
minimum value and the significance towards a maximum value.
Duplicate data endangers e.g. differentiated weightings of data. Usually, a
data row is assigned the weight 1 (weight, w = 1); duplicates now lead to a
higher weight (w > 1) being assigned to affected data (groups). Duplicates
thus endanger the representativeness and thus the basic interpretability of the
data within which they occur. The resulting bias can be positive or negative
and can vary depending on the subgroup (not to be confused with over- or
undercoverage; Rothhaas, 2002; Mariotte, 1999; Prewitt 2000). Regardless of
whether a dataset or database content should "actually" be either a sample or
a full survey: Because the dataset content is distorted by duplicates, strictly
speaking the data no longer represent a sample or a full survey. Two
measures for the coverage of samples, coverage error and coverage
efficiency, thus become pseudo-plausible due to duplicate sampling. Since
there is no representative random selection of data, a basic requirement of the
significance test is no longer fulfilled. Since sampling units ("frame units")
are also distorted, further reliable further processing and analysis of the data
is no longer possible because the data can no longer be reliably filtered,
stratified or combined.
Particularly when working with time series and panel data, there is a risk
that different codings exist for persons in different datasets (because these
codings can change over time, for example) and that, when the data is
merged, persons will appear several times in the final dataset instead of only
once. In order to avoid these problems, nationwide full surveys, for example,
strive for an elaborate and careful un- resp. deduplication of their data to
avoid overcoverage or asymmetrical weighting (Statistics Canada, 2003,
2000; Rothhaas, 2002, Fay, 2002). However, despite all care taken, errors
may occur: An evaluation of the Census 2000 of the US Census Bureau, for
example, revealed incorrect counts, especially in the form of duplicates
(Kostanich & Haines, 2003; Kostanich, 2003; Mule, 2002, 2001). One cause
was the different allocation (coding) of address data, which led to one and the
same person being counted several times. With regard to representativeness,
it should therefore be carefully checked before data delivery or analysis
whether the properties of the data also correspond to the assumed properties
of the cases or variables investigated.
Starting with version 12, SPSS also offers the menu „Identify Duplicate
Cases…“; however, the functionality and flexibility of the following syntax
programs goes beyond this menu.

5.2 Checking for duplicates: To be or not


to be?
When checking for duplicates, it must be ensured whether the dataset should
only contain one case per row (situation 1) or whether, for example, there is
a repeated measurement and the same cases may well occur in several rows
(situation 2). The approaches described below are generally suitable for
checking for within-group and between-group duplicates.
At this point, reference should also be made to the module "Data Preparation"
or the VALIDATEDATA procedure presented in chapter 9. This SPSS
application for data screening allows, besides many other data checks, also
the check for duplicate IDs or case identifiers.

5.2.1 Situation 1: The dataset contains one case per


data row
If the dataset contains only one case per data row, the check of the counter or
ID variables can be done with a simple bar chart for frequencies. In a simple
univariate bar chart, multiple IDs are most easily identified in the form of
individual peaks. Normally all IDs should have the frequency "1". Every
"peak" beyond 1 is an indication that an ID occurs multiple times. If values of
the counter or ID variables occur more than once, data rows are not right
away necessarily present more than once in the dataset though. It is also
possible that only the ID occurs multiple (e.g. due to a typing error), but the
rest is correct. In order to check the equality of complete data rows, one
would have to resort to the approaches listed below.
data list
/ID 1-3 GRUPPE 5(A)
ALTER 7-8 .
begin data
001 a 8
002 a 17
003 b 23
004 b 75
002 a 17
003 b 23
005 a 65
end data.
exe.
FREQUENCIES
VARIABLES=ID
/BARCHART FREQ
/ORDER= ANALYSIS
.
The dataset contains two IDs (cf. 2, 3) multiple times. For these data rows
you would have to check in a further step, if only the IDs (accidentally) occur
multiple times or if even the rest of the data rows are the same, in other
words: if not only the ID is the same, but also the complete row. A bar chart
is unsurpassed in quickly checking for the correct single assignment of IDs.
For the correct reading of the specific duplicate ID, a bar chart reaches its
limit with very many values, because the readability of the x-axis becomes
increasingly difficult. However, bar charts are very helpful, at least for
limiting the value. The specific duplicate occurring ID can then be taken from
an additionally requested frequency table, for example.
If the dataset does not contain any counter or ID variables or if their
correctness is questionable, multiple data rows can be identified by unique
values. The most uncomplicated procedure for short datasets is to list-wise
view a selected string variable. If two persons make absolutely identical
entries (including typos, if applicable), it is probably the same data row of the
same person. The simplest procedure for large datasets is the output of
probably unique numerical values (e.g. a laboratory parameter) in the form of
a bar chart for frequencies. If values of the unique laboratory parameter occur
more than once, one or more data rows are probably present more than once
in the dataset. For large datasets it may be useful to combine several test
criteria.

5.2.2 Situation 2: The dataset contains several


identical cases
If the dataset legitimately contains several cases in the data rows, the counter
or ID variables can be checked using a frequency diagram grouped
according to the coding for the repeated occurrence of the case in question,
e.g. measurement time or group membership. If the counter or ID variables
occur more frequently than measurement times are coded, one or more data
rows are probably present multiple times in the dataset. A similar result can
be achieved by sorting the data rows according to the counter resp. ID
variable and the repeated measurement coding at the same time, or by
outputting them as a cross table.
data list CROSSTABS
/ID 1-3 GROUP /TABLES=GROUP BY ID
5(A) ALTER 7-8 . /FORMAT = AVALUE TABLES
begin data /CELLS= COUNT
001 a 8 /COUNT ROUND CELL
002 a 17 /BARCHART .
003 a 23
004 a 74
001 b 75
002 b 17
003 b 23
003 b 65
004 b 46
end data.
exe.

The interpretation is analogous to the univariate bar chart. Multiple IDs are
most easily recognized by single peaks. For example, ID 3 occurs several
times in group "b". Here too, in a further step it would be necessary to check
whether only the ID (accidentally) occurs multiple or whether the rest of the
data row is also the same.

5.3 Removing duplicate data rows only via


ID Variable
If IDs have been identified that occur more than once, the next step is often to
delete the relevant data rows, a process called deduplication.
The following program uses an ID variable to remove all data rows that have
the same ID variable more than once. This approach assumes that the same
IDs also mean completely identical data rows. However, this is not self-
evident (see ID 4). Exclusively unequal IDs also do not exclude the
possibility that IDs were accidentally entered incorrectly, e.g. differently. If
the data should be checked for duplicates in a more sophisticated way, the
use of programs from 5.4. onwards is recommended.
When excluding multiple data rows, you have the option to keep one of the
duplicate data rows (Approach I) or to remove all duplicate data rows
completely (Approach II). The final LAG approach identifies duplicate
values in a variable even independently of an ID variable; here you have the
option of keeping resp. deleting the single or multiple values. Approach III
can also be applied to ID variables.
data list free
/ ID VAR1 VAR2 VAR3 VAR4 .
begin data
12345
23456
34567
45678
46789
56789
end data.
exe.

Approach I: Keeping a multiple row


Using approach I, one of the duplicate data rows is kept. The two SPSS
commands CASESTOVARS and MATCH FILES differ in filtering in that
CASETOVARS only keeps the first of the multiple rows, while MATCH
FILES approaches are also suitable for very large amounts of data (rows and
columns).
a) CASESTOVARS approach
sort cases by ID.
casestovars
/id = ID
/groupby = variable.
save outfile="C:\DE_DUP.sav"
/rename =(VAR1.1 VAR2.1 VAR3.1 VAR4.1 = VAR1 VAR2 VAR3 VAR4)
/drop VAR1.2 VAR2.2 VAR3.2 VAR4.2.
get file="C:\DE_DUP.sav".
list.
With the CASESTOVARS approach, only an ID variable needs to be
specified under /ID=. Everything else runs fully automatically. If the data
does not contain duplicate IDs, the dataset is not restructured and can be
taken over directly. If the data to be checked contains duplicate IDs (see
above, ID=4), CASESTOVARS restructures the dataset. A copy of each
variable is created and compared with each other.
From the finally remaining dataset, these copies must be removed using
SAVE OUTFILE= and /DROP=, as well as the original names have to be
restored via /RENAME=.
Result
ID VAR1 VAR2 VAR3 VAR4
1,0 2,0 3,0 4,0 5,0
2,0 3,0 4,0 5,0 6,0
3,0 4,0 5,0 6,0 7,0
4,0 5,0 6,0 7,0 8,0
5,0 6,0 7,0 8,0 9,0
Number of cases read: 5 Number of cases listed: 5
The number of copies depends directly on the number of duplicate IDs. If
two IDs occur (see above), only two variables are compared; if three IDs
occur, three comparison copies are created, and so on. With
CASESTOVARS, the verification process can be efficient with few
variables. With many IDs, as well as many variables, the removing or
renaming of the comparison copies is extremely time-consuming though.
b1) MATCH FILES approach b2) MATCH FILES approach
(keeping the last row) (keeping the first row)
sort cases by ID. sort cases by ID.
match files file=* match files file=*
/by=ID /by=ID
/last=LAST_CASE. /first=FRST_CASE.
exe. list.
compute FILTER= select if (not(FRST_CASE=0)).
( (LAST_CASE=1 and exe.
lag(LAST_CASE)=1) list
or ($CASENUM=1 and
LAST_CASE=1) ).
filter by LAST_CASE.
list.
b1) MATCH FILES approach: Keeping the last duplicate row
The MATCH FILES command uses /LAST= to create a variable
LAST_CASE, which assigns a 1 for the last row of the variables sorted by ID
and a 0 for all further rows. Row with the same ID values thus have zeros and
ones in the LAST_CASE variable. A FILTER variable is created via
COMPUTE, which assigns a 1 to data rows that occur once. "filter by
LAST_CASE" keeps all non-duplicate data rows resp. the last data row of
duplicate data rows. If "filter by LAST_CASE" is changed to "filter by
FILTER" in the MACTH FILES approach, its mode of operation corresponds
to approach II.
Result (b1)
ID VAR1 VAR2 VAR3 VAR4 LAST_CASE FILTER
1,0 2,0 3,0 4,0 5,0 1 1,0
2,0 3,0 4,0 5,0 6,0 1 1,0
3,0 4,0 5,0 6,0 7,0 1 1,0
4,0 6,0 7,0 8,0 9,0 1 ,0
5,0 6,0 7,0 8,0 9,0 1 1,0
Number of cases read: 5 Number of cases listed: 5
b2) MATCH FILES approach: Keeping the first duplicate row
The MATCH FILES command uses /FIRST= to create an FRST_CASE
variable, which assigns a 1 for the first row of the variables sorted by ID and
a 0 for all further rows. Rows with the same ID values thus have zeros and
ones in the LAST_CASE variable. SELECT IF NOT FRST_CASE=0 keeps
all non-duplicate data rows resp. the first data row of duplicate data rows.
The result equals the output of CASESTOVARS approach above.

Approach II: Excluding all multiple rows


Approach II deletes all duplicate data rows. The concluding SELECT IF
removes all data rows with multiple ID values.
sort cases by ID (A).
compute MY_COUNTER=1.
exe.
if ID=lag(ID) MY_COUNTER=MY_COUNTER+1.
create ID2=LEAD (ID,1).
if ID=ID2 MY_COUNTER=MY_COUNTER+1.
select if (MY_COUNTER = 1).
exe.
list.
Result (b2)
ID VAR1 VAR2 VAR3 VAR4 MY_COUNTER ID2
1,0 2,0 3,0 4,0 5,0 1,0 2,0
2,0 3,0 4,0 5,0 6,0 1,0 3,0
3,0 4,0 5,0 6,0 7,0 1,0 4,0
5,0 6,0 7,0 8,0 9,0 1,0 .
Number of cases read: 4 Number of cases listed: 4
Since these approaches assumed that identical IDs also mean completely
identical data rows, an ID 4 was removed from the dataset even if the further
values in the variables VAR1 to VAR4 were different.
If the kind and extent of the agreement of the data rows in further variables
(without or with ID) are to be considered, then the approaches in section 5.4.
are suitable. Before still briefly approach III is to be presented. Approach III
allows to identify duplicate values in a single variable (e.g. VAR1)
independently of an ID variable.

Approach III: Identifying duplicate values in a single variable


Approach III allows identify duplicate values in a single variable (e.g. VAR1)
independently of an ID variable. This approach makes it possible to keep
resp. delete either the single or the multiple values. Of course approach III
can even be applied to ID variables themselves.
sort cases by VAR1 (A) .
exe.
compute DUPLICATE = 0 .
exe.
if (lag(VAR1)=VAR1) DUPLICATE = 1 .
exe.
select if (DUPLICATE=0).
* alternatively: select if (DUPLICATE=1). *.
exe.
Result
ID VAR1 VAR2 VAR3 VAR4 DUPLICATE
1 2 3 4 5 0
2 3 4 5 6 0
3 4 5 6 7 0
4 5 6 7 8 0
4 6 7 8 9 0
Number of cases read: 5 Number of cases listed: 5
Although approach III appears to be uncomplicated, it has the same
disadvantage as the other approaches, namely that it still implicitly assumes
that equal values in one variable (IDs resp. variables, e.g. VAR1) mean
completely equal data rows. If the kind and extent of matching in data rows
of further variables (without or with ID) shall be considered, then the
approaches of the next section are recommended.

5.4 Removing duplicate data rows over


several variables (also excl. ID)
The following approach assumes that equal IDs do not always necessarily
mean completely identical data rows. When you want to delete data rows,
and you want to consider more information than just the entries in the ID
variable alone, then the following approach is recommended.
Basically, this approach is an addition to the MATCH FILES approach
already introduced under 5.3. with the important difference that you can
specify other variables for the comparison besides the ID variable, e.g.
VAR1 to VAR3.
match files ID VAR1 VAR2 VAR3 VAR4 FRST_CASE
/file = *
1,0 2,0 3,0 4,0 5,0 1
/first =
2,0 3,0 4,0 5,0 6,0 1
FRST_CASE
3,0 4,0 5,0 6,0 7,0 1
/by ID VAR1
4,0 5,0 6,0 7,0 8,0 1
VAR2
4,0 6,0 7,0 8,0 9,0 1
VAR3.
5,0 6,0 7,0 8,0 9,0 1
select if Number of cases read:6 Number of cases listed: 6
(FRST_CASE).
exe.
list.
The consideration of further information for the filtering of variables was
successful; 6 rows were kept as "unique" by this approach, instead of 5 rows
(see 5.3.).
The approaches presented under 5.3. and 5.4. can also be used for matching
tasks. As is known, matching generally eliminates all not paired data rows
(i.e. not duplicate matching-IDs). For such an application, the SELECT IF
would have to be adjusted e.g. to SELECT IF=2 (for row pairs) (e.g. for
case-control matching).

The concluding example demonstrates by means of another MATCH FILES


variant how to keep, not remove, data rows with matching ID values.
sort cases by ID.
ID VAR1 VAR2 VAR3 VAR4 LAST_CASE
match files
/file=* 4,0 5,0 6,0 7,0 8,0 0
/last=LAST_CASE 4,0 6,0 7,0 8,0 9,0 0
/by ID.
exe. Number of cases read:2
if ID=lag (ID) Number of cases listed:2
LAST_CASE= lag
(LAST_CASE).
exe.
select if
(LAST_CASE=0).
exe.
list.
The MATCH FILES command uses /LAST= to create a variable
LAST_CASE, which assigns a 1 for the last row of the variables sorted by ID
and a 0 for all further row. Rows with the same ID values thus have zeros and
ones in the LAST_CASE variable. For the IDs with duplicates, the LAG
function is used to standardize all entries to zero, then keeping them using
SELECT IF LAST_CASE=0; this removes the other cases.
The approaches introduced in the previous sections are characterized in that
they are immediately filtering without providing "direct" key figures about
whether, how many, and what kind of duplicate data rows occur. If (e.g. in
the context of a plausibility check) information about the type and number of
multiple (i.e. more than once) and unique occurring data rows is needed first,
the following approach is recommended.

5.5 Information about type and number of


duplicates (identification)
The following program determines information about the type and number of
data rows. First, however, the somewhat fuzzy term "duplicate" is to be
differentiated. Each data row always has two attributes with regard to its
occurrence: Frequency and diversity. The attribute "frequency" indicates
whether a certain data row occurs once or even more frequently (multiple
times). The attribute "diversity" indicates whether a data row is unique or
also occurs in copies. Frequency and diversity do not necessarily correspond
(see below). The result is influenced i.a. by data characteristics, but also by a
list of variables. This variable list decides in which variables and values the
data rows are compared with each other and are considered to be equal or
different. The variable list thus defines the agreement of the data rows. The
variable ID can, but does not have to be included in this list. If an ID is not
included, the correctness (uniqueness) of the ID variable itself can be
checked. In the case of "diversity", the assignment of the attribute "unique" is
also co-determined by the scanning or sorting direction of the dataset (see
example below).
If a variable list is only a sample of the checked data storage (variable set),
then it must be checked then before eliminating rows whether the test result is
reliable at all. Especially for large data storages with little variation of
measured values it is recommended to include sufficiently differentiating
variables in the test list. Due to a large amount of data, correct matches may
also occur. However, a list of too little differentiating variables could
erroneously call these variables duplicate; it should therefore be avoided that
duplicate appearing data rows are eliminated prematurely.
The program interprets missings as a uniform code. When selecting the
variables for the checklist, care should therefore be taken to ensure that they
do not contain any missings.
data list
/ID 1-3 GROUP 5(A) AGE 7-8 VAR_NUM1 to VAR_NUM3 10-12
VAR_CHAR1 to VAR_CHAR3 14-16 (A).
begin data
001 a 8 101 bca
002 a 17 010 abc
003 b 23 110 abc
004 b 75 010 bac
002 a 17 010 abc
003 b 23 110 abc
005 a 65 100 cba
002 a 17 010 abc
003 b 23 110 abc
005 a 65 100 cba
end data.
exe.
sort cases by GROUP (A) AGE (A)
VAR_NUM1 (A) VAR_NUM2 (A) VAR_NUM3 (A)
VAR_CHAR1 (A) VAR_CHAR2 (A) VAR_CHAR3 (A) .
match files
/file = *
/by GROUP AGE
VAR_NUM1 VAR_NUM2 VAR_NUM3
VAR_CHAR1 VAR_CHAR2 VAR_CHAR3
/first = ROW_FRST
/last = ROW_LAST.
do if (ROW_FRST).
compute COUNTER = 1 - ROW_LAST.
else.
compute COUNTER = COUNTER + 1.
end if.
leave COUNTER.
compute MULTIPLES = COUNTER > 0.
sort cases MULTIPLES(D).
match files
/file = *
/drop = ROW_FRST .
variable labels
MULTIPLES 'Frequency of Rows: Once vs. multiple'
ROW_LAST 'Diversity of Rows: Unique vs. duplicate'
COUNTER 'Summary of Frequency of Duplicate Rows' .
value labels
/MULTIPLES
0 'Single row'
1 'Multiple row'
/ROW_LAST
1 'Unique row'
0 'Duplicate row' .
frequencies
variables = MULTIPLES ROW_LAST COUNTER .
list.

The program consists of four sections. The section DATA LIST reads in a
manageable dataset. The rows with the IDs "002" and "003", for example, are
duplicates. The program will identify these and other data rows as duplicates.
This section will not be explained further. The section starting with SORT
CASES prepares the identification process. This section will be explained in
detail. The two sections starting with VARIABLE LABELS resp.
FREQUENCIES conclude the program with a formatting of the created
variables and an analysis of possible duplicates of the dataset.
The SORT CASES line specifies the variable list as the criterion according
to which data rows are checked for duplicates. SORT CASES sorts the listed
variables uniformly (in this case in ascending order). The consequence of this
uniform sorting is that the dataset is sorted according to duplicate data rows
(if duplicate data rows exist). In the following, a "data row" is thus a
sequence of variables and their values that are sorted in SORT CASES with
the same completeness, sequence and sorting direction, and are also specified
accordingly under MATCH FILES / BY. The variables specified there must
therefore correspond exactly to the variables specified under SORT CASES.
In the MATCH FILES section, the imported dataset is merged with itself.
This trick makes use of the ability of MATCH FILES using FIRST= or
LAST= to identify and mark the first or last data row in the data rows sorted
by BY. All data rows that are not marked are duplicate data rows. The
markings are stored in the variables ROW_FRST resp. ROW_LAST (these
names are intended to make the operation of SPSS transparent). The
ROW_FRST variable contains the markings from the beginning of the data
row grouped by BY; the ROW_LAST variable contains the markings from
the end of the data row grouped by BY. The variables specified under BY
must match the variables specified under SORT CASES exactly.
ROW_LAST therefore assigns the attribute "Duplicate row" until the first list
of identical data rows is completed with the attribute "Unique row". The list
of the next identical rows also begins with the attribute "Duplicate row" and
is concluded with "Unique row", and so on.
At this point, the MATCH FILES section has so far only identified the
positions of the start resp. end of duplicate data rows, but does not yet allow
for the specification of how many duplicate data rows occur in total in the
dataset, and how often each of these data rows has absolutely identical values
in the variables specified under BY. The section starting at DO IF
(ROW_LAST) creates a variable COUNTER which counts through the
previously established lists of equal data rows from the first "Duplicate row"
to the final "Unique row". If a multiple data row occurs for the first time, it is
assigned a 1. If another matching row appears, it gets the value 2. A third
identical data row gets the value 3, and so on. Data rows that occur once
receive the value 0 (not 1, which is reserved for multiple rows). COUNTER
thus uses a slightly different counting method compared to MULTIPLES
resp. ROW_LAST; this can also be recognized by the fact that the row with
the attribute "Unique row" within multiple rows does not receive the value 1
but the respective empirical maximum.
The finally created variable MULTIPLES indicates whether the respective
data row occurs once or multiple times in the dataset. If several data rows in
the list of test variables match, all variables are given the attribute "Multiple
row". If a data row occurs only once, it receives the attribute "Single row".
After formatting, FREQUENCIES is used to request information about the
type and number of tested data rows using the frequencies for ROW_LAST,
COUNTER and MULTIPLES. The variables can be interpreted in the order
in which they are created or in the order MULTIPLES, COUNTER and
ROW_LAST. A LIST is recommended for a more detailed check of smaller
datasets.
The program works in such a way that the output (SPSS dataset) first displays
results for (possibly existing) multiple rows (in which the last row is always
marked as unique row, see also the note below) and then possibly existing
unique data rows. Within the attribute level "Multiple row" the results are
sorted by the increasing values of the variables, as well as the subsequent
results for the attribute level "Unique row".
Terminology:
A data row has two different attributes
with regard to its occurrence: Frequency
and diversity. Frequency indicates whether
a particular data row occurs once or more
frequently. Diversity indicates whether a
data row is unique or also occurs in copies.
Note: The bold print of the two lowest
rows is only used for graphical
highlighting.
Single means that a row occurs only once,
i.e. without copy(s) in the dataset. Once
rows are always also unique rows, see e.g.
the bottom row ("4 b 75...").
Multiple means that a row occurs more
than once, i.e. with copy(s) in the dataset
(see e.g. "3 b 23..."). Multiple occurring
rows contain at least one unique and
additional copy/s of this row/s.
Unique means that a certain sequence of
values (in the sense of the row) is unique;
all further duplicates are copies of this
unique row. Unique rows only correspond
to once rows if the respective row occurs
only once; otherwise, the attribute
"unique" usually occurs more often than
the attribute "once".
Duplicate means that there are further
copies of a unique row in the dataset.
Duplicate rows are always also multiple
rows, but not necessarily the other way
round (see above).

In contrast to the attribute "Frequency", the assignment of the attribute


"Diversity" is not directly an empirical phenomenon, but in case of multiple
data rows it also depends on the scanning or sorting direction of the dataset.
For example, if the dataset is scanned or sorted from bottom to top, the
lowest of the "3 b 23..." rows is the unique row. However, if the dataset was
scanned from top to bottom, the topmost of the "3 b 23..." rows would receive
the attribute "unique".

From the point of view of output, the program's procedure is first as follows:
First, the basic question is checked whether duplicate data rows occur at all in
a dataset. The variable MULTIPLES ("Frequency" attribute), which was
created last, indicates how many once or multiple data rows occur in the
dataset in total. This program then provides differentiated information about
the type and number of once and multiple data rows (cf. the variables
ROW_LAST, attribute "Diversity" and COUNTER, attribute "Summary of
duplicate cases"). The use of this program section only makes sense if the
dataset in question contains multiple data rows.
Frequency of Rows: Once vs. multiple
Cumulative
Frequency Percent Valid Percent Percent
Valid Single row 2 20,0 20,0 20,0
Multiple row 8 80,0 80,0 100,0
Total 10 100,0 100,0

The MULTIPLES variable indicates that two data rows occur only once each,
while the other eight data rows occur more than once. MULTIPLES thus only
indicates which data rows occur only once (N=2) and all other data rows
occur more than once (N=8). MULTIPLES, however, does not allow any
indication of how many different data rows are hidden behind the N=8
(because N=2 already applies to rows that occur once). With N=8, for
example, this could be a row that occurs eight times, two different rows that
occur four times each, or (in this case) three rows that occur two to three
times.

Since data rows occur several times in the dataset, it makes sense to continue
working with the next program section. The two variables ROW_LAST and
COUNTER can provide detailed information about the type and number of
data rows that occur multiple resp. duplicate times. The variable
ROW_LAST ("Diversity of Rows: Unique vs. duplicate") refers to the
frequency of unique data rows. ROW_LAST indicates, for example, whether
a data row is unique in the sense of a unique sequence of variable values in
the dataset or whether it occurs as a duplicate of an already existing data row.
Diversity of Rows: Unique vs. duplicate
Cumulative
Frequency Percent Valid Percent Percent
Valid Duplicate row 5 50,0 50,0 50,0
Unique row 5 50,0 50,0 100,0
Total 10 100,0 100,0

According to ROW_LAST there are five different rows in the dataset; five
further data rows are duplicates. ROW_LAST thus informs that a total of five
different data rows occur in the same sequence of variable values. The eight
data rows that occur multiple times (cf. MULTIPLES) are thus composed of
three unique data rows (all others are duplicates); together with the two data
rows that occur only once in each case, this results in five unique (different)
data rows. It is not clear from ROW_LAST, however, how often the unique
rows occur in detail. This information is provided by the variable
COUNTER.

The variable COUNTER ("Summary of Frequency of Duplicate Rows")


counts through the established lists of identical data rows and, in contrast to
MULTIPLES and ROW_LAST, indicates how often identical rows occur in
the checked data storage. The maximum as well as the minimum (zero) of the
table below are easiest way to interpret. The maximum indicates how often
the maximum number of identical rows occurs. According to the table, there
are two groups of three rows in this case; the list below shows that these are
the rows "2a..." and "3b...".
Summary of Frequency of Duplicate Rows
Cumulative
Frequency Percent Valid Percent Percent
Valid ,00 2 20,0 20,0 20,0
1,00 3 30,0 30,0 50,0
2,00 3 30,0 30,0 80,0
3,00 2 20,0 20,0 100,0
Total 10 100,0 100,0

The minimum (zero) indicates the number of unique rows. In this case, two
unique rows occur ("1a...", "4b..."). A zero means that the data row in
question ("1a...", "4b...") has no other copy in the dataset (see below).
The other values between the minimum and maximum of COUNTER are to
be interpreted somewhat differently, as they represent counting steps of
variables and not frequencies. All values greater than zero represent counters
of duplicate data rows.
In the example, the dataset contains two rows without duplicates (cf. the
COUNTER minimum 0).

The table shows, for example, that the COUNTER value 1 does not indicate
that only one row occurs, but rather one (not necessarily the first; directional
interpretation depends on the sorting) of several identical rows.
ID GROUP AGE VAR_NUM1 VAR_NUM2 VAR_NUM3 VAR_CHAR1 VAR_CHAR2
VAR_CHAR3 ROW_LAST COUNTER MULTIPLES

2a 17 0 1 0 a b c 0 1,00 1,00
2a 17 0 1 0 a b c 0 2,00 1,00
2a 17 0 1 0 a b c 1 3,00 1,00
5a 65 1 0 0 c b a 0 1,00 1,00
5a 65 1 0 0 c b a 1 2,00 1,00
3b 23 1 1 0 a b c 0 1,00 1,00
3b 23 1 1 0 a b c 0 2,00 1,00
3b 23 1 1 0 a b c 1 3,00 1,00
1a 8 1 0 1 b c a 1 ,00 ,00
4b 75 0 1 0 b a c 1 ,00 ,00

Number of cases read: 10 Number of cases listed: 10

The COUNTER value 3 (rows are highlighted in the table above) represents
analogously the information that this row is one of three (not necessarily last;
interpretation depends on the sorting direction) identical data rows (see e.g.
for the data rows of the IDs "2 a..." and "3 b..."). Because of N=2, the
COUNTER maximum 3 does not only indicate that two groups with three
equal rows each occur (in it, therefore, once each as a unique sequence of
values and additionally two copies of this sequence of values). As the two
lists of three also each enclose the lower counter steps (e.g., 2 and 1), only a 1
and a 2 of the values between minimum and maximum remain. As the higher
value 2 encloses the lower value 1 here as well, COUNTER thus indicates
that a group of two remains in the dataset (see "5a...").
Summary
The sample data (N=10) contains five unique rows. Five other rows are
copies (duplicates, see ROW_LAST). The rows "1 a... ", "4 b...", "3 b...", "5
a..." and "2 a..." are unique. Only the data rows "3 b...", "5 a..." and "2 a..."
have duplicates. The two rows "1 a", "4 b..." represent single data rows,
which occur only once each, i.e. without duplicates (see MULTIPLES). The
remaining rows occur multiple times, but not equally often. The "Frequency
of duplicate rows" indicates the frequency of the respective occurrence of the
data rows (see COUNTER). The two rows "1 a..." and "4 b..." are occur only
once each; the other data rows up to three times ("2 a..." and "3 b..."). The
row "5 a..." occurs twice.
This program can also be used for matching tasks (see 5.3.). For further
approaches to "counting through" datasets, please refer to more detailed
examples in Section 10.1.

5.6 Displaying filtered and duplicate data


rows
The following program allows to detect filtered duplicate data rows and
output them in simple lists. For datasets with a massive amount of duplicates
(especially for extensive variable lists), it is recommended to request only the
frequency table for safety reasons.
data list
/ID 1-3 GROUP 5(A) AGE 7-8 VAR_NUM1 to VAR_NUM3 10-12
VAR_CHAR1 to VAR_CHAR3 14-16 (A).
begin data
001 a 8 101 bca
002 a 17 010 abc
003 b 23 110 abc
004 b 75 010 bac
002 a 17 010 abc
003 b 23 110 abc
005 a 65 100 cba
end data.
exe.
sort cases by GROUP (A) AGE (A)
VAR_NUM1 (A) VAR_NUM2 (A) VAR_NUM3 (A)
VAR_CHAR1 (A) VAR_CHAR2 (A) VAR_CHAR3 (A) .
match files
/file = *
/by GROUP AGE
VAR_NUM1 VAR_NUM2 VAR_NUM3
VAR_CHAR1 VAR_CHAR2 VAR_CHAR3
/last = NODUP .
exe.
variable labels
NODUP
'Diversity of Rows: Unique vs. duplicate'.
value labels
NODUP
0 'Duplicate row'
1 'Unique row'.
frequencies variables=NODUP.
exe.
temp.
select if (NODUP=1).
title "Filtered data rows".
list.
temp.
select if (NODUP=0).
title "Duplicate (excluded) data rows".
list.
The program consists of four sections. The section DATA LIST reads in a
manageable dataset. The rows "002" and "003" are duplicates. The program
will identify these data rows as duplicates. This section will not be explained
further. The section starting with SORT CASES prepares the identification
process. This section will be explained in detail. After the formatting, a
frequency table is requested via FREQUENCIES and the output of the once
and duplicate data rows in list form via LIST.
The SORT CASES row specifies the variable list as the criterion according
to which data rows are checked for duplicates. SORT CASES sorts the listed
variables uniformly (in this case in ascending order). The consequence of this
uniform sorting is that the dataset is sorted according to duplicate data rows
(if duplicate data rows exist).
In the MATCH FILES section the imported dataset is merged with itself.
LAST= is used to identify the first one in the data rows sorted by BY and is
marked with a 0. The variables specified under BY must exactly match the
variables specified under SORT CASES. All data rows that LAST= has
marked with a 0 are therefore duplicate data rows (FIRST= would proceed
the other way around; in this case, the last of the sorted data rows would be
marked with 0). The markings are stored in the NODUP variable. NODUP is
formatted using VARIABLE LABELS resp. VALUE LABELS. A frequency
table is requested by FREQUENCIES and a separate output of once and
duplicate data rows in list form using a temporarily (TEMP.) filtered
(SELECT IF=...) LIST.
Diversity of Rows: Unique vs. duplicate
Cumulative
Frequency Percent Valid Percent Percent
Valid Duplicate row 2 28,6 28,6 28,6
Unique row 5 71,4 71,4 100,0
Total 7 100,0 100,0
According to NODUP, the dataset contains five unique rows; two additional
rows of data are duplicates. NODUP thus informs that a total of five different
data rows occur in the same sequence of variable values. All other data rows
are duplicates of these five different sequences of variable values.
Filtered data rows
ID GROUP AGE VAR_NUM1 VAR_NUM2 VAR_NUM3 VAR_CHAR1 VAR_CHAR2
VAR_CHAR3 NODUP
1a 8 1 0 1 b c a 1
2a 17 0 1 0 a b c 1
5a 65 1 0 0 c b a 1
3b 23 1 1 0 a b c 1
4b 75 0 1 0 b a c 1

Number of cases read: 5 Number of cases listed: 5

The list after “Filtered data rows” shows that there are 5 unique data rows.

Duplicate (excluded) data rows


ID GROUP AGE VAR_NUM1 VAR_NUM2 VAR_NUM3 VAR_CHAR1 VAR_CHAR2
VAR_CHAR3 NODUP
2a 17 0 1 0 a b c 0
3b 23 1 1 0 a b c 0

Number of cases read: 2 Number of cases listed: 2

The list after “Duplicate (excluded) data rows” shows that there are 5
unique data rows.
Note: Please remember that duplicate is defined as the variables listed in the
BY statements in SORT und MATCH FILES. In the above example it is
GROUP AGE VAR_NUM1 VAR_NUM2 VAR_NUM3 VAR_CHAR1
VAR_CHAR2 VAR_CHAR3 . From this follows:

The definition of what is duplicate and what is unique depends on


the list of variables you choose. A different selection of variables
may lead to different selections and different numbers of
duplicate resp. unique rows.
You can include one (or more) IDs in this definition (the above
examples do not). If you do, make sure they are validated before.
The identification process which row is duplicate resp. unique is
objective, the specification which variables to use may be
subjective in some respect.

5.7 Identification of duplicates when


reading data rows (grouped data)
Duplicate data rows can appear in a dataset already available for analysis, but
also before, when several separate datasets are joined together, e.g. by using
FILE TYPE GROUPED to form a grouped dataset (see 5.8. on nested data).
The option DUPLICATE=WARN is especially recommended if the cases
probably show missings or duplicates. If several absolutely identical data
rows (so-called "records") occur in a dataset, usually only the last one is read
in.
FILE TYPE GROUPED
CASE=PATIENT 1 RECORD=#ABC 3(A) DUPLICATE=WARN.
RECORD TYPE 'A'.
DATA LIST
/ IQ 5-7.
RECORD TYPE 'B' .
DATA LIST
/ AGE 5-7.
RECORD TYPE 'C' .
DATA LIST
/ SCL90 5-7.
END FILE TYPE.
BEGIN DATA
1 A 127
1 B 58
1 B 58 /* duplicate
1 C 59
2 A 98
2 B 67
2 C 75
3 A 58
3 B 59
3 C 43
END DATA.
LIST.
FILE TYPE GROUPED defines the dataset to be created as a grouped
dataset. Prerequisite for a grouped dataset is that all data is grouped and
complete. #ABC is the temporary identification variable for the grouped data
and is located in column 3; the 3 therefore does not mean that three data
groups are read. (A) means that the values of the group identification
variables are strings. The value of the case identification number (CASE) is
in position 1 (variable PATIENT). As many RECORD TYPE statements
follow as data groups are read in. Each RECORD TYPE statement contains a
string as group identification variable. DATA LIST is used to specify
variable names and their respective position in the dataset. END FILE TYPE
ends the definition of the dataset structure. Between BEGIN and END DATA
the grouped data is read. DUPLICATE=WARN prevents more than one data
row is read in for a single case from the respective data groups and outputs
the following note:
>Warning # 517
>A duplicate record has been found while building the indicated >case. Each
occurrence of the record has been processed, and the >last occurrence will
normally take precedence.
>Command line: 495 Current case: 1 Current splitfile group: 1
>Duplicate Record ID: B
>Case ID: 1
>Start of Record: 1 B 58 /* duplicate
PATIENT IQ AGE SCL90
1 127 58 59
2 98 67 75
3 58 59 43
Number of cases read: 3 Number of cases listed: 3
If DUPLICATE=WARN is omitted, SPSS still prevents the reading of double
data rows.
5.8 Identification of duplicates when
reading data rows (nested data)
Duplicate data rows can also occur when combining so-called nested single-
case data, e.g. using FILE TYPE NESTED. Nested datasets can be imagined
in such a way that the previously read-in dataset opens a grid for the data of
the following dataset, into which they are then read. In the example, the
variables L_HHNO to INTVDATE (source: interview questions) form the
grid into which the next dataset from HOUSHOLD to NINCOME (source:
household questions) is read. The variables PERSONID to BPLACE of the
remaining dataset (source: questions about the person) are read into this grid.
DUPLICATE=WARN prevents that more than one data row is read in for a
single case from the respective data groups and outputs a warning.
DUPLICATE=WARN is especially recommended when the cases are likely
to have missings or duplicates. The value of the case identification number
(CASE) is located at positions 1 to 4 (variable L_HHNO).
RECORD=#MY_RECS 6 means that the value of the temporary
identification variable is located at position 6. As many RECORD TYPE
statements follow as data groups are read in.
FILE TYPE NESTED
CASE=L_HHNO 1-4 RECORD=#MY_RECS 6
DUPLICATE=WARN.
RECORD TYPE 1.
DATA LIST
/HHNO 9-11 INTRVIEW 12-13 INTVCODE 15-16 (A) INTVDATE 18-
27(A).
RECORD TYPE 2.
DATA LIST
/HOUSHOLD 11-15 (A) AREA 17-19 NINCOME 21-25 .
RECORD TYPE 3.
DATA LIST
/PERSONID 11-13 (A) AGE 15-16 GENDER 18 (A) JOB 20
BPLACE 22-32 (A).
END FILE TYPE.
BEGIN DATA
0001 1 322 1 CS 11/09/2004
* Type/Source1: Details of interviewer (Number of Household, Contact,
Interviewer Code, Interview Date)
0001 2 322-1 100 60000
* Type/Source2: Questions on household 322-1
0001 3 1_1 34 F 1 Eppingen
* Type/Source3: Questions on Person
0001 2 322-2 80 50000
* Type/Source2: Questions on household 322-2
0001 2 322-2 80 50000
* Note: duplicate: Questions on household 322-2
0001 3 2_1 25 M 1 Norderney
* Type/Source3: Questions on person
0001 3 2_2 24 F 1 Heidelberg
* Type/Source3: Questions on person
0001 3 2_3 59 F 0 Mutlangen
* Type/Source3: Questions on person
0001 2 322-1 100 60000
* Type/Source2: Questions on household 322-1
0001 3 1_2 39 M 1 Datteln
* Type/Source3: Questions on person
END DATA.
sort cases by HHNO HOUSHOLD PERSONID.
exe.
list.
The comments interspersed between BEGIN DATA and END DATA
indicate the data sources as well as other special features. However, they
must be removed from real SPSS programs, because otherwise they may
cause errors during the read-in process.
Output
>Warning # 517
>A duplicate record has been found while building the in->dicated case. Each
occurrence of the record has been pro->cessed, and the last occurrence will
normally take pre->cedence.
>Command line: 156 Current case: 2 Current splitfile group: 1
>Duplicate Record ID: 2
>Case ID: 1
>Start of Record: 0001 2 322-2 80 50000
If DUPLICATE=WARN is omitted, SPSS does not issue a warning message;
reading in duplicate data rows is still prevented.
L_HH HH INTR INTV INTV HOUS AREA NIN PERS AGE GEN
NR NO VIEW CODE DATE HOLD COME ONID DER JOB
BPLACE
1 322 1 CS 11/09/2004 322-1 100 60000 1_1 34 F 1 Eppingen
1 322 1 CS 11/09/2004 322-1 100 60000 1_2 39 M 1 Datteln
1 322 1 CS 11/09/2004 322-2 80 50000 2_1 25 M 1 Norderney
1 322 1 CS 11/09/2004 322-2 80 50000 2_2 24 F 1 Heidelberg
1 322 1 CS 11/09/2004 322-2 80 50000 2_3 59 F 0 Mutlangen
Number of cases read: 5 Number of cases listed: 5
Note: For space reasons, the variable names are wrapped in two lines.

6 Missings
Missing data: Patterns, mechanisms and reconstruction

Missings are missing values in a dataset. Missings are the opposite of entries
(numbers, strings) in a dataset. While entries indicate the presence of
information, missings represent the opposite, the absence of information.
Missings are therefore the rule rather than the exception in data analysis
(Peng et al., 2006).
"Missings", along with "completeness", "uniformity" and "duplicates", are
among the basic criteria of the DQ Pyramid that SPSS can be used to check.
The examination of all other criteria is based on these. However, the check
for missing data requires completeness, uniformity, and the exclusion of
duplicates.
The extent of the occurrence of missing data determines whether a data
storage, a variable and/or a case is considered complete (without any missing
data) or is described as incomplete (with missings). If data storages, variables
and/or cases contain only missing data, they are called empty.

6.1 Causes (patterns), consequences,


extent and mechanisms
The quality of research and its results is directly related to the causes resp.
patterns, the mechanisms of missings, as well as their extent (see Little &
Rubin, 2002²; Cohen et al., 2003³, Chapter 11; Schafer & Graham, 2002).
Causes and patterns of mechanisms are directly related to each other. Causes
are followed by patterns in the data (sets); patterns in the data (sets) in turn
suggest causes. Furthermore, the reconstruction of missings assumes certain
mechanisms, which are important for their interpretation and reconstruction.
Depending on the cause or pattern (6.1.1.), certain (qualitatively to be
interpreted) mechanisms of missing (6.1.3.) can have a greater impact (6.1.2.)
on results than their mere (quantitative) extent.
In some publications on the assessment of the extent of missing, you will find
the following "rules of thumb" on the question of how to assess the
proportion of missing (extent). Such rules of thumb, such as the following,
argue something like this:

< 1% are considered trivial.


1-5% are considered feasible (e.g. univariate imputation).
5-15% require sophisticated methods (e.g. hot deck imputation, EM,
regression).
>15% affect any kind of interpretation

Any application of reconstruction methods is always preceded by an


interpretation of the problem of missings. In this respect, the sequence
reconstructable and then no longer interpretable is not the actual problem.
By contrast, the coupling of a purely formal extent of missing with a certain
degree of reconstructibility/interpretability is rather meaningless. The
following remarks will show, among other things, that missings always
depend on causes and therefore, in the end, causes also determine the
concrete extent of missings. If, for example, values are calculated incorrectly
(e.g. a division by zero, also by technical effects or similar), then their extent
may well exceed 15%. Nevertheless, this high proportion of missing values
can be reduced as soon as the incorrect calculation rule is corrected. But even
if there is still a high level of missing data, there is still information from
covariates that would allow an interpretation. The problem of missings is
somewhat more complicated than as if it would follow such a simple, purely
quantitative "rule of thumb" (i.a. Little, 2007).
To the mere extent of missings (which, as can now be stated, does not
represent a reliable basis for evaluation), now causes resp. patterns of missing
data are added. Depending on what is revealed first, patterns of missing data
often point to causes; if one knows first and foremost the causes of missings,
a closer look often also reveals patterns of missing. Ultimately, the
underlying mechanisms determine the interpretation and reconstructibility of
missings.

6.1.1 Causes and patterns of Missings


Missings resp. loss of data are generally an issue if the data is sensitive
information, data suppliers and/or data generation processes (high-risk,
competitive studies). All the more so if the data organization expects
numerous measurement points (e.g. longitudinal studies), numerous data
columns (variables, e.g. data gaps in company-wide projects or nationwide
surveys) and/or numerous data rows (cases, e.g. data gaps in nationwide
surveys).
The specific causes for missings can range from general causes that are not
relevant to the subject matter to the individual, subject-relevant, intentional
refusal to answer, such as

a priori (before or during the data collection process): Data gaps or


data loss due to organizational, process-related or technical causes:
e.g. study design: Asking agreed upon and/or inappropriate
questions, forgetting central items during questionnaire design, e.g.
software errors: faulty derivation, transmission or storage processes
(e.g. defective laboratory equipment, interference during wireless
transmission, faulty read/write processes), etc.
ex post (after the data collection process) caused data gaps resp.
data loss: Inter alia, unprofessional data management (e.g. incorrect
combining of cases and/or variables, incorrect calculation e.g.
caused by dividing by zero), accidental overwriting of current data
(e.g. deleting complete data storages), e.g. hardware problems (e.g.
hard disk crashes, loss (theft) of unsecured data carriers), software
problems (destruction by viruses, loss of files during transmission,
checking, storage (e-mail programs, anti-virus programs, zip
programs, firewalls)), etc.
Data delivery: Survey or study appears irrelevant or contrary to the
interests of the information providers: e.g.
withholding/misappropriation of data by employees, customers or
test persons etc., especially if it is a voluntary-uncontrolled as
opposed to a controlled-mandatory provision of information (e.g.
individual participation in a study, e.g. organization of data delivery
processes in networks, companies or multi-center studies) etc.
In longitudinal or panel studies mainly: change of subject (e.g.
territorial change: reunification of Germany, e.g. institutional change:
(re)formation of corporate structures, e.g. individual change:
relocation, marriage, change of profession) and/or unavailability for a
second survey or follow-up survey (relocation, motivation, mortality,
etc.), etc.
Individual resp. selective problems during the survey or data
collection: e.g. accidental (random) skipping of a single item in a
questionnaire (cause: e.g. attention, lack of time), technically caused
(systematic) skipping (e.g. caused by incorrectly linked online
questionnaires), factual or linguistic incomprehension of questions,
instructions and/or interviewers, and/or also intentional refusal or
skipping of a question that is too personal or factually too difficult
(cause: e.g. aversion, excessive demands) etc.

Apart from concrete causes as a direct factual explanation of missings, there


are also patterns in the data that can be indirect indications of missings and
thus, in turn, of certain causes (see Rubin & Little, 2002², 4-11). In the
following examples of frequently occurring patterns, the reader can imagine
missing data as "nonresponse", "refusal to answer", etc.

Type: Unspecific Missingness


Ideally, a dataset should be complete, i.e. it
should not contain any missing data. If,
however, there are missings, the pattern of
the missings ("missingness") in the rows
and columns (e.g. Y1, Y2, Y3, Y4, Y5) of a
dataset should be random, non-informative
and not follow any regular pattern.
However, unspecific missingness, like the
following examples of typical missing
patterns, is not easily visible to the naked
eye in analysis practice. Multivariate
sorting of the dataset is often necessary to
reveal such patterns.

Type: Univariate Missings


The missings of all cases are limited to a
certain variable Yx at one measurement
time. NMissings is limited to Y1 / t1.
Examples: If an item in a questionnaire is
not available, incomprehensible or
inappropriate for all participants of the
study (or even a subset) at a measurement
time. If individual data cannot be obtained
in an agricultural experiment because, for
example, certain seeds do not grow in a
field (so-called missing plot problem).

Type: Multivariate
Missings
The missings of a subset of all cases are
limited to several, specific variables.
Example: If several variables are not
available, incomprehensible or
inappropriate for a subset of the study.
NMissings is limited to Y1,...,n / t1 .

Type: Combination of group and item specific Missings


The missings of the cases are distributed
over up to all variables in a regular, group-
and item-specific pattern.
Example: If certain variables Y1,...,n are not
available, incomprehensible or
inappropriate for a first subset of a study
and if at the same time certain (possibly the
same) variables (but then in a different
pattern) are also not available,
incomprehensible or inappropriate for a
second subset of this study.
Under certain circumstances, this group-
specific pattern may be completely
consistent with the pattern for data source-
dependent missings. If the different sources
Note: The dash in Y1 is also contain data from different groups, the
intended to indicate that the specific causes of the missings must be
data above resp. below it examined particularly carefully.
comes from different groups.
Type: Monotonic increase of missings (decrease of completeness of
answers)

The number of missings is increasing.


Longitudinal example: With univariate
repeated measurement, the number of
missings in the response variable increases,
e.g. due to failures, mortality etc. during
the survey. NMissings Y1 / t1 < Y1 / t2 ... < Y1 / tn..
This logic also applies to several dependent
variables.
Cross-sectional example: When filling out
a questionnaire, the number of missings
increases with the number of items, e.g.
due to increasing fatigue or demotivation.
NMissings Y1 / t1 < Y2 / t1 ... < Yn / t1.. This logic
also applies to repeated measurements. In
the case of series-like increasing missings,
it applies in particular for the condition
MAR (see below) that the kind of missing
distribution ("missingness") always
depends only on valid values before the
failure, not on those afterwards.

Type: Data source depending Missings


If, when appending cases (ADD FILES),
there are no values in a common variable
for a subset of cases (either already
existing or newly added) (cf. Y2, Y3), a
systematic empty column is created for the
subset (this effect can also be caused by
different lengths in strings).
If, when adding variables (MATCH
FILES), a variable occurs only in a
subgroup of cases (either already existing
or newly added), a systematic empty
column is also here the consequence for a
subset.
To both variants, which seem to be purely
technical at first, the problem of non-
simultaneity (asynchronicity) of the
observation may be added. If e.g. the
values from both data sources are very far
apart in time, it can be quite problematic to
want to relate them to each other without
further ado in a joint analysis.

Further types: Completely missing ("latent") predictors


In contrast to the examples already
presented, in this pattern predictors play
a role for the first time. X describes a
group of independent variables (X1, X2,
X3, etc.). X is completely missing and is
therefore "latent". Y describes a group of
dependent variables (Y1, Y2, Y3,, etc.)
Rubin & Little (2002², 8) describe this
pattern as an example of how, via the
observed variables Y, the parameters of
the unobserved X can be estimated
inferentially (with the help of some
assumptions), e.g. by means of a factor
analysis.
Type: Deviation from reference distributions
If distributions deviate from reference
distributions, e.g. the normal
distribution, a so-called drawing error
may be present. Drawing errors can lead,
for example, to samples no longer
represent the population and conclusions
can be highly questionable. A mean
value based on a distorted sample is
therefore an estimator to be corrected for
the actually undistorted population.
Deviations from reference distributions
are clearly visible in histograms (see
7.1.1. on the role of perspectives and
expectations). A prerequisite is that the
distribution form of the reference
distribution is known; only then can the
distorted sample be adjusted for the
drawing error.
With this so-called "truncation", the values below (cf. figure), above, and
on both sides of the distribution may be cut off, which causes problems in
model specification, especially with regression approaches (e.g. Chatterjee
& Price, 1995², 190; Pedhazur, 1982², 30-32). The difference in the type of
missing distribution is that in the previous examples, missings were
recognizable as a kind of pattern in a dataset by means of empty rows,
columns or at least several cells. In the last example, missing data is not
present in the form of empty cells (and is therefore not recognizable), but
data was not even included in the sample due to the sampling error. Thus a
so-called mechanism is present.

The causes or patterns of missings usually cause a different extent. With


regard to interpretability, the underlying mechanisms of missings are more
significant (6.1.3.).

6.1.2 Consequences of missings


What are the consequences of Missings? Missings can have far-reaching
consequences, from bias to massive material damage.
Bias
The so-called "bias" is considered as the most massive problem of missings
(Rubin, 1987). Just imagine in a product satisfaction study, that some users
are so dissatisfied that they refuse to participate in the study. The non-
respondents are likely to have a different assessment of the product than
those who reported their assessment. The final sample is therefore no longer a
random or representative sampling from the population. The studied model
thus contains a statistical bias (e.g. the derived estimates, including standard
errors, correlations, etc.), as well as a bias in terms of content-related
distortion. If conclusions are now drawn solely on the basis of the
information provided, they would be clearly distorted, since they do not take
into account the assessments of dissatisfied participants and would thus be
one-sidedly positive. A manufacturer who only takes positive feedback into
account (e.g. often caused by internal market research) runs the risk of not
correctly assessing the market acceptance of his product due to such an
information situation.
A classic example of a study with a bias in the response (but not only there)
are the so-called Hite Reports on sexuality of women and men (e.g. Hite,
1981, 1976). In 1998, "The Times" declared the original Hite Report one of
the 100 most important books of the 20th century. The conclusions of the
later so-called "Hite Reports" are still criticized. The serious methodological
criticism (e.g. Singer, 1995; Smith, 2003, 1989) criticized, among other
things

Methodology: Unacceptable methodological argumentation and


documentation, e.g. possibly with regard to a strategic weighting of
the data, organization of the survey, etc.
Sampling: Some kind of self-fulfilling prophecy sampling instead of a
random sampling (and thus minimal representativeness); the sample
obtained was already so distorted by the selection of the participating
organizations that the conclusions drawn could not be representative
from the outset.
Items: Numerous questions were criticized, e.g. being traditionalist,
one-sided, suggestive or at best unclear. Questions about the
relationship with mothers, for example, were formulated differently
than questions about the relationship with fathers. As a result,
differences and evaluations were already preprogrammed in the
formulation of the questions.
Response rate: An unacceptably low response rate of between 3 and
6%. The response rate of questionnaires sent to women, for example,
was only about 3%, while the response rate of questionnaires sent to
men was about 6%. The percentage of non-responses, i.e. missings,
was at least 96%. The low response rate was defended by the
sensitive subject matter and the elaborate methodology.
Response- and self-selection: In addition to the free choice to answer
only those questions that one would like to answer, also a self-
selection effect towards a non-response bias took place. According to
this, only persons participated in the studies who wanted to
communicate their emotional (in)satisfaction despite an average effort
of about 4.5 hours for one questionnaire in up to 180 open questions
and sub-questions (depending on the counting method). The research
design makes it practically impossible to investigate differences
between participants and refusers of the study.
Interpretation: The interpretation of the ultimately presented results
suffered from constantly changing N due to the aforementioned
possible answer selection and thus, depending on the item, from a
different (sub)sample.

Such a low percentage of respondents (approx. 3-6%) does not allow for any
generally valid statements. Serious survey researchers usually expect a
response rate of at least 60%, otherwise the results obtained are distorted in
favor of the few who are willing to communicate. Nevertheless, the Reports
have been published as studies on sexual dissatisfaction of women in general,
for example. In fact, a later replication by the Washington Post and the ABC
News (response rate: approx. 80%; method: short telephone interviews)
contradicted Hite's claims in almost all points.
With available data (however with missings) it is to be noted that a sample
composition can differ additionally massively from one variable to the other
depending upon the way of the resulting missings (also e.g. by an
inconsiderate handling of missings, e.g. pairwise deletion). Also here, a
generalization of such a result situation to a uniform population is no longer
possible. A bias is thus equivalent to information gaps and thus the absence
of one of the central requirements for secured conclusions and decisions. Bias
can occur in any field of research, e.g. in ecological studies in epidemiology,
see e.g. Cohen's ecological studies on lung cancer and radon (Jacob et al.,
2005, 20-23).
Less statistical power
Missings affect the degrees of freedom (df) in inferential statistical tests, such
as the t-test, which in turn leads to less statistical power compared to
randomly drawn samples (Cohen et al., 2003³; Cool, 2000; Raymond &
Robert, 1987).
Violations of inferential statistical requirements of procedures
Missings are equivalent to a lower N of the database.
For many, particularly multivariate procedures, a reduced N is already a
sufficient violation of inferential statistical procedure requirements, e.g., in
linear regression (Green, 1991; Cohen et al., 2003³), binary or multinomial
logistic regression (Hosmer & Lemeshow, 2000, 339-346) and in time series
analyses (Yaffee & McGee, 2000, Chapter 12). In factor analyses, for
example, the number of observations should be significantly larger than the
number of variables. The most liberal criteria require at least five cases per
variable, some recommendations even go as high as 1.000 ("excellent"). In
cluster analyses, the solvability of a cluster problem depends, among other
things, on the number of cases. With few cases, value levels and variables it
is difficult to form clusters at all; with many cases, value levels and variables
it is difficult to find a few clusters.
A reduced N generally causes further problems in the form of violations of
various inferential statistical requirements of procedures. Often several
problems occur simultaneously, e.g. distribution problems (asymmetry,
discontinuity, non-normality, less than expected table cell frequencies) and/or
the incorrect calculation of descriptive or inferential statistical parameters.

Missings can lead to less than expected frequencies in cells in table


analyses (cf. Schendera, 2004); this applies without restriction to
other tests based on the Chi² method, e.g. the Hosmer-Lemeshow test.
In Kaplan-Meier's survival analysis, the Log Rank, Breslow and
Tarone-Ware tests are also based on Chi² statistics and are therefore
the less accurate the smaller the sample is. For categorically scaled
predictors, for example, if there is too little data for ordinal or
multinomial regression, there is a risk that no values of the dependent
variable are available for certain characteristics or combinations.
Missings can interrupt the continuity of (curvi)linear functions, which
is problematic in correlation, regression, reliability analysis, time
series analysis (prediction) and repeated measurement (e.g.
Fitzmaurice et al., 2004; Cohen et al., 2003³; Yaffee & McGee,
2000).
Missings can change the weighting of data: Missings can convert
asymmetrically large groups into symmetrically large groups, and
vice versa. This has e.g. an influence on the selection of the correct
statistical procedure for two or more group comparisons, e.g. t-test vs.
Mann-Whitney test; ANOVA vs. GLM or Kruskal-Wallis test (e.g.
Keppel & Wickens, 20044, 138ff.).
Missings can change the distribution of data (e.g. Keppel & Wickens,
20044, 146).
With multivariate procedures, e.g. the family of cluster analyses,
missings can lead to disproportionally completed data rows (and thus
to distorted clustering) resp., depending on eliminating, to a
drastically reduced data basis.
In binary logistic regression, missings can lead to a situation that the
target event of interest does not occur frequently enough or
disproportionately in the dependent variable (Hosmer & Lemeshow,
2000, 346-347). If the target event is rare, the costs of false negatives
are usually higher than the costs of false positives, suggesting cutoffs
well below 0.5.
Time series analyses generally require data series without (larger)
gaps. The more irregular or the shorter a time series is, the more
difficult it is to reconstruct missings, e.g. by using the mean value of
the series, the neighboring values before or after the data gap, or by
predicting them on the basis of the values preceding the gap (Yaffee
& McGee, 2000, 3). In forecasting, the proportion of the prediction
error increases; the predictions become less accurate and the
confidence interval subsequently becomes wider and wider.

Missings can thus limit the use of statistical methods even if they are missing
completely by chance. The effect is naturally all the greater if the represented
information in the smaller existing data is additionally distorted by a bias
(Rubin, 1987).
Calculation errors
Missings can lead to incorrect results even in simple arithmetic operations.
The following example demonstrates an almost classical error in the
calculation of e.g. mean values.
data list list (",")
/value_1 miss_1 value_2 miss_2.
begin data
1, 1, 1, 1
1, , 1, 1
1, , 1,
1, , 1, 1
1, 1, 1, 1
end data.
exe.
A dataset is read into SPSS. The variables MISS1 and MISS2 contain
missings in contrast to the other two variables. From these four variables a
mean value is to be calculated. Determining the sum variable MY_SUM for
the counter appears uncomplicated yet; for this purpose, the sum of all valid
values is established using the SUM function (there are also other
possibilities, see below). If, however, there are incomplete data, the confusion
of the theoretical maximum (number of variables) with the number of
available data (valid values) causes an error when determining the
denominator.
For M_VALUE_1, the division is made by the number of theoretically
possible cases. Interpreted as a mean value, the result is usually not correct,
since it is divided by more values than actually present (see M_VALUE_2).
With M_VALUE_1, the denominator is a constant. Depending on the area of
application, however, dividing by a constant may be permissible or even
necessary; the calculated values are, however, only to be interpreted with
reservation as mean values, rather as indices depending on the completeness
of a data storage or delivery. Under certain circumstances, such a value can
even be interpreted directly as a completeness index. Interpreted as an index
for the completeness of a row (see also 3.4., 6.3.8.), the result correctly
expresses the degree of completeness of a data row, where e.g. only the value
1 or a missing is allowed as entries (if necessary, this value can be multiplied
by 100 to be interpreted as a percentage value). In this example, 0.75 or 75%
would mean that there are valid values in three of four cells.

Predominantly incorrect averaging in case of missing values


compute MY_SUM = sum(value_1, miss_1, value_2, miss_2).
exe.
compute M_VALUE_1 =(MY_SUM / 4) .
exe.
With M_VALUE_2, the division is made by the number of valid cases; this
value is determined beforehand in a separate step and stored in the variable
VALIDE. With M_VALUE_2, the denominator is thus a variable. - If there
are exclusively missings in a data row, SPSS would store the value 0 in
VALIDE. However, since SPSS would abort the calculation in case of a
division by zero, the value 0 would have to be replaced by a missing in
VALIDE in this case. - By dividing MY_SUM by VALIDE, the correct mean
value M_VALUE_2 is now available. M_VALUE_2 cannot be interpreted as
an index for the completeness of a line, since it would always result in 1.0 or
100% if determined correctly.

Correct averaging in case of missing values


compute MY_SUM = sum(value_1,miss_1,value_2,miss_2).
exe.
count VALID = value_1 miss_1 value_2 miss_2 (1).
exe.
compute M_VALUE_2 = MY_SUM / VALID .
exe.
format M_VALUE_1 M_VALUE_2 (F4.2).
list.
The difference in the logic of M_VALUE_1 compared to M_VALUE_2 can
easily be recognized by the results requested by means of LIST. Depending
on the data and the hypothesis, the COUNT variant could, for example, also
be replaced by a COMPUTE variant with the SUM option.
value_1 miss_1 value_2 miss_2 MY_SUM M_VALUE_1 VALID
M_VALUE_2
1 1 1 1 4 1,00 4 1,00
1 . 1 1 3 ,75 3 1,00
1 . 1 . 2 ,50 2 1,00
1 . 1 1 3 ,75 3 1,00
1 1 1 1 4 1,00 4 1,00
Number of cases read: 5 Number of cases listed: 5

If there are no gaps in the data, M_VALUE_1 and M_VALUE_2 match. If


the variables that were included in the summation contain at least one
missing at any point, the calculated mean values differ significantly.
Incommensurability of results (varying case numbers)
If data contain missings, analysis results of different (mainly) multivariate
procedures are always based on different case numbers. The obtained results
can only exceptionally be compared directly with each other, since the
concrete underlying data basis is different in each case if looked at closely. In
general, the reason for this is that statistical procedures resp. SPSS
procedures handle missings differently.
Inefficiency (resources)
Missings can also overwhelm users by misunderstanding them as a simple
systematic-technical problem rather than as a complex-multivariate problem
of content and statistics. Especially the "automatic" replacement techniques
provided in software often enough tempt to misunderstand the reconstruction
of missing parts as a purely technical problem that can be solved "at the push
of a button", to underestimate the complexity of the (statistical, content-
related) problem and to even increase the problem (damage, bias, etc.).
Financial damage
Lost or distorted data are a lost investment at first, but often result in
additional follow-up costs. As examples, the methods of longitudinal studies
or panel analyses should be mentioned first (e.g. Buu, 1999; Roth, 1994, 538-
539). However, data often represent much more than "just" cases, time and
effort. In companies, data often simply represent money. For them, the
following applies all the more: Every lost customer is also a lost profit.
Corrupted data is a bad investment, the consequences of which can be further
costs and even damage to the company's image (see Chapter 1). In such
situations, assume that a mistake does not necessarily come alone, but brings
a few "friends" to the "worst case party". For example, in July 2006 a
computer technician from the Alaskan tax authorities accidentally deleted
both the data and the backup drive. The real drama, however, began when
they wanted to access the data on the nightly backup tapes and found that the
tapes were unreadable. 800,000 scanned claims and documents for $38
billion in disbursements from the Alaska Permanent Fund were gone and all
of them had to be rescanned, reviewed and processed for months (Sutton,
2007).
Further consequences
Experience shows that lost, deleted or even falsified data often has further
consequences. Loss of data is often followed by the accusation of intent.
Depending on the type of data involved, extensive scandals can be triggered.
The fact that data could be lost despite security measures can also provoke
mistrust towards those in whose area of responsibility such a data loss was
possible. In June 2007, for example, the program REPORT MAINZ (June 25,
2007) reported that the “Nachrichtenwesen der Bundeswehr” (Bundeswehr
Intelligence Center), built for the most sensitive data in Germany (including
intelligence reports, CIA reports and images from spy satellites) had lost
extremely sensitive data during archiving due to a technical defect in the data
backup robot. Should the data remain lost, this seemingly simple loss of data
has far-reaching consequences: The investigative committees of the German
Bundestag can no longer fulfill their mission due to a lack of sources. For law
enforcement agencies in war crimes trials, evidence is destroyed. Intelligence
findings, including those related to terrorism, have been lost. The scandal that
was unleashed concerned, among other things, the professionalism of data
backup, political responsibilities, intelligence consequences, but also the
question of potential intentionality, for example, possibly with regard to the
cover-up of another scandal, including the role of the KSK in the detention of
Germans in U.S. prison camps. At this point in time (July 2007) it is unclear
whether the data can be reconstructed.

6.1.3 Mechanisms of missings


Missings can be assessed in terms of extent, causes and patterns (6.1.1.) and
also in terms of underlying mechanisms. These mechanisms are especially
important for the interpretation and reconstruction of missings.
If missings are distributed completely randomly, an interpretation is not
complicated by an additional bias except by a reduced N. However, if the
proportion of missings is high and is not randomly but systematically
distributed according to some pattern, the situation is particularly serious.
The consequence is a massively more difficult interpretation of results
distorted by a so-called 'bias'. It is therefore important to first determine how
many missings occur, but even more important is whether the missings are
distributed systematically or randomly.
How do you assess the patterns of missings ("missingness")? Assume that a
fictitious dataset contains the data of n subjects in m+1 variables, of which
are one dependent variable Y and m independent variables X1, X2, ..., Xm.
Generally, a distinction is made between the three possible patterns MCAR,
MAR and NMAR. The following terminology and logic is closely based on
Rubin & Little (2002²). The patterns MCAR and MAR should be available
for further exploration and analysis of data; the pattern NMAR, on the other
hand, is considered to be non-ignorable.

With MCAR (Missing Completely at Random), missing values in Y do not


depend on observed or missing values in Y (Rubin & Little, 2002², 12).
f(M|Y,Φ)= f(M|Φ) for all Y, Φ ,
where M stands for the missings (Mij), Y for the complete data (yij) and Φ for
the unknown parameters.
The pattern of the missing distribution itself is not necessarily random, but
only that the type of missing distribution ("missingness") does not depend on
the data values. The origin of the MCAR pattern does not need further
investigation. The subsample of the subject data can therefore be considered
as a randomly drawn subsample of the sample from the population.
The presence of the MCAR condition can be explored using Little's MCAR
Test. Little's MCAR test is implemented in the SPSS MVA procedure
([Link].). However, SPSS does not offer a test for MCAR vs. MAR. MCAR
(see below) is a more demanding assumption than MAR. Data with missings
in a MCAR pattern can therefore be considered as a randomly drawn
subsample of the potentially complete sample from the population (Little &
Rubin, 2002², 13).

With MAR (Missing at Random), the type of missing distribution


("missingness") in the variable Y depends only on the observed Y values
(Yobs) and not on the missings in Y (Ymiss).
f(M|Y,Φ)= f(M|Yobs,Φ) for all Ymiss, Φ ,
where M stands for the missings (Mij), Y for the complete data (yij) and Φ for
the unknown parameters.
If the probability for Y to be a missing does not depend on the missing value
itself, but on observed Y values, then Y is considered to be MAR. A missing
Y value is therefore only related to complete Y, but not to a Y value that
could possibly have been collected. Data with missings in a MAR pattern
cannot be considered a randomly drawn subsample of the potentially
complete sample from the population if the probability of missings in Y
depends on the missings in Y itself. The emergence of the MAR pattern
should be investigated and implemented in a statistical model. The subsample
of the subject data cannot be considered as a randomly drawn subsample of
the population sample; the missing Y values can be reconstructed from the
complete X1, X2,..., Xm variables by appropriate statistical modeling. MAR is
considered less restrictive than MCAR. The existence of the MAR condition
can be explored by a simple t-test, where the mean values of the groups with
and without missing values are compared (a Chi²-test could be used
analogously for categorically scaled data or binary logistic regression,
discriminant or variance analysis among others, as multivariate variants).
At this point it must be sensitized to the fact that this kind of testing
of MAR (e.g. by means of the t-test) changes the problem if you look at it
closely: It is no longer of interest whether the data are not randomly missing,
but whether the effect of the non-randomly missing data is large enough to
have a lasting negative impact on the results obtained. However, a
nonsignificant t-test neither completely excludes the presence of non-
randomly missing data nor, in the case of an (apparently) significant result,
does it necessarily render the obtained results completely meaningless (see,
among other things, the aspects of effect size and alpha error accumulation).
However, the t-test is only useful if it actually captures the missing- causing
variables resp. specific causes. If the t-test (or any other method used, see
above) only tests the X1, X2,..., Xm variables, but the missings are caused by
Z1, Z2,..., Zm or by other, e.g. technical defects, a nonsignificant test result
will miss the actual missing-causing process (see also NMAR). The
competent application of statistical tests supports the assessment of the MAR
assumption, but is no substitute for a careful examination of patterns, causes
and processes.

With NMAR (Not Missing at Random; sometimes also found as "MNAR"


or "nonignorable" in the literature, e.g. Longford, 2000) the type of missing
distribution in the variable Y depends only on the missings in Y (Ymiss) and
not on the observed Y values (Yobs). NMAR is, so to speak, the opposite of
MAR: In NMAR, Ymiss causes the "missingness", whereas in MAR, Yobs
cause the "missingness".
f(Y,M|θ,Φ)= f(Y|Φ ) f(M|Y,Φ ) = f(yi|θ) f(Mi|yi,Φ),
where f(yi|θ) denotes the density of yi (indicated by the unknown parameters
θ). f(Mi|yi,Φ) denotes the density of a Bernoulli distribution for the binary
indicator Mi with the probability pr(Mi=1|yi,Φ) that yi is a missing
With MAR, the type of missing distribution can be explained by other
variables, while with NMAR, the missingness can only be explained by the
missing variables or values. MCAR and MAR are considered ignorable and
should be present; the pattern NMAR cannot be ignored.
Dealing with missings requires knowledge of the existing patterns of
missings (deleting, e.g. MCAR). However, MCAR is rare; MAR occurs more
frequently, but is not always tenable. With NMAR, the probability of
missings depends on the missing values themselves. In other words: Only
variables with missing values can provide an explanation for the cause of the
missings.
The mechanisms are based on a common assumption that is particularly
important for the reconstruction of missings: "Missingness indicators hide
true values that are meaningful for analysis" (Little & Rubin, 2002², 8). Only
if this assumption is true (ideally for all strata of a variable), it makes sense to
replace missing values by imputed values (see 6.4.). If this assumption is not
true, it is more appropriate to include missings in an analysis (see 6.5.).
If the mechanisms of missing values are known lastly, cause-oriented
measures for error correction can also be initiated. Missings can therefore be
dealt with in several ways. Deleting (6.3.), reconstruction (incl. imputation,
6.4.), neither deletion nor reconstruction (6.2.), and including in an analysis
(6.5.).

6.2 Which missings should not be replaced


or deleted by values?
Missing data should not be replaced or data should not be deleted if the
missings result from the content, technology, logic or varying degrees of
completeness of data collection instruments.
Some surveys record questions of knowledge or facts; if someone does not
know an answer, for example, it is not legitimate to replace the corresponding
missings with answers that the person questioned did not even give or know.
In many electronically-based (online)surveys, other (groups of) questions are
skipped for certain answers, e.g. if the respondent belongs to a certain group
or the like. The answers of this group necessarily show missings in the
dataset, because the construction of the questionnaire prevented an answer
technically. If such a data collection technique (e.g. skipping) leads to
missings, these are technically caused missings that may not be replaced by
values.
In surveys, it often follows from the answer to a first question that there is no
need to answer a second question. If, for example, someone denies the first
question having ever smoked, then it is also unnecessary to answer the
question of when she started smoking. If such a questioning technique leads
to missings, these are logically caused missings, which also must not be
replaced by values.
During repeated measurements, sometimes different complete
questionnaire versions occur at different measurement times. In comparison
to a first measurement, for example, an additional question was included in
the second measurement; its values are therefore missing for the first
measurement time. During the analysis phase, one should not be tempted to
reconstruct the missing data for the first measurement time from the data for
the second measurement time, e.g. using a multivariate hot deck approach.
The reason is that in addition to the multivariate intercorrelatedness, the time-
dependent variability of the data is added because the biographical or socio-
historical context of the subjects is different. The same applies all the more to
questionnaires that were not filled out at all, e.g. if someone was unable to
participate in a follow-up.
The occurrence of such uncontrolled missings can be drastically reduced by
using so-called consistency codes, which are then excluded from the analysis
in the form of user-defined missings, e.g.
– "991" or also "-1" for "doesn't know answer".
– "992" or also "-2" for "denied answer"
– "993" or also "-3" for "answer skipped".
– "994" or also "-4" for "question never asked".
– "995" or also "-5" for "complete drop-out" etc.
You cannot tell from missings in a dataset, whether the values were already
missing in a questionnaire, if they were not transferred to the dataset at all, or
whether they were transferred correctly, but then got lost due to a faulty
transformation. Already during data entry, as well as later during
transformations, missing values have to be provided with a user-defined
code (see below). This procedure ensures that it is possible to distinguish
between controlled and uncontrolled missings (even for strings up to 8
characters long).
6.3 Deleting Missings
To say it in advance: Under certain circumstances it is possible to compensate
for missings through the duplication of "similar" cases, provided, among
other things, that there is a certain "similarity" with the missings to be
replaced and that a certain chance is taken into account when sampling
similar cases (Kish, 1990, 126).
When assessing the frequencies, however, the actual (original) number of
cases should not be confused with the number of duplicated data rows (see in
a similar context in the analysis of multiple answers; Schendera, 2005).
However, removing cases is technically easier (but not always statistically
better) than duplicating cases. By default, e.g. pair-wise or list-wise deleting
is practiced.

6.3.1 Deleting pair-wise vs. list-wise


Missings can be deleted, there are two different approaches. The approaches
of deleting in lists or pairs (cf. Little & Rubin, 2002², 41-58) are fast and
uncomplicated, are preset in numerous SPSS procedures, require MCAR and
may lead to bias, massive loss of information and possibly to further
problems. The approaches of list- or pair-wise deletion require at least two
variables. If only one variable is present, the distinction between deleting in
pairs and deleting in lists is irrelevant.
The list-wise deletion of cases (so-called "complete case analysis") is
mainly done in the multivariate description or analysis. The results are then
only based on complete data. All cases that contain a missing in a variable are
excluded from the analysis as a complete data row (i.e. also the variables in
which values are present). As the consequence, all variables contain the same
N.
An advantage of list-wise deleting is its simplicity. The common N also
facilitates the comparability of separate (also multivariate) analyses. The
disadvantage of list-wise deleting is the cumulative loss of information. The
more variables contain also only selective missings, the more cases are
excluded across all variables. The more cases are excluded, the more
probable are bias or increasing inaccuracy. The cumulative exclusion of
incomplete cases can be considerable, up to 100% of cases. A
countermeasure to reduce the bias would be the use of weighting methods.
List-wise deletion is permissible if the number of excluded cases is kept
within a limit, if MCAR is involved and if the remaining complete cases are a
random sampling of all cases. List-wise deletion of cases would also be
permissible, for example, if only values in a dependent variable are missing.
If missings concentrate on only one variable, this variable can be deleted,
unless it is the only dependent variable. At another place Little & Rubin
(2002², 3) call the "strategy" of the so-called "complete-case analysis"
generally inappropriate.

In the so-called "available case analysis", an analysis is only performed


with the values available for each variable. The problem is that the N is
usually different for each variable. If one of these incomplete variables is
related to a second variable with a different type of missing distribution, pairs
of values that include missings are excluded from the analysis. In bivariate
analyses (e.g. correlation, regression or t-test), the results are thus based only
on the pairwise common N. Depending on the bivariate analysis, a different
(pairwise) N may be the basis, making it difficult to compare separate
bivariate analyses. Pair-wise deletion of cases is appropriate when the
complete analysis is based on only two variables. With more variables,
multivariate interactions may be ignored and artificial variable relationships
may be derived. Also the number of excluded cases should be kept within a
limit, MCAR must be given and the remaining complete cases must be a
random sampling of all cases.
Especially small datasets, it is not recommended to delete cases or variables
thoughtlessly. If the MCAR assumption is not certain, the pair- or list-wise
deletion of cases or variables should be avoided. In addition to a substantial
data loss, the information of the cases remaining in the dataset without
missings may be systematically distorted.
Different analyses (univariate, bivariate, multivariate) of data with missings
often lead to irritation due to the inconsistent default settings for deleting
missings (e.g. pair- vs. list-wise) and the different N in the individual
analyses. If, for example, the same incomplete data is described once with
MEANS and once with EXAMINE, the results are not the same but generally
different. The results (e.g. the case numbers) differ because the two SPSS
procedures handle missings differently. EXAMINE, for example, uses a list-
wise deletion by default, whereas MEANS uses a variable-wise deletion. The
same applies to inferential statistical procedures that are generally preset
differently (regardless of whether they are parametric or non-parametric). The
default setting for statistical and graphic procedures is usually list-wise
deletion, e.g., 2SLS, CATREG, CLUSTER, EXAMINE, FACTOR, GLM,
GGRAPH, GRAPH, MANOVA, QUICK CLUSTER, REGRESSION,
XGRAPH, etc.; in contrast, the default setting for the procedures
CORRELATIONS, NONPAR CORR, or OPTIMAL BINNING is pair-wise
deletion. Thus, if the same incomplete data are subjected to inferential
statistical procedures that deal with missings in different ways, then also the
different N and power make the comparative interpretation of the obtained
results difficult.
When working with missing data, it is therefore important to ensure that the
treatment of missings (if missing values are present) is consistent both in the
description of data and in the subsequent inferential statistical analysis. In
other words: Missing values are to be treated in the same way in description
and inferential statistics, i.e. they are to be excluded or included uniformly
from the analysis (e.g. via syntax: MISSING=LISTWISE; via “Options” of
the respective menus: "Exclude cases listwise” under "Missing Values"). In
order to display descriptive parameters for paired measurements, the deletion
of missings should also be standardized, e.g. from case- to list-wise.
However, a uniform approach does not change the weaknesses of these
strategies. Neither of the two strategies can be clearly preferred to the other.
Each has its own advantages and disadvantages.

6.3.2 Technical problems as a cause of missings -


Delete completely empty rows
In section 6.3.1., it was assumed that missings are more or less accidental, i.e.
not caused by technical problems. An implicit assumption was also, that
missings are limited to a few fields and not to complete data series, e.g.
complete rows of a data table. The situation is somewhat different, if e.g.
completely or almost completely empty value rows are present. The very first
task when there is a conspicuously large number of empty data fields and
rows is to get an overview of the pattern and extent of the missing data.
Example: Pattern of data
Let's assume you take over a project from your predecessor and from now on
you have to deal with business data delivered daily. A characteristic of this
business data is that sales figures are always available (except for holidays)
on weekdays (in this case the stores are open) and no sales figures are
available on Saturdays resp. Sundays. For a weekend, by (suboptimal)
convention, only missings are available. You now find out that for a
distribution channel, the missings fall on Monday and Tuesday and not on
Saturday and Sunday, as would have been expected. A possible cause could
(but does not have to) be a wrong merging of the original data in the DWH or
already before accessing the DWH. In this case it is recommended to check
the route of the data into the DWH.
Example: Extent of data
A not uncommon cause for completely empty data rows can be that after
reading datasets from a non-SPSS format (e.g. MSEXCEL), several hundred
empty data rows may still be attached to the actual data part. These missings
must be removed in any case, especially to save memory space resp. to
increase the computing speed. In order to be able to distinguish between
"technical" and "non-technical" missings, a little cause analysis is also
necessary.
data list list (",")
/VALUE_1 MISS_1 VALUE_2 MISS_2.
begin data
1, 1, 1, 1
1, , 1, 1
1, , 1,
1, , 1, 1
, , ,,
, , ,,
1, 1, 1, 1
end data.
exe.

COMPUTE and SYSMIS can be used, for example, to set the proportion of
system-defined missings (N_MISS) in relation to the number of columns. In
the syntax below, there are, for example, four data rows; therefore, N_MISS
is divided by 4 and the result is stored in PERCENT. In contrast to the
classical error of averaging, this division is permissible because here the
numerator is not divided by the number of valid values (in principle a
variable) but the number of permissible columns (in principle a constant).
compute N_MISS = sysmis(VALUE_1) + sysmis(MISS_1)
+ sysmis(VALUE_2) + sysmis(MISS_2).
exe.
compute MY_PERCENTAGE=( N_MISS / 4 )*100 .
exe.
list N_MISS MY_PERCENTAGE.
exe.
First of all, the extent of the Missings should be examined. Data rows with
missings over 50% are clearly doubtful, depending on the type and relevance
of the missing values. However, data rows that are completely empty, i.e.
reach the value 100 in MY_PERCENTAGE, are of peculiar interest.
N_MISS MY_PERCENTAGE
,00 ,00
1,00 25,00
2,00 50,00
1,00 25,00
4,00 100,00
4,00 100,00
,00 ,00
Number of cases read: 7 Number of cases listed: 7
Often, only with MY_PERCENTAGE values equal to 100 (e.g. if the ID
variable is also empty, if present) the suspicion is obvious that technical
problems could be the cause, while MY_PERCENTAGE values below 100
could apparently only have been caused by incomplete statements or also
incorrect entries. In fact, a little more precise causal research is needed here.
select if (MY_PERCENTAGE < 100 ).
exe.
The data rows with MY_PERCENTAGE=100 can be filtered out e.g. with a
simple SELECT IF command (see, however, the supplementary remarks).
However, even if there are strikingly high percentages of missing data,
technical problems (e.g. reading from MSEXCEL to SPSS) could be the
cause.
Of course, it must be emphasized that technical problems can also occur if the
missing parts are not conspicuously high. For checking purposes it is always
advisable to carefully compare several data rows with missing before and
after reading them into SPSS, cell by cell; whereas this comparison should
include not only the values, but also the type and format
Frequently occurring causes are e.g., if empty rows already exist in
MSEXCEL, if special user-defined cell formats occur or if certain data
columns are locked. Countermeasures could be e.g. to delete surplus data
rows already in MSEXCEL, to define SPSS-compatible cell formats if
possible and to unlock affected columns. The percentage of purely technical
missings is likely to be much lower and could also be filtered out by a simple
SELECT IF command (see above).
However, before the empty data rows are filtered out, users should pause and
check two aspects:

Should the dataset to be filtered be merged with other records later?


Does the dataset to be filtered contain a correctly assigned ID
variable?

Only if the dataset in question is surely not to be merged with other datasets
later, the following comments are unnecessary. In any other case, before
filtering out the empty data rows, the possibility must be considered whether
and how an ID variable could be taken into account.
If an ID variable exists in the dataset concerned, it must be checked whether
it is sustainable, i.e. whether it can be related in name, content and format to
ID variables of the datasets with which the dataset concerned is to be merged.
If necessary, adjustments to the other datasets must be made before data is
deleted.
If there is no ID variable in the dataset concerned, the dataset concerned and
the datasets with which it is to be combined later must first be checked for
structural equality (i.e. at least the same number of rows), with or without
missings.
If there is an equal number of rows, it must be ensured that these also
originate from the same cases (persons etc.). If there is an equal sequence and
number of data rows, an ID variable must be assigned to all datasets before
filtering out the empty rows and merging them (see also Schendera, 2005).
compute ID=$CASENUM.
exe.
Only after assigning an ID variable, the empty data rows may be deleted. If
the data rows were deleted before an ID variable was created, the structure of
the dataset would change and it would no longer be possible to (row-wise)
join datasets. After merging, it makes no difference where the empty data
rows are located (e.g. in the middle or only at the end of a dataset
with/without an ID variable). The only important thing is whether the data
rows are still completely empty after merging or whether they contain data
from one or more (if not all) merged data after all. Only and only then, if the
data rows are still completely empty after the merges, a deletion of the
respective data rows is possible and reasonable.

6.4 Reconstruction and replacement of


missings
There are numerous approaches for the reconstruction and replacement of
missing values (e.g. Little & Rubin, 2002²; Wothke, 1998, Roth, 1994, 539-
547). Missings can be reconstructed using logically correct data and more or
less reliable estimates. The reconstruction approach may have some
advantages over the deletion approach: Information is not discarded. Further
analysis is based on a complete dataset. And: Missings in one variable can be
explained by other variables in the dataset, which are probably the cause of
the missings. Provided that the reconstruction itself is plausible, replacing
missings is always better than deleting missings. Nevertheless, in the words
of Little & Rubin (2002², 59), fundamental risks in reconstructing missing
data should be pointed out:
„The idea of imputation is both seductive and dangerous. It is seductive
because it can lull the user into the pleasurable state of believing that
the data are complete after all, and it is dangerous because it lumps
together situations where the problem is sufficiently minor that it can
be legitimately handled in this way and situations where standard
estimators applied to the real and imputed data have substantial
biases.”
In the following, several approaches to the reconstruction of missing parts
will be presented, including

the cold deck imputation (6.4.1.),


the random-based approach (6.4.2.),
the logical approach (6.4.3.),
the stereotype-guided approach (6.4.4.),
the univariate estimation (6.4.5.),
the hot deck imputation (6.4.6.), and
the multivariate estimation (EM, regression) (6.4.7).

All approaches (perhaps with the exception of the random-based approaches)


assume the plausibility of the other information in the dataset: Thus, if a
value in a source variable for determining the missing value is wrong, then
the determined substitute value is wrong, too. In practice, several approaches
are often combined with each other to replace missing values with reliable
values (Little & Rubin, 2002², 61; see also there for the division into implicit
and explicit modeling strategies).

6.4.1 Cold deck Imputation


The cold deck imputation is based on a rather brute appearing approach. All
missings are replaced by a constant. Which value is used must be taken
from suitable sources, e.g. a value from a previous or comparable survey, etc.
This rather simple process is called cold deck imputation. The summarizing
assessment of Little & Rubin (2002², 60-61) reads accordingly: „Satisfactory
theory for the analysis of data obtained by cold deck imputation is either
obvious or lacking“. The following example exaggerates a little bit, e.g. by
replacing each missing with a 3 to emphasize the questionable nature of this
approach.
data list
/VAR1 1-2 VAR2 4-5 VAR3 7-8 VAR4 10-11 .
begin data
11 65
5 67 92
7 15 94
end data.
define COLDECK (!pos!charend('/')).
!do !i !in (!1).
if sysmis(!i)!i = 3.
!doend
!enddefine.
COLDECK VAR1 VAR2 VAR3 VAR4 /.
list.

VAR1 VAR2 VAR3 VAR4


3 11 65 3
5 3 67 92
7 15 3 94

Number of cases read: 3 Number of cases listed: 3


However, the cold deck approach can be useful in many data storages,
especially when it comes to the efficient filling of missings in constants or the
rough filling of so-called local maxima. Local maxima are variables with
values that occur in small variation and predominantly in one form, e.g. zeros
(so-called "zero spikes") etc. (if necessary, the replacement procedures can be
adapted to the requirements and refined a little bit accordingly). The so-called
hot deck imputation, for example, is somewhat more demanding in
comparison (6.4.6.).

6.4.2 Random-based approach


The variant of the random-based approach is characterized using randomness
at the moment selecting the values to be imputed. But the values itself,
between which is chosen by chance, are predefined and not random. Thus the
only requirement of the approach is, that the range of values is defined. In the
example these are the values 1, 2 and 3.
In the example, the variable WITH_MISS e.g. contains missings and
otherwise only values in the range from 1 to 3, whereas the variable
REFERENCE is without gaps. The missings in WITH_MISS are filled up
with the entries in the complete variable REFERENCE. For this variant, a
random variable MY_RANDOM with the range between 0 and 1 is created
(via COMPUTE). MY_RANDOM contains values of a randomly generated
uniform distribution. Using DO IF conditions, the WOUT_MISS values 1,
2 and/or 3 are assigned for all REFERENCE values greater than zero and a
system-defined Missing in WOUT_MISS, depending on the range of the
random values of the variable MY_RANDOM.
data list list (",")
/REFERENCE WITH_MISS .
begin data
21, 2,
31, ,
41, 3,
21, ,
23, 2,
12, 1,
11, ,
end data.
exe.
compute WOUT_MISS=WITH_MISS .
exe.
do if (REFERENCE gt 0 & sysmis(WOUT_MISS)).
compute MY_RANDOM= [Link](0,1).
if (MY_RANDOM gt 0 and MY_RANDOM le .333) WOUT_MISS=1.
if (MY_RANDOM gt .33 and MY_RANDOM le .666) WOUT_MISS=2.
if (MY_RANDOM gt .666) WOUT_MISS=3.
end if.
exe.
list.
The MISSING function can also be used to replace user-defined missings.
REFERENCE WITH_MISS WOUT_MISS MY_RANDOM
21,0 2,0 2,0 .
31,0 . 2,0 ,5
41,0 3,0 3,0 .
21,0 . 1,0 ,1
23,0 2,0 2,0 .
12,0 1,0 1,0 .
11,0 . 3,0 ,9
Number of cases read: 7 Number of cases listed: 7
In the following example, both the sampling as well as the values imputed are
random.
compute WOUT_MISS=WITH_MISS .
exe.
do if (REFERENCE gt 0 & sysmis(WOUT_MISS)).
compute MY_RANDOM= [Link](0,1).
compute MY_REPLACE=rnd([Link](1,100)).
if (MY_RANDOM gt 0 and MY_RANDOM le .333)
WOUT_MISS=MY_REPLACE .
if (MY_RANDOM gt .33 and MY_RANDOM le .666)
WOUT_MISS=MY_REPLACE .
if (MY_RANDOM gt .666) WOUT_MISS=MY_REPLACE .
end if.
exe.
list.
This variant creates two random variables. MY_RANDOM determines the
moment of the sampling, MY_REPLACE the value to be imputed.
MY_RANDOM contains values of a randomly generated uniform
distribution between 0 and 1, MY_REPLACE contains rounded values of a
randomly generated uniform distribution between 1 and 100.
REFERENCE WITH_MISS WOUT_MISS MY_RANDOM MY_REPLACE
21,0 2,0 2,0 . .
31,0 . 28,0 ,5 28,0
41,0 3,0 3,0 . .
21,0 . 65,0 ,6 65,0
23,0 2,0 2,0 . .
12,0 1,0 1,0 . .
11,0 . 7,0 ,9 7,0
Number of cases read: 7 Number of cases listed: 7
The disadvantages of the random-based approach are basically that the
finally filled variables are difficult to interpret in terms of content, as well as
that the exact process of replacement is difficult to replicate or reconstruct
precisely because it is random. In any case a theory-based, e.g. logical
procedure for replacing missings would be more helpful (see 6.4.3.).
This chapter on randomly replacing missings shall be concluded with a
multivariate approach, which apparently looks very similar to a hot deck
approach (6.4.6.). However, due to its randomness, this approach has special
features that make it advisable to treat it as one of the random-based
approaches and to use it with due caution. This approach is based on the
SPSS function RMV ("replace missing values", see 6.4.5. for more details).
Using RMV, missing values can be replaced by estimated values, e.g. by
mean value or median.
This approach is based on the assumption that equal sequences of values
imply a sequential determinism. A value at the end of a sequence is
determined by the preceding ones; a value at the beginning of a sequence can
be reconstructed in the same sense by the following values qua backward
determinism. Thus if two rows are placed next to each other, which
correspond perfectly to each other, then according to this determinism the
gap can be closed simply by taking the value from the complete row.
First, for comparison purposes, the variable COPY_ITEM4 is created as a
copy of the variable ITEM4 to be filled in. Then the data is sorted. The goal
here is to sort the data in such a way that above or below the incomplete data
series there are rows with ideally (or at least approximately) the same values.
RMV (e.g. MEDIAN) is used to calculate the medians of the values above
and below the missing values, which are then stored in the variable
REPLACE and inserted in ITEM4 in place of the missing values using
COMPUTE. The number in brackets indicates how many neighboring values
should be included in the estimate. In the example, one value each is used
from above and below.
data list
/ITEM1 1 ITEM2 3 ITEM3 5 ITEM4 7 .
begin data
1235
1245
1235
412
4123
4133
123
end data.
exe.
compute COPY_ITEM4=ITEM4.
exe.
sort cases by ITEM1 ITEM2 ITEM3 (a) .
exe.
rmv REPLACE=median(ITEM4,1).
compute REPLACE=rnd(REPLACE).
exe.
if SYSMIS(ITEM4) ITEM4=REPLACE .
exe.
list.
ITEM1 ITEM2 ITEM3 ITEM4 COPY_ITEM4 REPLACE
1 2 3 5 5,0 5,0
1 2 3 5 5,0 5,0
1 2 3 5 . 5,0
1 2 4 5 5,0 5,0
4 1 2 4 . 4,0
4 1 2 3 3,0 3,0
4 1 3 3 3,0 3,0
Number of cases read: 7 Number of cases listed: 7
As you can see from the output, the first replacement value is well met (see
third line). The complete row corresponds to all other rows above. The
second substitute value (see fifth line) is somewhat "speculative"; in view of
the sequence 4-1-2-3 in the neighboring lines, a 3 instead of the imputed 4
would have been more likely.
Although this approach assumes a determinism in principle, it is nevertheless
listed among the random-based approaches because chance nevertheless
plays a role in practice that should not be underestimated. First of all, it is not
said, for example, that the dataset can be sorted so that the data rows can be
arranged around the missing in such a way that a suitable missing can be
estimated and adopted. If the data can be sorted accordingly, this is a
(favorable) coincidence, not necessarily a data (set) inherent property. If
missings are located at the upper or lower end of a dataset, a substitute value
cannot be determined due to insufficient neighboring values and set in place
of the missing. This would also be a coincidence (but this time: unfavorable).
The fact that the theoretical assumption of a sequential determinism can (or
cannot) be applied to an empirical object does not necessarily have to be a
necessary aspect of the object under investigation, but often also just a
coincidence.

6.4.3 Logical approach


The logical approach is based on factual knowledge and logical connections.
If e.g. a human being is pregnant, then her sex is female. By knowing a first
variable, the logical approach replaces missings in one or more other
variables.
Scenario:
In a dataset the information in the variable "gender" is missing, but there is
information about pregnancy (PREGNANT, YES/NO). If a pregnancy is
present, the missing in GENDER ("gender") can be replaced by the code "F"
(for female).
data list
/ID 1 GENDER 3 (A) PREGNANT 5-8 (A).
begin data
1 M NO
2 . YES
3 M NO
4 . YES
5 F NO
end data.
exe.
if PREGNANT="YES" GENDER="F".
exe.
list.
In this simple example for string variables e.g. the knowledge about the
variable PREGNANT helps to replace the gaps in the variable GENDER.
Similarly, first names could be used to infer the gender (it is best to look for
further supporting information here).
The following example is a somewhat more complex application to logically
or temporally consistent data. In a clinical study, volunteers should indicate
after how many minutes they notice the first effect of an attention-impairing
medication in a self-experiment. Values between 0 and 60 minutes were
possible in the variable WOUT_MISS. The variable WOUT_MISS is
gapless, i.e. without missings, because the test persons were still attentive
enough. Against later the test persons should indicate again, when they
determined the first effect of an attention-impairing medication. This time
they were only to indicate in the categorically scaled variable ORIGINAL
whether the effect occurred between 0 and 20, 21 and 40, or 41 and 60
minutes. In the ORIGINAL variable, only codes 1, 2 and 3 were therefore
possible. Because the test persons were now no longer attentive enough, the
variable ORIGINAL is incomplete, i.e. with missings. The data in
ORIGINAL resp. WITH_MISS can be filled in using the data in the complete
variable WOUT_MISS.
compute WITH_MISS=ORIGINAL .
exe.
do if (WOUT_MISS gt 0 and sysmis(WITH_MISS)).
if (WOUT_MISS ge 41 and WOUT_MISS le 60) WITH_MISS=3.
if (WOUT_MISS ge 21 and WOUT_MISS le 40) WITH_MISS=2.
if (WOUT_MISS ge 1 and WOUT_MISS le 20) WITH_MISS=1.
end if.
exe.
list.
COMPUTE creates a copy of the variable ORIGINAL in the form of the
variable WITH_MISS. The further steps are a condition-guided (conditional)
search and replace (via DO IF). For all WOUT_MISS values greater than
zero and a system-defined missings (user-defined missings can also be
replaced using the MISSING function) in WITH_MISS, the WITH_MISS
codes 1, 2, and/or 3 are to be assigned, provided that WITH_MISS has
values between 0 and 20, 21 and 40, resp. 41 and 60.
WOUT_MISS ORIGINAL WITH_MISS
21,0 2,0 2,0
31,0 . 2,0
41,0 3,0 3,0
21,0 . 2,0
23,0 2,0 2,0
12,0 1,0 1,0
11,0 . 1,0
Number of cases read: 7 Number of cases listed: 7
The logical procedure must be differentiated from the stereotype-guided
procedure. The disadvantage of the logical procedure is that it can be
complex and requires the plausibility of the other entries in the dataset (see
Chapter 8). If, for example, the data in PREGNANT=YES (first example) or
the values in WOUT_MISS are incorrect (second example), then logically the
value inserted is also incorrect.

6.4.4 Stereotype-guided approach


A stereotype-guided approach differs from a logical approach in that the
justification for the value used is based only on an assumed higher
probability of one of several assumptions, but not on its actual truth content.
If, for example, an entry is missing in the answer field "telephone number at
work", then one could conclude on the one hand that probably specific
population groups are present, such as housewives, homeworkers,
unemployed or pensioners; on the other hand, it is quite possible that the
respondents did not enter a value because they do not have a telephone at
their workplace or cannot or do not want to be called there. A recourse to
further demographic characteristics, e.g. age, occupation and sex, could
however transfer a stereotype-guided procedure into a more secured
approach, an even more probable result; the truth value of a logical statement
can however only be approximated.

6.4.5 Univariate estimation


A common procedure is the so-called "mean substitution". For the existing
values the parameter (e.g. mean value, if metric data; mode, if categorical) is
calculated and inserted instead of the missings.
Scenario:
In a dataset, some entries are missing in the variable "age" (approx. 1%);
other variables are not used for the estimate. Strictly speaking, the proportion
of missings is already too large for this example.
RMV solution:
data list
/ID 1 AGE 3-4 .
begin data
1 05
2 10
3
4 60
5 75
end data.
exe.
rmv MEAN_AGE=MEAN(AGE).
rmv MEAN_LIN=LINT(AGE).
rmv MEAN_MED=MEDIAN(AGE).
exe.
list.
ID AGE MEAN_ALGE MEAN_LIN MEAN_MED
1 5 5,0 5,0 5,0
2 10 10,0 10,0 10,0
3 . 37,5 35,0 35,0
4 60 60,0 60,0 60,0
5 75 75,0 75,0 75,0
Number of cases read: 5 Number of cases listed: 5
Via RMV ("replace missing values"), missings in variables can be replaced
by estimated values of different functions, e.g. mean value (MEAN), linear
interpolation (LINT) or median (MEDIAN). For MEAN and MEDIAN,
RMV does not calculate the missings on the basis of the complete data series,
but (as in the example) uses only two values before (5, 10) and after (60, 75)
the missing value by default (LINT: one value before and one after the
missing value). If a missing is located at the end of a dataset and thus does
not have enough neighboring values on the left and right, no replacement
value can be determined and inserted instead of the missing value.
AGGREGATE solution:
compute STRUCTURE=1.
exe.
save outfile='C:\[Link]'.
compute STRUCTURE=1.
exe.
aggregate
/outfile=*
/break = STRUCTURE
/AGE_MEAN=mean(AGE).
exe.
match files
/file = 'C:\[Link]'
/table =*
/by STRUCTURE .
exe.
if sysmis(AGE) AGE=AGE_MEAN.
exe.
list.

ID AGE STRUCTURE AGE_MEAN


1 5 1,00 37,50
2 10 1,00 37,50
3 37,50 1,00 37,50
4 60 1,00 37,50
5 75 1,00 37,50
Number of cases read: 5 Number of cases listed: 5
This example uses AGGREGATE to determine the average value
AGE_MEAN for the AGE values, returns it to the dataset (for which the
tricky COMPUTE statement is very helpful), and puts the calculated
AGE_MEAN value in the place of the missings.
The speculative nature of this approach can be seen in the inhomogeneity of
the source data; whether the average value used is the correct one appears
somewhat doubtful without including the information from other variables.
Using a mean value based on grouped data might be more appropriate (but
this would require a second variable, the grouping variable).
The univariate calculation and imputation of a parameter is not
recommended for several reasons. The main reason is that the univariate
approach unrealistically excludes other variable interactions (see example).
The variance in the replaced variable and the covariance with other variables
are artificially reduced by this process. Furthermore, an estimated value is
equated with an observed value (uncertainty). Especially for metrically scaled
values with a large range, this approach must be considered speculative and
artificial, because no verification of interactions with other variables was
performed. Since speculative values can already turn out to be
counterproductive in bivariate analyses, this approach should only be used if
no other method is available or the proportion of missing values is really very
small (< 5%).

6.4.6 Multivariate similarity (hot deck imputation)


The hot deck imputation procedure is based on aligning data series with
missings with data series without missings. The hot deck procedure is used,
for example, for the post-processing of survey and census data (e.g. Zajac,
2003; Fay, 2000 for the Census 2000). If the values in the data series are
largely or absolutely identical (i.e. if they are "similar"), the value is taken
from the data series without missing values as a replacement value for the
data series with missing values. For variants of imputation in survey research,
for example, reference can be made to the Census 2000; here, for example, a
distinction was made between assignment (a value within a person),
allocation (a value within the same or a similar household) and substitution
(all values from a similar household) (see Zajac, 2003, vi-vii). Imputation
methods can influence standard errors and must therefore be checked (cf.
Kearney, 2002).
In contrast to probability-based (probabilistic) multivariate estimation, the hot
deck approach can be described as rather deterministic (cf. 6.4.2.): If, in the
hot deck approach, the gap (missing) corresponds to a value in a data series
without missing values, the substitute value is adopted with certainty; in the
probabilistic approach, on the other hand, the substitute value is based on
probabilities.
Scenario:
There are three cases in one dataset. In case 3, a missing appears in the
variable "Item4". The hot deck imputation now first proceeds in such a way
that it searches for a data row that is as similar as possible to the data row of
case 3.
Matrix with missing
Case Item1Item2Item3Item4
1 4 1 2 3
2 1 2 3 5
3 1 2 3
The data row of case 2 is very similar to the value row of case 3. The value
"5" in "Item4" is transferred from case 2 to case 3.
Matrix without missing
Case Item1Item2Item3Item4
1 4 1 2 3
2 1 2 3 5
3 1 2 3 5
Simple example:
The following example should give a first impression of how a hot deck
imputation could look like. A lot of information about the available data is
involved in this procedure, e.g. it is important that the compilation of
variables forms a plausible sequence in terms of content (possibly filter out
via KEEP or DROP). Exactly equal data rows allow to derive the missing
value for the last variable ITEM4. The variable with the missing values must
therefore be at the end of the list. Also, the dataset must not contain any
unique values (e.g. ID variables), as these would prevent the identification of
duplicate data rows. Finally, the lag function requests the value that occurs
one row before in the duplicate data rows for the missing in ITEM4. This
program works for different duplicate data rows, but replaces only one value
at a time for a characteristic sequence of values.
data list
/ITEM1 1 ITEM2 3 ITEM3 5 ITEM4 7 .
begin data
1235
1235
1235
412
4123
4123
123
end data.
exe.
sort cases by ITEM1 ITEM2 ITEM3 ITEM4 .
exe.
match files
/file = *
/by ITEM1 ITEM2 ITEM3 ITEM4
/last =MATCH .
exe.
sort cases by ITEM1 (D) ITEM2 (D) ITEM3 (D) ITEM4 (D) MATCH (D) .
exe.
recode MATCH (0=1) (1=0).
exe.
compute MATCH2=lag(MATCH,1).
exe.
do if MATCH2=1.
compute ITEM4=lag(ITEM4,1).
end if.
exe.
list.
Hot deck imputed values
ITEM1 ITEM2 ITEM3 ITEM4 MATCH MATCH2
4 1 2 3 0 .
4 1 2 3 1 ,00
4 1 2 3 ☐0 1,00
1 2 3 5 0 ,00
1 2 3 5 1 ,00
1 2 3 5 1 1,00
1 2 3 5 ☐0 1,00
Number of cases read: 7 Number of cases listed: 7
Complex example:
The following, somewhat more complex example was kindly provided by
Prof. Van der Weegen (Radboud University Nijmegen, Netherlands). I did
only minor formatting, besides an adaptation to the German version of the
SPSS dataset "Employee [Link]" (cf. Schendera, 2007).
* (1) Data access.
get file='C:\Employee [Link]'.
* create some missing values.
if range(id,1,40) salary=0.
exe.
rename variables
(id, educ, salary = respnr, stratum, x).
match files file=*
/keep=respnr, stratum, x.
descriptives var=x.
* (2) Data management (especially for reconstruction of the original
sequence).
recode x (missing=1)(else=0) into xnew.
exe.
recode x (missing=1)(else=0) into xmis.
exe.
compute seqnbeg=$casenum.
exe.
sort cases
by stratum, xmis,seqnbeg.
exe.
compute strat=lag(stratum).
exe.
if ($casenum=1 or strat ne stratum) seqnstr=1.
exe.
compute volg=lag(seqnstr).
exe.
do if (missing(seqnstr)).
+ compute seqnstr=volg+1.
end if.
exe.
formats seqnbeg, seqnstr(f7.0).
exe.
* (3) Count the valid codes for X within each stratum.
aggregate
outfile='C:\[Link]'
/presorted
/break=stratum
/stratn=n(stratum)
/stratm=nmiss(x).
match files table='C:\[Link]'
/file=*
/by stratum
/keep=respnr, stratum, x, seqnbeg, seqnstr, xnew, xmis, stratn, stratm.
* (4) If missing, then sequence numbers from stratum.
if (xmis=1) seqnstr=trunc(1+uniform(stratn-stratm)).
* (5) Sorting the sequence numbers.
sort cases by stratum,seqnstr,xnew.
do if (xnew=0).
+ compute xnew=x.
else if (seqnstr=lag(seqnstr) and stratum=lag(stratum)).
+ compute xnew=lag(xnew).
else.
+ compute xnew=$sysmis.
end if.
exe.
* (6) Imputing the results into the original sequence.
sort cases by seqnbeg.
compute x=xnew.
exe.
descriptives var=x.
save outfile='C:\[Link]'
/keep=respnr, stratum, x.
list.
The hot deck imputation has the advantage of being suitable for data from the
nominal scale level upwards. However, the so-called hot deck imputation
also involves the danger of a stereotype-guided approach (see 6.4.4.). The
disadvantage of the hot deck imputation is the definition of "similarity"
passed by the user to the syntax, as well as the fact that there is currently no
implementation in SPSS software for this purpose; nevertheless, hot deck
imputations in standard software are possible by syntax (e.g. SAS: McNally,
1997; Iannacchione, 1982; other commercial software: PRELIS; SOLAS: cf.
critically Allison, 2000). However, submitting the definition of "similarity" to
an analysis program and the precise imputation of the determined values
requires a certain amount of planning and programming effort.

6.4.7 Multivariate estimation


The following estimation procedures are examples from multivariate
statistics. In contrast to the preceding (mathematical, logical) procedures,
these approaches are based on complex (statistical) assumptions about the
procedures themselves as well as the sample.
The approach of internal consistency ([Link].) comes from the field of
question construction resp. reliability analysis, and can be applied to 1/0
coded data only via syntax control. The two other approaches (regression,
expectation-maximization, see [Link].) can also be applied to other scales of
measurement. The last two approaches can also be accessed with the SPSS
menu "Missing Values Analysis ...".
The so-called Missing Value Analysis (SPSS procedure MVA) is used, for
example, to detect and describe a possible pattern of missing data by
including other variables. Using several approaches (list- resp. pair-wise,
regression, expectation-maximization) frequencies (only with pair-wise
method), mean values, standard deviations, covariances and correlations) can
be determined and inserted instead of the missings. Among other things,
FIML capabilities are implemented in AMOS.
Further approaches are possible depending on the data situation, e.g. by
means of a cluster analysis. Based on complete data, a cluster analysis can
establish the smallest distance between several cases. Missings could then be
replaced by the values of the case that has the smallest distance.
[Link] Approach of internal consistency
If people have ticked three items in a questionnaire and a fourth item not (i.e.
missing), this approach can be used to determine a substitute value for this
gap, that is based on the average response of the other items in the
questionnaire.
This first multivariate approach comes from the field of item construction
resp. reliability analysis and thus requires tau equivalence, exclusively 1/0
coded items and internal consistency (Cronbach's Alpha) of at least 0.7. The
more rows contain the same value, i.e. are consistent in each case, the closer
the alpha approaches the theoretical maximum of 1.0. In the example, the
internal consistency is 0.76. Tau equivalence means that the measurements
are equivalent and also have a comparable measurement accuracy. The
measurement error (error variance) does not vary.
This approach is based on the assumption that the consistency of value series
implies some kind of statistical determinism. A value at any point within a
series is determined by the consistency of all other values. Thus, if the
preconditions of tau-equivalence and internal consistency are fulfilled, a gap
within a row can be closed by inserting the rounded mean value (thus 1 or 0)
of all other values of this row. However, if the initial consistency is not at
least 0.7, then the determined values are not consistent estimates and
consequently unsuitable substitute values.
DATA LIST LIST (",")
/ NUM1 NUM2 NUM3 NUM4 .
begin data
1, 1, 1, 1,
1, 1, 1, 0,
0, 0, 0, 1,
0, 0, 1, 0,
1, 1, 1, ,
0, 0, 1, ,
0, 1, 0, 0,
1, 1, 1, 1,
0, 0, 0, 0,
end data.
exe.
reliability
/variables=NUM1 NUM2 NUM3 NUM4
/scale('NUM1 bis NUM4') all
/model=ALPHA.
compute BEFORE=NUM4.
exe.
do if (sysmis(BEFORE)).
compute REPLACE=rnd(mean(NUM1 to NUM4)).
compute NUM4=REPLACE.
end if.
exe.
compute SCALE1=sum(NUM1,NUM2,NUM3,NUM4).
exe.
compute SCALE2=sum(NUM1,NUM2,NUM3,BEFORE).
exe.
list.
list var=SCALE1 SCALE2.
First of all, Cronbach's Alpha is determined; then a copy of the variable with
the missing is created (BEFORE). In the next step, the substitute value
REPLACE is calculated as a rounded mean value from NUM1 to NUM4 and
stored as NUM4 value if a system-defined missing occurs there (see DO IF).
Finally, two scale sums are calculated, one without missings (SCALE1) and
one with missings in ITEM4 (SCALE2).
SCALE1 SCALE2
4,0 4,0
3,0 3,0
1,0 1,0
1,0 1,0
4,0 3,0
1,0 1,0
1,0 1,0
4,0 4,0
,0 ,0
Number of cases read: 9 Number of cases listed: 9
For this approach, special properties of Cronbach's Alpha have to be
considered. For example, the amount of Cronbach's Alpha depends on the
number of items: The fewer items the scale contains, the lower Cronbach's
Alpha is; the more items, the higher. The further determination of the
reliability is based on correlations, so, all requirements for a correlation are
also valid for Cronbach's Alpha, e.g. a plausible correlation in terms of
content, no negative correlations (lower alpha), sufficient pairs of measured
values (e.g. at least N=50) and as few missings as possible. After replacing
the missing values, Cronbach's alpha increases to 0.78.
[Link] Missing Value Analysis (MVA): Expectation-maximization and
regression
The expectation-maximization (EM) approach is based on two steps. In the
E-step, initially predicted values are established using a suitable method (e.g.
linear regression). In the M-step, the predicted values from the E-step are
inserted instead of the missing values and subjected to an iterative estimation
function until the covariance matrices maximally correspond.
In the computationally simpler regression method, an equation is developed
for each variable with missing values, with all other variables as predictors.
Based on the available data, this equation is used to predict the unknown
substitute values for the missings.
For the sake of clarity, both methods are presented in a comparative manner.
Scenario:
In a dataset some entries are missing in the variable "age" (approx. 1%);
further variables are used for the estimation. Using regression and
expectation-maximization, the missings are replaced by derived values. In the
following, the data from the MVA hot deck example is subjected to
regression and EM, whereby it is assumed for simplicity's sake that the data
are interval-scaled.
MVA
item1 item2 item3 item4
/EM item4 WITH item1 item2 item3
(TOLERANCE=0.001 CONVERGENCE=0.0001
ITERATIONS=25 OUTFILE="C:\[Link]")
/REGRESSION item4 WITH item1 item2 item3
(TOLERANCE=0.001 FLIMIT=4.0
ADDTYPE=RESIDUAL OUTFILE="C:\[Link]") .
get file="C:\[Link]".
list.
get file="C:\[Link]".
list.
EM values Regression values
ITEM1 ITEM2 ITEM3 ITEM1 ITEM2 ITEM3 ITEM4
ITEM4
1 2 3 5
1 2 3 5 1 2 3 5
1 2 3 5 1 2 3 5
1 2 3 5 4 1 2 3 ☐
4 1 2 3 ☐ 4 1 2 3
4 1 2 3 4 1 2 3
4 1 2 3 1 2 3 5 ☐
1 2 3 5 ☐
Number of cases read: 7
Number of cases read: 7 Number of cases listed: 7
Number of cases listed: 7
The values multivariate estimated by EM resp. regression correspond to each
other and with the results of the hot deck imputation (see 6.4.6.).
Approaches of multivariate estimation (expectation maximization, EM) or of
the Full Information Maximum Likelihood Model (FIML) are preferable to
regression, because they generally include interactions between variables in
the estimation and allow, among other things, to adjust for problems of
uncertainty, variance and consistency. The individual procedures differ in
procedure-specific biases, such as taking into account random or error
components, estimation of variances, standard errors, but also requirements,
such as that the missings are multivariate normally distributed, which is a
problem if the data are not normally distributed, categorically scaled, etc.
Compared to univariate imputation, multiple regression provides the
estimated value with more uncertainty, but assumes that the estimate
depends on good predictors and a linearity of the model. If these are fulfilled
resp. if there are no further problems (e.g. multicollinearity), the regression-
analytical approach nevertheless has disadvantages, namely an a priori
correlation of the estimated values with the other variables (because they are
derived from them), an artificially reduced variation of the values, as well as
of their probabilities (whereas as estimated values they should be much less
certain).
In addition, Hippel (2004) concludes in an evaluation of SPSS' MVA, that
under certain circumstances the procedure may provide biased estimates.
For example, if bivariate normally distributed values are randomly missing
(MAR), already the mere list- resp. pair-wise deletion of cases can lead to
biased estimates. Therefore, it must be expected that the results may be
biased, since this regression variant uses pairwise exclusion.
The point estimates from MVA for mean values, variances, and covariances
may therefore be biased if the values are accidentally missing (MAR); in
contrast, the EM variant provides ML point estimates for mean values,
variances, and covariances and may be used to impute missings. When
imputing missings, none of the MVA variants takes into account the
distribution of residuals resp. does not allow the estimation of standard errors.
Experience has shown that imputation of missings by MVA leads to a bias of
limited resp. underestimated variability/variance. SPSS itself recommends
not to use EM values without further verification or modification. For
bivariate approaches, a simple way of checking the covariances resp.
variances of the completed datasets is to determine them with a correlation
approach (e.g. CORRELATIONS), and also with the EM approach. If the
covariances or variances do not differ substantially, no bias is present. The
reduction of the resulting, quite considerable effort for alternative
approaches, e.g. to provide the EM-based imputed values additionally "by
hand" with a random variability, is the goal of further developments and
extensions in the field of multiple imputation approaches.

6.4.8 Conclusion
The deletion of missings is unsafe unless the MCAR assumption is fulfilled.
The approaches of reconstruction ensure complete and valid data, provided
that the respective process-specific requirements are met resp. appropriate to
the subject matter. Among the multivariate procedures, EM, hot deck
imputation and FIML method should be preferred to linear regression (cf.
Rubin & Little, 2002², 19-20). For all reconstructing approaches applies that
comparative analyses should be performed on datasets with exclusively
replaced values and datasets without missings, especially if the proportion of
replaced values is substantial in relation to the size of the dataset. If the
results differ, the reason for this should be checked; if the results agree, they
support the chosen reconstruction procedure.
In principle, the reconstruction of missings is structurally identical to
ensuring the completeness of data. The reconstruction of missings caused by
variables is possible if the variables causing the missings are known in the
data(sets). The reconstruction of missings caused by processes outside of
data(sets) is possible if the systematic processes (e.g. technical defects)
causing missings are known. Similarly, the reconstruction of missings is
always a reactive method. From a proactive point of view, however, the best
solution is generally to avoid the occurrence of missings (e.g. during data
collection or transmission, etc.) as far as possible from the outset.
Missings can be reconstructed. Depending on the approach, reliable
estimates or substitute values are possible on the basis of assumptions
appropriate to the object or cause. The universal solution does not exist
though. Depending on the patterns, causes and assumptions, the appropriate
procedure must be well-founded chosen. A transition to the plausibility of
data also sets in at this point. Every approach to replace missings with
plausible values supposes that data (and assumptions) that form the actual
basis for the plausibility decision are themselves plausible.

6.5 Calculating with missings


There are numerous possibilities to perform analyses with missings (cf. Little
& Rubin, 2002², 8-11). These approaches are especially suitable when
replacing missings with imputed values is not possible. For example, when
the necessary basic assumption for this is not fulfilled or only for certain
strata, that indicators for the type of missing distribution hide true values that
are meaningful for the analysis. The first approach describes a procedure for
multivariate model specification, which could be described, if at all, as purely
formal-quantitative, but is highly problematic if the MCAR assumption is not
given. All further subsequent approaches describe possibilities of analysis
with missings.

Procedure for multivariate model specification: For example,


multivariate model specification (e.g. multiple linear or logistic
regression) proceeds often this way: First, delete the cases with
missings in the only dependent variable of the model, then delete the
cases that have missings in all predictors in the model, then all
predictors that have exclusively missings (i.e., in principle contain no
data at all), then all predictors that have predominantly missings, then
all constants, then all cells with less than expected frequencies resp.
cells with little variation in measured values, and so on.

Dichotomous coding: A first approach would be to assign all cases


with resp. without missings each a different coding, and to include
these two codes as dichotomous control variable in the analysis.

This approach assumes that the missings are not distributed on extremely
different sized case groups, and that their multivariate distribution is
completely random, which should actually be checked first.
Stratification in analyses: Another possibility would be to assign
codes to cases with missings. The type of distribution of missings is
thus understood as a possibility to define (stratify) the data for an
analysis and to include it as a valid group (layer) in the analysis, e.g.
in a multinomial analysis resp. in a survival analysis.

However, this approach requires that the missings are not concentrated on
extremely few cells or strata.
Pattern-mixture model: The third approach, the so-called pattern-
mixture model (Hedeker & Gibbons, 1997) summarizes different patterns
of missings in a predictor variable. If the predictor variable is then
included in the desired analysis, it can be checked whether the patterns of
the missing have an effect (explanatory content).
The advantage of the pattern-mixture model is that neither MAR nor
MCAR need to be given. The procedure presupposes however that the
number of missing-patterns resp. variables with missings is in an
acceptable ratio to the available data volume. The disadvantage of this
procedure is that currently there seems to be no standard analysis software
available to do this. Transferring the patterns of missings to an analysis
program requires a certain amount of programming.

7 Outliers - Identify, Understand


and Handle
The inconspicuous outlier problem is considered as old as statistics
themselves, since it bears the risk of massively undermining the robustness of
statistical procedures.

Checking the criterion "Outlier" requires that the criteria "completeness",


"uniformity", "duplicates" and "missings" have already been checked and are
in order. The DQ Pyramid considers outliers as an advanced topic as it goes
beyond simple data checking and may involve questioning expectations (cf.
7.1.).
The results of data analyses can be completely distorted by a few outliers,
e.g. in linear regression, the linear model, designed experiments and time
series analyses (see e.g. Barnett & Lewis, 1994³; Cohen et al., 2003;
Hawkins, 1980; Zumbo & Jennings, 2002; Yaffee & McGee, 2000).
The mean value should not be calculated e.g. if outliers are present, because
it will be distorted as a location measure for the actual dispersion of the data.
Even the apparently robust t-test is distorted by outliers. In many multivariate
methods, e.g. a cluster center analysis, outliers should be excluded from the
analysis.
The cluster center analysis reacts very sensitively to outliers and
consequently distorts the clusters because of its starting value method resp.
the squaring of the deviations between the cases and the cluster centers based
on it. Distributions should therefore be checked for outliers before an
analysis.
In regression analysis, outliers can affect regression coefficients, their
standard errors, the R², and ultimately the validity of the conclusions reached.
In regression analysis, outliers can have two completely different "faces" and
accordingly two diametrical consequences on the estimation of the regression
line:

Outliers can lie crosswise to an actual linear distribution and thus


partly or completely undermine the estimation of such a distribution
(as e.g. indicated in this example). In extreme cases, no useful
regression equation can be estimated, although a linearity is present.
Removing the outliers enables the optimized estimation of the linear
relationship.
Outliers may be randomly linear and suggest the presence of a linear
distribution, while the remaining data may actually be diffuse resp.
distributed like a points cloud. The linearity is therefore formed by a
few outliers and not by the majority of the data. The result of such an
estimation is that a few linearly arranged outliers are sufficient to
simulate linearity resp. to conceal a missing correlation. In extreme
cases, a regression equation is estimated even though there is no
linearity. Removing the outliers allows to determine that no linearity
is present. A seemingly plausible regression equation is avoided.

In both variants, very few outliers, e.g. already 4-5 outliers per 1000 values,
may be completely sufficient to completely distort the estimate of the actual
distribution (linear or not) (all the more of course if the ratio between outliers
and the remaining data is less favorable). By the way, it can happen during
successive checking and removing of outliers that the actual distribution is
not yet recognizable (graphically, at least with simple linear regressions) and
that the removal of outliers indicates nonlinearity at first and linearity after
the removal of further outliers. And the opposite case can also occur.
The proven negative influence potential of outliers is, however, contrasted by
the strange fact that many of the German and English standard works on
research methods and statistics, if they even use the term "outlier", in no case
they do mention a way of their handling (see Bortz & Döring, 1995², 28;
Bortz, 19934, 198; Diehl & Arbinger, 2001³, Diehl & Kohr, 1991²; Keppel &
Wickens, 20044, 145; Litz, 2000, 113; Neter, Wasserman & Whitmore,
1988³, 83; Roth et al, 19995, 553; Schnell et al., 19996 etc.). Cleveland
(1993), for example, demonstrated that even a classical dataset, which was
repeatedly analyzed and published, contained undetected massive errors.
But what are outliers? Barnett and Lewis (1994³, 7), for example, give the
following definition: “We shall define an outlier in a set of data to be an
observation (or subset of observations) which appears to be inconsistent with
the remainder of that set of data.”
The indeterminate formulation "appears to be inconsistent" emphasizes on
the one hand the discretionary powers of the user to be allowed to call a value
an outlier (or not), but on the other hand also the necessity to think about
what distinguishes an outlier. Thus outliers are not necessarily exclusively
wrong resp. inaccurately captured values, but possibly also values, which
are correct and exact, but contrary to expectation. The former would
suggest to check the measurement process, the latter would suggest to check
the theory formation. This definition also does not exclude a "smooth
transition" between real outliers and normal data.

7.1 Characteristics of outliers


Outliers have several dimensions, which can occur in combination or alone:

Outliers can occur univariate and multivariate (syn.: high-


dimensional).
Outliers can be semantically (qualitative) or formally (quantitative)
striking.
Outliers can occur only in one case, but also in certain groups.
Outliers can occur only sporadically, but also massively.
Outliers can be relative to the amount of data (sample size).
Outliers can have different causes.

Outliers can therefore have several faces: Outliers can appear univariate-
qualitatively as e.g. a single value caused by the incorrect recording of a
clinical diagnosis, e.g. "hormone therapy" instead of "homeopathy".
However, outliers can also occur as (several) multivariate quantitative
outliers, caused e.g. by the simultaneous incorrect logging of several
variables. Such a case occurs, for example, when data acquisition via several
wireless ECG probes is affected by interferences in mobile communications.

7.1.1 The perspective is also decisive ("Frames")


Univariate outliers are conspicuously high (or, depending on the object, also
conspicuously low) values in a single variable, e.g. the highest water level at
high water. Such a single value simply falls out of the frame of the "usual"
values. The terms "frame" and "usual" are intentionally in emphasizing
quotation marks. Why, the following explanations will clarify. A "frame"
denotes the frame (or for interval-scaled data: the range), for which certain
(qualitative) events or also (quantitative) values are expected. If an event or
value lies within this "frame", it is considered "usual". If an event or value
lies outside this "frame", it is considered "unusual", i.e. outliers. Barnett &
Lewis (1994, 4-7), for example, provide an informative example for the
interpretation of above-average pregnancy duration in humans. The "usual"
"frame" (range) extends to 48 weeks (mean value: approx. 40 weeks). Thus,
pregnancies up to 48 weeks were considered "normal". Higher values, e.g. up
to 50 weeks long pregnancies, were often interpreted as indications of
adultery (and thus as a reason for divorce) due to their deviation from "usual"
values and were only accepted in later years (also judicially) as outliers of an
empirically possible variability. What is interesting about this example is that
a court also tried to determine the maximum duration of a "valid" pregnancy.
This example tried to illustrate that the term "outlier" is always relative to the
respective spatio-temporal situated expectations ("frame"), which do not
necessarily always coincide with "empirical normality" (perhaps better:
variability). Complicating matters, this "frame" can be changed, as well as
empirical normality can also change (not necessarily slowly).
When assessing events or values, such as the assessment of very long
pregnancies (see above), the associated "frame" plays a major role. A
changed viewing angle can therefore also lead to seeing apparent
abnormalities with different eyes, as something normal. A flood record (e.g.
10.69 m, Rhine near Cologne, Germany, 1995) is of course different from the
other water levels of a year, which can fall to just under 2.5 m or below
during low water. In relation to the annual data as a "frame", this water level
is necessarily an outlier. However, the 1995 flood record does not necessarily
have to be different from other flood records, e.g. the highest water level of
the Rhine (near Cologne) in 1993 was almost identical to 10.63 m ("flood of
the century", data source: Hochwasserschutzzentrale Köln [Flood Protection
Centre, Cologne, Germany). A changed interpretation framework can be
sufficient to transform conspicuous outliers into inconspicuous normality.
The trick is probably also to be able to depart from one's own expectations.
Outliers can also be signs of change. The closer examination of outliers (in
this case conspicuously low values) led, for example, to climatologists
discovering the ozone hole over Antarctica.

In 1957, scientists
began measuring
the ozone over
Antarctica. The
ozone
concentrations
were to follow a
regular seasonal
pattern, which they
did for over 20
years.

Afterwards first
deviations were
found. Every
spring the ozone
layer was weaker
than in the spring
before. Initially the
new measurement
results were
interpreted and
published only as
unexpectedly low
values, as more or
less easily
explainable
outliers. In 1984 it
was finally clear
that the Antarctic
stratosphere was
gradually
changing.

The values initially


interpreted as
single outliers
were precursors of
an altered
development of
ozone
concentration.

Notes: For better readability the scatter plots have been provided with a
uniformly scaled x- and y-axis. In addition, they contain a reference line in
1979 and a reference line for the value (303) of that year. I owe the reference
to this example to Prof. Stephen G. West (New York, pers. communication
05.09.2006). The data itself I owe to Dr. Jonathan D. Shanklin, Head of
Meteorology & Ozone Monitoring Unit, British Antarctic Survey,
Cambridge, England. The data from the Halley monitoring station reflect the
mean annual total ozone value and are corrected approximating Bass-Paur.
All measurements were performed with a Dobson ozone spectrophotometer.
As fourth variant there are those events or values, which are without any
measure and comparison; this does not necessarily mean, that one wants to
exclude certain events or values empirically. The reason is of rather
psychological nature. You simply do not want to think of certain events or
values; so here too "frames" play a role, but are rather to look for in human
nature. Just think of the number of human lives lost in the Indian Ocean
seaquake of 26 December 2004 as a sad "record". The latest estimates of the
number of victims are around 232,050 human lives (June 2005).
As these examples show, a conspicuously high value is not always an error,
but can always be an accurate reflection of empirical reality without fitting
into a series or "frame". This tragedy is also an example for the fact that
outliers can also occur in only one case (so far). Examples of outliers in the
form of a group (or several groups) would be if, e.g. a globally operating
company achieved its peak earnings in only a few Western industrialized
countries, or if e.g. the death rate from AIDS was concentrated in certain
geographical regions.

7.1.2 Univariate or/and multivariate


Outliers are distinguished between univariate and multivariate outliers.
Univariate outliers are extreme values in a single variable, e.g. the highest
water level of a flood. Multivariate outliers are value combinations of
several "qualitative" variables, whereby the values of the individual
variables may be inconspicuous, but in combination are unusual, e.g. a 14-
year-old girl with an annual income of 130,000 € and three children aged 10
to 17 years. Taken individually, the data "female", "age: 14 years", "annual
income: 130,000 €", "own children: 3" are absolutely inconspicuous. These
data show their true face as outliers only in their combination. This last
example should also make it clear that outliers do not necessarily stand out
due to quantitatively high values, but also due to special semantic
characteristics (e.g. "girl with children older than itself", "pregnant men" or
even "old children"). Such semantic "qualities" only show up when you look
for them. Multivariate outliers with exclusively quantitative dimensions may
be even more difficult to detect. Several outliers in dependent variables (y-
dimension) can, for example (simultaneously!) come about in completely
different ways. A (first) such outlier can arise, for example, if one
measurement is completely faulty and/or if several smaller measurements
systematically add up and cumulate in effect on the (second) y-value.
The special problem of the usual methods for outlier analysis is that they
rarely work theory-based (qualitative), but predominantly formally
(quantitative). For this reason, the checking of qualitative (but not necessarily
implausible) outliers is primarily treated in the section on plausibility
(Chapter 8 to incl. 8.2.2.). The explanations of the following section therefore
refer exclusively to the checking of formal (quantitative) outliers; a more
sophisticated, genuinely multivariate approach to the purely formal
examination for so-called "anomalies" is presented in section 8.2.3.

7.1.3 The data is to blame: Which data?


Outliers can only occur sporadically, but also massively. Depending on the
object or research context, the proportion of outliers can range from 0 to 20%
(e.g. Hampel et al., 2005). The higher the proportion of outliers, the more
likely it is that false outliers (e.g. typing or drawing errors) must be assumed
instead of correct outliers. An example for single outlier values would be the
seaquake example (real value, see 7.1.1.) or the protocol example ("hormone
therapy" instead of "homeopathy", wrong value, see 7.1.). The ECG example
could be interpreted as an example of a large proportion of outliers (errors). A
predominant share of outliers in terms of correct values should therefore be
interpreted with caution. It may not be the large proportion of outliers that
could be wrong, but perhaps the interpretation frame, e.g. the increasing
number of “outliers” in the ozone hole example. The term outliers was put in
quotation marks here because, on closer examination, it was no longer
individual outliers but an unexpected but nevertheless consistent course of a
time series.
In technical terms, it can be formulated that outliers are not model-invariant:
Outliers in one model are not necessarily always outliers in another. An
outlier in one application context is not necessarily an outlier in another
context (see Barnett & Lewis, Lewis, 1994³, 271, 298). The occurrence of
outliers (and partly their cause) is always also relative to the size of the data
volume resp. the ratio of the sample to the population. The smaller (the more:
non-representative) a sample is, the more likely outliers can deviate from the
rest of the available values, e.g. because, due to too few values, gaps occur in
the data of an otherwise empirically valid variability of measured values. The
larger (independent of representativeness) a dataset is, the more susceptible it
is to outliers in terms of measurement or transcription errors.
Outliers can be errors or a reflection of empirical reality.
Thus, outliers can either be an indication of suboptimal data quality or of
interesting (i.e. also: contrary to expectations) empirical phenomena. A
problem with outliers is therefore to reliably distinguish between data errors
and "real" outliers (as e.g. in the flood example). Not every value that is
formally noticeable is automatically equally wrong. The identification of
outliers generally requires the reliability (correctness, plausibility) of the
other variables in the dataset.
Thus, the verification of outliers is at the same time also the verification of
the semantic plausibility of the data in general. Therefore, in principle, this
check can only be performed by someone with expertise in the field. Only
experts (e.g. medical professionals) can find implausibilities in specific (e.g.
medical) data. For persons without or with limited expertise, these are not
necessarily recognizable. (Medical) expertise can, however, be incorporated
into standardized (e.g. automatic) checking rules (syn.: relational schemes,
constraints), which are defined together with specialists to ensure data
quality.

7.2 Univariate Outliers


The following applies to all the following measures and tests: Not every
value that is formally noticeable is automatically wrong. There is no omnibus
measure or procedure for the identification of univariate outliers. The choice
of a measure must be made with all due care in order to avoid suspicion of
arbitrariness (see Barnett & Lewis, 1994³, 271-272). Section 7.2.1. introduces
the identification of outliers by measures (including a digression on robust
estimators, see 7.2.2.), section 7.2.3. on rules, 7.2.4. on tests, and 7.2.5. on
diagrams.

7.2.1 Identification via measures


Univariate quantitative outliers are strikingly high or low values (extreme
values) of the distribution of a single variable. For univariate as well as
multivariate checks the COUNT function can be used exploratively. COUNT
specifies the number of certain values (0,1,8,9) or outside of a value range,
e.g. using LOWEST and HIGHEST.
data list free
/ID var1 to var20 .
BEGIN DATA
105345673456734567345
256734567345734568899
353164280453243567891
412436578904356789014
end DATA.
count OUTLIER_1=var1 to var20 (0,1,8,9) .
count OUTLIER_2=var1 to var20 (lowest thru 1, 8 thru highest) .
exe.
format ID var1 to var20 OUTLIER_1 OUTLIER_2 (F2.0) .
list variables=ID OUTLIER_1 OUTLIER_2.
ID OUTLIER_1 OUTLIER_2
1 1 1
2 4 4
3 6 6
4 8 8
Number of cases read: 4 Number of cases listed: 4
For further (univariate) checking for outliers, especially their effect, the usual
dispersion measures can be used: Range, quartile distance, the mean or
median absolute deviation from the median, variance, standard deviation and
coefficient of variation (see also Schendera, 2004).
Range R
The range R (also: variation width V) is determined by the width of the
dispersion range, more precisely: by the largest and smallest value of a
distribution.
R = xmax – xmin
R is based on all values of a distribution. One outlier is sufficient to distort
this measure of dispersion considerably. Conspicuously high R-values are
indications that outliers are present, especially if several series of measured
values with other dispersion widths are available for comparison.
Quartile distance and Q1 resp. Q3
Quartiles and the quartile distance also inform about the dispersion width.
Like the range, Q1 and Q3 are also distorted by outliers.

indicates the limit of the I. quarter (25% limit). indicates the limit of
the III. quarter (75% limit). The quartile distance provides the width of the
area where about half of all observations are located.
The ratio of Q1 to Q3 can thus also give an indication of outliers. Range and
quartiles only provide information about the dispersion range, but not about
the extent of the dispersion.
Mean absolute deviation from the median
The mean absolute deviation from the median (so-called 'MAD') measures
the dispersion based on the distances of the individual values from the
median. The sum of these distances is divided by the number of values. The
median is used as reference value. If the number of measured values is even,
the median is the average of the two mean values. When using the median,
the following applies:

The MAD statistics can be calculated in SPSS using the RATIO


STATISTICS procedure with the AAD option. Outliers can also distort this
measure of dispersion. Distributions with high MAD values should be
checked for outliers. Up to and including the MAD measure all dispersion
parameters presented are based on a frequency distribution. The distance
measure is the absolute difference. All following measures of dispersion are
based on the arithmetic mean of a measurement series and the distance
measure of the squared distance.
Variance
The variance is based on the deviation of the measurement, here e.g. from the
mean value. For each measured value there is a corresponding deviation. A
deviation is positive if the deviation is above the sample mean and negative if
it is below. The sum of all deviations from the mean value is necessarily zero.
The variance sx2 is therefore the sum of all squared distances of the respective
measured values from the mean value, divided by the number of measured
values reduced by 1. The greater the variability around the mean value in the
dataset, the greater the variance. The squaring is actually only done to prevent
the mutual neutralization of positive and negative numbers. However, outliers
also cause outlier deviations and can distort the variance due to the
weighting, especially if several outliers occur in the data. Before the
calculation of an (inconspicuous) variance a distribution on outliers has to be
checked. Strikingly high variances are distorted by outliers which should be
checked.

For the interpretation or comparison of different variances, please refer to the


explanations on standard deviations.
Standard deviation
The standard deviation (also called dispersion) is usually derived from the
variance. The standard deviation is the positive root of the variance and thus,
in contrast to the variance, has again the same dimension of the data from
which it is calculated. Again, the larger the variability around the mean, the
larger the standard deviation. The fewer extreme values occur in a dataset, the
smaller the standard deviation.

A standard deviation cannot be assessed directly; recourse to further


information resp. transformations is necessary. The mean value provides the
most important additional information; in addition, the empirical resp.
theoretically possible range of the measurements is informative. The
comparison of several standard deviations must always include the respective
mean value. However, different standard deviations are only exceptionally
based on identical mean values, so that they can be compared directly with
each other only in the rarest cases. Two identical standard deviations (also:
variances) can be compared with each other (even if the mean value is
different), if the data were previously subjected to a z-transformation.
Strikingly high (z-standardized) standard deviations can be distorted by
outliers that should be checked. Another measure for comparing two
distributions based on the standard deviation is the coefficient of variation.
Coefficient of variation
The standard deviation is a measure for the absolute variability within a data
range. The relative variability, however, is a more significant measure and is
expressed by the coefficient of variation. The coefficient of variation (CV;
sometimes also called variability coefficient V) is a simple measure for the
direct comparison of two distributions. The CV is based on the relativization
of the standard deviation of a sample to the respective mean value. For the
variation coefficient, the standard deviation is inserted into the numerator, the
arithmetic mean into the denominator, and multiplied by 100 (some CV
formulas do not include multiplication).

resp.
The higher the CV, the greater the spread. High CV values are indications
that the distribution is distorted by outliers (especially in comparison with
other measurement series). In contrast to the standard deviation as a measure
for the absolute variability, CV indicates the relative variability within a data
range (see also Schendera, 2004). The coefficient of variation should only be
used for variables that contain only positive values. The CV cannot be
calculated for a mean value equal to zero.

7.2.2 Identification via rules


Statistics provides some so-called rules for the evaluation of outliers, which
are, however, sometimes discussed quite critically (see Barnett & Lewis,
1994³).
Outliers can be identified e.g. by means of "confidence intervals". If a value
is outside this interval, it is eliminated as an outlier. Common ranges are e.g.
the median +/- 4MAD, the mean +/- 2Sigmas, or the so-called α%-trimmed
mean. With the α%-trimmed mean, for example, the values of a variable
containing the outliers are sorted by size and then the average of the mean
(100-2α)% values is taken. Thus, with α=20, 60% (100-2*20) of the mean
values are needed for the calculation. This value can be called the 20%
trimmed average. Some methods are based on such trimmed distributions.
The Moses test for extreme reactions is based, for example, on a control
group trimmed for outliers.
The breakpoint concept (BP, Breakdown Point, see e.g. Hampel, 1985,
1971) is also considered promising as a global quantitative index of
robustness. The BP gives the limit for the proportion of outliers in a sample
in terms of the smallest proportion of "noise" without changing the estimator.
The break point is e.g. for the mean BP=0, for the median BP=0.5, for the
interquartile range BP=0.25 and for the α%-trimmed mean BP=α. A method
for the calculation of the breakpoint is currently not implemented in SPSS
(but see for robust M-estimators). M-estimators have a breakpoint BP=0.5 (if
the respective requirements are fulfilled).

7.2.3 Identification via tests


Statistics provides also provides some so-called outlier tests. However, their
unthoughtful application is extremely problematic. Less so because a
conspicuous value does not necessarily have to be a wrong value, but mainly
because even non-significant outlier tests do not exclude the occurrence of
outliers with certainty, but only with a certain probability. Here it is clearly
true: "The use of outlier tests is no substitute for conscientious handling of
data and plausibility checks" (Rasch et al., 1996, 571). Therefore, outlier tests
are, if at all, only to be used exploratively, e.g. to find first hints for possible
data or survey errors.
Hartung (199912, 343-347), for example, presents several univariate tests, the
David-Hartley-Pearson test, the Grubbs test and also the outlier tests
according to Dixon. The David-Hartley-Pearson-Test (often abbreviated to
David-Test) will be presented here representatively (see also Barnett & Lewis
1994³).

The test value Q is calculated as the quotient of range R and standard


deviation s and compared with a table value Qn. The David-Hartley-Pearson
test tests the null hypothesis that the (upper or lower) extreme values of the
distribution belong to the sample (i.e. are not outliers). If Q < Qn, the null
hypothesis cannot be rejected. Thus the extreme values belong to the sample,
i.e. they are no outliers. The distribution can therefore be considered as
outlier-free. Like the Grubbs test and the Dixon outlier tests, the David-
Hartley-Pearson test assumes that the measured values to be tested are
realizations from a normal distribution. Since the David-Hartley-Pearson test
is not implemented, some critical Qn values of the test shall be given here.
The data are taken from Hartung (199912, 344).
N Qn, 0.95 Qn, 0.99
3 2,00 2,00
4 2,43 2,45
5 2,75 2,80
6 3,01 3,10
7 3,22 3,34
8 3,40 3,54
9 3,55 3,72
10 3,69 3,88
12 3,91 4,13
15 4,17 4,43
20 4,49 4,79
30 4,89 5,25
40 5,15 5,54
50 5,35 5,77
100 5,90 6,36
To repeat it once more in conclusion: Neither significant nor non-significant
outlier tests replace careful plausibility checks.

7.2.4 Identification via diagrams


Statistics provides different diagram types for graphical exploration of
outliers. Boxplots, error bar charts, histograms (dichotomically grouped: so-
called "butterfly-plot”) and also stem-and-leaf plots (stem-leaf plot) are the
standard diagram types implemented in SPSS for univariate exploration.
However, by a little trick, SPSS also provides quite illustrative "one-
dimensional" scatter plots.
(Univariate) frequency and dot plots are less suitable for graphical
exploration for outliers. Both types of diagrams do not show the extent of the
"outlier", but only the number of values; for the dot plot, moreover, a sorting
of the values is necessary.
The graphical exploration of outliers and their (distorting effect) on the
diagram parameters is presented below using the square kilometer numbers
(variable: SQKM) of natural lakes in Germany (variable: LAKE). It is
pointed out in advance that the conspicuous value is a reflection of a real
existing entity and not an error (data source: LAWA,
Länderarbeitsgemeinschaft Wasser; German Working Group on water issues
of the Federal States and the Federal Government represented by the Federal
Environment Ministry, date: 01.06.2003). The variable DUMMY (actually it
should be called: DUMMY constant) is read in to "trick" SPSS for the
exploration of a single variable using syntax.
data list
/LAKE 1-17 (A) SQKM 19-24 DUMMY 26 .
begin data
Bodensee 571,5 1
Müritz 109,2 1
Chiemsee 79,9 1
Schweriner See 61,5 1
Starnberger See 56,4 1
Ammersee 46,6 1
Plauer See 38,4 1
Kummerower See 32,5 1
Steinhuder Meer 29,1 1
Großer Plöner See 30,0 1
Schaalsee 22,8 1
Selenter See 22,4 1
Kölpinsee 20,3 1
end data.
formats SQKM (F5.1).
exe.
variable labels
DUMMY "Natural Lakes"
SQKM "Km²".
exe.
Box plot
The box plot is a dispersion plot based on the median and quartiles. The
"box" is bounded by the first and third quartiles (25 and 75% percentile
respectively) of the distribution. This so-called interquartile range (IQR) thus
covers 50% of the values. The crossline in the box represents the median (Q2,
50%-percentile). The whiskers from the box lead to the highest or lowest
value observed within the 1.5-fold interquartile range, provided this is not an
outlier. Observed values outside the whiskers are called outliers or extreme
values. Values that lie more than one and a half box lengths (IQR) outside
(outliers) are marked with a circle in a box plot. Values that lie more than
three box lengths (IQR) outside (extreme values) are marked with an asterisk.

EXAMINE
VARIABLES=SQKM BY
DUMMY
/PLOT=BOXPLOT
/STATISTICS=NONE
/NOTOTAL
/ID=LAKE .

The outlier (more precisely: extreme value) "Bodensee" can be recognized in


the box plot by the fact that it is marked by a star outside the so-called
whiskers. EXAMINE allows to mark the outlier (option: ID=). The y-axis
shows the absolute value of the outlier. Extreme outliers cause the "box" to
be compressed proportionally and can hardly be interpreted anymore. The
other parameters of Box and Whisker (quartiles, median) are not affected.
Histogram
Histograms are considered the standard for presenting frequency distributions
of a metric scaled variable and are therefore useful for detecting outliers
beyond the "actual" distribution. In a histogram, the individual cases are
categorized into classes and represented by bars. The area of the bars
corresponds to the number of frequencies represented in each case.

GRAPH
/HISTOGRAM(NORMAL)=SQKM.

The outlier can be recognized by the fact that it lies above the "actual"
distribution (depending on the data situation, outliers can also lie below or on
both sides of a distribution). Outliers cannot be marked separately. The x-axis
of the histogram does not show the value of the outlier, but the boundaries of
the class to which it was assigned. Outliers distort the calculated parameters
(mean value, standard deviation) and the estimation of the normal
distribution. With the histogram, it should be noted that the classification of
the cases may possibly lead to "swallowing" outliers. In the case of very wide
intervals, high (low) values may fall into the neighboring category and thus
no longer stand out as outliers. For very wide intervals, a histogram is not
suitable for the identification of outliers. The same applies to the "butterfly-
plot" offered by SPSS, which is nothing else but a histogram mirrored on the
y-axis, i.e. dichotomously grouped.

XGRAPH
CHART=
[HISTOBAR]
BY educouns [s]
BY cathol[c]
/COORDINATE
SPLIT=YES
/BIN
START=AUTO
SIZE=AUTO.
Error bar charts
In SPSS, error bar charts can be used to display various dispersion measures
around the mean value. As measures of dispersion the statistics of the
standard deviation, the standard errors and confidence intervals are available,
which ultimately also determine the length of the bars.

GRAPH
/ ERRORBAR(STDDEV
1)
=SQKM BY DUMMY
.

The outlier is not specifically recognizable (cf. box plot, histogram) and also
not marked. However, the outlier shows itself "indirectly" in the strikingly
high standard deviation. Outliers distort the parameters (mean value, standard
deviation). How the error bar chart without outliers looks like is shown
below. Compare the "position" of the mean value and also the proportional
"length" of the standard deviation.

GRAPH
/ ERRORBAR(STDDEV
1)
=SQKM BY
DUMMY .

When interpreting error bar charts (especially when comparing several scales
resp. diagrams), please note the preceding remarks on the comparison of
(non)standardized standard deviations.
“One-dimensional“ scatter plot
Scatter plots normally display pairs of measured values of two interval-scaled
metric variables in a common coordinate system ('point cloud'). Because of
the trick with the DUMMY variable, all SQKM values lie on one line (if
SQKM and DUMMY were swapped in the syntax, the line would be
displayed vertically).

GRAPH
/SCATTERPLOT(BIVAR)=
SQKM WITH DUMMY
/MISSING=LISTWISE .

The outlier can be recognized by the fact that it lies above the "actual"
distribution. Depending on the data situation, outliers can also lie below or on
both sides of a distribution. In contrast to an IGRAPH variant (see below)
outliers cannot be marked separately. The x-axis of the scatterplot shows the
absolute value of the outlier (see histogram, by contrast). Other parameters
are not calculated and are therefore not affected. With the scatterplot you
have to check whether several exactly same values could possibly lie on top
of each other.

IGRAPH
/VIEWNAME='Scatter Plot'
/X1 = VAR(DUMMY) TYPE =
SCALE
/Y = VAR(SQKM) TYPE=
SCALE
/COORDINATE = VERTICAL
/POINTLABEL = var(LAKE)
ALL
/X1LENGTH=3.0
/YLENGTH=3.0
/X2LENGTH=3.0
/CHARTLOOK='NONE'
/SCATTER COINCIDENT =
NONE.

Stem-and-leaf plot (stem-leaf plot)


Stem-and-leaf plots are basically histogram-like diagrams rotated on their
sides. In contrast to the histogram, in stem-and-leaf plots the values
themselves are stacked on each other like bars. The stacks of numbers
represent the values present and are therefore more informative than bar
charts (including histograms). However, unlike bar charts, stem-and-leaf
plots "swallow" possible gaps in the data and should be carefully searched for
missing categories. Stem-and-leaf plots are mainly used to identify outliers
and extreme values.
As the following example will show, stem-and-leaf plots are less suitable for
small datasets, such as N=13 in the Lake example. The following example is
based on N=711 hip data.
EXAMINE
VARIABLES = taille
/PLOT STEMLEAF
/COMPARE GROUP.
Waist circumference Stem-and-Leaf Plot
Frequency Stem & Leaf
19,00 6 . 0223444
55,00 6 . 555566778888899999
118,00 7 . 0000000011111122222222223333333334444444
105,00 7 . 5555556666666777777788888888999999
119,00 8 . 0000000000111111112222222333333333444444
89,00 8 . 55555566666777777788999999999
66,00 9 . 00000111111222233344444
61,00 9 . 55555566666777888899
36,00 10 . 000001233344
14,00 10 . 67789
10,00 11 . 034&
2,00 11 . &
17,00 Extremes (>=117)
Stem width: 10
Each leaf: 3 case(s)
& denotes fractional leaves.
'Stem' indicates the first digit in the value stacks; so if there is a 6 in the
'stem', a 6 has always to precede the numbers in the 'leaf'. 'Stem Width'
indicates the unit of the columns; a 'Stem Width' of 10 means that the
columns contain whole units of ten. Taken together ('Stem' x 'Stem Width'),
this means that the (in the example two-line) stacks contain 60s, 70s, 80s, etc.
values. 'Leaf' indicates, which values exactly are contained in each column.
The 'Leaf' unit is always one unit below the 'Stem Width' unit (an even lower
unit is not displayed, so for multi-digit data the displayed values are always
rounded). For example, if the 'Stem Width' was two digits, e.g. 10, then the
'Leafs' give the single digit values. 'Frequency' indicates, how many values
are contained in each column. In case of large amounts of data, each leaf may
not be able to represent each case individually; in this case the information
'Each leaf: 3 case(s)' refers to how many cases a leaf represents.
A 'stem' of 6 and a 'leaf' of 0223444555566778888899999 (combined for
illustration purposes) with a 'Stem Width' of 10 would mean that the two
columns (N=74, 19 plus 55) contain values from 60 to 69, namely the values
1 x 60 (6x10+0) x 3['Each leaf' factor], 2 x 62 (6x10+2) x 3, 1 x 63 x 3, 3 x
64 x 3, 4 x 65 x 3, 2 x 66 x 3, 2 x 67 x 3, 5 x 68 x 3 and 5 x 69 x 3. Therefore,
the line for the first "6" character cannot be read like that the value 0223444
is present 19 times. If extreme values are present, a reference is made to
them in a last line. In this case, there are 17 extreme values >= 117. Extreme
values in a stem-and-leaf plot are values that are more than 1.5 IQRs below
the [Link] or above the [Link]. This definition differs from the
definitions for outliers (> 1.5 - 3 IQRs) and extreme values (> 3 IQRs) in a
box plot. Thus, the stem-and-leaf plot does not distinguish between outliers
and extreme values, furthermore, these are neither listed nor marked
individually. The stem-and-leaf plot only indicates that there are
conspicuously high values, but does not provide any other more precise
information.
QQ plots (not further presented) can also be used for analyses for outliers in
the sense of deviation from a reference distribution. QQ-Plots are scatter
plots, in which the empirically measured values of a variable (graphically
represented as points) are compared with the values to be expected according
to a reference distribution (e.g. normal distribution) (graphically represented
as diagonal). In such diagrams, outliers appear as points next to the reference
diagonal. Further diagrams especially for identifying outliers, e.g. so-called
Andrew or Influence plots (cf. Schnell, 1994, 122-123; Barnett & Lewis,
1994, 308-309), are not (yet) offered by SPSS.

7.3 Multivariate Outliers


In multivariate methods, e.g. cluster, discriminant or regression analysis,
outliers are value combinations of the variables to be analyzed, in contrast to
e.g. boxplot, where an outlier is only a single value. The section 7.3.1.
introduces the identification of outliers via measures, section 7.3.2. special
features of time series data (e.g. bivariate) and 7.3.3. the identification of
outliers via diagrams.
Also for multivariate data there is no omnibus measure or procedure for the
identification of outliers. The choice of a measure must be made with all due
care in order to avoid suspicion of arbitrariness (see Barnett & Lewis, Lewis,
1994³, 271-272).

7.3.1 Identification via measures


For the assessment of the influence of observations on multivariate models,
different measures have been developed and described as outliers (residuals),
leverage, discrepancy and influence measures (see also Cohen et al., 2003³,
394f., 406ff.).
The leverage describes the influence of a case on determining the estimator
at this point. Leverage values thus measure the influence of a point in the x-
dimension on fitting the regression. A case (outlier) has a high leverage if it is
located far away from the center of the rest of the distribution, regardless of
the direction.
A case has a large discrepancy if it is far away from the (linear) distribution
of the remaining values, especially the discrepancy (distance) between the
predicted and the observed Y-value.
Influence is a consequence of leverage and discrepancy. If of leverage and
discrepancy at least one quantity is rather small, the influence is also rather
small. If leverage and discrepancy are large each, the influence of the case is
also large. Influence measures measure the effect that the omission of one
point from the estimation process has on the estimation itself.
These values are especially useful for assessing the model adequacy of
multivariate models, e.g. multivariate linear regression.

Dichotomized, high resp. low


leverage and high resp. low
discrepancy form in their
combinatorics a 2 by 2 table,
which is visualized exemplarily
in the following figure. The
bright lines symbolize the
comparison direction from the
respective case to the center of
the distribution.
Case (a) (at the regression line) Case (b) (bottom right) has a high
has a low leverage effect, leverage effect because it is located far
because it is located not far from away from the center of the rest of the
the center of the rest of the distribution. The discrepancy of this
distribution. The discrepancy of case is high, because it is nearly
this case is low, because it lies orthogonal to the linear distribution of
almost perfectly on the linear the remaining values. The influence of
distribution of the remaining this case is high.
values. The influence of this
case is, if at all, minimal. Case
(a) could even be interpreted as
a continuation of the rest of the
distribution.
Case (c) (bottom left) has a low Case (d) (top right) has a high leverage
leverage effect because it is effect because it is located far away
located not far from the center of from the center of the rest of the
the rest of the distribution. The distribution. The discrepancy of this
discrepancy of this case is high, case is low, because it lies almost
because it is nearly orthogonal to perfectly on the linear distribution of the
the linear distribution of the remaining values. The influence of this
remaining values. The influence case is rather small.
of this case is low.
Leverage
Leverage is the influence of a case on determining the estimator at this point.
Leverage values thus measure the influence of a point in the x-dimension on
fitting the regression. A case (outlier) has a high leverage if it is located far
away from the center (mean, centroids) of the rest of the distribution,
regardless of the direction. Leverage values can therefore also be at the
extreme end of an estimated straight line and can therefore exert a great
leverage effect. Parameters for the leverage effect are, among others, the
Mahalanobis distance and the so-called leverage values. The Mahalanobis
distance indicates how much an individual case differs from the average of
the other cases with respect to the explanatory variable(s). If a case has a
large Mahalanobis distance, it can be assumed that it has very high values for
one or more predictors and therefore could have a strong influence on the
model equation. Both statistics can be derived directly from each other.
Leverage value = Mahalanobis distance / (N-1). Mahalanobis distance =
leverage value*(N-1). Leverage values as well as Mahalanobis distances
range from 0 to (N-1)/N. The higher the value, the higher the leverage. Cases
(points) with high leverage values should be examined for influence,
especially if they differ significantly from the rest of the distribution, e.g. in
visualizations (see Cohen et al., 2003, 395ff.).
Checking leverage values: sort cases by LEV_1 (D).
exe.
temp.
list variables= ID LEV_1.

Discrepancy values
A case (outlier) shows a large discrepancy, if it lies far away from the (linear)
distribution of the other values, especially the discrepancy (distance) between
the predicted and the observed Y-value. Among the discrepancy statistics are
residuals (Cohen et al., 2003, 398ff.). Residuals indicate model adequacy in
terms of the deviation of a single observation from its corresponding
estimated value. Related to a residuals diagram: A residual is the distance of a
point from the line of the plotted function. Chatterjee & Price (1995², 9)
always recommend checking standardized residuals. According to the
terminology of Cohen et al (2003, 402), standardized residuals are
synonymous with internally studentized residuals. The mean value of
standardized residuals is zero and their standard deviation is 1. According to
a general convention, absolute values above 3 are clearly considered outliers
resp. influential cases; absolute values above 2 should be examined more
closely. Cohen et al. (2003, 401) recommend higher cutoffs, e.g. 3 or 4, for
larger datasets. The usefulness of residuals is limited to testing simple
(bivariate) regressions and is not suitable for multiple regression models
(Chatterjee & Price, 1995², 86).
Checking of residuals (e.g.): if abs(ZRE_1) >= 2 AUSREISR=2.
exe.
if abs(ZRE_1) >= 3 AUSREISR=3.
exe.
temp.
select if AUSREISR >= 2.
list variables= ID ZRE_1.
Sample output (residuals)
The cases with the exemplary requested residuals (absolute values >= 3) will
be excluded in a subsequent iterative analysis. The output and evaluation of
the other measures will not be presented further.
ID ZRE_1
21,00 -4,69092
392,00 -4,69092
574,00 3,34377
Number of cases read: 3 Number of cases listed: 3
The significance of outliers resp. influencing values is assessed in practice by
first performing a regression with and then without these values. The newly
developed models are also subsequently subjected to a residual analysis. The
outliers themselves should not only be examined from a purely formal
perspective, but also from a content-related perspective. Often enough,
conspicuous values are indications of measurement errors, but also possibly
interesting from a content-related point of view.
Influence measures
Influence is a consequence of leverage and discrepancy. If at least one
quantity of leverage and discrepancy is rather small, the influence is also
rather small. If leverage and discrepancy are large, the influence of the case is
also large. Influence measures measure the effect that the omission of one
point from the estimation process has on the estimation itself. In other words,
influence measures are a measure of how much the residuals of all other
cases would change if a particular case were excluded from the regression
function. Usually, only one point at a time is excluded from the estimation
process (Cohen et al., 2003, 402). In the following, the influence measure
DfFit and Cook distance are presented. Both measures are almost redundant;
the only difference is that Cook values do not become negative.
The influence measure DfFit (abbreviation for "difference in fit,
standardized") describes the change in the predicted value that results from
the exclusion of a certain observation. For standardized DfFits, the
convention is to check all cases with absolute values greater than 2 divided
by the square root of p/N, where p is the number of independent variables in
the equation and N is the number of cases. Cohen et al. (2003, 404)
recommend cutoffs 1 resp. 2 for medium and large datasets, and Chatterjee &
Price (1995², 89) recommend checking all values with conspicuously high
values for this influence measure as well.
Checking of DfFit statistics (e.g.): sort cases by DFF_1 (D).
exe.
list variables= ID DFF_1.
exe.
The Cook distance is the average squared deviation between the estimates of
the complete dataset and the dataset reduced by one observation relative to
the mean square error in the estimated model. A high Cook distance indicates
that the exclusion of the case in question substantially alters the calculation of
the regression function of the remaining cases. According to a general
convention, Cook distances above 1 are considered influential (Cohen, et al.,
2003, 404). Chatterjee & Price (1995², 89) recommend checking all
conspicuously high Cook distances.
Checking of Cook distances (e.g.): temp.
select if COO_1 >= 1.
list variables= ID COO_1.
exe.

7.3.2 Identification via rules


Further approaches are based on the knowledge of absolute target values
resp. relatively constant proportions. However, both approaches have in
common that they assume certain characteristics of an object as invariant and
use them as test criteria.
Absolute approach: Reference value
For example, the sum of angles of a triangle is 180 degrees. This invariant
characteristic is used as a test criterion in the following. If, for example, three
variables A, B and C each contain an angular degree of a triangle, the
checksum of these three variables must be 180 degrees. If the checksum
deviates from 180, there is at least one outlier (downward and/or upward)
(Barnett & Lewis, 1994, 301). This check does not indicate in which of the
three variables A, B and/or C an error occurs; furthermore, this approach
does not exclude error compensation, e.g. the occurrence of values such as 0,
180 and again 0.
Relative approach: Proportions:
Many research units or objects can be described in units whose characteristics
(size, weight, price, etc.) have a relatively invariant but object-specific
relationship to each other.
A classic example from the field of medicine is the body mass index (BMI),
which indicates the relationship between weight (e.g. in kg, there are also
other calculation variants) and squared height (e.g. in cm). If this quotient is
calculated using pounds and inches, for example, the index "usually" (frame!)
ranges between 25 and 40. Values above and below this value generally
represent extreme under- or overweight. Negative values or values above 60
often indicate outliers, data and/or calculation errors.
Many other application areas know basic constants. In astrophysics, for
example, the so-called "Hubble constant" describes the relationship
between the redshift and the distances of galaxies, which results from the
expansion of the universe. One of the practical applications of this constant is
the attempt to calculate the age of the universe. However, if the derivation of
the constant is questioned resp. corrected, the extent and age of the universe
calculated on its basis will of course change (cf. Bonanos et al., 2006).
In financial mathematics, for example, the Pay-Performance-Index and a
compound interest rule, the so-called "72-rule", are known. The Pay-
Performance-Index compares the salaries of managers with their performance
(measured in return on equity (percentage of company profit in relation to
equity) and total shareholder return (development of the value of the share
including dividends)) and is a (not uncontroversial) measure of the ratio
between manager's salary and performance. The more neutral "72-rule"
indicates on the stock exchange by dividing the value 72 by a known return
how long it takes for an investment to double. For example, if the annual
return is 7%, the "72 rule" (72/7=10.3) shows that it takes about ten years
until the initial investment has doubled. Conversely, the validity of a
calculation can also be checked here. If, for example, after a ten-year period
of 7%, the achieved investment is not twice the initial value, then there is a
calculation error to be checked. This rule is used for the calculation or
verification of (expected) doublings also in other areas, e.g. in biology. For
example, a bacterial culture that grows 3% per hour should double within one
day (24h). If the bacterial culture is not doubled after 24h, there may be
growth, measurement or calculation errors.
The total length of insects, for example, often consists of several segments,
whose respective length is proportional to each other as well as to the total
length. Let us now assume that there is a species of insect whose body
consists of three segments whose size is proportional to each other in a ratio
of [Link] units resp. indicates a proportional total length of 12 units. If the
individual segment lengths are added up and the sum differs from the
proportional expected total length, a deviation is an indication for an outlier.
This approach also assumes that there is no error compensation (see above).
In contrast, the verification in detail is more difficult, because relative total
values could even only be obtained in a seemingly plausible way. For
example, a value of 12 can also be obtained by adding (incorrect) ratios such
as [Link]. Thus, for the verification of a relative reference value, the
proportions of the individual sum elements must be pair-wise checked, e.g.
A:B, B:C and A:C.

7.3.3 Special features of (bivariate) measurement


series
The verification of time series data is in its simplest form a bivariate
verification method. The repeated measurements of the study object represent
the first variable. The time interval represents the second variable. Time
series data resp. measurement series can be checked for gross data errors by
means of simple line or scatter plots ([Link]., [Link]). For a more detailed
analysis of a time-dependent measured value series, e.g. in the context of
quality control, so-called control charts could be used ([Link].).
[Link] Explorative line plots
Time series data generally have a special feature that makes it easier to
identify data errors using simple line or scatter plots. Time series data, such
as the annually calculated gross national product, have the property of
varying only slightly on average over time. Drastic fluctuations are the
exception rather than the rule; the gross national product, for example, cannot
easily increase by 300% in one year and then decrease by approximately the
same amount the following year. The following example presents e.g. the
SPSS dataset "Trends chapter [Link]", I smuggled in an artificial data error
for demonstration purposes. This outlier is immediately noticeable.

GRAPH
/LINE(SIMPLE)=
MEAN(crestpr)
BY week_.

However, errors in time series data are not always so obvious. If time series
data are based on complex cycles or value series of irregular periodicity, then
conspicuous values, especially if the time series are too short, are not
necessarily indications of incorrect data, but may also be indications of still
unidentified cycles etc. For example, if you find conspicuously high turnover
figures in sales data for a year around Christmas, then this could be an error
on the one hand; however, if you also check the turnover figures for previous
years around Christmas, then you might find that this is a very common (so-
called cyclical) phenomenon. Many people tend to spend money on gifts
mainly before Christmas. However, if the time series is too short or too
volatile, conspicuous values can possibly only be identified and distinguished
from "actual" errors by recourse to context knowledge and the use of
sophisticated (multivariate) time series procedures.
Line charts are suitable as a test instrument for any value series where the
parameters (absolute value, mean value, frequency, etc.) of data should move
within a certain, known "frame" (see also the comments on the somewhat
more sophisticated control charts for running processes). In Section [Link].,
you find variants using scatterplots. If you now examine the values (e.g.,
absolute values) of these variables in a series or even in a temporal course,
e.g., in a line chart, conspicuous curve peaks are at first less to be understood
as indications of empirically relevant phenomena, but rather as data errors;
drastically fluctuating frequencies of values (but also of missings!) can also
signal data problems. One cause can be, for example, that the data in the
individual dataset is polarized or coded the wrong way round. Everything is
possible.
Note: For explorative purposes, the time variable needs not be continuous or
regular; however, if anomalies are found in the response variables, it is
strongly recommended to also check the time variable for anomalies (gaps,
irregular intervals).
The analysis of such parameter series is especially appropriate if you compile
annually created individual data rows from a database to a time series dataset
and want to get an uncomplicated overview whether the measurement,
logging, coding or similar of the respective values is ok.
[Link] Control charts for quality control
Control charts describe dynamic processes over time as diagrams. The
control chart is the most important tool in “Statistical Process Control” (SPC)
to check whether a process is statistically under control. In its simplest form,
a control chart is a graphical representation of a chronologically recorded
series of measured values, around which upper and lower interval limits are
drawn, which were determined from the series of measured values (cf. the so-
called "Six Sigma" philosophy). In process control, these interval limits are
initially understood as warning thresholds and, in the case of more than
punctual overshooting or undershooting by outliers, as intervention
thresholds. The center line represents the ideal process course with zero
variation ("ideal course", often based on the values of exploratory qualifying
runs). Control charts thus allow to assess whether a process is still under
statistical control resp. to identify deviations or clear tendencies via outliers,
to intervene in time and to optimize quality, yield, or process flow.
The quality of a product or a performance feature is assessed e.g. on the basis
of certain characteristics. Each measurement must comply with previously
defined tolerances resp. tolerance ranges (so-called specification limits),
within which deviations from the target value are still accepted. In the
industrial manufacturing process, for example, the target diameter of
workpieces is specified with millimeter accuracy. Control charts can now be
used to check whether the nominal dimensions are maintained within
precisely specified tolerance limits during production, or whether there are
deviations from them. Control charts can also be used to observe other
process-relevant effects that occur in cyclical form, e.g. shift changes,
employee rotation or systematic errors. Control charts are indispensable for
controlling the adherence to tolerances and thus for ensuring process quality.
Cp and Ck are basic parameters for the so-called process capability, PP
values for the so-called process performance. These dimensions will not be
further presented here.
There are many different variants of control charts. Control charts on the
basis of the moving average also include information of the temporally
preceding individual or subgroup value and are therefore considered more
sensitive to even small changes in the process variable than control charts in
which each curve point is limited to only one individual (subgroup) value.
The data basis for the following (simplified) example comes from industrial
material control. In the example, the scattering behavior of a pH value
(variable "ph", so-called process variable) in the manufacturing process over
time (variable "time") is investigated. As Sigma 3 sigma are adjusted
(compare the reference in the diagram legend "Sigma level: 3").
SPCHART
/XR=ph BY time
/STATISTICS=
CP CPU CPL K CPM
CZOUT PP PPU PPL
PPM PZOUT AZOUT
/CAPSIGMA=RBAR
/RULES=ALL
/SIGMAS=3
/USL=5.5
/LSL=2.0
/TARGET=5.0
/MINSAMPLE=2.

The SPCHART procedure actually outputs two graphs for the XR


subcommand together with RBAR, an X-bar chart for the mean value of the
various samples (see above) and an R chart (syn.: Range Chart) for their
range (not shown, nor the various statistics). Since SPSS 15, control charts
can be extended by special rules for checking the measurement series for
conspicuous values. With the RULES option, rules for control charts can
now be defined, which allow for easy identification of points that are out of
statistical control. If a point violates one or more rules (e.g. greater than +3
sigmas), it is displayed in a control chart in a different color and with a
different symbol. In the example, the data violates the rules "6 points in a row
with upward trend" and "6 points in a row with downward trend". The table
“Rule Violations for X-bar” shows the results in detail.
Rule Violations for X-bar
datum Violations for Points
01.07.1997 6 points in a row trending up
05.08.1997 6 points in a row trending up
07.01.1998 6 points in a row trending down
03.02.1998 6 points in a row trending down
06.07.1998 6 points in a row trending up
04.08.1998 6 points in a row trending up
14.01.1999 6 points in a row trending down
09.02.1999 6 points in a row trending down
03.08.1999 6 points in a row trending up
9 points violate control rules.
For details please refer to the SPSS Command Syntax Reference.
The following information can be taken from the diagram for the mean value
(X-bar chart for the location): At 3.60 lies the determined average of the
measurement value series ("pH Level"), based on the respective mean values
of the different samples (minimum size: 2). The determined UCL (upper
confidence limit) is at 7.23. The determined LCL (lower confidence limit) is
-0.04. The upper and lower intervention thresholds are 5.5 and 2.0
respectively (see "U Spec" and "L Spec"). The process runs within the
specified intervention limits (no conspicuous outliers are present) and is thus
generally within statistical control. However, the periodic fluctuations are
conspicuous and their factual clarification certainly contributes to the
optimization of the process quality.
Working with control charts has various process and data requirements (e.g.
number of samples, normal distribution, appropriate tolerances), which
cannot be discussed here. In principle, however, two control charts should be
requested, a location chart and a dispersion chart, in order to obtain
differentiated information about systematic location or dispersion variations.

[Link] Explorative scatter plots


Scatterplots allow to visually analyze the relationship of a time variable with
a second variable. The focus is on the identification of outliers and
anomalies, so the fact that (non-linear) associations are maybe not
recognizable in large datasets is not of primary relevance.
For explorative purposes, the time variable does not have to be continuous
or regular, but if you find anomalies in the response variable, it is strongly
recommended to check the time factor for anomalies (gaps, irregular
distances).

The scatterplots above show the distribution of the data from [Link], once
with, once without outliers. Notice the effect of eliminating the outlier,
especially on the left axis:

The scale changes from 1,0 – 5,0 to 1,6 - 2,2.


The scaling unit changes from 1 to 0,2.
All in all a more detailed view of the dispersion is obtained; here
even with a chance to identify other anomalies, e.g. note the
“dent” between weeks 75 und 100.

Note: For explorative purposes, the time variable (here: WEEK) needs not
be continuous or regular; however, if anomalies are found in the response
variables, it is strongly recommended to also check the time variable for
anomalies (gaps, irregular intervals).

7.3.4 Identification via diagrams


As first screening procedures, graphic procedures are considered as necessary
and helpful. In contrast to measures and tests, they rarely contain assumptions
about distributions resp. assumptions about distributions that might have
caused these outliers. However, this can be a constructive advantage if, for
example, reliable distributional assumptions about the (untransformed) data
are missing (as long as one is not too much guided by one's own expectations
when interpreting them), or if the various requirements (e.g. multivariate
form of distribution) are not given or unclear when determining the various
measures (e.g. M-estimators etc.; Barnett & Lewis, Chap. 7.2.) or test
statistics (Chap. 7.3.1-7.3.5.).
However, a graphical analysis of multivariate data has its limits in terms of
perceptual psychology as well as outlier definition. One cannot, as in a
bivariate analysis, order the data in a manageable way on a maximum of two
dimensions and then say, "here one point is particularly far away from the
others". A graphical analysis cannot easily provide multivariate data with a
reliable order and structure, which is directly followed by the problem that it
is thus difficult to define the nature and extent of multivariate "outlier
being" in this way (e.g., using concepts such as "extremes" or "distance"). In
addition, with multivariate data it is also necessary to distinguish between
outliers and influential observations. Influential observations can distort the
estimation of parameters without being outliers themselves. Also for
multivariate outliers statistics provides different diagram types for graphical
exploration. According to Barnett & Lewis (1994, 308-309) Andrew's Plots
offer interesting possibilities for multivariate identification of outliers.
Only a few types of multidimensional graphics are implemented in SPSS.
Multivariate means that several (n) dimensions can be displayed graphically
at the same time. A point in a multidimensional space is composed of n
coordinates. Thus, several dimensions should not be confused with the simple
display of several variables in a univariate diagram, e.g. in a multivariate box
plot.
A classic example of a multidimensional diagram is the bivariate
scatterplot. As the name suggests, the bivariate scatterplot displays two
dimensions simultaneously. A data point is composed of two coordinates,
usually value pairs of the x and y axes.

GRAPH
/SCATTERPLOT(BIVAR)
= Leukocytes WITH
Thrombocytes
/MISSING=LISTWISE
/TITLE= 'Bivariate
Exploration'.

One or more outliers can always be recognized by the fact, that they are
spatially distant from the other values, e.g. the two values above or below
right.
Using a bivariate scatterplot, outliers can be described by three regression-
analytical influence criteria (leverage, discrepancy, influence, see 7.3.1.):

Leverage: A case (outlier) has a high leverage if it is located far away


from the center (mean value, centroids) of the remaining distribution,
regardless of the direction. The Mahalanobis distance is i.a. one of the
leverage statistics.
Discrepancy: A case (outlier) has a large discrepancy, if it lies far
away from the (linear) distribution of the remaining values. The
residue is a discrepancy-statistic, among others.
Influence: Influence is a consequence of lever and discrepancy. If at
least one quantity of leverage and discrepancy is rather small, the
influence is rather small too. If leverage and discrepancy are large,
the influence of the case is also large. Cook distance, DfFit and DBeta
are influence statistics, among others.

Therefore, a single case always has a leverage effect, a discrepancy and an


influence at the same time (cf. also Cohen et al., 2003, 406ff.). For example,
the value in the upper right corner is far from the center (mean value) of the
distribution; it therefore has a large leverage effect. If, for the sake of
simplicity, it is assumed that the linear distribution runs approximately from
the bottom left to the top right (left third), then this value is also far from the
distribution of the remaining values and thus shows a high discrepancy.
Because the value has a large leverage as well as a large discrepancy, it has a
large influence.
But SPSS also offers a real multidimensional scatterplot. A three-
dimensional scatterplot can be requested by means of a second WITH. A
data point is composed of three coordinates, in this case triplets of values of
the x-, y- and z-axes.

GRAPH
/SCATTERPLOT(BIVAR)=
Leukocytes WITH
Thrombocytes
WITH pO2
/MISSING=LISTWISE
/TITLE=
'3-dimensional Exploration'.

With 3-dimensional diagrams the "reading" of the distributions is more


difficult. The 2-dimensional reproduction (e.g. on paper) of 3-dimensional
distributions seduces the eye to perceive distributions as one would like to
see them. To counter such implicit interpretational seductions, it is
recommended to rotate the variables on the x, y and z axes of the cube, i.e., to
display variable A on the x axis first, then on the y axis in the next diagram,
and so on. This gives you the opportunity to view one and the same
distribution "from several angles".

In the diagram “3-dimensional Exploration – 3“, combinations of values in


the upper right and middle right areas are conspicuous, which could not be
identified in the other two diagrams with difficulty or not at all. Because
these values are spatially distant from the others, these (in this case) are 3-
dimensional outliers.
SPSS offers further multidimensional diagrams for multivariate procedures,
e.g. the territories or area maps of the discriminant analysis or also the axis
representation for the factor analysis. However, the basic principle is the
same: The display of multivariate combinations of values (depending on the
approach based on e.g. observations, discriminant or charge values) in n-
dimensional space. Statistics offers many other methods for multivariate
outlier analysis, which are based on the same principle except for graphical
design and mathematical transformations and therefore shall not be presented
further (see e.g. Schnell, 1994, especially Chapter 11.3).
In conclusion, it must be said about the multidimensional diagrams presented
so far that the example presented is based on exclusively interval-scaled data.
The basic procedure can also be performed for ordinally scaled data down to
(with certain restrictions) categorically (nominally) scaled data. In the
following example, for example, leukocyte values are plotted on the x-axis
and a 0 or 1 coding for gender on the y-axis.

GRAPH
/SCATTERPLOT
(BIVAR)=Leukocytes
WITH Gender
MISSING=LISTWISE
/TITLE= 'Bivariate
Exploration – Response
Category'.

This diagram clearly shows that the 0-values (for male) scatter more than the
1-values (for female). With the 0 codes (for male), a single and distinct
outlier occurs at the level of about 30; all other values are close together. At
the 1-codes (for female) a similarly clear outlier of about 25 occurs, in
addition there are some more strongly deviating values between 10 and 15.
With more than two categorically (nominally) scaled variables, the
graphically based outlier analysis in SPSS reaches its limits and conceptually
goes over to the analysis of qualitative data, which is presented in this book,
as already announced, in the unit on the more content-based plausibility
analysis.
With more than two categorically (nominally) scaled variables, the
graphically based outlier analysis in SPSS reaches its limits and also
conceptually moves over to the analysis of qualitative data, which is
presented in this book in the unit on the more content-based plausibility
analysis as already announced.

7.4 Causal analysis: Outliers or not?


The possible causes of quantitative or even qualitative outliers range from the
concrete reproduction of empirical reality, sampling errors, incorrect data
management, to errors in analyses or even software programs. With the
exception of "correct" outliers in the sense of correct reflection of empirical
reality, all other outlier variants are in principle considered as erroneous
values. The following examples define outliers primarily as observation
values resp. generally incorrect values. It is also difficult to speak of
"observation" values if, for example, data errors are caused by technical
defects. Quality measures for models or estimates, e.g. residuals or standard
deviations, are only touched on.
Striking values as a reflection of empirical reality: Context
A very first interpretation possibility can be that outliers are (e.g.) empirical
reality represented in numerical form and correctly reproduced. The
introductory example with the flood levels should emphasize that the
absolute values of an "outlier" are on the one hand always relative to the
spatio-temporal located expectations ("frame") (a disregard could introduce a
bias towards a sampling error). A flood record, for example, clearly differs
from other water levels of a year, but not necessarily from other flood
records. On the other hand, it must be considered that empirical reality can
change; if, for example, one were to assume a constant flood caused by
global climate change, the daily water level data would no longer appear as
something special, but as (changed) reality. In comparing data generally, and
thus also of outliers, it applies that values as a reflection of empirical reality
always have their context, which can enhance or even relativize this value.
For example, the top speeds of the first racing cars were valid for their
technological context, but cannot be transferred to the state of vehicle and
engine technology of the present. The same applies to the field of psycho-
and sociometry. A long age in modern times, for example, has a completely
different context than the same age during the Middle Ages, e.g. due to better
medical prophylaxis (hygiene, vaccinations). A high IQ at the beginning of
the 20th century, for example, has a completely different meaning than the
same IQ (with the same instrument) in the 21st century; the learning context
for individual further education has been revolutionized by computers and the
Internet.
Sampling (external error source):
Example: Instead of young persons, older persons were inadvertently
included in the sample, which distorts at least the values of the variable "age"
in the analysis.
This variant of the sampling error causes an "outlier" caused by a sampling
error. This outlier is an incorrect value.
A completely different problem exists if the person actually comes from the
sample to be examined, but still has "extremely conspicuous" values. This
outlier variant thus has a "correct" value as an outlier, but there is no
sampling error. A first cause of this problem can be the "frame", e.g. the
theoretical model distribution leading to the judgement "extremely
conspicuous". It is possible that e.g. the normal distribution as "frame" is not
appropriate to the empirical distribution of the feature in which this outlier
occurred. This error variant would be that of the type of reference
distribution. In the case of samples (especially the smaller they are), such
errors are often also consequences of sampling errors, which can possibly be
compensated by additional surveys. However, sampling errors cannot
necessarily be assumed for comprehensive surveys.
Data input (external source of error):
Example: Instead of entering the age of "18" for a young person, the value
"81" was accidentally entered.
The problem with data entry is that only outliers immediately stand out as
false values; values that are within a plausible range cannot be identified
univariately as errors, but maybe as multivariate outliers. If, for example,
codes are assigned for different cancer therapies with varying degrees of side
effects, it is hardly noticeable during data control whether a "3" was formally
assigned instead of a "1"; however, it could very well lead to semantic
problems during analysis. If, for example, the "1" stands for chemotherapy
and the "3" for mistletoe therapy, persons with the rather strong side effects
of chemotherapy are assigned to the mistletoe therapy subsample, which at
least leads to a decrease in the separability of these two groups and thus the
effects of the therapies can no longer be clearly determined.
Erroneous data management (internal/external sources of errors):
Numerous errors can occur in the area of data management. In general, these
causes can be assigned to the variant of the logic error. A program
implements the tasks as they were specified, but these specifications are
already wrong from the initial logic (this also includes the variant of the
software error).
Examples:

Calculate a mean value although the data is bimodal dispersed.


The mean value is arithmetically correct, but does not represent
the empirical reality.
Accidentally not passing codes for user-defined missings to the
analysis software. One consequence can be that such codes, e.g.
"9999", enter the analysis undetected as real values and
completely falsify the results.
Outliers can occur e.g. in the incorrect calculation of
psychometric scales, if e.g. the analysis does not conform to the
scale manuals, or if in the calculation of percentages the division
is not made by a denominator corrected by the number of
missings.

Data transformations are a common cause of outliers. Beginners in data


management are recommended to check transformations manually or on
another analysis system for safety.
Model error (internal/external error source):
Numerous analyses allow model parameters to be stored as values in external
datasets. Outliers in e.g. standard deviations can occur because the
specification of the analyzed model is not correct (this applies to residual
analysis in general). Conspicuous standard deviations are, for example, a
characteristic of a suboptimal model specification in binary logistic
regression. If such "warning signals" are not heeded, there is a danger of
premature publishing of unreliable results.
Software errors (internal source of error):
Many users assume that software and algorithms used are reliable. This leap
of faith is not justified. All major analysis systems have their faults. These
errors are unavoidable due to their complexity and the different modular
development speed; they are however differently well documented. Outliers
can therefore also arise, for example, if the software causes these erroneous
values by an uncontrolled "a life of its own". This possibility should be taken
into account especially if the data originates from data storage of third
parties. In principle, every user should check whether an error and the
corresponding solution (work-around) for the procedure just applied are
documented. The fact that no error is currently known does not mean that no
error exists. Examples for such errors and biases are omnipresent; awareness
of the inherent danger is not.
An algorithm that should determine A-level grades for students who
could not sit exams turned out to be fundamentally flawed and
essentially "cheated" after which the British government decided not
to use its data in favor of teacher-led assessments (BBC News,
2020a). Similarly, the Home Office scrapped a “racist” algorithm for
UK visa applicants (Guardian UK, 2020). Facial recognition
technology is improving, but as its so-called training is biased, also its
results are biased which means results are not correct for certain
groups (The New York Times, 2018; NZZ, 2019): For example, a
facial recognition software used by police misidentified the faces of
28 congressmen as suspects from a database of detained Americans.
Even commercial artificial-intelligence systems have massive gender
and skin-type bias (Buolamwini & Gebru, 2018). An image-
recognition photo app one even mistakenly labeled black people as
gorillas (The New York Times, 2015). About 400 US customers lost
their homes because their bank’s software calculated wrong values
(Süddeutsche, 2018). German bank giant Deutsche Bank transferred 6
billion by mistake; although caused by a simple fat-finger error it
managed to pass all their compliance and check systems
(Tagesanzeiger, 2015). Thousands of scientific studies could be
wrong because of software bugs after a functional magnetic resonance
imaging (fMRI) on a dead salmon showed "brain activity"
(WIRED, 2009). In 2005, for example, several software errors caused
a great deal of resentment to the German Federal Employment
Agency. At the beginning of 2005, a first larger "computer
breakdown" led to the fact that of (at that time) 2.8 million Hartz IV
entitled persons nationwide approx. 5% per cent did not receive their
unemployment pay II (ALG) punctually to 1st of January (Beikler,
2005). A further software error led in August of the same year in
several hundred thousand cases to wrong health insurance company
registrations (i.a. registrations, cancellations) of “Unemployment
Benefit II” (orig.: “Arbeitslosengeld-II”) receivers (HEISE online
news, 2005). The British tax office made itself similarly unpopular in
2007. The British National Audit Office (NAO) found out that in
2006 the tax offices had miscalculated more than 1.6 million times
and had made incorrect payments about 1.04 million times. Due to
processing errors, a total of 157 million British pounds too much and
125 million too little had been paid out. These errors were more than
annoying for both sides: British taxpayers had to invest a lot of time
and effort to get everything right again (Atherton, 2007). The British
tax authorities had sought more powers in advance, e.g. to collect
taxes on their own authority. After the publication of the NAO report,
however, demands were made for better protection of taxpayers from
the tax authorities.
Such errors are technically called processing errors (see 19.1.). Their
common characteristic is that people who develop software, apps or
algorithms usually tend not to talk to people who are using them, and both
usually don’t talk to people who are having decisions made about them by
these algorithms; also the first two parties usually don’t think they may ever
fall into the third (Eubanks, 2018; Fry, 2018). The inherent danger is an
institutionalization of a minority-driven asymmetry in transparency, justice
and risk (Taleb, 2019).

7.5 Handling of outliers


How can outliers be dealt with now? This question is not easy to answer. The
way of handling depends on how the outliers were created (see above). We
can only advise against a general or even automated procedure, especially if
it is to be carried out in an unmonitored or undocumented way. One has to
keep in mind that already the differentiation into "real" outliers and outliers
caused by errors requires a differentiated procedure.

Outliers can remain in the analysis if they are "real" outliers. If


necessary, the expectation-based questions resp. selected procedures
and measures have to be checked to what extent they can deal with
the particularities of the detected outliers (robustness). For example, a
mean value as a measure of location would have to be replaced by a
more robust measure (e.g. by an M-estimator, see above). Another
strategy would be to increase the sample size. Small (non-
representative) samples are more prone to outliers than large
(representative) samples.
Outliers can be replaced by the correct values, if they are "wrong"
outliers, i.e. outliers caused by incorrect values. Outliers are to
replace by other values only if you are sure, that they are really the
right ones (e.g. using questionnaires, documentation, syntax,
metadata, logical derivation, etc.).
Outliers can be replaced by estimated values, e.g. determined by
RMV, MVA, hot deck or other approaches. With these procedures it
is important to note that no bias is introduced.
Outliers can be replaced using fixed values. Outliers above a certain
limit value are often set to a value below this limit in order to comply
with certain process requirements. For monetary data, but also for all
other types of information, it would have to be carefully considered
before whether this approach justifies the waiver of the value of this
information.
Outliers can be bundled via a coding, i.e. further analysis is no longer
performed via the outlier value, but via the membership to a certain
group. This approach is of course only useful if the bundling coding
is possible or even appropriate to the question of the analysis.
Outliers can be deleted without exception. By deleting all outliers,
their influence is also completely eliminated. It is important to ensure
that the outliers are distributed randomly and have a small (ideally
univariate) proportion. In multivariate approaches, it is important to
ensure that outliers do not concentrate on relevant predictors (e.g., in
multiple regression) or the predictors for the rarer target event, e.g., in
binary logistic or Poisson regression. The price of deleting outliers is
that with the sample size also the power of the procedure is reduced.
Outliers can be reduced. For example, by not deleting all values, but
only values outside a certain range, the influence of outliers is not
completely eliminated, but only reduced. The power of procedures is
reduced less than by deleting values without exception. Common
ranges are e.g. the median +/- 4MAD, the mean value +/- 2Sigmas, or
the so called α%-trimmed mean value. It should be noted that
depending on the definition of the rank, a different number of outliers
may be identified and eliminated. The Moses test for extreme
reactions checks, for example, whether extreme values in the
experimental group influence the range and by default compares an
observed group with a control group trimmed for outliers. Depending
on the definition of the percentage or absolute number of outliers, this
test may give a different result.
In all procedures, except when the data or values are not changed at all, a
comparative approach should be chosen in order to estimate the effect of
deleting, reducing, or even replacing etc. of outliers. These procedures are
also valid for qualitative outliers with few exceptions, especially if they are
numerically (coded). The actual check for the correctness of qualitative
outliers is treated in the following section on plausibility.

8 Plausibility
Qualitative and quantitative approaches

Checking for plausibility requires the adherence to all criteria presented so


far. For all the criteria presented so far, it applies in any case that errors at the
time of an analysis have no business being in the dataset anymore. The
identification of their cause is central (see the chapter on missings). It is of
fundamental importance whether these errors were caused by intentional
incorrect information by data providers (e.g. study participants), data entry,
data migration or other factors. The type of error cause or source determines
the extent, systematics and also (systematic) correctability and thus the
guarantee of plausibility.
If the data to be checked should also contain time or date variables, these
must be checked before for time- or date-related plausibility using Chapter
12. As can be seen in the DQ Pyramid, the criterion of plausibility is
represented by two chapters. The common content of Chapters 8 and 9 is
thus the criterion of plausibility; the formal common of these two chapters is
that the operations are performed on only one dataset. Chapter 8 introduces
the basic principle and introduces first simple (pragmatic, logical
approaches), then the multivariate anomaly approach. Chapter 9 introduces
the use of much more sophisticated screening rules that may only need to be
programmed once.
The next section will first introduce the concept of plausibility. For this
purpose, test approaches, test equipment, as well as three test phases and their
specific result attributes are presented. The data quality criteria presented in
Chapters 3 to 7 (if applicable, Chapter 12) are assigned to different "degrees"
of plausibility.
8.1 Formal and content-related approach
Even if it seems to be contradictory at first: the criterion "plausibility", which
seems to be purely content-related, also has formal characteristics that have to
be met. This section will order the criteria for data quality presented in the
previous chapters in such a way that, from a different perspective, they form
different "degrees" of "plausibility". In the end, plausibility is defined as data
that are externally and internally correct and are plausible in terms of content
within a theory-based frame. In a first phase, the criteria completeness,
duplicates resp. missings are used to check the externally formal
correctness. In a second phase (II., see below), the criteria uniformity and
outliers are used to check for internal formal correctness. – If you have
checked and corrected your data according to the criteria from Chapters 3 to
7 (if applicable, Chapter 12), your data has already passed through two
phases and already shows the degree of internally formal correctness (this
term should explicitly distinguish from "plausible as regards content"). – In
the third phase (III.), the "actual" content-related plausibility is checked;
the approaches and procedures presented below resp. in Chapters 9 and 12
can be used for this purpose. The latter do not necessarily correspond to the
"usual" hypothesis tests, in which expectations decide on (inappropriate)
data; on the contrary, data can certainly decide on (inappropriate)
expectations. When testing content-related plausibility, common sense with
the courage to question critically can be a real resp. regular door opener.
A. Test approaches:
Data always has both content-related and formal attributes. Therefore, for
checking and warranting of data quality within a dataset, two approaches are
derived, which have one thing in common: Both approaches follow criteria
that are explicitly formulated before the check is performed (as well as
strategies for dealing with erroneous data).
The content-related approach is based on content-related criteria and
specifies which content-related requirements must be met by the data in order
to be considered plausible and valid. The formal procedure is based on
formal criteria and defines the formal requirements for the data to be
considered plausible and valid. The formal procedure often precedes the
content-related procedure; in reality, both procedures complement each other:
Formal plausibility is only of use if there is content-related plausibility at the
same time; content-related plausibility is only of use if there is formal
plausibility simultaneously. For demanding applications, a certain iterative
approach is indispensable; there, a content-related check is followed by
another formal check, and so on.
B. Test equipment: Test and analysis variables
Both approaches agree on that at least two variables must always be present
when comparing the plausibility of variables. The variable to be checked is
the so-called analysis variable ("analysis" in this context in the sense of
plausibility analysis or check). A second variable serves as a so-called test
variable; this is used to check whether the values in the analysis variable can
be considered plausible. The test variable does not necessarily have to be
included in the dataset itself; it can also be included in the project
documentation, the study protocol or other materials. Otherwise, if the
variables are contained exclusively in a dataset, one can get into a
questionable vicious circle, which can end up with the actual analysis
variable being used as an external criterion for testing the test variables. A
further possibility would be e.g. to supplement the active dataset with further
variables or values, either by hand (if it should concern only a few values) or
by adding a complete dataset. The plausibility of the test materials or
variables themselves must be checked and warranted. The reliability of the
test variables is an obvious and indispensable prerequisite for the test
procedure. If the test variable is not reliable as an external criterion, then
logically the result of the test procedure is also useless. A plausibility check
concentrates on the most central variable(s) of an analysis, the so-called
primary variable(s), but is usually not limited to them.
C. Test results:
The data check along the test steps takes place in three phases. In the result, a
distinction is made between three different, possibly easily confused data
attributes: externally formal correct, internally formal correct, and plausible
data.
Phase I. II. III.
Correctness, Correctness, Plausibility,
externally formal internally formal content-related
(uni-/multivariate) (esp. multivariate)
Comparison ... of data storage ... of data storage ... of data storage
with external with internal or with content-
original data. external criterion. related
expectations.
Criteria Completeness, Outliers, Outliers, Content-
Duplicates, Uniformity. related Plausibility.
Missings.
Result Criteria met. Negative! Positive or!
negative.
Examples “123 year old “Ozone hole over
person”, “14- year- Antarctica”
old girl”,
“pregnant men”
Conclusion The external data The content of the The data
including errors data storage is correspond /
are taken over internally formally contradict content-
formally correct. correct. related
expectations.

Phase I: Externally formal correctness


External formal correct refers to the takeover of data from external data
storages. In the following, external formal correct refers to external data (e.g.
data from patient records, examination questionnaires, but also from Excel
spreadsheets or other databases), if they have been transferred (internally)
into an SPSS dataset in a formal correct manner. However, the formally
correct data taken over need not be correct or plausible. The correctness of
the transferred data are checked in a second step.
Externally formal correct data is not necessarily reliable or valid data.
There are many causes for incorrect data, from intentional to accidental, from
sporadically to general. - For example, a data supplier (study participant or
similar) may deliberately provide incorrect data in a questionnaire. In the
hustle and bustle of a clinical practice, incorrect information can
inadvertently sneak into a patient documentation. Under certain
circumstances, certain methods, e.g. online data acquisition, generally
provide relatively unreliable data if no protective measures against
manipulation have been taken, e.g. by means of checking rules or
authorization options. – Online surveys, for example, are relatively easy to
manipulate. In February 2007, for example, the TV station Hamburg Eins
had to cancel an online survey for the election of a top candidate because it
was obviously manipulated. SPIEGEL ONLINE, for example, was also the
victim of massively falsified votes in spring 2004. One or the other reader
may still remember the downright manipulation competition between Kerry
and Bush supporters (Schendera, 2006, 429). – If such data, which actually
contains errors, is correctly taken over, the adopted value is initially only
externally formal correct (because it ideally corresponds 100% to the error).
However, the transferred value is not plausible, precisely because it is an
error.
The external formal correctness is therefore fundamentally important for a
plausibility check, because it allows to exclude errors during phase I., the
phase of the transfer to SPSS (e.g. reading/input, conversion, migration). If
the data are externally formal correct, both the transfer phase and the original
data (e.g. questionnaires etc.) can be excluded as a source of error. Externally
formal correct data is data that has been correctly transferred to a dataset; so,
there is no external indication that the credibility of the data is doubtful. For
checking for external formal correctness, procedures for the criteria
completeness, duplicates or missings can be used.
However, the fact that data has been correctly transferred to a data storage
does not automatically mean that it is error-free; in fact, looked at closely,
errors have also been transferred correctly. Externally formal correct data
must therefore be subjected to a second phase of plausibility checks, namely
whether the internal data is formally correct.
Phase II: Internally formal correctness (uni- and multivariate)
In the second phase, the present data is checked with the aim of eliminating
more or less obvious errors. The second phase of the plausibility check thus
checks whether data is internally formal correct. The goal of this phase is to
identify and correct possible errors (but in the sense of external correctness).
This approach is especially recommended if the reliability of the data source
(e.g. data provider, study participants) appears doubtful. The concrete
procedure usually proceeds from univariate to multivariate plausibility
checks. If, for example, a person states that he or she is 123 years old, then a
univariate analysis is sufficient to identify this as an error. However, separate
univariate analyses are usually not sufficient to identify data errors. If, for
example, the information "gender: female", "age: 14 years", "annual income:
130,000 €", "own children: 3" were each considered separately, there would
be absolutely nothing conspicuous about it. It is different, however, if these
details are unusual in their combination, e.g. a 14-year-old girl with an annual
income of 130,000 € and three children aged 10 to 17 years. It is not always
such conspicuous multivariate outliers as for example the 14-year-old girl
presented at the beginning with an annual income of 130,000 € and three
children aged 10 to 17 years. Conspicuous values resp. combinations do not
necessarily have to be conspicuous by quantitatively high values, but can also
be conspicuous by special semantic properties (e.g. "a girl with biological
children who are older than herself"). It must be emphasized here that explicit
checking rules are employed in the second phase. In the third phase, test rules
have a special, different meaning. For the second test phase it can be stated
that the test rules applied there are rough, in order to exclude the existence of
fundamental violations of plausibility. For checking for internal formal
correctness, procedures for the criteria uniformity or outliers can be used,
among others.
Phase III: Content-related plausibility (especially multivariate)
The third phase of the plausibility check is somewhat more difficult to
distinguish from the second phase. The difference is rather gradual. In the
third phase, the presumably correct data is compared with expectations
("frames"). This procedure is similar to a kind of informal hypothesis test.
However, the goal is not to accept/reject statistical hypotheses, but rather to
check the plausibility of the content (consistency) of datasets resp. storages.
If data do not meet expectations, both the data and the theory-based
framework, i.e. the applied frame, have to be questioned. What is the
(gradual) difference e.g. to the multivariate example listed under phase II?
The difference is: In phase II., when creating the test rule, the participants
agree before which test result must be clearly wrong: The theory behind the
test rule does not need to be revised: "A 14-year-old girl cannot have
biological children who are older than herself". In phase III. however, the
test result may possibly be correct. Applied to the ozone hole example, this
could mean, for example: "The data are not outliers (rejecting the original
assumption that the ozone concentration is approximately invariant), but
reliable measurements of a continuous increase of the Antarctic ozone hole
(new assumption: the ozone concentration decreases). Here, too, data are
compared with a criterion derived from theory. The theory behind the test
rule can (and should) in any case be critically and thoroughly questioned
(and, if necessary, justifiably revised). Not everything that is conspicuous
must automatically be wrong in terms of content. Errors can be much more
subtle resp. some conspicuities may not be errors at all. As hopefully could
be shown at the ozone hole example in Chapter 7, not always an “error” (e.g.
a large percentage of outliers) must be wrong, it can also be the interpretation
frame.
This approach is by no means to be seen naively as a trailblazer of scientistic
arbitrariness, but on the contrary as rationality enhancing action, as a
constructive-critical approach to truth about the exclusion of errors, be it on
the data side or on the assumptions side. Scientific theories change,
depending on the current methods and data available. This is, by the way,
also one of the essential differences between real science and dogmatism: If,
for example, on the basis of new data, one comes to the conclusion that a
previously held assumption was wrong, the theory is corrected. A prominent
example is the physicist Stephen Hawking (e.g. 1988, 71-73), who first took
the view that the universe arose from a big bang singularity ("Big Bang") and
is now convinced that the universe did not arise from a big bang singularity;
similarly, the theories of the death of dinosaurs were optimized on the basis
of the respective data (e.g. Gould, 2000², Chapters 11 and 12; Brockman,
1991, 108-122). McClellan & Dorn (2001, especially chapters 10 and 11) and
e.g. Feyerabend (1986) describe how i.a. models of thought, methods (e.g.
telescopes, wall quadrants, exact protocols) and (theory-guided)
observational data heralded the transition from a geocentric to a heliocentric
world view. The developmental history of molecular biology from Mendel's
plant experiments to genome deciphering can also be interpreted as an
interplay between theories, research methods and the correspondingly
acquired data. "Observation is always observation in the light of theories.”
(Popper, 19716, 31). However, this interplay did not always proceed in an
evolutionary way in the sense of optimized knowledge, but sometimes made
mistakes (e.g. Schendera, 2001). The discussion of the practical plausibility
check is to be continued with a further reference to the quite complex
background of science theory and i.a. science history (e.g. Kuhn, 1976;
Albert, 1968; Popper, 1963, 1962; Feyerabend, 1986, 1980; Gould, 1983).
The strategy behind these last two phases is a test logic complementary to
research hypotheses. The "usual" research hypotheses ask where differences
or correlations etc. might exist. However, this type of testing logic excludes
any possibility that the differences or correlations found have been artificially
created (when tested after the actual hypothesis test) or will be created (when
tested before the actual hypothesis test). In the last consequence one tries to
refute even confirmed hypotheses by oneself; because only then, when no
argument can be found on the data or analysis level, which speaks against the
occurrence of these findings, these results are actually valid (moreover it is
always better to do this test by oneself, than to have errors or contradictions
proved by third parties).
But it is useful to make these checks, where e.g. no occurrence, difference or
correlation may exist, before the actual hypothesis test. The advantages are
manifold: One is not tempted to take the risk of presenting significant but
unchecked results as valid due to a certain fixation on results; and: one is not
disappointed if the long-awaited results turn out to be a load of smoke and
mirrors. On the other hand, one preserves the chance of being able to convert
statistically non-significant results, which one might have "swept under the
carpet" without checking them, into valid and significant results in time
through data verification and correction. And one saves oneself additional
work; it is quite possible that after the data corrections, other procedures are
more appropriate than those originally chosen.
The approaches and procedures presented below can be used to check for
plausibility of content (see also Chapters 9 and 12). In contrast to the
previous chapters, checking for plausibility of content requires an increased
proximity to the subject and a close content-related dialogue with clients resp.
experts. First of all, one's own common sense should always be consulted,
and only then the support from SPSS. The following five examples may
illustrate this.
Example 1:
In a study with a large proportion of older to very old people, several people
with high-school certificates attracted attention despite a short time at school,
which seems to be a contradiction from today's perspective. However, checks
based on the project documentation showed that this was not a mistake, but
rather the so-called “wartime high-school certificates” (orig.
"Kriegsabitur"), a prematurely given school leaving certificate in war times.
The information was therefore internally correct and plausible despite its
conspicuousness. Formally and in terms of content, there were no errors and
thus no need to correct them.
Example 2:
A study examined the breast cancer status of women. The data was
internally totally inconspicuous, until the idea was born to check whether the
dataset contained only women, since the results should only apply to women.
In fact, by using additional variables, it was possible to determine that several
men had also been included in the breast cancer study. The values in the
uncorrected variables were thus internally formal inconspicuous, but in part
not plausible (acceptable) in terms of content. If the data had been included in
analyses without correction, the formally correct results of a mixed sample
would have been inadmissibly generalized to a uniform population (a not
uncommon error). Formally, there are no errors in this example; in terms of
content, the error was that a study on resp. for women also included data
from men. This error could be corrected by filtering out men. This example is
presented in more detail in Section [Link].
Example 3:
In a study on the long-term effect of smoking, an accumulation of "99"
values was observed when comparing several variables (e.g. the number of
cigarettes per day). However, smokers are not known to focus on smoking
exactly 99 cigarettes a day. A review of the data documentation revealed that
the "99" values were codes for missing data. The accumulation of "99"
values was caused by the original responsible persons forgetting to pass the
values to SPSS as codes for user-defined missings. By explicitly defining
these codes as user-defined missings, this error could be corrected.
The ex-post identification of such codes in interval-scaled data depends on
the extent to which the range of the codes differs from the range of the data.
If the two overlap and the proportion of user-defined missings is small (<
5%), ex-post identification of the missings (including their differentiation
from the "correct" 99 values) and the effect they cause (see the section on
missings) is hardly possible. The values in the uncorrected variables are thus
formally inconspicuous, because they are correct and free of contradictions,
and are difficult to judge with regard to their plausibility in terms of content.
The problem in this example is, that the errors are difficult or impossible to
distinguish from the correct data. Further errors from this smoking study are
presented in [Link].
Example 4:
In a carcinoma study, two variables completely stood out. The variable with
the number of positive lymph nodes systematically contained always more
cases than the variable with the number of lymph nodes examined. A review
of the original patient documentation revealed the cause: Although the data
had been entered correctly, variable names and labels had been confused
during the data migration phase. If these variables had been included in
analyses without correction, the formally correct results would have been
interpreted against the background of the respective wrong construct (lymph
nodes total vs. lymph nodes positive). The values in the uncorrected
variables were thus internally formal inconspicuous because they were
correct and consistent, but in fact completely implausible in terms of content.
Formally there are even two errors in this example, one variable was
considered to be the other. A corrective swapping of variable names and
labels corrected this subtle error. This example is taken up again in section
[Link].
Example 5:
When checking time series in a data warehouse of a telecommunications
company, it was noticed that sales figures were generated on Sundays,
although this sales channel did not make any sales on Sundays; it was also
noticed that the subsequent Mondays contained missings, although sales
figures should have been expected. The obvious explanation was that the
original sales data entered the DWH incorrectly, which could be confirmed
and corrected by consulting the responsible technicians.
Each of these five examples first examined the central variables (gender,
cigarette consumption, high school graduation, tumor parameters, sales
cycle). Each example formulated one or more test rules: Is there a high school
graduation despite a short school time? Does a study on the breast cancer
status of women only contain data from women? Do smokers consume
exactly 99 cigarettes a day? Etc. Common sense, proximity to the subject,
and dialogue with clients resp. experts are indispensable. If these are not
given, a check or analysis by means of SPSS or other tools seems less
trustworthy resp. pseudo-plausible.

8.2 Practical check of data plausibility


The following sections introduce the formal as well as the content-related
approach on the basis of two common analysis situations: working with one
dataset and therein working with one or more variables. A number of tips are
provided for working with multiple datasets. The examples presented in the
following from my analysis activities have been simplified for a better
illustration; any similarity with real projects is purely coincidental and not
intended. All these examples have in common that there is a test rule that
determines which test result is clearly wrong. "Wrong" can be defined
logically or empirically.

8.2.1 Plausibility of one variable


The presented approaches for a categorical, a string and a metric variable
presuppose the external formal correctness of the data and are limited to the
internal formal correctness and its plausibility. This section does not discuss
examples with missings.
[Link] Example for a categorical variable
General:
For a categorical variable, the existing values and the corresponding missings
(if any) are requested using simple (one-dimensional) frequency tables. Most
categorical variables have a manageable number of values and can be
requested and checked in lists.

Three looks on the data


The view on the data is a threefold one; in a first step it is checked whether
all expected values are present. This first look is that on completeness. If, for
example, when checking the variable ALPHABET, only the values "A" and
"C" are displayed, although "B" should be present, then there are usually two
possibilities: The "B" values are missing completely or have been
accidentally assigned to a different category, e.g. "A" or "C". Differences in
the frequencies often indicate quickly which possibility could be present.
The second look examines the occurrence of values that should not occur at
all (redundancy). If, for example, when checking the variable ALPHABET,
the values "A", "B", "C" and "D" are displayed although the value "D" should
not occur at all, then there is clearly a data error. Such errors are often
characterized by a minimal frequency of their occurrence. However, if such
errors occur to an extent that cannot be ignored, then a coding error could be
possible. For example, if three codes, e.g. "0", "1" and "2", occur in the codes
of the (usually dichotomous) variable GENDER, then usually two codings,
e.g. "0/1" and "1/2", have been used simultaneously. If several codings are
used at the same time, their consistency should also be ensured. One rather
cannot assume that the same code in different codings systematically
designates the same thing (e.g. the respective gender). A "1" can then stand
for "male" as well as for "female", which would surely be devastating for an
analysis that would include the gender.
The third look examines the absence of values, the missings. If there are no
missings in the variable, this step is not superfluous. At this point, the
requested frequency table should be checked to see if it requested the
missings at all. Afterwards, it could be checked whether any user-defined
codes for missings were not passed to the analysis system. Strikingly many
missings are not generally indications of suboptimal data quality. In other
words: The rarity of events should not be equated with their irrelevance. The
variable per se is therefore not qualitatively unrewarding or irrelevant simply
because it contains no or only marginal information. Whether the marginality
of information is irrelevant should only be judged by someone with proximity
to the subject. Possible causes can be the data situation for even valid but rare
events, or errors during data transformation or migration.

Starting point of a realistic example:


The breast cancer status of women was studied at a women's clinic. On the
basis of the data finally available, the apparently banal fact whether only
women appear in the dataset is to be checked. After all, the results of the
analysis should be transferable back to women only; therefore, it is necessary
to ensure that the sample meets the necessary inclusion and exclusion criteria
for the study participants resp. their data. Thus, the sample may only contain
women; in relation to the dataset, this means that the variable "gender" is
checked. This example ultimately demonstrates how the content-related
plausibility of the variable "breast cancer status" is ensured.
1. Possibility of verification: Query of the variable "gender"
Using the dataset documentation (e.g. "Display Data File Information" under
the SPSS menu “File”) you search for the variable "Gender". Let us assume
the result is that the variable "gender" is not present in the dataset, because it
was implicitly assumed that only women were examined in the course of the
study. A different procedure is therefore necessary.
2. Possibility of verification: Asking the data collectors
The persons in charge of data collection (criterion outside the dataset) are
asked whether only women were examined in the course of the study.
However, not all persons in charge of data collection are available anymore;
some are not sure whether only women were examined. This retrospective
external criterion is uncertain to identify exact data rows with men. A
different approach is necessary.
3. Possibility of verification: Query of the variable "first name"
Instead of the variable "gender" other variables are checked. The result is that
the variable "first name" is present in the dataset (external criterion within
the dataset). Close examination shows that there are undoubtedly several
unambiguously male first names, e.g. 'James', 'Robert' or 'Scott' (especially as
double first names). It should be noted that there are first names that are
gender-neutral, e.g. 'Riley', 'Kay' or 'Chris'. This problem cannot be solved
neither by sorting the data according to last name and then simply identifying
the bearer of a uniquely non-female name as the male part of a married
couple. This procedure is based on the experience that people often visit a
clinic together as a married couple. However, siblings with the same name,
unmarried or lesbian couples are also possible. Although the idea was good,
it has its pitfalls, e.g. the skipping of married couples with different last
names. The problems are comparable to the stereotype-led approach (see
above).
4. Possibility of verification: Taking over other data sources
The study design provides for sending samples of the examined persons to
the laboratory for analysis (external criterion outside the analysis dataset). A
call to the laboratory reveals that the staff recorded the gender of the samples
when documenting the samples in order to be able to set the analysis
correctly. The laboratory data is merged with the analysis dataset. Using the
variable "gender" from the laboratory dataset, the male participants in the
analysis can be clearly identified and removed from the dataset.
The presented test methods can be easily and straightforwardly implemented
in the form of ideally (alphabetically) sorted (SORT CASES) data lists
(LIST) or frequency tables (FREQUENCIES). The selective filtering resp.
restricting by male first names could be done e.g. by SELECT IF.
[Link] Example for a string variable
General:
A string variable is usually treated as a categorical variable. In this respect,
what has been said about the evaluation of categorical variables is essentially
also true for string variables.

Starting point of a realistic example:


At a rehabilitation clinic, the name and dose of the post-operative pain
medication are studied. Using the final string data the not at all banal fact is
verified whether the drug names, documented as character strings in the
dataset, were written uniformly at all (see also Rohe & Beyer, 2005). This
verification is necessary because the results of a frequency analysis planned
for later use should be based on absolutely uniformly written strings. One
typing error (including blanks) may therefore be sufficient that not only
correct drug names but also misspelled names are listed and counted
separately as own drug names in frequency analyses; it is therefore necessary
to ensure the uniform spelling of the strings by means of an exploratory
frequency analysis (FREQUENCIES) before the actual frequency analysis.
Uniformity (see Chapter 4) includes i.a. upper and lower case, umlauts, as
well as spaces and special characters. This approach thus checks the internal
consistency of a variable in the form of its (clinically) uniform spelling. Of
course, it is also necessary to check whether the drugs listed are painkillers
and not other drugs.
1. Possibility of verification: Query of the variable "drug"
The variable "drug" exists in the dataset because the study design explicitly
provided for the logging of the administered pain medication. A first
frequency analysis shows that the drug names were not documented
uniformly. Lower case alternates with upper case. Typing errors were also
discovered after an initial screening, which is not at all easy with the often
quite exotic drug names. The drug names are subsequently standardized, e.g.
via UPCASE and LTRIM.
2. Possibility of verification: Checking the medication
In contrast to subjective statements, drug names are objective facts, to verify
them you do not necessarily need to fall back on the respondents. A check of
the drug names given (e.g. "Vioxx", "Tramadol", "Dentocaps") on the basis
of the so-called "Red List" (list of drugs, e.g. for Germany) shows that some
names are not painkillers at all but laxatives (e.g. "Dulcolax"). This check
must be carried out with due care. As an example for the possible confusion
of not clearly written drug names, e.g. "Mevinacor" (a rather harmless
cholesterol-lowering drug) with "Marcumar", which can lead to dangerous
internal bleeding when overdosed, reference is made to Rohe & Beyer
(2005). There are far too many similar drug names: "Cirnelin"/"Cirnedin",
"Isicom"/"Insidon", etc. For dose-critical medications, it is also essential to
observe the exact unit. The same number can mean completely different
doses for different units, e.g. 15 mg (=1.5ml) compared to 15 ml (150 mg). A
closer examination in the example shows the following picture: Painkillers
were on the one hand correctly indicated, but misspelled ("Dentrocaps") and
issued as laxatives ("Dulcolax") during the correction process; on the other
hand, laxatives were sometimes clearly indicated instead of painkillers.
3. Possibility of verification: Checking the false negatives
A laxative is not an analgesic; therefore, for the laxatives recorded, it is
possible to check whether they were in fact misstated by mistake, or whether
the laxative was prescribed as a light analgesic (painkiller) medication. For
this purpose, the prescribing doctors can be interviewed in addition to the
"Red List".
The presented test methods can be easily and straightforwardly implemented
in the form of ideally (alphabetically) sorted (SORT CASES) data lists
(LIST) or frequency tables (FREQUENCIES). Correction of values that are
often written incorrectly in the same way can be done e.g. via IF or INDEX
statements. An automatic correction via syntax has the advantage that no
sporadic errors sneak in. If an error should sneak in at this point, then it is
immediately apparent so clearly that it can be easily recognized and
corrected.
[Link] Example for a metric variable
General:
A metric variable also includes the nominal level. In this respect, what has
been said about the assessment of categorical variables applies without
restriction to metric variables as well.

Starting point of a realistic example:


At a clinic, children and adolescents are examined for certain cardiovascular
parameters. The data finally available is to be checked for a first plausibility.
In the following, the variables "age" and "blood pressure" are tested as
examples.
1. Possibility of verification: Boxplots
The range of the variable "age" is checked graphically with a box plot. In
SPSS, outliers can be identified graphically quickly and straightforwardly by
means of an ID variable. If boxplots indicate that outliers may be present,
they are examined closer in a further step to determine whether they are
plausible outliers or clearly incorrect data. One could also imagine a LIST
output, filtered by e.g. SELECT IF (ALTER > 18).
2. Possibility of verification: Descriptive characteristics
The range of the variable "age" is examined using DESCRIPTIVES. This
approach allows only the analysis for outliers, but not the identification by
means of an ID variable. In the result, the maxima for "age" are values above
40. These values are clearly not plausible. 40 year old children and
adolescents are a contradiction in terms.
3. Further possibilities: Reconstruction and documentation
If further patient data is available, a reconstruction of the children's actual age
at the time of the examination using their date of birth would be possible.
Another possibility is to check the incorrect data for systematic spelling
mistakes, e.g. frequently occurring transposed digits, e.g. "14" instead of
"41". If a tendency towards transposed digits could be identified as a possible
cause for incorrect data, then also apparently correct data should be checked
to see if they are not actually wrong (e.g. "21" instead of "12").
In the case of data storage with unspecific specifications, the entries must be
checked to see whether they were entered in the same unit, e.g. time durations
uniformly in seconds or minutes or hours. Drug doses should be checked
carefully for use of the same unit, e.g. 15mg compared to 15ml.

8.2.2 Plausibility of two or more variables:


"Qualitative" approach
The following examples are limited to two variables, but can be extended to
several variables.
These variants are called "qualitative approach", because, i.e. qualitatively
speaking, cases can be assessed directly and immediately without
transformation of the source variables (however, see 8.2.3.).
Here, too, the goal is to first check the internal formal correctness and then
the content-related plausibility. For reasons of clarity and efficiency,
numerous examples from the previous sections are repeated here; the detailed
explanation can be found there. Examples of missings are also discussed in
this section.
[Link] Example for metric variables
For two metric variables, bivariate scatter plots including an ID variable are
useful. The interactive scatter plots are recommended (the used IGRAPH
syntax is not shown). Outliers are easily identifiable by the POINTLABEL;
however, IDs that are close together can become unreadable (see bottom left).

The cases
with the IDs
239 and 331
are
immediately
conspicuous
and should be
examined
more closely.

For grouped metric variables, grouped box plots are considered to be


particularly suitable because they allow for easy identification of conspicuous
cases by means of an ID variable resp. the line number (if no ID variable is
available). The formally conspicuous cases are then checked for plausibility
of content.

The cases with


the line numbers
(if no ID is
available) 331
etc. must be
examined more
closely.

Also pay attention to missings in the graphical check. Neither the analysis
variables should contain missings (e.g. in the case of bivariate scatterplot),
nor the ID variables necessary for identification.
Also note that formally inconspicuous values are not necessarily plausible
(see below).
[Link] Example for categorical variables
For two categorical variables, grouped bar charts or crosstabs (not shown) are
useful.

For the variable


"Tumor degree",
the "99" values are
without label.
For the variable
"Histological
type", the label for
the value "4.00" is
missing.
Particularly with large amounts of categorical, but also metric data, formal
inconspicuousness is often equated with plausibility of content, which is
however premature. The reason is that in grouped graphical or tabular
analyses the data is checked group-wise but not case-wise. Errors that affect
all cases, e.g. coding (see above), stand out immediately. Case-wise errors,
e.g. contradictory information from the study participants, cannot be easily
identified in this way; other approaches must be chosen.
Example:
In one study, the participants were asked, i.a., whether they are or have ever
been smokers (SMOKING), whether they have stopped smoking
(SMOKSTOP), and how many cigarettes they smoked per day (FREQCIGS).
From a formal or univariate perspective, the values in the dataset appear
completely inconspicuous.
data list
/ID 1 SMOKING 3-6 (A) SMOKSTOP 8-11 (A) FREQCIGS 13-14.
begin data
1 NO NO 20
2 NO YES 30
3 YES NO 40
4 YES YES 20
5 NO NO 20
6 NO YES 30
7 YES NO 40
8 YES YES 20
9 NO NO 30
end data.
exe.
However, the image changes when the data is cross tabulated or classified:
SMOKING * SMOKSTOP
Crosstabulation CROSSTABS
Count /TABLES=SMOKING BY
SMOKSTOP SMOKSTOP
NO YES Total /FORMAT= AVALUE TABLES
SMOKING NO 3 2 5 /CELLS= COUNT
YES 2 2 4 /COUNT ROUND CELL .
Total 5 4 9

For example, someone who has never smoked (SMOKING=NO) cannot stop
smoking (SMOKSTOP=YES).
Report
FREQCIGS
MEANS
SMOKING Mean N Std. Deviation
TABLES=
NO 26,00 5 5,477
FREQCIGS BY SMOKING
YES 30,00 4 11,547
/CELLS MEAN COUNT STDDEV .
Total 27,78 9 8,333

People who have never smoked cannot smoke an average of 26 cigarettes per
day.
In all these bivariate contradictions (inconsistencies), it is not clear to which
of the variables involved the error is due resp. which variable has to be
corrected to establish plausibility. Since in principle even both variables can
contribute an error component, other criteria within or outside the dataset
must be used for clarification.
Other typical test questions are, for example:
If, for example, someone claims not to know a certain product, then this
person should neither own the product in question nor have any experience
with it. There must be no correlation between these statements (e.g. test using
a cross table).
If, for example, someone performs a survival analysis, the dataset for patients
for whom the information is available that they are alive must not also
contain a date of death. This information must not occur simultaneously (e.g.
logical test via IF condition).
[Link] Example for string variables
The following problem checks missing values for plausibility. In a dataset,
the information in the variable "gender" is missing, but there is information
about pregnancy (PREGNANCY, YES/NO). If a pregnancy is present, the
missing in GENDER ("gender") could be replaced by the code "F" (for
female). However, the possibility of PREGNANCY=YES and
GENDER="M" must be excluded.
data list
/ID 1 GENDER 3 (A) PREGNANCY 5-8 (A).
begin data
1 M NO
2 M YES
3 M NO
4 . YES
5 F NO
end data.
exe.
compute ERROR=0.
exe.
if PREGNANCY="YES" & GENDER="M" ERROR=1.
exe.
if PREGNANCY="YES" & ERROR=0 GENDER="F".
exe.
list.
ID GENDER PREGNANCY ERROR
1 M NO ,00
2 M YES 1,00
3 M NO ,00
4 F YES ,00
5 F NO ,00
Number of cases read: 5 Number of cases listed: 5
This example replaces missings e.g. using the information in the variable
PREGNANCY and the explicit exclusion of data errors. Data errors are
stored in the variable ERROR.
The previous sections introduced plausibility checks for one or two variables.
These still simply appearing bivariate questions can be extended e.g. by one
variable. For example, if a woman is 81 years old, this is still nothing special;
but if this woman is 81 years old and pregnant, then something may be not
right. Either the number "81" is a transposed digit or the indication
"pregnant" is wrong (or all three indications are wrong). In principle, a fourth
(sic) variable would now have to be used, e.g. the date of birth, to shed light
on the matter.
Such a differentiated procedure is unavoidable, but for complex variable
configurations or for big or several datasets, it can be extremely costly. The
following Chapter 8.2.3. therefore presents a first approach for a multivariate
screening for (also qualitative) anomalies. Further examples for multivariate
screening are given in Chapters 9 and 10, as well as for the work with several
datasets simultaneously in Chapter 11. However, it should be noted that these
rather automated-quantitative approaches can also be interpreted and justified
in a theory-guided (qualitative) way.
In the anomaly approach the values, which are identified as potentially
unusual, have also to be checked content-related in a concluding step, e.g. if
they are really criterion-guided anomalous or if they are criterion-guided
permissible values. Not every formal conspicuous value is also a content-
related conspicuous value.

8.2.3 Multivariate data plausibility (detection of


anomalies): "Quantitative" approach
Since version 14 SPSS offers the procedure DETECTANOMALY under the
"Data" menu item "Identify Unusual Cases..." for the (seemingly)
uncomplicated exploration of "unusual" cases (so-called anomalies) in a
dataset. This procedure is based on a cluster analytical approach (TwoStep)
and identifies unusual cases (anomalies) due to their deviation from the
norms of their respective cluster groups. The procedure
DETECTANOMALY was developed for a general, application-independent
detection of data anomalies, i.e. the definition of a certain case as an anomaly
is of general statistical nature and not application-dependent theory-guided.
DETECTANOMALY quantifies qualitative as well as quantitative
differences or similarities between cases. This procedure is therefore also
suitable for the exploration of categorical data as well as mixed data volumes.
String variables can also be included in the analysis as category variables.
This approach is called "quantitative", because cases can only be assessed
after complex multilevel transformation of the source variables, i.e. only
quantitatively directly, but qualitatively speaking only indirectly and
mediated (cf. however 8.2.2.). Thus, in qualitative theory-guided approaches,
qualitative resp. quantitative differences or similarities between cases can still
be directly assessed and knowledge of statistics is not necessarily required.
The advantage of the DETECTANOMALY procedure is a simplification of
the exploration of larger volumes of variables, which in contrast to the
theory-guided approaches requires less planning and programming effort. In
contrast, there is the procedural complication, in detail the multi-level
mathematical processes, as well as the retransfer afterwards of the generated
statistics back to the original values (interpretation of plausibility). For the
evaluation of the correct course of this procedure (e.g. whether the statistical
requirements are fulfilled), as well as its results, knowledge of methods and
statistics is required, otherwise an "anomaly" would only be judged 'per fiat'.
A statistical anomaly is a potential anomaly until the plausibility of this
assumption could be confirmed by a more theory-guided approach.
[Link] Assumptions and algorithm
The DETECTANOMALY procedure can be used for continuous and
categorically scaled variables (also mixed), even if missings are present.
If an ID variable is not available, the procedure uses the row numbers of
the active dataset instead. Weight variables are ignored. The
DETECTANOMALY procedure assumes that various prerequisites are
given:

The sequence of cases in the dataset is random. The identification


of a case as a potential anomaly can be influenced by the order of the
cases (rows) in the dataset. In order to exclude a possible sequence
effect resp. to ensure the stability of the findings, it is recommended
to perform the analysis several times with differently randomly sorted
cases.
The variables are in columns, cases are arranged in rows. The
TwoStep procedure (like the cluster center analysis) is able to group
cases as well as variables. Therefore, it must be ensured that the
correct dataset structure is given, especially when analyzing time
series data.
Continuous variables are normally distributed; categorically
scaled variables are multinomially distributed. The algorithm does
not consider ordinal variables; only nominally and metrically scaled
variables. Users must decide for themselves how to include ordinally
scaled variables in the analysis process. The normal distribution of a
continuous variable can be checked by explorative data analysis using
a histogram or a Kolmogorov-Smirnov goodness-of-fit test.
According to SPSS Technical Support (2004), a multivariate
distribution of the continuous variables is not required.
The likelihood distance measure assumes that the variables in the
cluster model are nonconstant and independent. Independence
refers to both the independence of the variables and the cases. The
independence of metric variables can be checked using bivariate
Pearson correlations. The independence of categorical variables can
be checked using table measures, e.g. Phi, Cramer's V, etc. The
independence between a continuous and a categorical variable can be
checked by a statistical comparison of the mean values (possibly also
Eta measure). To check the independence, again the requirements of
the respective procedures apply (e.g. Pearson correlation, Eta measure
resp. Chi² test). According to SPSS Technical Support (2004), the
algorithm is generally considered robust against a violation of the
prerequisites. Internal tests show that the procedure is less susceptible
to violations with respect to the assumptions of independence and
distribution. Nevertheless, attention should be paid to whether and
how exactly these prerequisites are fulfilled and whether TwoStep as
a procedure meets the requirements of the data resp. classification.
Cases (rows) contain values in the explored variables. Cases are
therefore not completely empty.

The TwoStep algorithm underlying DETECTANOMALY may not give


equal weight to the influence of continuous and categorically scaled variables
(see the MLWEIGHT option).
Thus, finding of anomalies does not only depend on the mere compilation of
variables but also on their respective scaling. So if a variable is interval-
scaled in one run, but is included as categorically scaled in another run after
recoding, DETECTANOMALY may result in different anomaly cases.
According to Bacher et al. (2004), the variables to be tested should not have a
mixed scale level because of this peculiarity of the underlying TwoStep
algorithm. This algorithm has not been changed since SPSS 12. However,
since version 14 it is possible to specify a relative weight using MLWEIGHT
and thus compensate for the different influence of the scale level. Possibly,
the algorithm may have been changed in SPSS 16 (SPSS Technical Support,
2006).
The algorithm of the procedure DETECTANOMALY runs in the phases
Modeling, Scoring and Reasoning. These three stages each pass through
further sub-steps:
I Modeling
This step creates a cluster-based model that identifies "natural" case groups
resp. clusters within a dataset based on a specified set of input variables. The
resulting cluster model and related statistics, e.g. for norm values of cluster
groups, are stored for later use.

1. Creating the training dataset: Removing cases with extreme values in


continuous variables (if specified). List-wise removing of cases with
missings, if a certain handling of missings (e.g. replacing) is not
specified. Removing variables with only constant values resp.
missings. The remaining cases and variables are used to find the
potential anomalies. Statistics displayed in tables are based on this
(possibly reduced) training dataset, statistics stored as variables in the
dataset are calculated for all cases.
2. Handling of missings (optional): If input variables are continuous, a
mean value is calculated for each input variable based on the valid
(existing values) and used instead of the missings. If input variables
are categorically scaled, all missings are combined into one
"Missings" category and included as a valid category in further
transformations and calculations.
3. Creating the variable "Missing Value Pct" (optional): A new
continuous variable "Missing Value Pct" is determined, which
reflects for each case the percentage of variables (continuous and
categorically combined) with missing values. If "Missing Value Pct"
is selected for the Modeling phase, the calculation of these variables
is also specified for the Scoring phase.
4. Finding clusters: The TwoStep cluster algorithm creates a cluster
model based on the prepared input variables.

Saving of statistics: The resulting cluster model and related statistics, e.g. for
norm values of the cluster groups, are stored for the Scoring step.
II Scoring
In the Scoring phase, the cluster model from the Modeling phase is used to
identify the cluster membership (cluster group) for each case. Some indices
are created to measure the "unusualness" (anomaly) of the case with respect
to its cluster group. All cases are sorted by the amount of the anomaly values;
the (top) cases with the highest values are considered potential anomalies.
1. Screening for new valid categories: The scoring data should contain
the same input variables from the training dataset of the Modeling
phase; also the format of the variables should be the same. Cases in
the scoring data are excluded if they contain new (valid) categories in
categorical variables that do not appear in the training data during the
Modeling phase.

2. Handling of missings (optional): If input variables are continuous, a


mean value is calculated for each input variable based on the valid
(existing values) and used instead of the missings. If input variables
are categorically scaled, all missings are combined into one
"Missings" category and included as a valid category in further
transformations and calculations.
3. Creating the variable "Missing Value Pct" (depending on the
Modeling phase): A new continuous variable "Missing Value Pct" is
determined, which reflects for each case the percentage of variables
(continuous and categorically combined) with missing values.
4. Assigning a case to the nearest cluster: The cluster model from the
Modeling phase is applied to the prepared variables of the scoring
dataset in order to identify the cluster membership for each case and
to provide it with a cluster ID. Cases in noise clusters are assigned to
the nearest non-noise cluster.
5. Calculation of variable deviation indices (VDI).
6. Calculation of an index for the group deviation (GDI).
Calculation of an anomaly index and a measure of the contribution of a
variable: The Anomaly Index (AI) of a case is defined by the ratio of the GDI
of the respective case to the mean GDI of the cluster to which it belongs. The
higher the individual deviation from the average, the higher the anomaly
index. The measure for the contribution of a variable is defined by the ratio of
the VDI of the variable in question to the GDI of the case in question and
reflects the proportional contribution of a variable to the deviation of the case
in question. The higher the contribution of a variable to the deviation, the
higher this measure is. At the end of the scoring phase, each case has a GDI,
AI, and the corresponding VDI for each input variable.
III Reasoning
The purpose of the Reasoning phase is to rank the potential anomalies and
provide the reasons why they are considered a potential anomaly. For each
"unusual case" (anomaly) the variables are sorted according to the level of
their respective VDI values. The highest variables, their values and the
corresponding norm values are given as reasons why a case was identified as
a potential anomaly.

7. Identify the most unusual cases: The cases are sorted in descending
order according to their AI values. The highest AI values (optionally
by % or N) form the anomaly list. If an AI limit value
(ANOMALYCUTPOINT) is specified, cases less than or equal to this
value are not considered unusual.
8. Give reasons why a case may be an anomaly. For each potentially
unusual case, the variables are sorted in descending order according
to their VDI values. The top variables, their values and the
corresponding norm values are given as reasons why a case is
considered a potential anomaly.

[Link] Example: Computation and interpretation


A statistical reason why a syntax version of DETECTANOMALY is
presented in the following is because SPSS offers the possibility to
compensate the potentially different influence of continuous and categorical
variables only via syntax with the option MLWEIGHT= in the CRITERIA
subcommand.
A. Example:
In a mamma carcinoma study on the relationship between dietary habits
and breast cancer in women (N=663), various data on dietary components
(including fatty acids, saccharides, cholesterol, etc.) were analyzed for
unusual cases. The aim was to detect and correct data errors (including input
errors, erroneous information, etc.) before analyzing a complex multivariate
prognosis model. All variables included in DETECTANOMALY (FOOD2,
FOOD4 to FOOD29) are interval scaled and normally distributed. The results
of testing these assumptions are not presented further.
If an analysis also includes categorically scaled variables, the SPSS output
does not deviate substantially from the output described below.
B. Syntax:
First the continuous variables are tested for normal distribution (macro
MAC_CORR) and then for independence (CORRELATIONS). Before the
actual analysis for unusual cases, a randomly generated variable
MY_RANDOM is created and the dataset is sorted by it. Subsequently,
DETECTANOMALY is used to perform the analysis for unusual cases.
Finally, a bivariate scatterplot is requested for the first reason. Of the
procedures used, only DETECTANOMALY will be explained.
DEFINE mac_corr (!POS!CHAREND('/')).
!DO !i !IN (!1).
GRAPH
/HISTOGRAM(NORMAL)=!i .
!DOEND
!ENDDEFINE.
mac_corr food2 food4 food5 food6 food13 food14 food17 food18
food19 food20 food21 food22 food23 food27 food28 food29 /.
CORRELATIONS
/VARIABLES= food2 food4 food5 food6 food13 food14 food17
food18 food19 food20 food21 food22 food23 food27
food28 food29
/PRINT=TWOTAIL NOSIG
/MISSING=PAIRWISE .
compute MY_RANDOM= [Link](0,1000).
exe.
sort cases by MY_RANDOM (A).
exe.
DETECTANOMALY
/VARIABLES
SCALE=food2 food4 food5 food6 food13 food14 food17
food18 food19 food20 food21 food22 food23 food27
food28 food29
ID=patnr
/HANDLEMISSING APPLY=NO
/CRITERIA ANOMALYCUTPOINT=2 PCTANOMALOUSCASES=5
MINNUMPEERS=1 MAXNUMPEERS=15 NUMREASONS=3
/PRINT CPS ANOMALYLIST ANOMALYSUMMARY NORMS
REASONSUMMARY
/SAVE ANOMALY(Anomaly_Index) PEERID(Group_ID)
PEERSIZE(Group_Size) PEERPCTSIZE(Group_Size_Pct)
REASONVAR(Reason_Var) REASONMEASURE(Reason_Measure)
REASONVALUE(Reason_Value) REASONNORM(Reason_Norm) .
GRAPH
/SCATTERPLOT(BIVAR)=Reason_Measure_1 WITH Anomaly_Index
BY Group_ID BY PatNr (NAME)
/MISSING=LISTWISE .
GRAPH
/SCATTERPLOT(BIVAR)=Reason_Measure_2 WITH Anomaly_Index
BY Group_ID BY PatNr (NAME)
/MISSING=LISTWISE .
GRAPH
/SCATTERPLOT(BIVAR)=Reason_Measure_3 WITH Anomaly_Index
BY Group_ID BY PatNr (NAME)
/MISSING=LISTWISE .
C. Explanation of the DETECTANOMALY syntax:
The DETECTANOMALY command requests the checking of a variable set
or dataset for the presence of unusual cases. By default, the procedure outputs
three tables: the list of indices of anomal cases (potentially unusual cases and
their anomaly values), the list of group IDs of potentially anomal cases
(unusual cases and information about their groups), and the list of reasons for
potentially anomal cases (ID, reason variables, impact, value and norm of the
variable, etc.).
Via /VARIABLES, variables are passed to the DETECTANOMALY
procedure. After SCALE= the continuous variables are specified, e.g.
FOOD2, FOOD4 etc. After CATEGORICAL=, categorically scaled variables
or strings can be specified (not in the example). After ID=, an ID variable
was specified, e.g. PATNR. With EXCEPT it would be possible to exclude
variables from the analysis, if they are not already specified under SCALE=
or CATEGORICAL= (only possible via syntax).
The /HANDLEMISSING subcommand is used to specify whether and how
DETECTANOMALY should deal with missings during the Modeling phase:
After APPLY=YES a mean value is used for each continuous input variable
instead of the missings. For categorically scaled input variables, all missings
are combined into a separate "Missings" category and included as a valid
category in further transformations and calculations. After APPLY=NO, all
cases with missings are excluded from the analysis. With
CREATEMISPROPVAR=YES (NO) (not in the example) an additional
variable "Missing Proportion Variable" can be created and included in further
analyses. The "Missing Proportion Variable" indicates for each case the
number of variables with missings in the dataset.
After the /CRITERIA subcommand, various settings are passed to
DETECTANOMALY.
ANOMALYCUTPOINT=number (default: 2) causes DETECTANOMALY
to identify only cases as potential anomalies if their anomaly index is greater
than or equal to the specified limit. The value must be a positive integer. The
threshold can be used in conjunction with the PCTANOMALOUSCASES
(percentage of cases; maximum: 100; in the example: 5) or
NUMANOMALOUSCASES (fixed number of cases; maximum: N rows of
the active dataset; not used in the example). If the
ANOMALYCUTPOINT=3 and NUMANOMALOUSCASES=100, the
anomaly list can contain a maximum of 100 cases with an AI greater than or
equal to 2.
Using the inputs MINNUMPEERS=number (default: 1) and
MAXNUMPEERS=number (default: 15), DETECTANOMALY searches for
the best number of groups between the specified minimum and maximum
value. The values must be positive integers. The value under
MINNUMPEERS= must not be greater than that under
MAXNUMPEERS=. If the values are equal, the procedure assumes a fixed
number of groups. Depending on the data variation, even fewer groups may
be determined, i.e. fewer than specified under MINNUMPEERS=. The
number of clusters should be adjusted to characteristics of the data basis
(complexity of the cluster procedure depending on the number of variables,
number of cases and value variation). With few cases, value characteristics
and variables it is difficult to form clusters at all; with many cases, value
characteristics and variables it is difficult to find few clusters.
NUMREASONS=number is used to preset the maximum number of reasons
to be displayed in the anomaly list (in the example: 3; default: 1; maximum:
N variables in the active dataset). A reason consists of the VDI (variable
impact measure), the variable name for this reason, the value of the variable,
and the value of the corresponding group. The value must be a positive
integer. If NUMREASONS=0, the PRINT option REASONSUMMARY is
suppressed.
With MLWEIGHT=number (syntax only; not in the example) the
potentially different influence of continuous and categorical variables can be
compensated (default: 6). The value must be a positive integer.
The /PRINT subcommand allows you to customize the SPSS output. For
example, NONE (not in the example) suppresses all output except warnings
and notes.
CPS requests the “Case Processing Summary” table including frequencies
and frequencies in percent for all cases in the active dataset, the cases
included and excluded in the analysis, and the cases in each group.
ANOMALYLIST requests the display of three tables: the "Anomaly Case
Index List" table (IDs of potentially unusual cases and their anomaly values),
the "Anomaly Case Peer ID List" table (IDs of unusual cases and information
about their groups), and the "Anomaly Case Reason List" table (ID, reason
variables, impact, value, and norm of the variable, etc.). All lists are sorted
according to the AI; if no ID is specified under VARIABLES ID=, the line
number is used.
ANOMALYSUMMARY requests the table "Anomaly Index Summary".
This summary contains descriptive statistics for the Anomaly Indexes of the
cases identified as most unusual.
NORMS requests the table "Scale Variable Norms" (if metrically scaled
variables were specified under SCALE=; see example), as well as the table
"Categorical Variable Norms" (if categorically scaled variables were
specified under CATEGORICAL=; not in the example). The table "Scale
Variable Norms" shows mean values and standard deviations of each metric
variable per group. The table "Categorical Variable Norms" shows most
popular categories (modes), frequencies and percentages of each categorical
variable per group. Mean values of metric variables and modes of categorical
variables are used as norm values in the analysis.
REASONSUMMARY requests n separate tables for the analysis of the n
reasons. N is specified by the CRITERIA setting NUMREASONS=. If
NUMREASONS=0, REASONSUMMARY is ignored and an error message
is output (in the example: 3; default: 1; maximum: N variables in the active
dataset). The n tables "Reason n" indicate per reason the frequencies and
percentages of the occurrence of each input variable as possible reason. The
tables also contain statistics of the variable impact (among other things
minimum, maximum etc.).
The /SAVE subcommand can be used to specify under which variable names
the variables calculated by DETECTANOMALY are to be stored in the
active dataset. The SPSS conventions for the assignment of valid variable
names apply here. If no variable names are assigned by the user, a default
variable name is assigned. If variables already exist in the active dataset
whose names correspond to the user-defined or system-defined variable
names, DETECTANOMALY prevents accidental overwriting of the existing
data by appending a suffix to the variables to be stored. The values in these
additional variables are assigned to all cases, even if they are not included in
the anomaly list. The SAVE subcommand is independent of settings in the
CRITERIA options PCTANOMALOUSCASES,
NUMANOMALOUSCASES and ANOMALYCUTPOINT. The settings
mean in detail:

“ANOMALY(Anomaly_Index)”: The anomaly index is stored in a


variable named "Anomaly_Index".
The following three variables are created for each group. The names can be
assigned individually; in contrast to mouse control, no common root name
needs to be considered.

"PEERID(Group_ID)": The group ID is stored in a variable named


"Group_ID".
"PEERSIZE(Group_Size)": The group size (N) is stored in a
variable named "Group_Size".
"PEERPCTSIZE(Group_Size_Pct)": The group size (in percent) is
stored in a variable named "Group_Size_Pct".

The following four variables are created for each reason. For n reasons, N
variables are created each. With NUMREASONS=0, this setting is ignored.
The names can be assigned individually; in contrast to mouse control, no
common root name needs to be considered.

"REASONVAR(Reason_Var)": The variable associated with a


reason is stored in a variable named "Reason_Var". Depending on the
number of n reasons set under NUMREASONS=, a maximum of n
variables with the names "Reason_Var_1", "Reason_Var_2", ... ,
"Reason_Var_n" are output.
"REASONMEASURE(Reason_Measure)": The extent to which a
variable is associated with a reason (impact measure) is stored in a
variable named "Reason_Measure". Depending on the number of n
reasons set under NUMREASONS=, a maximum of n variables with
the names "Reason_Measure_1", "Reason_Measure_2", ... ,
"Reason_Measure_n" are output.
"REASONVALUE(Reason_Value)": The value of the reason
variable is stored in a variable named "Reason_Value". Depending on
the number of n reasons set under NUMREASONS=, a maximum of
n variables with the names "Reason_Value_1", "Reason_Value_2", ...
, "Reason_Value_n" are output.
"REASONNORM(Reason_Norm)": The norm value of the group is
stored in the variable with the name "Reason_Norm". Depending on
the number of n reasons set under NUMREASONS=, a maximum of
n variables with the name "Reason_Norm _1", "Reason_Norm _2", ...
, "Reason_Norm _n" are output.

D. Results:
Detect anomaly
After the heading "Detect Anomaly", the tables of "Case Processing
Summary", "Anomaly Case Index List", "Anomaly Case Peer ID List", and
"Anomaly Case Reason List" follow first. These lists are sorted according to
the anomaly index. If no ID was specified under VARIABLES ID=, the line
number is used. Finally, the table "Scale Variable Norms" and the table
"Anomaly Index Summary" follow.
Case Processing Summary
N % of Combined % of Total
Peer ID 1 86 13,0% 13,0%
2 250 37,7% 37,7%
3 327 49,3% 49,3%
Combined 663 100,0% 100,0%
Total 663 100,0%

The "Case Processing Summary" table shows how each case belongs to a
group of "similar" cases. In the example, three groups have been identified,
each containing N=86 (13.0%), N=250 (37.7%) and N=327 (49.3%) of the
cases. The table of the "Anomaly Case Index List" shows a preset proportion
(in the example: 5% and an anomaly index >= 2) of cases with the highest
anomaly indices including line number and ID variable.
Anomaly Case
Index List
Anomaly
Values are displayed from the highest anomaly index
downwards including ID (PATNR). The higher an
Case patnr Index
anomaly index is, the more unusual a case is compared to
399 399 7,934 its group. Of the 22 values displayed, the top ten values
527 527 5,831 stand out clearly, of which the four top values (PATNR
450 450 3,952
399, 527, 450, 178) are most likely anomalous. The values
178
from incl. PATNR 451 downwards are increasingly
178 3,009
homogeneous and should be examined in detail.
451 451 2,688
511 511 2,670
546 546 2,623
24 24 2,494
539 539 2,355
596 596 2,340
245 245 2,169
106 106 2,144
128 128 2,140
497 497 2,132
107 107 2,095
602 602 2,093
466 466 2,089
536 536 2,086
313 313 2,080
657 657 2,074
488 488 2,035
478
478 2,007

The "Anomaly Case Peer ID List" gives for the potentially anomalous cases
(see also the table "Anomaly Cases Index List") additional information about
the respective associated groups.
Anomaly Case Peer ID
List
Case patnr Peer Of the top four cases (PATNR 399, 527, 450, 178),
Peer Peer Size three e.g. belong to group 3 (N=327, 49.3%; see
ID Size Percent also the table "Case Processing Summary". The case
399 399 3 327 49,3% PATNR 178 belongs to group 2, while the next
527 527 3 327 49,3% three cases again belong to group 3.
450 450 3 327 49,3%
178 178 2 250 37,7%
451 451 3 327 49,3%
511 511 3 327 49,3%
546 546 3 327 49,3%
24 24 2 250 37,7%
539 539 1 86 13,0%
596 596 2 250 37,7%
245 245 2 250 37,7%
106 106 3 327 49,3%
128 128 2 250 37,7%
497 497 2 250 37,7%
107 107 3 327 49,3%
602 602 2 250 37,7%
466 466 3 327 49,3%
536 536 1 86 13,0%
313 313 2 250 37,7%
657 657 3 327 49,3%
488 488 2 250 37,7%
478 478 2 250 37,7%

For the potentially anomalous cases, the table of the "Anomaly Case Reason
List" shows information about the corresponding reasons (ID, reason
variable, and impact, value and norm of the variables. Depending on the
number of requested reasons (see NUMREASONS), n lists are displayed (see
the note "Reason 1", ..., " Reason n" in the top left corner). For an analysis of
the reason variables, please refer to the tables "Reason 1" to "Reason n".
Reason variables are those variables that contribute most to an identification
of a variable as "unusual". With PATNR 399, this would be e.g. the input
variable FOOD17.
Anomaly Case Reason List

Reason: 1
Case patnr Reason Variable Variable Variable
Variable Impact Value Norm For a better understanding this
399 399 food17 ,140 234460,43 56735,0749 table will be described from right
527 527 food2 ,230 56,40 11,8224 to left. The final summary is then
450 450 food5 ,222 898,17 172,6728 from left to right.
178 178 food22 ,301 4579,21 1948,7974
451 451 food22 ,211 7965,02 2862,3449
511 511 food17 ,231 185770,84 56735,0749
546 546 food18 ,413 212520,44 50787,8672
24 24 food29 ,366 2057,28 621,9725
539 539 food18 ,496 88621,03 24007,2788
596 596 food29 ,324 1922,64 621,9725
245 245 food29 ,338 1899,33 621,9725
106 106 food2 ,306 42,01 11,8224
128 128 food29 ,359 1932,73 621,9725
497 497 food28 ,253 6,03 2,2217
107 107 food22 ,634 10981,42 2862,3449
602 602 food22 ,565 4975,79 1948,7974
466 466 food6 ,185 898,27 342,4960
536 536 food13 ,331 5813,37 2382,3558
313 313 food22 ,520 4835,35 1948,7974
657 657 food6 ,184 895,23 342,4960
488 488 food29 ,437 2039,19 621,9725
478 478 food28 ,328 6,44 2,2217

For each group, a norm value is determined for each reason variable, shown
under “Variable Norm”. In this example the norm values represent mean
values of the input variables, i.e. directly interpretable empirical values. For
PATNR 399 the norm value of the corresponding group 3 is e.g. 56735,08.
This value corresponds to the mean value of the disaccharides (see the
following table „Scale Variable Norms“, see below). If m groups are given
in the "Case Processing Summary", up to (m groups x n reasons) norm values
can be shown in the table "Anomaly Case Reason List". The "Variable
Value", on the other hand, indicates the individual value for the respective
case. For PATNR 399, the variable value is e.g. 234460,43. The individual
variable value (234460,43) deviates clearly from the norm value (56735,08)
of the corresponding group. The variable "Variable Impact" indicates the
proportional contribution of a reason variable to the deviation of the
respective case from its group. The expected value is calculated by 1/N of the
respective group. Since N=327 are in group 3, the expected value is about
0.003. The actual extent of the variable impact is 0.140, which is many times
higher than expected. The column "Reason Variable" finally indicates
which input variable exerts this impact on PATNR 399. In summary, the
variable FOOD17 exerts a much higher impact on PATNR 399 (see
"Variable Value" resp. "Variable Impact") than could have been expected
(see “Variable Norm” resp. 1/N).
Per case, all other n reasons can also be displayed if the reasons are moved
from "Layer" to "Row" via Pivoting Trays → Pivot. Such a view makes it
easier to compare the impact of the n reasons per case. For space reasons, this
table is not shown completely.
Anomaly Case Reason List – By Reason [abbreviated]
Variable
Case Reason patnr Reason Variable Impact Variable Value Variable Norm
399 1 399 food17 ,140 234460,43 56735,0749
2 399 food2 ,124 49,56 11,8224
3 399 food23 ,106 7,42 3,1622
527 1 527 food2 ,230 56,40 11,8224
2 527 food14 ,159 4982,84 1487,7781
3 527 food13 ,156 14956,79 5170,7182
450 1 450 food5 ,222 898,17 172,6728
2 450 food4 ,215 287,39 51,4581
3 450 food6 ,194 1154,46 342,4960
178 1 178 food22 ,301 4579,21 …
2 178 food13 ,103 … …
3 178 food21 … … …

PATNR 399 is characterized by a much higher impact also of the input


variables FOOD2 and FOOD23. The values of the variable impact lie here
e.g. with 0.124 resp. 0.106. The central information of this table is however
that different impact variables dominate in the finally presented scatter
diagrams for the reason variables.
The table "Scale Variable Norms" is in principle nothing else than the
output of the mean values and standard deviations of each metric input
variable per group. The "Categorical Variable Norms" table not output, since
no categorically scaled variables were included in the analysis. Mean values
of metric variables and modes of categorical variables are used as variable
norms in the analysis.
Scale Variable Norms
Peer ID
1 2 3
Linoleic acid [g] Mean 4,8447 7,2959 11,8224
Std. Deviation 2,58184 2,59944 6,13842
Eicosatrienoic acid (mg) Mean 22,1112 34,1825 51,4581
Std. Deviation 12,22562 18,07146 42,20711
Eicosatetraenoic Mean 75,3609 118,0614 172,6728
acid/arachidonic acid (mg) Std. Deviation 35,65714 55,86954 127,45775
Cholesterol (mg) Mean 150,7978 245,4070 342,4960
Std. Deviation 51,56659 76,00221 151,23608
Cellulose (mg) Mean 2382,3558 3223,6624 5170,7182
Std. Deviation 931,35131 797,19311 1618,08314
Lignin (mg) Mean 610,7591 830,6855 1487,7781
Std. Deviation 300,93520 282,46025 574,97284
Disaccharides (mg) Mean 24376,7874 34652,7113 56735,0749
Std. Deviation 15198,28417 12057,81199 26964,73170
Monosaccharides (mg) Mean 24007,2788 31297,3410 50787,8672
Std. Deviation 13466,38388 14319,18286 24844,26971
Polysaccharides (mg) Mean 55464,2030 88326,4866 125719,4042
Std. Deviation 14600,05901 22535,22338 38958,75634
Starch [g] Mean 54,3481 87,0415 124,3179
Std. Deviation 14,52278 22,54739 38,85230
Calcium [100 mg] Mean 4,3531 7,3317 11,7775
Std. Deviation 1,84595 2,94678 5,53611
Copper (myg) Mean 1357,3269 1948,7974 2862,3449
Std. Deviation 304,19898 525,31363 1115,95784
Copper [mg] Mean 1,8772 2,3359 3,1622
Std. Deviation ,44534 ,46437 ,73218
Vitamin A - retinol equivalent Mean 705,8472 1026,1352 1729,1481
(myg) Std. Deviation 283,72940 329,56463 667,60671
Beta-carotene [mg] Mean 1,8572 2,2217 4,2582
Std. Deviation 1,19097 ,95196 2,27872
Retinol (myg) Mean 365,9673 621,9725 944,4827
Std. Deviation 185,40915 308,14736 562,01236

The only "difficulty" is to correctly assign the values, which are now output
with a variable label, to the values, which are output above only with a
variable name. However, the norm value 56735,08 in group 3 can be clearly
assigned to the disaccharides; seen in this light, PATNR 399 differs from the
other members of the group due to the unusually high values in the
disaccharides.
The table of the "Scale Variable Norms" makes it easier to compare the norm
values per group and thus to understand which variables (possibly)
contributed to which group formation. If mean values over all three groups
are approximately constant, a substantial impact of the respective variables
on the group formation cannot be assumed. If the mean values differ across
the groups, an impact of the respective variables on the group formation can
be assumed.
One should not be tempted to "check" these differences by means of
significance tests. Inferential statistical tests can only be used for descriptive
purposes and not for testing the null hypothesis that there are no differences
between the groups in the variables concerned, because the cases have been
cluster-analytically assigned to the groups in such a way that the differences
in the variables are maximized resp. the variances within the groups are
minimized. Furthermore, the alpha would have to be adjusted with respect to
the number of tests performed.
If the output variable lists are to some extent manageable, the determined
groups can be described analogously to the factor analysis according to the
mean values of the variables comparable to a factor loading. From a certain
amount of variables and/or groups these descriptions, all the more content-
wise plausible (causal) interpretations, become more extensive and
sophisticated (particularly with categorically scaled variables).

The table "Anomaly Index Summary" presents descriptive statistics for the
anomaly index (e.g. minimum, maximum, mean and standard deviation) for
the potentially anomalous cases (N=22) compiled at the beginning.
Anomaly Index Summary
N in the Anomaly
List Minimum Maximum Mean Std. Deviation
Anomaly Index 22 2,007 7,934 2,774 1,438
N in the Anomaly List is determined by the specification: anomaly percentage is 5% and anomaly
index cutpoint is at least 2
The legend shows the criteria for compiling the 22 cases (see
ANOMALYCUTPOINT resp. PCTANOMALOUSCASES).
Reason Summary
After the heading "Reason Summary", n tables with the headings „Reason 1“,
„Reason 2“, ... „Reason n“ follow, depending on the number of reasons (see
NUMREASONS=n). For space reasons only one table is shown and
explained.
Reason 1
Occurrence as Reason Variable Impact Statistics
Frequency Percent Minimum Maximum Mean
Linoleic acid [g] 2 9,1% ,230 ,306 ,268
Eicosatrienoic acid (mg) 0 0,0% . . .
Eicosatetraenoic
1 4,5% ,222 ,222 ,222
acid/arachidonic acid (mg)
Cholesterol (mg) 2 9,1% ,184 ,185 ,184
Cellulose (mg) 1 4,5% ,331 ,331 ,331
Lignin (mg) 0 0,0% . . .
Disaccharides (mg) 2 9,1% ,140 ,231 ,186
Monosaccharides (mg) 2 9,1% ,413 ,496 ,454
Polysaccharides (mg) 0 0,0% . . .
Starch [g] 0 0,0% . . .
Calcium [100 mg] 0 0,0% . . .
Copper (myg) 5 22,7% ,211 ,634 ,446
Copper [mg] 0 0,0% . . .
Vitamin A - retinol
0 0,0% . . .
equivalent (myg)
Beta-carotene [mg] 2 9,1% ,253 ,328 ,291
Retinol (myg) 5 22,7% ,324 ,437 ,365
Overall 22 100,0% ,140 ,634 ,335

The "Reason 1" table indicates for each variable in the analysis how many
times it has been named as "Reason 1" in the "Anomaly Case Reason List"
table for reason 1. Here, too, is a certain intricateness in correctly assigning
the information now output with a variable label to the reasons previously
output with a variable name only. In the "Anomaly Case Reason List" for
reason 1, for example, FOOD29 appears five times and corresponds to the
N=5 entries in "Retinol (myg)" in the table "Reason 1". However, the nature
of the current output contains a certain risk of confusion, e.g. with "Copper
(myg)" at the very bottom. The remedy is a reassuring look into the active
data record. For each variable as reason 1, minimum, maximum, and mean
value are displayed next to frequency and percentage. If a variable exists only
once as reason 1, no standard deviation can be calculated. If a variable is not
available as a reason, no descriptive statistics are displayed.
With these tables the actual output of DETECTANOMALY is completed.
Finally, the additionally requested bivariate scatter plots shall be explained.

Graph
On the x-axis, the impact measure for reason variable 1 is plotted; on the y-
axis, the anomaly index. The cases are marked with their PATNR ID and
colored according to their group membership.

As an additional interpretation aid, a


reference line was drawn at the level
of the anomaly index = 2. When
interpreting it, it should be noted that
the x-axis has a different range. For
example, in the graph for the reason
variable 1 the range extends to 0.7, in
the graph for the reason variable 3
the range only extends to approx.
0.2.

If only the x-axis (impact measure) is taken into account when interpreting
the reason variable 1, the impact of the individual variable on the respective
case can be described as rather inconspicuous. If, however, the y-axis
(anomaly index) is also taken into account, PATNR 399, 527 and 450 are
clearly unusual cases. All three cases belong to the same group (Group
ID=3). In the two further scatter plots for the reason variables 2 and 3,
PATNR 399, 527 and 450 are also identified as unusual cases.
When interpreting these scatter diagrams, it should be noted that, per case,
different impact variables dominate in the specified reason variables. For
illustration purposes, the above already given Anomaly Case Reason List,
grouped by reasons, is repeated here.
Anomaly Case Reason List – By Reason [abbreviated]
Variable
Case Reason patnr Reason Variable Impact Variable Value Variable Norm
399 1 399 food17 ,140 234460,43 56735,0749
2 399 food2 ,124 49,56 11,8224
3 399 food23 ,106 7,42 3,1622
527 1 527 food2 ,230 56,40 11,8224
2 527 food14 ,159 4982,84 1487,7781
3 527 food13 ,156 14956,79 5170,7182
450 1 450 food5 ,222 898,17 172,6728
2 450 food4 ,215 287,39 51,4581
3 450 food6 ,194 1154,46 342,4960
178 1 178 food22 ,301 4579,21 …
2 178 food13 ,103 … …
3 178 food21 … … …
In the first scatter diagram (Reason 1), the variables FOOD17
(Disaccharides), FOOD2 (Linoleic acid) and FOOD5 (Eicosatetraenoic
acid/arachidonic acid) are particularly influential for PATNR 399, 527 and
450. In the second scatter diagram (Reason 2), the variables FOOD2 (Oleic
acid), FOOD14 (Lignin) and FOOD4 (Eicosatrienoic acid) are particularly
influential. In the third scatterplot (Reason 3), the variables FOOD23
(Copper), FOOD13 (Cellulose) and FOOD21 (Calcium) are particularly
influential.
Concluding remarks:
The anomaly approach, comparable to cluster analyses, is not an inferential
statistical procedure. A transfer of the determined results to other than the
examined feature carriers is not easily possible. Cases identified as "unusual"
on the basis of purely formal criteria should be stable with respect to minor
variations (e.g. the sequence of cases in the dataset, variable scaling,
MLWEIGHT, etc.). In a final step, the values that are ultimately consistently
identified as potentially unusual should also be checked content-related, e.g.
whether they are actually incorrect values caused by measurement or input
errors, criterion-guided anomalies, or whether they are criterion-guided
permissible values.
When assessing whether conspicuous values are actually anomalies or not, it
must also be considered whether the examined data represent a full survey or
a sample. The smaller a sample is, the more likely individual values appear to
be conspicuous. Against the background of a full survey (population), this
can mean in the other case that conspicuous values in samples do not
necessarily represent anomalies of a population. As the chapter about outliers
showed, not every formally conspicuous value is also a conspicuous value in
content.

9 More Efficiency
Checking multiple variables and criteria using multiple
checking rules
Validate data through a central screening program (menu
item "Validation")

Since version 14, SPSS offers with the menu item "Validation" resp. the
procedure VALIDATEDATA several possibilities to check variables and
values for correctness. The further names of the menu item "Validation" resp.
the SPSS procedure VALIDATEDATA are different depending on the SPSS
version: The designation in SPSS V15 is "Data Preparation". "Data
Validation" is the designation in SPSS V14.

Checking for plausibility assumes that all criteria presented so far have been
met (see Chapter 8 for the reference to the possible relevance of Chapter 12).
As you can see in the DQ Pyramid, the criterion of plausibility is represented
by two chapters. Chapter 8 introduces the basic principle and introduces first
simple approaches and the multivariate anomaly approach. This chapter
introduces the use of much more sophisticated screening rules, which may be
programmed only once. First of all, it should be pointed out that SPSS syntax
is more flexible to program than it might seem after the chapter on
VALIDATEDATA and the screening of at most two variables (see Chapters
10 and 11).
The quality criteria that can be checked uni- as well as multivariately
(depending on the checking rules) are missings, uniformity, outliers and
plausibility for analysis variables, as well as duplicates for ID variables resp.
case identifiers.
Under the "Basic Checks" tab, analysis variables can be subjected to basic
checks, e.g. maximum percentage of missing values, maximum percentage of
categories with the number 1, or even a minimum variance (coefficient of
variation, standard deviation). These checks are preset and are applied to
several selected variables (automatically grouped by scale level)
simultaneously (see 9.1.) Depending on the requirements and the amount of
data, test rules can be used to check a) one criterion per variable, b) several
criteria per variable, and c) several criteria for several variables at the same
time.
Under the tab "Single-Variable Rules", pre-installed checking rules can be
applied to individual variables. The checking rules differ according to the
type (numeric, string) of variables and check e.g. for compliance with a 0/1
dichotomy etc. The section will additionally show how to develop and apply
own validation rules, e.g. also for date variables (see 9.2.).
Under the "Cross-Variable Rules" tab, system or user-defined validation
rules can be applied to multiple variables. Test rules for multiple variables
are not pre-installed, but must first be developed from test rules for individual
variables. The section 9.3. will show additionally, how own test rules can be
developed and applied among other things also for several date variables (see
9.3.). In 9.4., the creation and execution of own rules for several variables
(cross-variable rules, e.g. for date variables) with VALIDATEDATA syntax
is introduced. Under 9.5., you will find further (tested) examples of check
rules, e.g. for the detection of certain characters within a string (e.g.
STRING). These example programs have been tested, but are not explained
further.
The following chapter will first introduce the mouse control of the "Data"
menu item "Validation"; finally, it will move on to the extension by self-
written test programs in SPSS syntax. To enable users to understand the
functionality of this versatile validation tool in detail, the following examples
are demonstrated using the two SPSS datasets "Mouse [Link]" and
"Employee [Link]".
The menu item "Validation" resp. the SPSS procedure VALIDATEDATA
are only available with the module "Data Preparation" (designation in SPSS
V15) or "Data Validation" (designation in SPSS V14). However, both
accesses also require the correct transfer of the scale level to SPSS; this can
be seen in the active dataset, e.g. in the variable view in the "Measure"
column. In order to be protected against an inadvertently incorrectly assigned
measurement level (and thus suboptimal data checks), it is recommended to
supplement the descriptive tests of the analysis variables with simple
analyses of the frequencies for variables also on metric scale level. The
measurement level can also be subsequently adjusted via syntax, e.g. via
VARIABLE LEVEL (see the example under 9.1.).
Example
VARIABLE LEVEL
List of metric variables (SCALE)
/ List of nominal variables (NOMINAL)
/ List of ordinal variables (ORDINAL).

The validation process ignores weighting variables and treats them like any
other analysis variable. The default values set by SPSS are sometimes a bit
generous; depending on the requirements of data quality or analysis
relevance, they could be set much more restrictively.
A first advantage of the menu item "Validation" is, for example, the
graduation into basic checks, and then the increasingly complex checks for
one or more rules. A further advantage is that, independently of the mouse or
syntax access, once created rules can be automatically used as menus for
mouse control. Especially for syntax access, this means that, over time,
additional, customized single- and cross-variable rules can be bundled.
Scattered, already existing SPSS check programs can be centralized in a
single screening program that becomes more powerful over time (provided
that their check logic allows to be integrated into the VALIDATEDATA
syntax).
Larger research departments would thus have the possibility, for example, to
proceed in such a way that a first person or department would program (or, if
necessary, assemble from other departments) and test the required rules via
syntax, while a second person or department would use the tested rules via
mouse clicks for practical data validation. For multiple or multivariate checks
of large amounts of data which may be delivered repeatedly, the regular use
of a check program in VALIDATEDATA syntax is highly recommended.
Not as a kind of disadvantage, but at least as a challenge in connection with
complexity is to be seen in the fact that, depending on the check logic,
programming via mouse or syntax access could be quite complex in
individual cases and thus possibly error-prone, especially with regard to
implementing the necessarily not always simple check logic in SPSS syntax
into positively or negatively formulated validation rules.
Also, the programming environment of VALIDATEDATA is currently not
flexible enough to include many of the other approaches from the later
chapters. For simple or one-time checks the menu item "Validation" resp. the
SPSS procedure VALIDATEDATA might be too complex in individual
cases. In this case it is possible to use approaches from other chapters.
In any case it must be clearly pointed out from the beginning that the menu
item "Validation" resp. the SPSS procedure VALIDATEDATA serve only
for the identification of possible errors, but not for their interpretation resp.
correction.
The interpretation of the results should always consider the two-faced
problem of false positives or false negatives. False positives are e.g. error
messages, which turn out to be false alarms on closer inspection. False
negatives are e.g. apparently absent error messages.
A very simple cause for false positives or false negatives can be e.g. the
wrong programming of own test rules. This can happen when test rules are
supposed to indicate a deviation from criteria as an error, but due to an
incorrectly implemented test logic they inadvertently indicate the compliance
with certain criteria (see 9.2.3. and 9.3.).
With VALIDATEDATA, check rules issue an error message if certain
criteria are met (see 9.3.). VALIDATEDATA always interprets the matching
of rules as an error. However, rules can be formulated positively, but also
negatively, e.g. with respect to reference values like 95, 96, 97 etc. In case of
a negative rule e.g. any other value (i.e. deviation: e.g. 91, 92, and 93) will
trigger an error message. In case of a positive rule, the matching of reference
values (e.g. 95, 96, and 97) will also trigger an error message: Deviation from
a negative rule or compliance with a positive rule will equally result in the
code "1", depending on the further settings additionally 0 and Missing. When
defining rules, programming them and interpreting the result, it is essential to
consider the difference between positive and negative rules.
Because of possible false positives, e.g. a feedback of indicated possible
errors should be interpreted with a certain caution at first. A high proportion
of missings does not necessarily have to be an error, but can be due to
technical reasons. In the case of online surveys, for example, questions that
are not applicable may have been skipped. Even seemingly absent error
messages should be interpreted rather cautiously, as there is the possibility of
false negatives. Such a result does not necessarily mean that the tested
variable does not contain any (further) errors, e.g. with differently (more
strictly) formulated test rules, settings (limit values) or criteria (e.g.
duplicates). Results of the menu item "Validation" resp. the SPSS procedure
VALIDATEDATA are therefore always to be seen in the context of the
examined variables, performed checks and specified criteria. This central
aspect will be pointed out explicitly in the description of the exemplary
performed checks.

9.1 Validate data: Basic Checks


Open the SPSS dataset „ Mouse [Link] “. Create a variable "ID", e.g.
via $CASENUM. Then follow the further instructions:
Path: Data → Validation → Validate Data…
"Variables" tab:
Drag the Variable ID into the „Case Identifier Variables" area. Only if case
identifier variables are selected, can they be subjected to validation checks
later.
Drag all remaining variables into the "Analysis Variables" area. Only if
analysis variables are selected, can subsequent validation checks be
performed.
"Basic Checks" tab:
Use this tab to select basic checks for analysis variables, case identifiers and
specific cases. The preset checks can be (de)activated by ticking the
checkbox.
Analysis Variables:
To be able to perform checks on analysis variables, the box "Flag variables
that fail any of the following checks" must be ticked.
Maximum percentage of missing values (applies to all variables): Returns
analysis variables for which the percentage of missing values exceeds the
specified value (default: 70). The specified value must be a positive number
less than or equal to 100.
Maximum percentage of cases in a single category (applies to categorical
variables only): Returns categorical analysis variables where the percentage
of cases in a single non-missing category exceeds the specified value
(default: 95). The specified value must be a positive number less than or
equal to 100. The percentage value is based on cases of the variable without
missings.
Maximum percentage of categories with count of 1 (applies only to
categorical variables): Returns categorical analysis variables where the
percentage of categories of the variable containing only one case exceeds the
specified value (default: 90). The specified value must be a positive number
less than or equal to 100.
These two check criteria each represent a check for the presence of constants,
i.e. for the presence of possibly individual cells with more cases than
expected or multiple cells with less cases than expected. The maximum
percentage of cases in a single level of a categorical variable is in principle a
screening for constants, i.e. for extremely frequent single levels (possibly
unexpected more frequent single cells). In comparison, the maximum
percentage of categories with a single case represents the opposite screening
for extremely rarely occurring multiple levels (possibly unexpected less
frequent multiple cells).
Minimum coefficient of variation (applies to metric variables only): Returns
metric analysis variables where the absolute value of the coefficient of
variation is less than the specified value (default: 0.001). This option only
affects variables with a non-zero mean value. The specified value must be a
non-negative number; 0 disables the check of the variation coefficient.
Minimum standard deviation (applies to metric variables only): Returns
metric analysis variables whose standard deviation is less than the specified
value (default: 0). The specified value must be a non-negative number; 0
disables the check of the standard deviation.
In contrast to the standard deviation, the coefficient of variation (see also
7.2.1.) is a measure for the absolute variability within a data range and thus
also a suitable measure for the direct comparison of two distributions. The
higher the coefficient of variation, the greater the dispersion. High values are
indications that the distribution is distorted by outliers (especially in
comparison with other measurement series). The same applies to the standard
deviation: The fewer extreme values occur in a dataset, the smaller the
standard deviation. A standard deviation cannot be assessed directly, it only
reflects the relative variability within one data range; for a comparison, the
recourse to further information or transformations is necessary (e.g. z-
transformation).
Case identifier:
Case identifiers are IDs or other variables, which (possibly in their
combination) allow to identify a single case (row) without any doubt, e.g.
name, place of residence and street. If case identifiers contain missings, they
are only of limited use for the identification of cases.
Activate "Flag incomplete IDs" (default): This will output cases (rows) with
incomplete case identifiers. A case identifier is considered incomplete if a
value is empty or missing.
Activate "Flag duplicate IDs" (default): This will output cases (rows) with
duplicate case identifiers. Incomplete case identifiers are excluded from the
set of possible duplicates.
Empty cases:
Activate „Flag empty cases": With this option, those cases will be returned
where all relevant variables are empty or missing. To identify empty cases,
either all variables in the file (except ID variables) can be used or only the
analysis variables specified on the "Variables" tab. Select "All variables in
dataset except ID variables".
Start the data check with "OK".
Syntax
GET
FILE='C:\Programme\SPSS\Mouse [Link]'.
compute ID=$casenum.
exe.
VARIABLE LEVEL
icrf hxm status time icrfsq hmxsq icrfhilo (SCALE)
/ status hxmd icrfd (NOMINAL) .
* Validate data .
VALIDATEDATA
VARIABLES=icrf hxm status time hxmd
icrfd icrfsq hmxsq icrfhilo
ID=ID
/VARCHECKS STATUS=ON
PCTMISSING=20 PCTEQUAL=25
PCTUNEQUAL=10 CV=0.5 STDDEV=0.5
/IDCHECKS INCOMPLETE DUPLICATE
/CASECHECKS REPORTEMPTY=YES SCOPE=ALLVARS
/CASEREPORT DISPLAY=YES MINVIOLATIONS=1
CASELIMIT=FIRSTN(100).
Note: The VALIDATEDATA command initializes the SPSS procedure for
checking data.

After VARIABLES=, all analysis variables to be checked are specified.


Instead of the names of the individual variables, ALL can also be specified.
Rule outcome variables are ignored.
After ID= all ID variables to be checked or other case identifier variables are
specified. ID variables are used to designate the case-wise output of cases
(rows). Rule result variables are ignored. If two or more ID variables or case
identifier variables are specified, their combination is used as case identifier.
After VARCHECKS the individual settings for checking the analysis
variables are made. If no analysis variables are specified under
VARIABLES=, VARCHECKS will be ignored.
STATUS=ON causes SPSS to perform validation checks. With
STATUS=OFF further settings are ignored and no validation checks are
performed.
PCTMISSING=20 Maximum percentage of missing values (applies to all
variables): Reports analysis variables where the percentage of missing values
exceeds the specified value (set: 70). The specified value must be a positive
number less than or equal to 100.
PCTEQUAL=25 Maximum percentage of cases representing a single
category (applies to categorical variables only): Reports categorical analysis
variables where the percentage of cases in a single non-missing category
exceeds the specified value (default: 95). The specified value must be a
positive number less than or equal to 100. The percentage value is based on
cases of the variable without missings.
PCTUNEQUAL=10 Percentage of categories containing only one case in a
categorical variable (applies only to categorical variables): Reports
categorical analysis variables where the percentage of categories of the
variable containing only one case exceeds the specified value (default: 90).
The specified value must be a positive number less than or equal to 100.
CV=0.5 Minimum absolute coefficient of variation (applies to metric
variables only): Reports metric analysis variables where the absolute value of
the coefficient of variation is less than the specified value (default: 0.001).
This option only affects variables with a non-zero mean value. The specified
value must be a non-negative number; 0 disables the check of the coefficient
of variation.
STDDEV=0.5 Minimum standard deviation (applies to metric variables
only): Reports metric analysis variables whose standard deviation is less than
the specified value (default: 0). The specified value must be a non-negative
number; 0 disables the check of the standard deviation.
After IDCHECKS, settings are made for checking case identifiers or ID
variables. If no case identifiers are specified under ID=, IDCHECKS will be
ignored.
NONE No check is performed on case identifiers or ID variables.
INCOMPLETE (default) reports cases (rows) with incomplete case
identifiers. A case identifier is considered incomplete if a value is empty or
missing.
DUPLICATE (default) reports cases (rows) with duplicate case identifiers.
Incomplete case identifiers are excluded from the set of possible duplicates.
After CASECHECKS, settings are made for checking empty cases (rows) in
the active dataset.
REPORTEMPTY=YES (default) checks and flags all empty cases.
REPORTEMPTY=NO suppresses the check for empty cases. If
REPORTEMPTY=NO, SCOPE is also ignored.
SCOPE=ANALYSISVARS uses all specified analysis variables to check for
empty cases (default, if analysis variables are specified). SCOPE=ALLVARS
uses all variables to check for empty cases (default, if no analysis variables
are specified). When checking for empty cases, the SCOPE option
ALLVARS ignores ID variables, SPLIT FILE variables and rule result
variables. A case (row) is considered empty if all (analysis) variables used for
the check are empty or missing.
After CASEREPORT, settings can be made for the output of a report on rule
violations of individual cases (rows). CASEREPORT is only valid if settings
have been made on the "Single-Variables Rules" resp. "Cross-Variable
Rules" tabs and is used to display rule violations for single resp. multiple
variables. CASEREPORT is ignored if no settings have been made on these
tabs (as in the example above). CASEREPORT is described in more detail in
sections [Link].
Output:
Validate Data
The result of the Basic Checks is displayed after the heading "Validate Data".
Warnings
Some or all requested output is not displayed because all cases, variables, or data values passed the
requested checks.

This warning is a desirable result. This note indicates that all or some of the
checked variables have passed all basic checks. In the example, this is
variable ICRF, which does not appear in the "Variable Checks" table even
though it has been checked. However, this result does not mean for ICRF that
this variable does not contain (further) errors (e.g. duplicates).
Variable Checks
Categorical Cases Missing > 20 hxmd
Cases Constant > 25 status
hxmd
icrfd
Scale Cases Missing > 20 icrfhilo
Coefficient Of Variation < 0.5 time
Each variable is reported with every check it fails.

Note: Not all applied checks are displayed. The standard deviation is not
displayed, nor is the maximum percentage of categories with count 1, even
though they were specified and applied. The reason is that all checked
variables comply with these two checking rules, among others. Only the test
rules that have been violated are listed, together with the variables that have
violated them.
The table "Variable Checks" shows which variables were subjected to the
checks (e.g. HXMD), which scale level SPSS assumes (e.g. categorical for
HXMD) and which other settings the basic checks contained. For example,
the "Cases Missing" check for the variable HXMD, which is interpreted as
categorical, was set in such a way that an error message is output if the
proportion of missings exceeds 20%.
This result thus indicates that the dataset contains errors. The variable
HXMD contains e.g. a proportion of missing cases above 20%. The variable
STATUS, for example, contains a proportion of constants above 25%. ICRF is
not listed because this variable complies with the checking rules; however,
this does not mean that this variable is 100% error-free. ICRF, as well as the
other variables, may also contain types of errors other than those checked
(e.g. duplicates). The "Variable Checks" table must therefore always be
interpreted within the context of the variables examined, the checks made
and the criteria specified.

9.2 Loading and applying predefined


validation rules for single variables
SPSS provides access to numerous pre-defined validation rules, which must
be loaded (even the first time with the first mouse access) from the file
"Predefined Validation Rules SPSS [Link]" [even in v15; slightly different
names in later SPSS versions] in the SPSS installation directory. SPSS
distinguishes between so-called single-variable rules (rules to check single
variables) and cross-variable rules (rules to check several variables). With
cross-variable rules, several variables must meet certain conditions at the
same time in order not to trigger an error message. With single-variable rules,
only one variable has to fulfill certain conditions.
Under 9.2.1., the access by mouse is explained, the access by
VALIDATEDATA syntax is explained under 9.2.2. resp. 9.2.3. Creating and
executing your own rules for individual variables (single-variable rules, e.g.
for date variables) by syntax is introduced in 9.3. The creating and executing
of own rules for several variables (cross-variable rules, e.g. for date
variables) with VALIDATEDATA syntax is introduced in 9.4.
Note: The (accidental) access to other validation rules overwrites already
existing links between variables and rules. An obvious protection against this
is the creation of your own check program. The "Predefined Validation Rules
[Link]" until SPSS v22 did not contain any validation rules for date
variables, nor any predefined validation rules for multiple variables (cross-
variable rules).

9.2.1 By mouse
Open the SPSS dataset „ Mouse [Link] “. Create a variable "ID", e.g.
via $CASENUM. Then follow the further instructions:
Path: Data → Validation → Load Predefined Rules …
Just click on OK in the dialog box that appears.
SPSS then loads the predefined validation rules from the file "Predefined
Validation Rules SPSS [Link]". Alternatively, the Copy Data Properties
Wizard (Data → Copy Data Properties…) for copying data properties can
be used to load rules from any data file.
Path: Data → Validation → Validate Data …
SPSS may prompt you if you want to use the present measurement level of
your variable or if you want to adjust it individually.
Tab „Variables“:
Drag the ID variable into the field “Case Identifier Variables”.
Drag all remaining variables into the field “Analysis Variables”.
Tab „Basic Checks“:
Do not perform basic checks on analysis variables. Uncheck the "Flag
variables that fail any of the following checks" field. Uncheck the "Flag
empty cases" field.
Tab „Single-Variable Rules“:
In the "Analysis Variables" field below left, click the variables HXM or
ICRF. The histogram resp. minimum and maximum are supporting
information to help you assign a suitable rule to the distribution shape of
HXM or ICRF (see below).
This information can even be abandoned if the user has a codebook that
indicates which distributional properties the variables to be analyzed should
have. Under "Rules", the number of rules already applied to the variable in
question is displayed. If a variable still contains the value "0", no rule has yet
been assigned to this variable (possibly even accidentally). User- and system-
defined missing values are not included in the summaries.
Using the "Display" selection field below "Analysis Variables", the display
of all variables can be limited to numeric, string or date variables, which can
be a much better overview and a noticeable simplification of work in the case
of very large variable sets. Of course this only works if a dataset contains
mixed variable types at all. The type of the variables in the dataset also has an
influence on the displayed rules. Since the dataset "Mouse [Link]"
contains only numerical variables, rules for string variables are automatically
not displayed in the selection window "Rules". – Until v22, SPSS does not
yet offer rules for date variables. However, later chapters will show how
users can program their own rules for date variables. Rules can therefore not
be displayed for two reasons: Either the rules are not (no longer) available in
the selection. Or rules are present, but data do not have the required
measurement level. – The same applies to "Display": Depending on the
display, only rules that are suitable for the selected analysis variables are
listed. For example, if "Numeric Variables" was selected, only rules for
numerical variables are displayed. If, for example, "String variables" was
selected, only rules for string variables are displayed in the selection window.
The output shows all existing rules, including those not requested. Such an
output seems to be rather confusing at first; therefore only the SPSS output is
presented before the VALIDATEDATA syntax is explained.
Example
Variable HXM: From a codebook resp. the information from the histogram,
as well as minimum and maximum, you assume that the variable HXM could
be a positive, metric and possibly integer variable. Therefore, you do not
assign the rules "0,1 dichotomy" or "1,2 dichotomy" to this variable, but
instead "1 to 10 integer". As soon as you place a cross in the "Apply" column
(to the left of the name of the validation rule), the number of rules under
"Rules" (is increased by 1, in this case from 0 to 1.
Variable ICRF: From the codebook resp. the information from the
histogram, minimum and maximum, you assume that the variable ICRF
could be a positive, metric and possibly integer variable. You therefore assign
the rule "Nonnegative integer" to this variable.
Variable STATUS: You assign the rule "0,1 dichotomy" to the variable
STATUS.
Variable TIME: You assign the "Nonnegative integer" rule to the TIME
variable.
Variables HXMD resp. ICRFD: You assign the "0,1 dichotomy" rule to these
variables. Variables ICRFSQ, HMXSQ or IRFHILO: You assign the
"Nonnegative integer" rule to these variables.
Tab „Cross-Variable Rules“:
We don’t apply cross-variable rules.
Tab „Output“:
If validation rules have been assigned to one or more variables, the Output
tab can be used to request a report that logs validation rule violations by case,
analysis variable or validation rule.
"Casewise Report".
Tick "List validation rule violations by case".
"Minimum Number of Violations for a Case to be included": This specifies
the minimum number of validation rule violations for a case to be included in
the report. Enter "1".
"Maximum Number of Cases in Report": Enter "10". For very large data
volumes and frequent validation rule violations, the output may become
unmanageable. The approximate setting of parameters is recommended.
"Single-Variable Validation Rules":
Tick "Summarize violations by analysis variable". This will display all
violated validation rules per analysis variable, as well as the number of values
that violated the individual rules and the total number of rule violations (in
case of several rules).
Tick "Summarize violations by rule". This will output all analysis variables
per validation rule that violated the rule, as well as the number of invalid
values per variable and the total number of all rule violations (in case of
multiple variables).
Do not tick "Display descriptive statistics for analysis variables".
Do not tick "Move cases with validation rules violations to the top of the
active dataset".
Start the data check with "OK".

Output:
Validate Data
After the heading "Validate Data", the result of applying rules to individual
variables is displayed.
Single-Variable Rules
The "Rule Descriptions" table describes the rules that have been violated.
The table "Variable Summary" shows the variables that violated at least
one rule. The table "Case Report" lists the cases that violated at least one
rule.
Rule Descriptions
Rule Description
1 to 10 integer Type: Numeric
Domain: Range
Flag user-missing values: No
Flag system-missing values: No
Minimum: 1
Maximum: 10
Flag unlabeled values within range: No
Flag noninteger values within range: Yes
$[Link][4]: Rule
Nonnegative integer Type: Numeric
Domain: Range
Flag user-missing values: No
Flag system-missing values: No
Minimum: 0
Flag unlabeled values within range: No
Flag noninteger values within range: Yes
$[Link][6]: Rule
Rules violated at least once are displayed.

The "Rule Descriptions" table describes the rules that have been violated.
The advantage of this presentation method is, of course, that you can quickly
get a first overview of the quality of the examined data. The "Rule
Descriptions" table does not show the rules that were followed. The
disadvantage in the form of this output is of course that you do not know
which error you were willing to accept. Rules that have been followed and
thus not displayed do not necessarily mean that the data is error-free (see
above).
In the following, the properties of a rule are explained using the first rule,
$[Link][4], as an example; for details please refer to the explanation of
the VALIDATEDATA syntax for rules under 9.2.3. Type: Numeric assigns
this rule only to numeric variables. Domain: Range checks the target values
in the form of a range (rank). Flag user- resp. system-defined missing values:
No prevents user- resp. system-defined missings from being displayed as
invalid. Minimum: 1 resp. Maximum: 10 specify the extreme points of the
permissible range. Flag values without label within the range: No prevents
values without labels from being flagged as invalid. Flag non-integer values
within the range: Yes causes non-integer values to be flagged as invalid.
$[Link][4]: Rule returns the name of this rule; on the left, the label is
indicated, "1 to 10 integer".

Variable Summary
Number of
Rule Violations
icrf Nonnegative integer 40
Total 40
hxm 1 to 10 integer 214
Total 214
icrfsq Nonnegative integer 40
Total 40
hmxsq Nonnegative integer 40
Total 40
icrfhilo Nonnegative integer 40
Total 40

The table "Variable Summary" shows the variables that violated at least
one rule. In addition, it is indicated, against which rule they violated and how
often this variable violated the rule in question. For example, the variable
HXM violated the rule "1 to 10 integer" (for definition, see the "Rule
Descriptions" table). Under "Number of violations" the number of violations
of the respective rule is shown. HXM e.g. violates the rule "1 to 10 integer"
214 times. If several check rules have been applied to one variable, the sum
of violations of the variable against all applied rules is indicated under
"Total".
The explanation of this finding for HXM is simple: Although the variable
HXM is an integer, the values lie exclusively outside the range from 1 to 10.
One could therefore also use such a result to counter-check the
appropriateness of the applied rule. Seen in this light, applying the rule "1 to
10 integer" to the variable HXM does not seem appropriate.

Case Reportb

Validation Rule Violations Identifier

Case Single-Variablea ID
1 1 to 10 integer (1) 1,00
2 1 to 10 integer (1) 2,00
3 1 to 10 integer (1) 3,00
4 1 to 10 integer (1) 4,00
5 1 to 10 integer (1) 5,00
6 1 to 10 integer (1) 6,00
7 1 to 10 integer (1) 7,00
8 1 to 10 integer (1) 8,00
9 1 to 10 integer (1) 9,00
10 1 to 10 integer (1) 10,00
a. The number of variables that violated the rule follows each rule.
b. There were more than 10 cases with rule violations. Only the first 10 are
displayed.

The table "Case Report" lists the cases that violated at least one rule and the
checking rule that the listed case violated. By "case", the row number is
meant; if an ID variable had been specified, it is explicitly output as a column
with the heading "Identifier". After the respective validation rule, the number
of times the examined variables violated the respective rule is indicated in
brackets. For the first row this means that in case 1 the examined variables
violated the rule "1 to 10 integer" only once. From the table "Variable
Summary" it is known that this is exclusively the variable HXM. If several
examined variables would violate the rule "1 to 10 integer", the table would
be less easy to interpret. Due to the default settings made, the "Case Report"
table lists only the first ten cases that violated at least one rule; thus, more
than ten cases with rule violations have occurred.

9.2.2 By syntax
For a better clarity the corresponding program sections are provided with
digits; the program sections so designated are explained in section 9.2.3. For
programming tips, please refer to 9.3, 9.4., and 9.5.
Syntax
* (1) Open the SPSS dataset „Mouse [Link]“. *.
GET
FILE='C:\Programs\SPSS\Mouse [Link]'.
* (2) Prepare further variables. *.
compute ID=$casenum.
exe.
VARIABLE LEVEL
icrf hxm status time icrfsq hmxsq icrfhilo (SCALE)
/ status hxmd icrfd (NOMINAL) .
* (3) Load the checking rules from „Predefined Validation Rules SPSS
[Link]“. *.
* Process: Load predefined
rules *.
APPLY DICTIONARY FROM
'C:\Programs\SPSS\Predefined Validation Rules SPSS [Link]'
/FILEINFO ATTRIBUTES=MERGE
/VARINFO.
* (4) Delete existing links between variables and rules. *.
VARIABLE ATTRIBUTE VARIABLES=ALL DELETE=$[Link].
* (5) Apply the mouse-controlled applied checking
rules. *.
* Process: Validate data
… *.
* (6) Delete existing validation rules for a
variable. *.
DATAFILE ATTRIBUTE DELETE=$[Link].
* (7) Delete existing links between variables and
rules. *.
VARIABLE ATTRIBUTE VARIABLES=ALL DELETE=$[Link].
* (8) Define validation rules for a variable
(again). *.
* Displays all currently implemented rules ($[Link][1] to
[24]). *.
DATAFILE
ATTRIBUTE ATTRIBUTE=
$[Link][1]("Label='0,1 dichotomy', Type='Numeric',
Domain='List', FlagUserMissing='No',
FlagSystemMissing='No', FlagBlank='No',
CaseSensitive='No', List='0' '1")

$[Link][2]("Label='1,2 Dichotomy', Type='Numeric',


Domain='List', FlagUserMissing='No',
FlagSystemMissing='No', FlagBlank='No',
CaseSensitive='No',List='1' '2' ")

$[Link][3]("Label='1 to 5 integer', Type='Numeric',


Domain='Range', Minimum='1', Maximum='5',
FlagUserMissing='No', FlagSystemMissing='No',
FlagBlank='No', FlagNoninteger='Yes',
FlagUnlabeled='No' ")

$[Link][4]("Label='1 to 10 integer', Type='Numeric',


Domain='Range', Minimum='1', Maximum='10',
FlagUserMissing='No', FlagSystemMissing='No',
FlagBlank='No', FlagNoninteger='Yes',
FlagUnlabeled='No' ")

$[Link][5]("Label='Nonnegative number', Type='Numeric',


Domain='Range', Minimum='0', Maximum='',
FlagUserMissing='No', FlagSystemMissing='No',
FlagBlank='No', FlagNoninteger='No',
FlagUnlabeled='No' ")

$[Link][6]("Label='Nonnegative integer', Type='Numeric',


Domain='Range', Minimum='0', Maximum='',
FlagUserMissing='No', FlagSystemMissing='No',
FlagBlank='No', FlagNoninteger='Yes',
FlagUnlabeled='No' ")

$[Link][7]("Label='0 to 100 number', Type='Numeric',


Domain='Range', Minimum='0', Maximum='100',
FlagUserMissing='No', FlagSystemMissing='No',
FlagBlank='No', FlagNoninteger='No',
FlagUnlabeled='No' ")

$[Link][8]("Label='Flag system-missing values',


Type='Numeric', Domain='Range', Minimum='',
Maximum='', FlagUserMissing='No',
FlagSystemMissing='Yes', FlagBlank='No',
FlagNoninteger='No', FlagUnlabeled='No' ")

$[Link][9]("Label=' Flag user-missing values ',


Type='Numeric', Domain='Range', Minimum='',
Maximum='', FlagUserMissing='Yes',
FlagSystemMissing='No', FlagBlank='No',
FlagNoninteger='No', FlagUnlabeled='No' ")

$[Link][10]("Label='Flag missing values', Type='Numeric',


Domain='Range', Minimum='', Maximum='',
FlagUserMissing='Yes', FlagSystemMissing='Yes',
FlagBlank='No', FlagNoninteger='No',
FlagUnlabeled='No' ")

$[Link][11]("Label='Flag noninteger values',


Type='Numeric', Domain='Range', Minimum='',
Maximum='', FlagUserMissing='No',
FlagSystemMissing='No', FlagBlank='No',
FlagNoninteger='Yes', FlagUnlabeled='No' ")

$[Link][12]("Label='Flag unlabeled values', Type='Numeric',


Domain='Range', Minimum='', Maximum='',
FlagUserMissing='No', FlagSystemMissing='No',
FlagBlank='No', FlagNoninteger='No',
FlagUnlabeled='Yes' ")

$[Link][13]("Label='Sex (1 char.)', Type='String',


Domain='List', FlagUserMissing='No',
FlagSystemMissing='No', FlagBlank='No',
CaseSensitive='No', List='M' 'F' ")

$[Link][14]("Label='Sex (full)', Type='String',


Domain='List', FlagUserMissing='No',
FlagSystemMissing='No', FlagBlank='No',
CaseSensitive='No', List='Male' 'Female'")

$[Link][15]("Label='Day of week (3 char.)', Type='String',


Domain='List', FlagUserMissing='No',
FlagSystemMissing='No', FlagBlank='No',
CaseSensitive='No', List='Mon' 'Tue' 'Wed'
'Thu' 'Fri' 'Sat' 'Sun' ")

$[Link][16]("Label='Day of week (full)', Type='String',


Domain='List', FlagUserMissing='No',
FlagSystemMissing='No', FlagBlank='No',
CaseSensitive='No',
List='Monday' 'Tuesday' 'Wednesday' 'Thursday' 'Friday' 'Saturday' 'Sunday'
")

$[Link][17]("Label='Month (3 char.)', Type='String',


Domain='List', FlagUserMissing='No',
FlagSystemMissing='No', FlagBlank='No',
CaseSensitive='No',
List='Jan' 'Feb' 'Mar' 'Apr' 'May' 'Jun' 'Jul' 'Aug' 'Sep' 'Oct' 'Nov' 'Dec' ")

$[Link][18]("Label='Month (full)', Type='String',


Domain='List', FlagUserMissing='No',
FlagSystemMissing='No', FlagBlank='No',
CaseSensitive='No',
List='January' 'February' 'March' 'April' 'May' 'June' 'July' 'August'
'September' 'October' 'November' 'December' ")

$[Link][19]("Label=' U.S. states (2 char.)',


Type='String', Domain='List',
FlagUserMissing='No', FlagSystemMissing='No',
FlagBlank='No', CaseSensitive='No',
List='AL' 'AK' 'AZ' 'AR' 'CA' 'CO' 'CT' 'DE' 'FL' 'GA' 'HI' 'ID' 'IL' 'IN' 'IA'
'KS' 'KY' 'LA' 'ME' 'MD' 'MA' 'MI' 'MN' 'MS' 'MO' 'MT' 'NE' 'NV' 'NH' 'NJ'
'NM' 'NY' 'NC' 'ND' 'OH' 'OK' 'OR' 'PA' 'RI' 'SC' 'SD' 'TN' 'TX' 'UT' 'VT'
'VA' 'WA' 'WV' 'WI' 'WY' ")

$[Link][20]("Label='U.S. states (full)',


Type='String', Domain='List',
FlagUserMissing='No', FlagSystemMissing='No',
FlagBlank='No', CaseSensitive='No',
List='Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
'Connecticut' 'Delaware' 'Florida' 'Georgia'
'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas'
'Kentucky' 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan'
'Minnesota' 'Mississippi' 'Missouri' 'Montana'
'Nebraska' 'Nevada' 'New Hampshire' 'New Jersey' 'New Mexico' 'New York'
'North Carolina' 'North Dakota' 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania'
'Rhode Island' 'South Carolina'
'South Dakota' 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia'
'Washington' 'West Virginia' 'Wisconsin' 'Wyoming' ")

$[Link][21]("Label='Canadian provinces (2 char.)',


Type='String', Domain='List',
FlagUserMissing='No', FlagSystemMissing='No',
FlagBlank='No', CaseSensitive='No',
List='AB' 'BC' 'MB' 'NB' 'NL' 'NT' 'NS' 'NU' 'ON' 'PE' 'QC' 'SK' 'YT' ")

$[Link][22]("Label='Canadian provinces (full)',


Type='String', Domain='List',
FlagUserMissing='No', FlagSystemMissing='No',
FlagBlank='No', CaseSensitive='No',
List='Alberta' 'British Columbia' 'Manitoba' 'New Brunswick' 'Newfoundland
and Labrador' 'Northwest Territories'
'Nova Scotia' 'Nunavut' 'Ontario' 'Prince Edward Island'
'Quebec' 'Saskatchewan' 'Yukon' ")
$[Link][23]("Label='UK post codes', Type='String',
Domain='List', FlagUserMissing='No',
FlagSystemMissing='No', FlagBlank='No',
CaseSensitive='No',
List='AB' 'AL' 'B' 'BA' 'BB' 'BD' 'BH' 'BL' 'BN' 'BR' 'BS' 'BT' 'CA' 'CB' 'CF'
'CH' 'CM' 'CO' 'CR' 'CT' 'CV' 'CW' 'DA' 'DD' 'DE' 'DG' 'DH' 'DL' 'DN' 'DT'
'DY' 'E' 'EC' 'EH' 'EN' 'EX' 'FK' 'FY' 'G' 'GL' 'GU' 'GY' 'HA' 'HD' 'HG' 'HP'
'HR' 'HS' 'HU' 'HX' 'IG' 'IM' 'IP' 'IV' 'JE' 'KA' 'KT' 'KW' 'KY' 'L' 'LA' 'LD'
'LE' 'LL' 'LN' 'LS' 'LU' 'M' 'ME' 'MK' 'ML' 'N' 'NE' 'NG' 'NN' 'NP' 'NR' 'NW'
'OL' 'OX' 'PA' 'PE' 'PH' 'PL' 'PO' 'PR' 'RG' 'RH' 'RM' 'S' 'SA' 'SE' 'SG' 'SK'
'SL' 'SM' 'SN' 'SO' 'SP' 'SR' 'SS' 'ST' 'SW' 'SY' 'TA' 'TD' 'TF' 'TN' 'TQ' 'TR'
'TS' 'TW' 'UB' 'W' 'WA' 'WC' 'WD' 'WF' 'WN' 'WR' 'WS' 'WV' 'YO' 'ZE'")

$[Link][24]("Label='UK social class designation',


Type='String', Domain='List',
FlagUserMissing='No', FlagSystemMissing='No',
FlagBlank='No', CaseSensitive='No',
List='A' 'B' 'C1' 'C2' 'D' 'E' ").
* (9) Define links between variables and rules (again); only the *.
* selected rules are updated with the corresponding analysis variables. *.
* Depending on the SPSS version, OutcomeVar may have a trailing
underscore [”_”]*.

VARIABLE ATTRIBUTE
VARIABLES=status
ATTRIBUTE=$[Link][1]("Rule='$[Link][1]',
OutcomeVar='@01dichotomy_status'")
/VARIABLES=icrfhilo
ATTRIBUTE=$[Link][1]("Rule='$[Link][6]',
OutcomeVar='Nonnegativeinteger_icrfhilo' ")
/VARIABLES=hxmd
ATTRIUTE=$[Link][1]("Rule='$[Link][1]',
OutcomeVar='@01dichotomy_hxmd' ")
/VARIABLES=hxm
ATTRIUTE=$[Link][1]("Rule='$[Link][4]',
OutcomeVar='@1to10integer_hxm' ")
/VARIABLES=hmxsq
ATTRIBUTE=$[Link][1]("Rule='$[Link][6]',
OutcomeVar='Nonnegativeinteger_hmxsq' ")
/VARIABLES=icrfsq
ATTRIBUTE=$[Link][1]("Rule='$[Link][6]',
OutcomeVar='Nonnegativeinteger_icrfsq' ")
/VARIABLES=time
ATTRIBUTE=$[Link][1]("Rule='$[Link][6]',
OutcomeVar='Nonnegativeinteger_time' ")
/VARIABLES=icrfd
ATTRIBUTE=$[Link][1]("Rule='$[Link][1]',
OutcomeVar='@01dichotomy_icrfd' ")
/VARIABLES=icrf
ATTRIBUTE=$[Link][1]("Rule='$[Link][6]',
OutcomeVar='Nonnegativeinteger_icrf' ") .
TEMPORARY.
* (10) 1 to 10 Integer. *.
COMPUTE @1to10integer_hxm =
NOT(VALUE(hxm)>=1 AND VALUE(hxm)<=10 AND
VALUE(hxm)=TRUNC(VALUE(hxm)) OR MISSING(hxm)).
* (11) 0,1 Dichotomy. *.
COMPUTE @01dichotomy_hxmd =
NOT(ANY(VALUE(hxmd),0,1) OR MISSING(hxmd)).
COMPUTE @01dichotomy_icrfd =
NOT(ANY(VALUE(icrfd),0,1) OR MISSING(icrfd)).
COMPUTE @01dichotomy_status =
NOT(ANY(VALUE(status),0,1) OR MISSING(status)).
* (12) Non-negative integer. *.
DO REPEAT
#OV=Nonnegativeinteger_hmxsq
Nonnegativeinteger_icrfhilo
Nonnegativeinteger_icrf
Nonnegativeinteger_time
Nonnegativeinteger_icrfsq
/#IV=hmxsq icrfhilo icrf time icrfsq.
COMPUTE #OV=NOT(VALUE(#IV)>=0 AND
VALUE(#IV)=TRUNC(VALUE(#IV)) OR MISSING(#IV)).
END REPEAT.
* (13) Flag result variables for rules in the SPSS data dictionary
accordingly *.
VARIABLE ATTRIBUTE
VARIABLES=@1to10integer_hxm TO
Nonnegativeinteger_icrfsq
ATTRIBUTE=$[Link]("Yes").
* (14) Validate data *.
VALIDATEDATA
VARIABLES=icrf hxm status time hxmd icrfd
icrfsq hmxsq icrfhilo
/VARCHECKS STATUS=OFF
/CASECHECKS REPORTEMPTY=NO
/CASEREPORT DISPLAY=YES MINVIOLATIONS=1
CASELIMIT=FIRSTN(10)
/RULESUMMARIES BYVARIABLE BYRULE.

9.2.3 Explanation of VALIDATEDATA syntax


Section (1):
Open the SPSS dataset „Mouse [Link]“.
Section (2):
Prepare further variables.
An ID variable is created. Using VARIABLE LEVEL, the variables ICRF
etc. are defined as metric and the variables STATUS, HXMD and ICRFD as
nominal.
Section (3):
Load the checking rules from „Predefined Validation Rules SPSS [Link]“.
Using APPLY DICTIONARY FROM, the validation rules from "Predefined
Validation Rules SPSS [Link]" are loaded into the opened dataset "Mouse
[Link]" [even in v15; slightly different names in later SPSS versions] in
the SPSS installation directory. FILEINFO ATTRIBUTES=MERGE
connects global attributes of the file definition with the active dataset "Mouse
[Link]". VARINFO applies attributes of the variable definition (e.g. for
strings) to matching (string) variables in the active dataset.
Section (4):
Delete existing links between variables and rules.
The attribute list "$[Link]" is deleted for all variables
(VARIABLES=ALL) using VARIABLE ATTRIBUTE and the keyword
DELETE. "$[Link]" does not designate a single attribute, but an
attribute list, in this case an already possibly existing list of links of rules to
single variables. All attributes with the name part "$[Link]" are
deleted, e.g. $[Link][1], $[Link][2], etc. The beginning of the
name with a $ character shows that the attribute list is reserved for internal
SPSS use.
Section (5):
Apply the mouse-controlled applied checking rules.
Process: Validate data …
Section (6):
Delete existing validation rules for a variable.
DATAFILE ATTRIBUTE DELETE= deletes the $[Link] attribute list
which was previously assigned to the active dataset. $[Link] denotes a list
of single-rule variables. All attributes with the name part "$[Link]" are
deleted, e.g. $[Link][1], $[Link][2], etc. (cf. 8).
Section (7):
Delete existing links between variables and rules.
The attribute list "$[Link]" is deleted for all variables
(VARIABLES=ALL). SPSS issues this statement a second time (cf. 4).
Section (8):
Define validation rules for a variable (again).
Displays all currently implemented rules ($[Link][1] to [24]).
DATAFILE ATTRIBUTE ATTRIBUTE= is used to define the concrete
$[Link][1] to $[Link] [24] rules, assigned to the active dataset, as
attribute lists again. Identically worded later attribute lists always take over
the elements from previous lists as well. Insofar, the later attribute lists
always include the elements of the earlier attribute lists. The test rules are
reproduced in detail so that the user can judge whether the method of data
validation meets his requirements. In addition, the test rules reproduced in
this way can stimulate the adaptation or development of own test rules.
As a representative for all 24 output rules, not all but only four rules (attribute
lists) shall be presented in detail. For metric variables, the rules
$[Link][1] and $[Link][3] are introduced; for string variables the
rules $[Link][16] and $[Link][18] are introduced. The file
"Predefined Validation Rules SPSS [Link]" does not contain validation
rules for date variables.
Rule example: „0,1 Dichotomy“
The „0,1 Dichotomy“ rule checks a numerical variable to see if it contains
only the values 0 and 1; any other value is output as an error.
$[Link][1]("Label='0,1 Dichotomy', Type='Numeric',
Domain='List', FlagUserMissing='No',
FlagSystemMissing='No', FlagBlank='No',
CaseSensitive='No',List='0' '1' ")
"$[Link][1]" is the attribute name for a first rule for single variables. The
concrete rule is specified in brackets resp. within apostrophes. The sequence
of the attributes is irrelevant. Label= denotes the rule: '0,1 Dichotomy'.
Type='Numeric' assigns this rule only to numerical variables. Domain='List'
allows to specify the reference values for the test domain in form of a list, in
the example as listing of 0 and 1 after LIST=. FlagUserMissing='No' prevents
user-defined missings from being displayed as invalid.
FlagSystemMissing='No' prevents system- defined missings from being
displayed as invalid. FlagBlank='No' prevents missing values of strings
('blanks') from being displayed as invalid. CaseSensitive='No' means that the
data check is not case-sensitive. After List= the reference values are listed
(the reference values for rules for metric variables are also listed in quotes).
Rule example: „1 to 5 integer“
The “1 to 5 integer” rule checks a numeric variable to see if it contains only
integer values between 1 and 5. Any deviation from this rule (non-integer
values or values smaller than 1 or larger than 5) is output as an error.
$[Link][3]("Label='1 to 5 integer', Type='Numeric',
Domain='Range', Minimum='1', Maximum='5',
FlagUserMissing='No', FlagSystemMissing='No',
FlagBlank='No', FlagNoninteger='Yes',
FlagUnlabeled='No' ")
"$[Link][3]" is the attribute name for the third single-variable rule.
Label= denotes the rule with '1 to 5 integer'. Type='Numeric' assigns this rule
only to numeric variables. Domain='Range' allows to specify the target
values in the form of a range. Minimum= resp. Maximum= denote the
extreme points of the permissible range, in the example from 1 to 5, each
including FlagUserMissing='No' resp. FlagSystemMissing='No' that prevent
user- resp. system-defined missings from being displayed as invalid.
FlagBlank='No' prevents missing values of strings ('blanks') from being
displayed as invalid. FlagNoninteger='No' causes non-integer values not to be
flagged as invalid. FlagUnlabeled='No' causes values without labels not to be
flagged as invalid.
Rule example: „Day of week (full)“
The rule „Day of week (full)“ checks a string variable to see if it contains
only weekdays in the form of „Monday“, „Tuesday“ etc.; any different
spelling will be output as an error.
$[Link][16]("Label='Day of week (full)', Type='String',
Domain='List', FlagUserMissing='No',
FlagSystemMissing='No', FlagBlank='No',
CaseSensitive='No',
List='Monday' 'Tuesday' 'Wednesday' 'Thursday' 'Friday' 'Saturday' 'Sunday'
")
„$[Link][16]“ is the attribute name for the 16th rule for single
variables. Label= denotes the rule with 'Day of week (full)'. Type='String'
assigns this rule only to string variables. Domain='List' allows to specify
the reference values for the check in the form of a list, in the example as a
listing of the individual weekdays. FlagUserMissing='No' resp.
FlagSystemMissing='No' prevent user- or system-defined missings from
being displayed as invalid. FlagBlank='No' prevents missing values for
strings ('blanks') from being displayed as invalid. CaseSensitive='No' means
that the data check is not case-sensitive. After List= the reference values are
listed (for rules for strings the reference values are put in quotation marks).
Rule example: „Month (full)“
The „Month (full)“ rule checks a string variable to see if it contains only
month details in the form of "January", "February" etc.; any different spelling
will be output as an error.
$[Link][18]("Label='Month (full)', Type='String',
Domain='List', FlagUserMissing='No',
FlagSystemMissing='No', FlagBlank='No',
CaseSensitive='No',
List='January' 'February' 'March' 'April' 'May' 'June' 'July'
'August' 'September' 'October' 'November' 'December' ")
„$[Link][18]“ is the attribute name for the 18th rule for single
variables. Label= denotes the rule with 'Month (full)'. Type='String' this rule
assigns only to string variables. Domain='List' allows to specify the reference
values for the check in form of a list, in the example as listing the month
names. FlagUserMissing='No' resp. FlagSystemMissing='No' prevent user-
resp. system-defined missings from being displayed as invalid.
FlagBlank='No' prevents missing values of strings ('blanks') from being
displayed as invalid. CaseSensitive='No' means that the data check is not
case-sensitive. After List=, the reference values are listed.
Section (9):
Define links between variables and rules (again); only the selected rules are
updated with the corresponding analysis variables.
VARIABLE ATTRIBUTE is used to (re)define the links between variables
(VARIABLES=) and rules ("Rule='$[Link][n]"). In contrast to Section 8,
only the rules assigned to the respective analysis variables by mouse click
($[Link][1], $[Link][4] and $[Link][6]) are listed and updated.
The first section, up to TEMPORARY., assigns the variables STATUS to
ICRF to the rules $[Link][1] and so on. The sequence of rule assignment
by mouse clicks does not correspond to the sequence by syntax.
For example, the STATUS variable receives the $[Link][1] rule
assignment as attribute (ATTRIBUT=). The rule assignment contains the
assigned rule and the variable in which the check result should be stored: The
attribute contains the rule ("Rule=") "$[Link][1]" in brackets and
quotation marks, and the result variable (OUTCOMEVAR=) after a comma.
Result variables (e.g. "@01dichotomy_status") are required because the
result of the rule application must be stored in them.
The link between a rule and respective further variables is set according to
the same scheme, with the difference that VARIABLES= is preceded by a
slash ("/"). After /VARIABLES=, e.g., the variable ICRFHILO is also
assigned the rule assignment "$[Link][1]" as attribute. The attribute
contains the rule "$[Link][6]" in brackets and quotation marks, and after
a comma the result variable („Nonnegativeinteger_icrfhilo“). The links for
the other variables resp. rules are not explained further.
Because of TEMPORARY, all further calculations are temporary, i.e. the
variables created or modified by COMPUTE are also temporary.
Section (10):
1 to 10 integer.
This section describes the concrete application of rule "$[Link][4]" („1
to 10 integer“, see above) to the values of the HXM variable. COMPUTE
creates the result variable “@1to10integer_hxm”; the result of the HXM
check is stored in this variable. For each invalid value, the value 1 is stored in
@1to10integer_hxm (as a positive checking rule, the value 1 is stored for
each valid value). The checking rule itself is self-explanatory and is not
explained further.
Section (11):
0,1 Dichotomy.
This section describes the concrete application of rule "$[Link][1]" („0,1
Dichotomy“, see above), this rule is applied separately to the variables
HXMD, ICRFD and STATUS. COMPUTE stores the respective check result
in the result variables „@01dichotomy_hxmd”, „@01dichotomy_icrfd” and
„@01dichotomy_status”. Formulated as a negative checking rule, the value 1
is stored for each invalid value, plus 0 resp. Missing. If the checking rule is
positive, the value 1 is stored for each valid value, and so on. The checking
rule itself is self-explanatory and will not be explained further.
Section (12):
Nonnegative integer.
This section describes the concrete application of rule "$[Link][6]"
(„Non negative integer“, see above). This rule is applied separately to the
variables HMXSQ, ICRFHILO, ICRF, TIME and ICRFSQ. Using a
combination of DO REPEAT and COMPUTE, the test result is stored in the
respective result variable „Nonnegativeinteger_hmxsq” etc. Formulated as a
negative checking rule, the value 1 is stored for each invalid value, plus resp.
missing. Formulated as a positive check rule, the value 1 is stored for each
valid value, plus 0 resp. missing, etc. The checking rule itself is self-
explanatory and will not be explained further.
Section (13):
Flag result variables for rules in the SPSS data dictionary accordingly.
Using the "$[Link]" attribute and the "Yes" characteristic,
VARIABLE ATTRIBUTE flags the created variables @1to10integer_hxm
(Section 10) to Nonnegativeinteger_icrfsq (Section 12) for SPSS especially
as result variables.
Section (14):
Validate data
The VALIDATEDATA command initializes the SPSS procedure for
checking data.
After VARIABLES=, all analysis variables to be checked are specified.
After VARCHECKS, STATUS=OFF causes SPSS not to perform basic
checks. After CASECHECKS, the setting REPORTEMPTY=NO suppresses
the check for empty cases. With REPORTEMPTY=NO, SCOPE is also
ignored.
CASEREPORT is used to report rule violations for single or multiple
variables by cases, analysis variables or validation rules. CASEREPORT
only works if settings have been made on the "Single-Variable Rules" or
"Cross-Variable Rules" tabs. CASEREPORT is ignored if no settings have
been made on these tabs (see 9.1.).
DISPLAY=YES causes violations of validation rules to be output per case
(default: YES). A case can be specified by a specified ID variable, but also by
a case number. With DISPLAY=NO, MINVIOLATIONS and CASELIMIT
are ignored. MINVIOLATIONS=1 specifies the minimum number of
violations above which a case should be listed in the report (default: 1).
CASELIMIT=FIRSTN(10) specifies the maximum number of cases to
include in the report (default: 500; NONE disables any upper limit).
RULESUMMARIES is used to request summaries of violations of the
validation rules. BYVARIABLE requests an output broken down by
variables. The table "Variable Summary" shows the variables that violated at
least one rule. In addition, the table shows which rule they violated and how
often this variable violated the rule in question. BYRULE requests an output
broken down by rules. The table "Rule Summary" shows for the applied rules
how often the checked variables violated the respective rule.
Note: For details please refer to the example in 9.1.

9.3 Creating and executing custom rules


for single variables
SPSS also allows you to create and execute your own rules.
VALIDATEDATA distinguishes between so-called single-variable rules and
cross-variable rules. With cross-variable rules, several variables must comply
with certain conditions simultaneously in order not to trigger an error
message. With single-variable rules, only one variable has to fulfill certain
conditions. Section 9.3. introduces rules for single-variable rules (including
two examples for date variables). Section 9.4. introduces rules for several
variables (e.g. for date variables). In these sections only the access by
VALIDATEDATA syntax is presented. However, it is explained to which
phases of mouse control the corresponding syntax sections correspond.
Contents of Section 9.3:

Creating and executing your own rules


Among others, two examples for date variables
Separate access of two rules to the same date variable; not: Access of
one rule to two different date variables (9.4.)
Description comparative to mouse access
Tips for programming

GET FILE loads the SPSS dataset „Employee [Link]“.


GET
FILE='C:\Programs\SPSS\Employee [Link]'.
Starting with DATEFILE ATTRIBUTE, the validation rules for the variables
are (re)defined. Only the attributes of the rules are listed, but not yet assigned
to the variables to be checked. Under "Validate data…", the validation rules
are displayed on the right in the "Single-Variable Rules" tab, but still without
a check mark in the "Rules" window on the right.
DATAFILE ATTRIBUTE ATTRIBUTE=
$[Link][1]("Label='Value list (99-95), List variant',
Type='Numeric', Domain='List',FlagUserMissing='No',
FlagSystemMissing='No', FlagBlank='No',
CaseSensitive='No', List='99' '98' '97' '96' '95' ")
$[Link][2]("Label='m,w Dichotomy', Type='String',
Domain='List',FlagUserMissing='No',
FlagSystemMissing='No',FlagBlank='No',
CaseSensitive='Yes',List='m' 'w' ")
$[Link][3]("Label='Checking a date', Type='Date',
Domain='List', Format='edate8',FlagUserMissing='Yes',
FlagSystemMissing='Yes',FlagBlank='No',
CaseSensitive='No', List='03.02.52' '15.04.47' '18.07.62' '1.11.63'
'16.09.42' '10.01.07' ")
$[Link][4]("Label='Checking a date 2', Type='Date',
Domain='List', Format='edate8',FlagUserMissing='Yes',
FlagSystemMissing='Yes',FlagBlank='No',
CaseSensitive='No', List='01.01.00' ") .
Starting with VARIABLE ATTRIBUTE, the validation rules are assigned
to the variables to be checked. Under "Validate data", the validation rules in
the "Single-Variable Rules" tab on the right now additionally appear with set
check marks at "Rules", and the selected variables on the left with the number
of assigned rules. For "Date of Birth" (BDATE) the value 2 is displayed
under "Rules", because two rules refer to this variable.
Tip: SPSS may respond to breaks in the program lines by not executing the
validation check. Remedy is possible in two ways: Either the statements in
the syntax window are written into a single line. Or the line break is explicitly
passed to SPSS with "+".
VARIABLE ATTRIBUTE
VARIABLES=jobtime ATTRIBUTE=$[Link][1]
("Rule='$[Link][1]', OutcomeVar='Values_9995' ")
/VARIABLES=gender ATTRIBUTE=$[Link][1]
("Rule='$[Link][2]', OutcomeVar='m_w_Dicho' ")
/VARIABLES=bdate ATTRIBUTE=$[Link][1]
("Rule='$[Link][3]', OutcomeVar='Check_Date' ")
/VARIABLES=bdate ATTRIBUTE=$[Link][2]
("Rule='$[Link][4]', OutcomeVar='Check_Date2' ") .
TEMPORARY.
From COMPUTE on, the function of the checking rules is specified as SPSS
practically applies them to the values in the active file. With the help of the
validation rules, the user can judge whether and to what extent the method of
data validation meets his requirements. If a test process is not reliable or at
least doubtful, the first step should be to check here.
Tip: With VALIDATEDATA, validation rules issue an error message if
certain criteria are met.
If a rule is true, the result is always to be interpreted as an error. Rules can
be formulated positively or negatively; negatively formulated rules contain
e.g. a NOT. For example, in negative rule 1 (incl. NOT), deviations from the
reference values 95 to 99 trigger error messages. Positive rule 3 (without
NOT) triggers an error message if the dates specified find matches. Whether
a check rule is formulated positively or negatively, does not affect its logic in
principle, but it can significantly facilitate an efficient implementation in
SPSS syntax.
* Rule 1: 95 bis 99 (integer). *.
compute Values_9995 = not(any(value(JOBTIME),95,96,97,98,99)
| missing(JOBTIME)).
* Rule 2: m/w-Dichotomy. *.
compute m_w_Dicho = not(any(lower(gender),'m','w')
| missing(gender) | gender='').
* Rule 3: Checking a date. *.
compute Check_Date = BDATE eq [Link](03,02,52) |
BDATE eq [Link](15,04,47) |
BDATE eq [Link](18,07,62) |
BDATE eq [Link](01,11,63) |
BDATE eq [Link](16,09,42) |
BDATE eq [Link](10,01,07) |
missing(BDATE).
* Rule 4: Checking a date (2). *.
compute Check_Date2 = BDATE gt [Link](01,01,65) .
Good programming is characterized by the fact that especially user-defined
result variables are explicitly assigned format, measurement level and length.
FORMAT
Values_9995 m_w_Dicho
Check_Date Check_Date2 (F4.0).
VARIABLE WIDTH
Values_9995 m_w_Dicho Check_Date Check_Date2 (4).
VARIABLE LEVEL
Values_9995 m_w_Dicho
Check_Date Check_Date2 (nominal).
VARIABLE ATTRIBUTE is used to flag the result variables in the SPSS
data dictionary.
VARIABLE ATTRIBUTE
VARIABLES=Values_9995 m_w_Dicho Check_Date Check_Date2
ATTRIBUTE=$[Link]("Yes").

Finally, data validation is requested for basic and rule-based checks. The
section of the VALIDATEDATA syntax is not further explained.
VALIDATEDATA
VARIABLES=gender bdate jobtime
ID=id
/VARCHECKS STATUS=ON PCTMISSING=20 PCTEQUAL=75
PCTUNEQUAL=10
CV=0.5 STDDEV=0.5
/IDCHECKS INCOMPLETE DUPLICATE
/CASECHECKS REPORTEMPTY=YES SCOPE=ALLVARS
/CASEREPORT DISPLAY=YES MINVIOLATIONS=1
CASELIMIT=FIRSTN(10)
/RULESUMMARIES BYVARIABLE BYRULE.

Validate data
After the heading "Validate data" the result of the basic checks is displayed.
Warnings
Some or all requested output is not displayed because all cases, variables, or data values passed the
requested checks.

This warning is a desirable result. This note means that all or some of the
checked variables have passed all basic checks, in the example this is the
variable GENDER, this variable does not appear in the „Variable Checks“
table, although it has been checked. However, this result does not mean for
GENDER that this variable does not contain (further) errors (e.g. duplicates).
Variable Checks
Scale Coefficient Of Variation < 0.5 Date of Birth
Months since Hire
Each variable is reported with every check it fails.

Note: Not all applied checks are displayed. For example, the standard
deviation is not displayed although it was specified and applied. The reason is
that all variables comply with this checking rule, among others. Only the test
rules that have been violated are listed, together with the variables that have
violated them.
The table "Variable Checks" shows which variables were subjected to the
checks (e.g. "Date of Birth" and "Months since Hire"), which scale level
SPSS assumes (e.g. "Measure", i.e. Scale) and which other settings the basic
checks contained. The "Coefficient of Variation" check, for example, was set
so that an error message is output if the coefficient of variation is less than
0.5. This result therefore indicates that the dataset contains errors. For
example, the variables "Date of Birth" and "Months since Hire" each have a
variation coefficient below 0.5. The variable “Gender” is not listed because
this variable complies with the checking rules; however, this does not mean
that this variable is 100% error-free. “Gender” as well as the other variables
may also contain other error types than those tested (e.g. duplicates). The
"Variable Checks" table must always be interpreted within the context of the
variables being checked, the checks made and the criteria specified.
Single-Variable Rules
The "Rule Descriptions" table describes the rules that have been violated. The
table "Variable Summary" shows the variables that violated at least one rule.
The table "Rule Summary" shows for the applied rules how often the
variables violated the respective rule. The table "Case Report" lists the cases
that violated at least one rule.
Rule Descriptions
Rule Description
Value list (99-95), List variant Type: Numeric
Domain: List
Flag user-missing values: No
Flag system-missing values: No
List: 99; 98; 97; 96; 95
$[Link][1]: Rule
m,w Dichotomy Type: String
Domain: List
Flag user-missing values: No
Flag blank values: No
List: m; w
Case sensitive: Yes
$[Link][2]: Rule
Checking a date Type: Date
Domain: List
Flag user-missing values: Yes
Flag system-missing values: Yes
List: 03.02.52; 15.04.47; 18.07.62; 1.11.63;
16.09.42; 10.01.07
$[Link][3]: Rule
Checking a date 2 Type: Date
Domain: List
Flag user-missing values: Yes
Flag system-missing values: Yes
List: 01.01.00
$[Link][4]: Rule
The "Rule Descriptions" table describes the rules that have been violated,
but not the rules that have been followed. The disadvantage is that you do not
know which error you were willing to accept. Rules that have been followed
and therefore not displayed do not necessarily mean that the data is free of
errors. In the following, the properties of a user-defined rule will be explained
using the $[Link][3] rule as an example; for details please refer to the
explanation of the VALIDATEDATA syntax for rules under 9.2.3.

Type: Date this rule assigns date variables only.


Domain: List checks the target values in the form of a value list.
Flag user- or system-defined missing values: Yes causes user- or
system-defined missings to be displayed as invalid.
After List:, the reference values are listed: Any match with the
reference values 03.02.52, 15.04.47, etc. will trigger an error
message.
$[Link][3]: Rule returns the name of this rule; on the left, the
label "Checking a date" is given.
Variable Summary
Number of
Rule Violations
Gender m,w Dichotomy 216
Total 216
Date of Birth Checking a date 5
Checking a date 2 135
Total 140
Months since Hire Value list (99-95), List variant 427
Total 427

The table "Variable Summary" returns the variables that violated at least
one rule. In addition, it is also specified, which rule they violated and how
often this variable violated the rule in question. For example, the variable
“Date of Birth” violated the rule "Checking a date" 5 times (for definition see
the table "Rule Descriptions"). "Total" shows the sum of the violations of the
respective variables concerned against all rules applied to them. The result of
the "Variable Summary" table should always also be used to check the
appropriateness of the rule(s) applied. The "m,w Dichotomy" turns out not to
be appropriate; the date are coded as "m"/"f", not as "m"/"w".
Rule Summary
Number of
Rule Violations
Value list (99-95), List variant Months since Hire 427
Total 427
m,w Dichotomy Gender 216
Total 216
Checking a date Date of Birth 5
Total 5
Checking a date 2 Date of Birth 135
Total 135

The "Rule Summary" table shows for the applied rules how often the listed
variables violated the respective rule. For example, the variable "Date of
Birth" violated the rule "Checking a date 2" 135 times. The table "Rule
Summary" lists only the rules that were violated.
Case Reportb
Validation Rule Violations Identifier
Case Single-Variablea Employee Code
1 Checking a date (1) 1
3 m,w Dichotomy (1) 3
4 m,w Dichotomy (1)
4
Checking a date (1)
8 m,w Dichotomy (1)
8
Checking a date 2 (1)
9 m,w Dichotomy (1) 9
10 m,w Dichotomy (1) 10
11 m,w Dichotomy (1) 11
12 Checking a date 2 (1) 12
14 m,w Dichotomy (1) 14
17 Checking a date (1) 17

a. The number of variables that violated the rule follows each rule.
b. There were more than 10 cases with rule violations. Only the first 10 are
displayed.

The "Case Report" table lists the cases that violated at least one rule and the
checking rule that each listed case violated. Under "Identifier" the variable
"Employee Code" is specified as ID variable. After the respective checking rule,
the total number of times the examined variables violated the respective rule
is indicated in brackets. For the first line this means that the examined
variables of Case 4 violated the rules "Checking a date" and "m,w
Dichotomy (1)" one time each. The table "Case report" lists only the first ten
cases that violated at least one rule; thus, more than ten cases with rule
violations have occurred.

9.4 Programming and executing rules for


multiple variables
The file "Predefined Validation Rules [Link]" e.g. does not contain any
validation rules for date variables, nor any predefined validation rules for
multiple variables (cross-variable rules). Section 9.4. therefore introduces the
programming and execution of rules for multiple variables (e.g. for date
variables) using VALIDATEDATA syntax. With cross-variable rules,
several variables must comply with certain conditions at the same time in
order not to trigger an error message. With single-variable rules only one
variable has to fulfill certain conditions (see 9.3.).
Contents of Section 9.4:

Programming and Executing Own Rules


Two examples for date variables, among others
Access of a rule to a metric and a date variable
Access of a rule to two different date variables simultaneously
More tips for programming

GET FILE loads the SPSS dataset „Employee [Link]“.


GET
FILE='C:\Programs\SPSS\Employee [Link]'.
VARIABLE ATTRIBUTE VARIABLES=ALL DELETE=$[Link].
DATAFILE ATTRIBUTE DELETE=$[Link].
compute MY_SYSDATE=[Link]($time).
formats MY_SYSDATE (edate8).
exe.
VARIABLE resp. DATEFILE ATTRIBUTE delete existing validation rules
resp. their referencing to variables. For validation purposes, the variable
MY_SYSDATE is created; the current system time is stored in
MY_SYSDATE.
DATAFILE ATTRIBUTE ATTRIBUTE=
$[Link][1]("Label='Comparison with System Date',
OutcomeVar='Comparison', Expression='BDATE < MY_SYSDATE' ")
$[Link][2]("Label='Linking of Job Time and Birthdate',
OutcomeVar='Comparison2', Expression='JOBTIME >= 70 &
BDATE > 01.01.1965' ").
Starting with DATEFILE ATTRIBUTE, the validation rules for the variables
are (re)defined. With cross-variable rules, several specific variables must
meet predefined conditions in order not to trigger an error message. Since
cross-variable rules are thus referenced simultaneously, the corresponding
VARIABLE ATTRIBUTE section for single-variable rules (9.3.) is omitted.
"$[Link][1]" denotes the attribute name for a first cross-variable rule.
The concrete rule is specified in brackets resp. within apostrophes. The
sequence of the attributes is irrelevant. Label= denotes the rule "Comparison
with System Date". After OUTCOMEVAR=, the created result variable (e.g.
"Comparison") is specified; result variables are required because in the
COMPUTE section the result of the rule application is stored there.
EXPRESSION= specifies a label for the logic of the validation rule, but not
the specific validation rule itself (see COMPUTE). If EXPRESSION does not
correctly reflect the check rule, this will inevitably lead to an incorrect
interpretation of the check result. Incorrect test results may remain
undetected.
Tips: SPSS responds to identical labels by not performing the validation
check. Remedy: Specify different labels. EXPRESSION= is essential for a
correct interpretation of the data check. Compliance with the corresponding
COMPUTE statement must be ensured by multiple checks, if necessary,
especially if mouse users only have access to the front-end menu and have no
possibility to check the correctness of the back-end validation process.
TEMPORARY.
COMPUTE Comparison = BDATE < MY_SYSDATE.
COMPUTE Comparison2 = JOBTIME >= 70
& BDATE gt [Link](01,01,65) .
From COMPUTE on, the function of the check rules is specified as SPSS
actually applies them to the values in the active file. With the help of the
validation rules, the user can judge whether and to what extent the method of
data validation meets his requirements. If a test process is not reliable or at
least doubtful, the first step should be to check here.
FORMAT Comparison Comparison2 (F4.0).
VARIABLE WIDTH Comparison Comparison2 (4).
VARIABLE LEVEL Comparison Comparison2 (nominal).
Good programming is characterized by the fact that especially user-defined
result variables are explicitly assigned format, measurement level and length.
VARIABLE ATTRIBUTE
VARIABLES= Comparison Comparison2
ATTRIBUTE=$[Link]("Yes").
VARIABLE ATTRIBUTE is used to flag the result variables in the SPSS
data dictionary. Finally, data validation is requested for basic and rule-based
checks. The section of the VALIDATEDATA syntax is not further explained.
VALIDATEDATA
VARIABLES=MY_SYSDATE BDATE
ID=id
CROSSVARRULES=$[Link][1] $[Link][2]
/VARCHECKS STATUS=ON PCTMISSING=70 CV=0.001 STDDEV=0
/IDCHECKS INCOMPLETE DUPLICATE
/CASECHECKS REPORTEMPTY=YES SCOPE=ALLVARS
/CASEREPORT DISPLAY=YES MINVIOLATIONS=1
CASELIMIT=FIRSTN(10).

Validate Data
After the heading "Validate Data" the result of the basic checks is displayed
first.
Warnings
Some or all requested output is not displayed because all cases, variables, or data values passed the
requested checks.
This warning is a desirable result. This note means that all or some of the
checked variables have passed all basic checks, in the example this is the
variable BDATE, it does not appear in the "Variable Checks" table, although
it has been checked. However, this result does not mean for DONE that this
variable does not contain (further) errors (e.g. duplicates). However, this
result does not mean for BDATE that this variable does not contain (further)
errors (e.g. duplicates).
Variable Checks
Scale Coefficient Of Variation < 0.001 MY_SYSDATE
Each variable is reported with every check it fails.

Note: Not all applied checks are displayed. The maximum proportion of
missing values is not shown although they were specified and applied. The
reason is that all variables comply with this checking rule. Only the test rules
that have been violated are listed, together with the variables that have
violated them.
The table "Variable Checks" shows which variables were subjected to the
checks (e.g. MY_SYSDATE), which scale level SPSS assumes (e.g. "Scale")
and which other settings the basic checks contained. The "Coefficient Of
Variation" check, for example, was set so that an error message is output if
the coefficient of variation is less than 0.001. This result therefore indicates
that the dataset contains errors. For example, the variable MY_SYSDATE
has a variation coefficient below 0.001; the explanation is simple:
MY_SYSDATE is a constant, therefore the coefficient of variation is zero.
The variable BDATE is not listed because this variable complies with the
checking rules; however, this does not mean that this variable is 100% error-
free. BDATE, as well as the other variables, may also contain types of errors
other than those checked (e.g. duplicates). The "Variable Checks" table must
therefore always be interpreted within the context of the variables examined,
the checks made and the criteria specified.

Identifier Checks
Cross-Variable Rules
Number of Rule Expression
Rule Violations
Comparison with System Date 473 BDATE < MY_SYSDATE
Linking of Job Time and Birthdate 96 JOBTIME >= 70 & BDATE > 01.01.1965

Following the “Identifier Checks” header, the tables "Cross-Variable


Rules" lists the cross-variables rules, the number of violations against them
and the check logic of the variables.
When interpreting the "Rule Expression" field, please note that this field
does not reflect the concrete check rule itself (cf. COMPUTE), but only the
label for the check rule. If the label does not correctly reflect the validation
rule, this will inevitably result in incorrect checks resp. interpretations.

Case Reporta
Validation Rule Violations Identifier
Case Cross-Variable Employee Code
1 Comparison with System Date 1
2 Comparison with System Date 2
3 Comparison with System Date 3
4 Comparison with System Date 4
5 Comparison with System Date 5
6 Comparison with System Date 6
7 Comparison with System Date 7
8 Comparison with System Date
8
Linking of Job Time and Birthdate
9 Comparison with System Date 9
10 Comparison with System Date 10
a. There were more than 10 cases with rule violations. Only the first 10 are displayed.

The table "Case Report" lists the cases that violated at least one cross-
variable rule and the checking rule that each listed case violated (in the
example only for several variables, the so-called cross-variable rules). Under
"Identifier" the variable "Employee Code" is specified as ID variable. After
cross-variable validation rules, the number of times the examined variables
violated the respective rule is not indicated in brackets. The first line can
therefore be understood to mean that Case 1 (Employee Code: 1) violated the
cross-variable rule "Comparison with System Date". Case 8 (Employee Code:
8) violated two cross-variable rules: "Comparison with System Date" and
"Linking of Job Time and Birthdate". The violation of the rule "Linking of
Job Time and Birthdate" can be understood as a deviation from the
specifications in JOBTIME and/or BDATE. Further exploration will provide
more precise information about the nature and type of the detected deviations.
The "Case Report" table lists only the first ten cases that violated at least one
cross-variable rule; thus, more than ten cases with rule violations have
occurred.

9.5 Further examples of check rules


(uncommented)
The following sample programs have been tested, but are not explained
further. Knowledge of programming with VALIDATEDATA syntax is
assumed at this point. Some of the examples are scanning for German
language terms.
Example for single-variable rule:
Function:
Reveal all entries that differ from the following spelling for German federal
states (negative rule): „Baden-Württemberg“, „Bayern“, „Berlin“,
„Brandenburg“, „Bremen“, „Hamburg“, „Hessen“, „Mecklenburg-
Vorpommern“, „Niedersachsen“, „Nordrhein-Westfalen“, „Rheinland-Pfalz“,
„Saarland“, „Sachsen“ bzw. „Schleswig-Holstein“.
GET
FILE='C:...sav'.
VARIABLE ATTRIBUTE VARIABLES=ALL DELETE=$[Link].
DATAFILE ATTRIBUTE DELETE=$[Link].
DATAFILE ATTRIBUTE ATTRIBUTE=
$[Link][1]("Label='Deutsche Bundesländer', Type='String',
Domain='List', FlagUserMissing='No', FlagSystemMissing='No',
FlagBlank='No', CaseSensitive='Yes',
List='Baden-Württemberg' 'Bayern' 'Berlin' 'Brandenburg'
'Bremen' 'Hamburg' 'Hessen' 'Mecklenburg-Vorpommern'
'Niedersachsen' 'Nordrhein-Westfalen' 'Rheinland-Pfalz'
'Saarland' 'Sachsen' 'Schleswig-Holstein' ") .
VARIABLE ATTRIBUTE
VARIABLES=B_LAND
ATTRIBUTE=$[Link][1]("Rule='$[Link][1]',
OutcomeVar='B_LAENDER_' ") .
TEMPORARY.
* Rule 1: Federal States *.
compute B_LAENDER_ = not(any(B_LAND,'Baden-Württemberg',
'Bayern', 'Berlin', 'Brandenburg', 'Bremen',
'Hamburg', 'Hessen', 'Mecklenburg-Vorpommern',
'Niedersachsen', 'Nordrhein-Westfalen', 'Rheinland- Pfalz', 'Saarland',
'Sachsen', 'Schleswig-Holstein'))
| missing(B_LAND) | B_LAND=''.
FORMAT B_LAENDER_ (F4.0).
VARIABLE WIDTH B_LAENDER_ (4).
VARIABLE LEVEL B_LAENDER_ (nominal).
VARIABLE ATTRIBUTE
VARIABLES=B_LAENDER_
ATTRIBUTE=$[Link]("Yes").
VALIDATEDATA
VARIABLES=B_LAND
/VARCHECKS STATUS=ON PCTMISSING=20 PCTEQUAL=75
PCTUNEQUAL=10
/CASECHECKS REPORTEMPTY=YES SCOPE=ALLVARS
/CASEREPORT DISPLAY=YES MINVIOLATIONS=1
CASELIMIT=FIRSTN(10)
/RULESUMMARIES BYVARIABLE BYRULE.
Example for cross-variable rule:
Function:
Reveal all postal codes (PLZ) that do not belong to the corresponding federal
state „Bayern“, „Berlin“ resp. „Brandenburg“ (negative rule).
GET
FILE='C:...sav'.
DATAFILE ATTRIBUTE ATTRIBUTE=
$[Link][1]("Label='Abgleich PLZ Bayern',
OutcomeVar='PLZ_Bay', Expression='PLZ Bayern' ")
$[Link][2]("Label='Abgleich PLZ Berlin',
OutcomeVar='PLZ_Ber', Expression='PLZ Berlin' ")
$[Link][3]("Label='Abgleich PLZ Brandenburg',
OutcomeVar='PLZ_Bran', Expression='PLZ Brandenburg' ").
TEMPORARY.
COMPUTE PLZ_Bay = B_LAND="Bayern" & not(any (PLZ_BL,
86169,95447,96450,91058,97437,95032,86899,81929,
82194,85737,80802,90491,83209,83026)).
COMPUTE PLZ_Ber = B_LAND="Berlin" & not(any (PLZ_BL,
14195,14089,10969,13439,10178,14163,12439,14165)).
COMPUTE PLZ_Bran = B_LAND="Brandenburg" & not(any (PLZ_BL,
3048,15236,14532,14478)).
* Explicitly assign format, measurement level, and length to result variables
*.
FORMAT PLZ_Bay PLZ_Ber PLZ_Bran (F4.0).
VARIABLE WIDTH PLZ_Bay PLZ_Ber PLZ_Bran (4).
VARIABLE LEVEL PLZ_Bay PLZ_Ber PLZ_Bran (nominal).
VARIABLE ATTRIBUTE
VARIABLES=PLZ_Bay PLZ_Ber PLZ_Bran
ATTRIBUTE=$[Link]("Yes").
VALIDATEDATA
VARIABLES=B_LAND
ID=id
CROSSVARRULES=$[Link][1] $[Link][2] $[Link][3]
/VARCHECKS STATUS=ON PCTMISSING=70
/IDCHECKS INCOMPLETE DUPLICATE
/CASECHECKS REPORTEMPTY=YES SCOPE=ALLVARS
/CASEREPORT DISPLAY=YES MINVIOLATIONS=1
CASELIMIT=FIRSTN(10).
Examples for single characters resp. strings:
Function:
Reveal certain characters within a certain string (e.g. STRING). All searched
(and found) characters (e.g. "1", "2", "3", "4" etc.) trigger an error message
(positive rule). For scanning for certain characters (strings) it is helpful to
first create auxiliary variables (e.g. ERROR_C1 etc.).
GET
FILE='C:...sav'.
COMPUTE ERROR_C1 = index(STRING,'0123456789',1) .
COMPUTE ERROR_C2 = index(STRING,'Begriff',7) .
COMPUTE ERROR_C3 = index(STRING,'.;,:',1) .
exe.
VARIABLE ATTRIBUTE VARIABLES=ALL DELETE=$[Link].
DATAFILE ATTRIBUTE DELETE=$[Link].
DATAFILE ATTRIBUTE ATTRIBUTE=
$[Link][1]("Label='Scan_1(Ziffern)', Type='Numeric',
Domain='List',FlagUserMissing='No',
FlagSystemMissing='No', FlagBlank='No',
CaseSensitive='No',List='0' ")
$[Link][2]("Label='Scan_2(Begriff)', Type='Numeric',
Domain='List',FlagUserMissing='No',
FlagSystemMissing='No', FlagBlank='No',
CaseSensitive='No',List='0' ")
$[Link][3]("Label='Scan_3(Interpunktion)', Type='Numeric',
Domain='List',FlagUserMissing='No',
FlagSystemMissing='No', FlagBlank='No',
CaseSensitive='No',List='0' ") .
VARIABLE ATTRIBUTE
VARIABLES=ERROR_C1 ATTRIBUTE=$[Link][1]
("Rule='$[Link][1]', OutcomeVar='ERROR_S1' ")
/VARIABLES=ERROR_C2 ATTRIBUTE=$[Link][1]
("Rule='$[Link][2]', OutcomeVar='ERROR_S2' ")
/VARIABLES=ERROR_C3 ATTRIBUTE=$[Link][1]
("Rule='$[Link][3]', OutcomeVar='ERROR_S3' ") .
TEMPORARY.
* Implementing the first scan *.
COMPUTE ERROR_S1=
NOT(ANY(VALUE(ERROR_C1),0) OR MISSING(ERROR_C1)).
* Implementing the second scan *.
COMPUTE ERROR_S2=
NOT(ANY(VALUE(ERROR_C2),0) OR MISSING(ERROR_C2)).
* Implementing the third scan *.
COMPUTE ERROR_S3=
NOT(ANY(VALUE(ERROR_C3),0) OR MISSING(ERROR_C3)).
FORMAT ERROR_S1 ERROR_S2 ERROR_S3 (F4.0).
VARIABLE WIDTH ERROR_S1 ERROR_S2 ERROR_S3 (4).
VARIABLE LEVEL ERROR_S1 ERROR_S2 ERROR_S3 (nominal).
VARIABLE ATTRIBUTE
VARIABLES=ERROR_S1 ERROR_S2 ERROR_S3
ATTRIBUTE=$[Link]("Yes").
VALIDATEDATA
VARIABLES=ERROR_C1 ERROR_C2 ERROR_C3
/VARCHECKS STATUS=ON PCTMISSING=20 PCTEQUAL=75
PCTUNEQUAL=10
/IDCHECKS INCOMPLETE DUPLICATE
/CASECHECKS REPORTEMPTY=YES SCOPE=ALLVARS
/CASEREPORT DISPLAY=YES MINVIOLATIONS=1
CASELIMIT=FIRSTN(10)
/RULESUMMARIES BYVARIABLE BYRULE.

9.6 Checking of rules (conditions)


Checking rules (conditions) are used in various contexts, e.g. for data access,
for checking or replacing data, or for determining derived dimensions and
values (calculations). The reliability of checking rules is therefore of
fundamental importance, since in each case further values, analyses or results
depend on them. A faulty validation rule leads to faulty values, analyses or
results. The problem with dysfunctional checking rules and accesses (e.g.
SQL queries) is that the data itself may be ok, but the data quality problems
are caused by the incorrect rules or accesses. Check rules and accesses
(queries) should therefore be checked for several aspects, ideally before
programming them in SPSS syntax. These aspects include semantics, logic
and informatics:

Formulation as positive or negative check rule?


For complex check rules: Completeness of the logically realizable
"true" and "false" events? Or only a justified extract of "relevant"
events?
For complex checking rules: Verification of the logically correct test
rules against the complexity and dynamics of empirical reality. Is
everything that is logically possible also empirically useful or
probable?
Which (SPSS) syntax or procedure allows programming the rules
(conditions)? How do several syntax options differ in detail? Here,
among other things, the actuality of syntax (releases), the associated
functional range and change, as well as the smooth interaction of the
data with software and hardware of different manufacturers, systems
and versions must be taken into account.
Last but not least: Correctness of the programming? Is the checking
rule (condition) thus (e.g.) negatively formulated as planned, takes
into account the relevant, logically realizable "true" and "false" events
including a check for a possible programming error?
Test rules or conditions should always be checked first using manageable test
data before integration into a workflow. Incorrect results as a result of
unchecked conditions are otherwise not detected at all or may be discovered
by chance because they are implausible only in a certain context. An often
overlooked source of error are simple programming errors, e.g. the wrong
use of EXE. after transformation commands, like LAG. The two apparently
identical programming variants of COMPUTE LAG_VAR1 lead to different
results because of the EXE. (see also Schendera, 2005, 144-146).
Example 1: Without EXE. Example 2: With EXE.
compute LAG_VAR = lag(VAR_1). compute LAG_VAR =
compute VAR_1 = VAR_1*2. LAG(VAR_1).
exe.
compute VAR_1 = VAR_1*2.
In example 1, the transformed value of VAR_1 is used. In example 2, the
original value of VAR_1 is used.
Chapter 8 introduced the concept of plausibility and its initially simple check
on up to two variables; the subchapter 8.2.3. on anomalies introduced a
multivariate check using the DETECTANOMALY procedure. Chapter 9
presented, among other things, validation checks with the screening
application VALIDATEDATA on up to two variables. VALIDATEDATA
presents at least one challenge in connection with complexity: depending on
the check logic, programming via mouse or syntax access may be quite
complex and thus also error-prone, especially with respect to implementing
the necessarily not always simple check logic into positively or negatively
formulated validation rules. Also, the programming environment of
VALIDATEDATA is currently not flexible enough to include alternative
approaches (see Chapter 10). For simple or one-time checks the menu item
"Validation" or the SPSS procedure VALIDATEDATA might be too
complex in individual cases. In this case it is possible to use approaches from
other chapters. With SPSS syntax it is possible to program much more
flexible than it may seem after the chapter on VALIDATEDATA and the
checking of at most two variables.

10 More Flexibility: Screening and


more
Working basis: One dataset
Review and edit multiple values, rows and columns all at
once

What Chapter 10 and Chapters 8 and 9 have in common is that the operations
are performed on only one dataset (cf. Chapter 11 for working with several
datasets). However, Chapter 10 differs from Chapters 8 and 9 in several
ways.
Working within a dataset requires, with one exception, that all criteria
presented so far are met. However, working within a dataset can also include
checking steps. With this only permissible exception, it is therefore necessary
that at least after working within a dataset, all criteria presented so far
(including Chapter 12) have been checked and are OK. Under special
circumstances, necessary comparisons with other datasets may become
necessary (cf. Chapter 11).
A first difference is e.g. the flexibility of the SPSS syntax. While in previous
chapters one SPSS syntax was usually reserved for one application, this
chapter will show (rather incidentally, and perhaps a bit irritating at first) how
one and the same SPSS function can be used for several purposes. For
example, the LAG function can be used for "normal" counting of cases or
duplicates, for "group-wise" counting of cases, or even for row-wise
replacing of missing entries. On the one hand, flexibility means the versatility
of SPSS syntax which initially appears only to be simple. On the other hand,
flexibility of course means that SPSS can be used for many more purposes
via syntax than e.g. pure mouse navigation would suggest. In addition, other
examples will show that SPSS often allows multiple syntax ways to the same
destination. If, for example, an amount of data is to be counted, SPSS can do
this with the functions CASENUM, MOD or even LAG. A third difference
can be the level of difficulty of the programming itself. The structure of the
book was planned in such a way that the chapters are arranged according to
the degree of difficulty as far as possible. Towards the end of the chapter
there might be some examples that users might find a bit more demanding
than the previous ones. Chapter 10 is structured in such a way that it
gradually and step by step brings you up to the level of macro applications in
Chapter 11.
With SPSS syntax it is possible to program much more flexible than it may
seem after the chapter about VALIDATEDATA and the checking of at most
two variables. With SPSS syntax much more is possible, for example:

Screening of rows (also group-wise etc.); for the numerous


possibilities of "counting through" a dataset see 10.1.
Screenings within a column (10.2.).
Differentiated screenings within several columns (10.3.), including,
among other things, the counting of certain values, strings or missings
(10.3.1), the analysis of the levels in several variables (10.3.2) or the
analysis for absolute match of several columns (10.3.3).
Column-wise and row-wise analysis of several numerical data
simultaneously (10.3.4.).
Operations on several variables or values simultaneously, e.g. the
recoding of values and missings in several variables (10.3.5), the
uniform "filling" of several data rows (LAG function) (10.3.6) or also
the renaming of numerous variable names (prefixes, suffixes)
(10.3.7).

At this point, the possibilities of global index variables for the completeness
or plausibility (data quality) of entire data storages, which have been hinted at
in earlier chapters, should also be pointed out (cf. 3.4., 6.1.2.).

10.1 Counting and ID Variables: Options


for "Counting" through a Dataset
To assess the quality of a dataset and its entries, it is essential to have an
overview of the completeness of variables (columns), rows (cases) and
values. Counting variables (counters) are extremely versatile. Counters not
only allow a quick overview of the completeness of a dataset or subsets of it
(rows, columns). However, because counters also support the user in
identifying duplicates and even in segmenting a dataset, they were given their
own chapter due to their versatile functionality (10.1.) and were not scattered
over the individual application areas (completeness, duplicates, etc.).
One of the best known counting variables is the so-called ID variable, which
serves to identify cases, rows or observations. But ID variables can also be
used profitably in other contexts.
Overview:

Creating a new ID variable: Assigning a counter (increment) per data


row
Create a new ID variable: Loop-like assignment of a number
(segmentation)
Create a new ID variable: Assign a number per existing group
Work with an existing ID variable: Check for duplicate values
Storing the number of rows per existing group as a constant

Creating a new ID variable: Assigning a counter (increment) per data


row
The following, first approach assigns a unique number to each data row. The
$CASENUM approach counts the number of data rows by assigning each
row a value increasing by 1, starting from 1 and ending at the last row.
$CASENUM is therefore often used to add an ID variable to a dataset at a
later time.
compute ID=$CASENUM.
exe.
$CASENUM is a system variable which represents the respective internal
number of a dataset row. COMPUTE simply copies this number into the ID
variable. The SPSS Command Syntax Reference points out that the
$CASENUM value does not necessarily correspond to the row number of the
Data View. The fact that $CASENUM assigns a value does not mean that the
data rows contain other values at all. $CASENUM also assigns a value to
completely empty data rows.
The next approach counts a dataset section-wise and assigns regular number
ranges, e.g. from 1 to 50. This approach assigns values only depending on a
row number. If, on the other hand, values are to be assigned depending on the
dataset content, the LAG approach presented afterwards should be used.
Create a new ID variable: Loop-like assignment of a number
(segmentation)
This second, extended $CASENUM variant counts the number of data rows
by assigning a value increasing by 1 to each row from the first to the 50th
row. This process is continued until the end of this dataset is reached.
Depending on how long the dataset is, the very last row does not necessarily
receive the value 50. This $CASENUM approach can be used, for example,
to regularly group (cluster) resp. regularly segment data ex post. In
conjunction with a random function, several randomly based drawings could
also be created in this way.
compute COUNT1_50 = (mod($CASENUM-1, 50) + 1) .
exe.
The repeatedly assigned values from 1 to 50 are stored in the variable
COUNT1_50. But what happens in detail is a little more complicated. As in
the example above, SPSS assigns a unique value to each data row. From the
originally assigned $CASENUM values, the difference CASENUM value
minus 1 is divided by 50 (would not be corrected by 1, the values would start
at 2). The decisive factor here is the so-called indivisible "remainder". MOD
takes only this remainder; thus the result are values from 0 to 49. The row
numbers are corrected by 1 to always go from 1 to 50. If only a certain subset
of data rows is to be selected, a scratch variable #FALL can be created and
included in a SELECT IF filter including MOD.
compute #FALL=#FALL + 1.
select if (mod(#FALL, 7) =0 ).
exe.
This SELECT IF filter works by selecting cases based on their sequence in a
dataset. If the MOD value after #FALL equals 1, all cases are selected; if 2,
only every second value is retained, if 3, every third, and so on.
Creating a new ID variable: Assign a number per existing group –
Work with an existing ID variable: Check for duplicate values
The third approach assigns values depending on the dataset content. In
contrast to the above approach, which assigned ranges of values
independently of the dataset content (i.e. also for empty datasets) and only
depending on a row number, this approach assigns values depending on the
levels of a variable, for example.
data list list sort cases by GROUP.
/ GROUP
* (a) Application as grouping increment with
NUMERICS .
respect to GROUP *.
begin data
compute NUMBER=1.
1 54
if (GROUP=lag(GROUP))
1 65
NUMBER=lag(NUMBER)+1.
1 53
exe.
1 64
list var=GROUP NUMERICS NUMBER .
2 65
2 12 * (b) Application as test program regarding
2 63 NUMERIC *.
end data. sort cases by NUMERICS.
exe. compute NUMBER_N=1.
if (NUMERICS =lag( NUMERICS))
NUMBER_N=lag(NUMBER_N)+1.
exe.
list var= GROUP NUMERICS NUMBER_N.

(a) Application as grouping (b) Application as test program.


increment
GROUP NUMERICS NUMBER_N
GROUP NUMERICS NUMBER
2,0 12,0 1,0
1,0 54,0 1,0 1,0 53,0 1,0
1,0 65,0 2,0 1,0 54,0 1,0
1,0 53,0 3,0 2,0 63,0 1,0
1,0 64,0 4,0 1,0 64,0 1,0
2,0 65,0 1,0 1,0 65,0 1,0
2,0 12,0 2,0 2,0 65,0 2,0
2,0 63,0 3,0 Number of cases read: 7
Number of cases listed: 7
Number of cases read: 7
Number of cases listed: 7
If a dataset contains a grouped variable GROUP, then all cases within the
respective levels are incremented starting with 1 until the last case within
such a level is reached (cf. result NUMBER for GROUP in approach (a)).
This approach only makes sense if a variable actually has several items with
the same level; if not, only the value 1 would be assigned each time. This
approach could be used as a second application (b) to check whether there
are any duplicates in an at least interval-scaled variable (cf. b). Each value
unequal to 1 is an indication of multiple occurring values (cf. result
NUMBER_N for NUMERICS in approach b).
Storing the number of rows per existing group as a constant
The final approach creates a grouped counting variable and, within it, a
constant representing the number of cases (rows).
data list list sort cases by GROUP.
/ GROUP NUMERICS compute NUMBER=1.
. if (GROUP=lag(GROUP))
begin data NUMBER=lag(NUMBER)+1.
1 54 exe.
1 65
aggregate
1 53
/outfile='C:\[Link]'
1 64
/ presorted
2 65
/ break=GROUP
2 12
/CONSTANT=n.
2 63
match files
end data.
/file=*
exe.
/ table='C:\[Link]'
/ by GROUP.
exe.
list var= GROUP NUMBER CONSTANT.

GROUP NUMBER CONSTANT


1,0 1,0 4
1,0 2,0 4
1,0 3,0 4
1,0 4,0 4
2,0 1,0 3
2,0 2,0 3
2,0 3,0 3
Number of cases read: 7 Number of cases listed: 7

10.2 Screenings within a column (variable)


A common feature of the screening approaches described below is that they
serve to test a single variable. The approaches differ, among other things,
according to the type of variable (numeric, string), what the variable in
question is to be checked for (uniqueness of values, certain strings, missings,
etc.), special features of the test procedure itself, and how the test result is to
be output (per row or also aggregated). First, approaches for numerical
variables are presented, then for strings and then for missings, among others.
Overview:

Overview of the levels of a variable


Counting and consecutively numbering the number of different
elements within a numerical variable
Counting the frequency of an element within a numeric variable
Aggregated output: One frequency value for each element level
Counting the number of different elements within a string variable
Counting individual characters within a string variable
Counting multi-figure character strings within a string variable
Number of row-wise elements within a string variable
Counting specific strings within a numeric variable
Counting missings within a variable (numeric, string)
Counting the different elements within a string variable (row- and
column-wise)

Overview of the levels of a variable


Counting method and output:
Result: One (aggregating) value per n levels of a variable.
– Counter value for each individual element
– Same level=same counter value
– Number of data rows (elements) not changed (only frequency output)
– Value only counter within frequency output
– Value= frequency, no counter (see below)

Data Program
data list FREQUENCIES
/ VALUES (f8). VARIABLES=VALUES
begin data /BARCHART FREQ
11111122 /ORDER= ANALYSIS .
11111122 VALUES
22223333 Valid Cumulative
22223333 Frequency Percent Percent Percent
33112233 Valid 11111122 2 28,6 28,6 28,6
11223311 11223311 1 14,3 14,3 42,9
22222233 22222233 1 14,3 14,3 57,1
end data. 22223333 2 28,6 28,6 85,7
exe. 33112233 1 14,3 14,3 100,0
Total 7 100,0 100,0
save outfile =
"C:\[Link]".
For each characteristic of the VALUES variable, the frequency of its
occurrence is stored as the result in the table. Since, for example, the value
"11111122" appeared twice in the source list, the table shows the frequency 2
behind "11111122". This sample data can now be analyzed in a variety of
ways using different approaches (e.g. COMPUTE/lag function,
AGGREGATE). Compared to the FREQUENCIES example, the following
approaches have one thing in common despite all their differences: They
modify the analysis dataset.
The FREQUENCIES approach only describes the content of a dataset
without changing it. All other approaches modify their data basis, e.g. by
adding a counting variable or changing the number of remaining data rows
(e.g. via aggregation etc.). The "Aggregated output: One frequency value for
each element level" example (see below, AGGREGATE example) appears to
have exactly the same output as result as the contents of the FREQUENCIES
output shown above. In fact, there are two major differences in the
underlying dataset. The output dataset contains only five instead of seven
rows and now additionally contains the count variable NUMBER, which
contains the frequency of multiple occurring rows.
The following examples will show that counting and outputting elements
within a variable is by no means as trivial as it may appear on the surface.
The dataset "[Link]" read in at the beginning is now examined in a
variety of ways. To make it easier for you to recognize the differences, the
counting method and output are preceding the results of the respective
application in keywords as a short description (see also the FREQUENCIES
example, see above).
Counting and consecutively numbering the number of different elements
within a numerical variable
Counting method and output:
– counter value for each individual element
– same level=same counter value
– number of data rows (elements) unchanged
– value only counter within dataset
– not frequency

get file="C:\[Link]". VALUES COUNTER


compute COUNTER=0.
11111122 1,00
exe.
11111122 1,00
sort cases by VALUES.
11223311 2,00
compute COUNTER=COUNTER+1.
22222233 3,00
if (VALUES=lag(VALUES))
22223333 4,00
COUNTER=lag(COUNTER).
22223333 4,00
if (VALUES ne lag(VALUES))
33112233 5,00
COUNTER=lag(COUNTER)+1.
exe. Number of cases read: 7
list. Number of cases listed: 7

Counting the frequency of an element within a numeric variable


Counting method and output:
– counter value for each individual element
– counting method: from 1 for each element
– same level=same counter value
– number of data rows (elements) unchanged
– value only counter within level
– not frequency (except for N=1)
get file="C:\[Link]". VALUES NUMBER
sort cases by VALUES. 11111122 1,00
compute NUMBER=1. 11111122 2,00
if (VALUES=lag(VALUES)) 11223311 1,00
NUMBER=lag(NUMBER)+1. 22222233 1,00
exe. 22223333 1,00
list. 22223333 2,00
33112233 1,00
Number of cases read: 7
Number of cases listed: 7

Aggregated output: One frequency value for each element level


Counting method and output:
– counter value for each individual element
– each characteristic only once
– number of data rows (elements) changed (aggregated output)
– Value= frequency, no counter (see above)

get file= VALUES NUMBER


"C:\[Link]".
11111122 2
AGGREGATE 11223311 1
/OUTFILE='C:\[Link]' 22222233 1
/BREAK=VALUES 22223333 2
/NUMBER=N. 33112233 1
get file = 'C:\[Link]'.
Number of cases read: 5
list.
Number of cases listed: 5
Counting the number of different elements within a string variable
Counting method and output:
Result: One value per variable
The following approach counts the present number of different values of a
variable. This approach is useful e.g., for counting values for control
purposes (e.g., if the number of expected levels is known and thus must not
exceed a certain number), as well as for estimating the number of rows and
columns in crosstabs, etc.
The approach works for numeric as well as string variables (see Schendera,
2005).
get file="C:\[Link]".
aggregate outfile=*
/break=VALUES NUMBER
/freq=NU. 5,00
compute X=1.
compute Y=1. Number of cases read: 1
exe. Number of cases listed: 1
aggregate outfile=*
/break=X
/NUMBER=sum(Y).
exe.
list variables=NUMBER.
The value "5" is written into the SPSS output window. Thus, the variable
VALUES has five different levels.
Counting individual characters within a string variable
The following approach counts certain one-figure characters within a string
variable.
data list
/ STRING 1-11 compute Freq_A = 0.
(a). loop #i=1 to 11.
begin data compute Freq_A =
AA AA AA BB Freq_A + index(substr(STRING,#i,1),'A').
BB BB CC CC end loop.
CC AA BB CC exe.
AA BB CC AA list var= Freq_A.
BB BB BB CC
end data.
exe.
The variable FREQ_A shows how often the searched characters occur in the
examined variable.
Freq_A
6
0
2
4
0

Number of cases read: 5 Number of cases listed: 5


The created counting variable FREQ_A is first set to the initial value 0. As
soon as the desired character is found in the string to be scanned, the counter
is incremented by 1 each time the character is found. The repeated loop is
based on scanning with SUBSTR. This loop starts at the first position of the
string variable, continues by 1 to the next position, and so on. Each time the
desired character (e.g., "A") is found, the counter increases by 1; if the
desired character is not found, the counter variable FREQ_A increases by
zero. The number of characters found cannot therefore exceed the maximum
length of the scanned string (STRING), e.g., 11 in the example (if "A" occurs
11 times).
When programming, make sure that LOOP #i=... is matched to the length of
the variable to be checked. If a loop value is specified after TO that is greater
than the length of the string (e.g. 15 instead of 11), this leads to an abort and
an error message. If a loop value is specified after TO which is smaller than
the length of the string (e.g. 5 instead of 11), the string is not completely
scanned, but only up to the fifth position (provided the loop starts at the first
position).
Counting multi-figure character strings within a string variable
The following approach counts certain multi-figure strings within a string
variable.
data list
compute FREQ_AA= 0.
/ STRING 1-12 (a).
loop #i=1 to 12 by 2.
begin data
compute FREQ_AA = FREQ_AA
AAAAAABBAAAA
BBBBCCCCAAAA + INDEX(SUBSTR(STRING,#i,2),'AA').
AAAACCAABBCC end loop.
AABBCCAAAAAA exe.
AAAABBBBBBCC list.
end data.
exe.

The variable FREQ_AA shows how often the searched strings occur in the
searched string variable STRING.
STRING FREQ_AA
AAAAAABBAAAA 5,0
BBBBCCCCAAAA 2,0
AAAACCAABBCC 3,0
AABBCCAAAAAA 4,0
AAAABBBBBBCC 2,0
Number of cases read: 5 Number of cases listed: 5
The difference to searching for single-figure strings is the BY option in the
LOOP statement. When programming, make sure to add a BY to the LOOP
statement with the length of the multi-figure string. Thus, if a two-figure
string is to be searched, BY 2 is specified; if a three-figure string is to be
searched, BY 3 is specified, and so on. The next special feature derives from
the search behavior for multi-figure strings. Here, too, the loop starts at the
first position of the string variable, but then moves on by 2 to the next
position and on in steps of two, and so on, until the last character. With even
strings (as in the example above) no remainder is left. The string to be
searched (e.g. 12) corresponds to the multiple of the searched multi-figure
string.
However, if the string to be scanned differs from the multiple of the multi-
figure string searched for, a rest remains from scanning from left to right. So
if the string to be searched (e.g. 11) is longer than the multiple of the
searched multi-digit string (e.g. 2), one character remains and leads to an
abort and an error message. In such a case, the value after TO in the LOOP
statement can be reduced to the next lower multiple of the string to be
searched. Thus, if the string to be searched is 11 and the searched string is 2,
then the value after TO is 10. The last character is thus no longer scanned.
Scanning of the last character is not necessary because its length does not
correspond to the searched character.
If the strings to be searched contain blanks (e.g. "AA "), they are treated like
a third, "invisible" character. The further programming then follows the logic
of the examples already shown above.
compute FREQ_AA= 0.
data list
loop #i=1 to 12 by 3.
/ STRING 1-12 (a).
compute FREQ_AA = FREQ_AA +
begin data
INDEX(SUBSTR(STRING,#i,3),'AA ').
AA AA AA BB
end loop.
BB BB CC CC
exe.
CC AA BB CC
compute FREQ_BB= 0.
AA BB CC AA
loop #i=1 to 12 by 3.
BB BB BB CC
compute FREQ_BB = FREQ_BB +
end data.
INDEX(SUBSTR(STRING,#i,3),'BB ').
exe.
end loop.
exe.
list.
STRING FREQ_AA FREQ_BB
AA AA AA BB 3,0 1,0
BB BB CC CC ,0 2,0
CC AA BB CC 1,0 1,0
AA BB CC AA 2,0 1,0
BB BB BB CC ,0 3,0
Number of cases read: 5 Number of cases listed: 5
These approaches can also be applied to numeric characters.
Number of row-wise elements within a string variable
The following approach counts how many elements occur within a string
row-wise. This approach is based on counting the delimiters between the
actual elements, e.g., usually (but not exclusively) a space (this approach,
however, is able to consider any other character as a spacing symbol as long
as it does not occur in the elements to be scanned themselves).
The further logic of this approach is uncomplicated: COMPUTE is used to
create a counting variable FREQ_ELEM starting at 1; each time a delimiter is
found, this counter is incremented by 1. After passing through LOOP, the
FREQ_ELEM value corresponds to the number of elements in a string. If
FREQ_ELEM were created with 0, the final FREQ_ELEM value would only
correspond to the number of delimiters but not to the number of elements.
The variable to be scanned may be empty, i.e., a missing (cf. row 3). In this
case, a FREQ_ELEM value that corresponds to the total length of the
variable to be scanned plus 1 would not indicate the number of elements
contained in it, but that this entry is completely empty.
data list compute freq_elem = 1.
/ STRING 1-11 loop #i=1 to 11.
(a). compute freq_elem = freq_elem +
begin data index(substr(STRING,#i,1),' ').
AA AA AA BB end loop.
BB BB CC CC exe.
list.
C A B C C CC
AA BB CC AA
BB BB BB CC
end data.
exe.
STRING freq_elem
AA AA AA BB 4,0
BB BB CC CC 4,0
12,0
C A B C C C 6,0
AA BB CC AA 4,0
BB BB BB CC 4,0
Number of cases read: 6 Number of cases listed: 6
This approach is based on an important assumption: The distance between the
elements is uniform, e.g. one space. If the distance is not uniform, it would
have to be standardized before using the following approach.
The information about the number of elements within a string variable does
not say anything about the equality or difference of the elements. At this
point we would like to point out an already above introduced approach,
which allows to search a string row-wise for several known elements. The
following approach is able to log the difference of unknown or very many,
even different elements within a string variable across all row in detail.
Counting of specific strings within a numeric variable
The following approach counts certain strings within a numeric variable. This
approach requires that the sequence of numeric characters is read in as a
string or converted into an alphanumeric format using STRING (see below).
data list string STRING (a8).
/ VALUES (f8). compute STRING=string(VALUES,f8.0).
begin data exe.
11111122
compute freq_2 = 0.
22223333
loop #i=1 to 8.
33112233
compute freq_2 = freq_2 +
11223311
index(substr(STRING,#i,1),'2').
22222233
end loop.
end data.
exe.
exe.
list.
The variable FREQ_2 shows how often the searched character occurs in the
examined variable.
VALUES STRING freq_2
11111122 11111122 2,00
22223333 22223333 4,00
33112233 33112233 2,00
11223311 11223311 2,00
22222233 22222233 6,00
Number of cases read: 5 Number of cases listed: 5
If a one-figure value is specified, the error message that may occur can be
ignored.
This example works in principle like the already presented example for
numeric variables. The counting variable FREQ_2 is set to the initial value 0.
As soon as the desired character is found in the created string STRING, the
counter is incremented by 1. Here again, the number of characters found
cannot exceed the maximum length of the scanned string.
The above instructions for programming the loop also apply here. The
frequency of the loop to be executed (LOOP #i=...) is matched to the length
of the variable to be checked. If a value is entered for the loop after TO that is
greater than the length of the string (e.g. 15 instead of 11), this will lead to an
abort and an error message. If a value is specified for the loop after TO which
is smaller than the length of the string (e.g. 5 instead of 11), the string is not
scanned completely, but only up to the fifth position (provided the loop starts
at the first position).
Counting of missings within a variable (numeric, string)
The following approach counts the number of missings in a column
(variable). The approach is only valid for numeric variables. Whereas the
AGGRGEATE function can be used for univariate-aggregated counting of
missings.
For a single variable, the SPSS function COUNT cannot be used as an
aggregating function. Although COUNT counts (only) numerical missings, it
only counts row-wise, i.e., each numerical missing would be assigned the
value 1. However, COUNT does not determine the total sum of the assigned
1s or missings.
data list
GET FILE="C:\[Link]".
/ VALUES (f8).
begin data AGGREGATE
11111122 /OUTFILE='C:\[Link]'
/BREAK=VALUES
22223333
/N_missing=NUMISS(VALUES).
22223333
GET FILE='C:\[Link]'.
11223311
select if SYSMIS(VALUES).
22222233
list.
end data.
exe.
SAVE
OUTFILE="C:\[Link]".
VALUES N_missing
. 2
Number of cases read: 1 Number of cases listed: 1
The variable N_missing indicates the number of missing values in the
numerical variable NUMBERS (N=2). AGGREGATE per se is not able to
aggregate strings. Using a small detour (COMPUTE), the required
information is passed to AGGREGATE in the form of the numerical variable
"N_missing".
data list GET FILE='C:\[Link]'.
/ STRING (A6). if STRING=" " N_missing=1.
begin data exe.
ABCDEF AGGREGATE OUTFILE=*
/BREAK=N_missing
ABCDEF
/FREQ = NU.
ABCDEF
COMPUTE MISSING=1.
ABCDEF COMPUTE N_MISS=1.
ABCDEF exe.
end data. AGGREGATE OUTFILE=*
exe. /BREAK=MISSING
SAVE OUTFILE /NUMBER=SUM(N_MISS).
="C:\[Link]". exe.
list variables=NUMBER .
The value of the variable NUMBER indicates the number of missing values
in the string variable STRING (N=2).
Counting the different elements within a string variable (row- and
column-wise)
This approach counts how many elements occur within a string; this approach
can also be used to analyze complex text responses, see Schendera (2005) for
an explanation of this approach. Due to technical changes in IBM SPSS the
following version versions differs from the original version published in
Schendera (2007) on SPSS v15. The original string MY_STRING is
transformed into a vector (NEW_STR), which in turn is split into n variables
NEW_STR1 to NEW_STRn, which again are transposed to the result
variable STRING. This program processes a string of max. 97 character
width (fixed data list type) without adaptations; e.g. if there is a fourth
character column in MY_STRING ending half-way at ID 06.
DATA LIST FIXED
/ID 1-2 MY_STRING 4-100 (A).
begin data
01 AA BB DD
02 AA EE CC
03 AA BB CC
04 DD EE GG
05 AA BB CC
06 AA DD EE
07 AA BB CC
08 FF GG CC
09 AA BB CC
10 AA HH KK
end data.
exe.
string NEW_STR (A100).
compute NEW_STR=rtrim(ltrim(MY_STRING)).
compute N_DISTANCE=( (length(NEW_STR)) –
(length(replace(NEW_STR,' ',''))) )
+ 1.
exe.
vector NEW_STR (100,A100) .
loop #NEXTS = 1 to N_DISTANCE by 1.
compute DISTANCE = INDEX(NEW_STR,' ') .
do if (DISTANCE ne 0).
compute NEW_STR(#NEXTS) = substr(NEW_STR,1,DISTANCE-1).
compute NEW_STR = substr(NEW_STR,DISTANCE+1).
else.
compute NEW_STR(#NEXTS) = NEW_STR .
end if.
end loop .
exe.
varstocases
/make STRING from NEW_STR1 to NEW_STR100
/index = order (100)
/keep = ID
/null = keep .
variable labels STRING
"Scanning all elements of a string".
select if (STRING ne "").
exe.
frequencies
/variables= STRING.

Frequencies
Scanning all elements of a string
Cumulative
Frequency Percent Valid Percent Percent
Valid AA 8 26,7 26,7 26,7
BB 5 16,7 16,7 43,3
CC 6 20,0 20,0 63,3
DD 3 10,0 10,0 73,3
EE 3 10,0 10,0 83,3
FF 1 3,3 3,3 86,7
GG 2 6,7 6,7 93,3
HH 1 3,3 3,3 96,7
KK 1 3,3 3,3 100,0
Total 30 100,0 100,0

Graph
Up to this point the screenings were limited to one variable. The following
approaches can be applied to several variables (data columns)
simultaneously.

10.3 Screenings within several columns


(variables)
With the screening approaches presented in the following, several variables
can be checked simultaneously. If sections present several variants, the
respective subsections are preceded by summary overviews.
The first approaches under 10.3.1. are based exclusively on the SPSS
function COUNT and are quite easy to use. The further approaches differ,
among other things, according to the type of the variable (numeric, string),
what the respective variable is to be checked for (uniqueness of values,
certain strings, missings, etc.), special features of the test procedure itself and
how the test result is to be output (per row or also aggregated). Above all, the
other approaches are a bit more demanding to program.
Under 10.3.2., two approaches for counting multivariate value combinations
are presented: A DO IF approach for the combinatorics of dichotomous
numerical values, as well as a CONCAT/SUBSTR combination for counting
multivariate combinatorics also of strings with more than two levels.
Under 10.3.3., an approach is presented, with which sequential numerical or
alphanumeric data can be checked for absolute agreement. This approach is
based on a programming using VECTOR.
Under 10.3.4., an approach is presented, which carries out a column- and
row-wise analysis of several numerical data and thus allows to check the
compliance of certain parameters such as outliers, maxima and missings in
the shape of a multivariate data screening. This approach is based on a
combination of DO IF, LOOP and VECTOR programming.
Under 10.3.5., three approaches are presented for the recoding of values and
missings in several variables. The approaches are based on a DO REPEAT -
END REPEAT command resp. on macros.
Under 10.3.6., a further variant of replacing missings is introduced, this time
a standardizing "filling" of even several data rows using the LAG function,
demonstrated on numeric and string variables.
Finally, 10.3.7., introduces two variants for renaming numerous variable
names, e.g., if uniform prefixes or suffixes are to be assigned.

10.3.1 Counting specific Values, Strings or Missings


Overview:

Counting individual strings in several string variables


Counting multiple strings in several string variables
Counting missings resp. valid values in several string variables
Counting values and missings in several numeric variables
Counting values and strings in multiple variables
data list
/ STRING1 1-2 (a) STRING2 4-5 (a) STRING3 7-8 (a)
STRING4 10-11 (a) NUM1 13 NUM2 15 NUM3 17 NUM4 19 .
begin data
AA AA AA BB 1 2 3 4
BB BB CC CC 5 5 3 8
CC AA BB CC 1 2 3 4
AA CC AA 5 9 3 7
BB BB BB CC 4 4 8 2
end data.
exe.
The TO option can also be used for longer variable lists. The advantage is
that the programming is less complex. Instead of longer variable lists, only
the first and last variable of the list need be specified.
Without TO: With TO:
count MYSUM= NUM1 NUM2 count MYSUM = NUM1 to NUM4
NUM3 NUM4 (1). (1).
Using the TO option thus addresses all variables of COUNT that are located
between the first and last variable of the list. The arrangement of the variables
in the dataset is decisive here. Which again can be disadvantageous: If there
is a variable in the middle of this list that does not belong to it, COUNT will
still include it in the count and consequently (and often unnoticed) output
incorrect results. If there is a variable in the middle of this list that cannot be
evaluated for various reasons (e.g. a string variable in the midst of numerical
variables), no false results will be obtained. However, the processing of the
COMPUTE statement is blocked. The reduced programming effort is
therefore offset by the necessary care in checking the correct variable
sequence in the dataset.
Counting individual strings in several string variables
The following approach counts single strings in multiple string variables.
Below an approach is presented, which allows to check string variables for
the frequency of occurrence of several strings.
count STRING_AA = STRING1 STRING2 STRING3 STRING4 ('AA') .
exe.
COUNT counts the occurrence of certain strings (e.g. "AA") in the string
variables STRING1 to STRING4 and stores the result in the variable
STRING_AA.
Counting multiple strings in several string variables
The following approach counts multiple strings in several string variables.
Below another approach is presented that allows to check string variables for
the frequency of occurrence of several strings.
count STRING_AABB = STRING1 STRING2
STRING3 STRING4 ('AA','BB') .
exe .
COUNT counts the occurrence of certain strings (e.g. "AA" and "BB") in the
string variables STRING1 to STRING4 and stores the result in the variable
STRING_AABB.
Counting missings resp. valid values in several string variables
count MISSINGS = STRING1 STRING2 STRING3 STRING4 (' ') .
exe.
compute VALIDS= 4 - MISSINGS.
exe.
COUNT counts the missings in the string variables STRING1 to STRING4
and stores the result in the variable MISSINGS.
COUNT is not able to output the number of valid values in STRING1 to
STRING4. Using a detour via COMPUTE, calculating the difference
between the number of the examined variables (e.g. STRING1 to STRING4,
i.e. N=4) and the number of the determined missings (e.g. in the form of the
result variable MISSINGS) can deliver the desired result and store it in the
variable VALIDS.
Counting values and missings in several numeric variables
count NUMERIC_4 = NUM1 NUM2 NUM3 NUM4 (4) .
exe.
COUNT counts certain values (e.g. 4) in the numerical variables NUM1 to
NUM4 and stores the result in the variable NUMERIC_4.
count NUMERIC_4_1 = NUM1 NUM2 NUM3 NUM4 (4, 1) .
exe.
COUNT counts certain values (e.g. 4 and 1) in the numerical variables
NUM1 to NUM4 and stores the result in the variable NUMERIC_4_1.
count NUM_MISS = NUM1 NUM2 NUM3 NUM4 (SYSMIS) .
exe.
compute VALIDS= 4 - NUM_MISS.
exe.
COUNT counts the system-defined missings in the numerical variables
NUM1 to NUM4 and stores the result in the variable NUM_MISS. For
counting the system- and user-defined missings, MISSING can be used
instead of SYSMIS.
Counting values and strings in multiple variables
COUNT counts certain numeric values resp. strings and stores the result in
the variable AABB_1_4.
count AABB_1_4 =NUM1 NUM2 (1, 3)
STRING1 STRING2 ('AA', 'BB') .
exe .
Output:
list var= list var=
STRING_AA NUMERIC_4 MISSINGS VALIDS. STRING_AABB
NUMERIC_4_1
STRING_AA NUMERIC_4 MISSINGS VALIDS
AABB_1_4.
3,0 1,0 ,0 4,0
STRING_AABB
,0 ,0 ,0 4,0 1,0 1,0 ,0 4,0 2,0 ,0 1,0 3,0
NUMERIC_4_1
,0 2,0 ,0 4,0
AABB_1_4
Number of cases read: 5
4,0 2,0
Number of cases listed: 5
2,0 ,0 2
2,0 2,0
2,0 ,0 1
3,0 2,0
Number of cases
read: 5
Number of cases
listed: 5
The respective sums indicate how often the values resp. strings occur in the
checked variables. COUNT can therefore be applied to numeric variables,
strings, as well as to strings and numeric variables simultaneously.
In the case of several COUNT applications, the total frequency of the
searched for values could be determined by a final summation of the
individual results (e.g. FREQ_AA plus FREQ_BB etc.) (not shown further).
10.3.2 Counting the Combinatorics of several
Variables – Analysis of the Levels in several
Variables
Overview:

For two levels, numerical: DO IF approach


For two or more levels, numeric or string: CONCAT/SUBSTR
combination
Often it is not only of interest how often single values across certain variable
lists occur sometime, but rather how often characteristics occur in two or
more variables at the same time. Such a question examines the combinatorics
of the levels of variables.
In the following two approaches are presented: First a DO IF approach using
the example of dichotomous numerical values, then an approach for strings
with even more than two expressions on the basis of a combination of
CONCAT and SUBSTR commands.
The DO IF approach examines the combinatorics of the joint occurrence of
certain values or missings, not their individual frequency. The following
example can be used e.g. if the data are dichotomously coded, e.g. 1/0, 1/2
etc. This approach is based on the simple distinguishing and counting of
existing vs. non-existing values. Subsequently, an approach is presented
based on CONCAT and SUBSTR commands, which has further advantages.
DATA LIST LIST
/ NUM1 NUM2 NUM3 .
begin data
111
110
100
001
110
000
010
011
011
end data.
exe.
do repeat VARLIST=NUM1 to NUM3 .
recode VARLIST (0=SYSMIS).
end repeat.
exe.
compute COMBI=999.
exe.
do if (nvalid(NUM1) and nvalid(NUM2) and nvalid(NUM3)).
compute COMBI = 1.
else if (nvalid(NUM1) and nmiss(NUM2) and nmiss(NUM3)).
compute COMBI = 2 .
else if (nmiss(NUM1) and nvalid(NUM2) and nmiss(NUM3)).
compute COMBI = 3.
else if (nmiss(NUM1) and nmiss(NUM2) and nvalid(NUM3)).
compute COMBI = 4.
else if (nvalid(NUM1) and nvalid(NUM2) and nmiss(NUM3)).
compute COMBI = 5 .
else if (nvalid(NUM1) and nmiss(NUM2) and nvalid(NUM3)).
compute COMBI = 6.
else if (nmiss(NUM1) and nvalid(NUM2) and nvalid(NUM3)).
compute COMBI = 7.
else if (nmiss(NUM1) and nmiss(NUM2) and nmiss(NUM3)).
compute COMBI = 8.
end if.
exe.
variable labels COMBI "Combinatorics".
value labels
/COMBI
1 'Codes 1, 2, 3'
2 'only Code 1'
3 'only Code 2'
4 'only Code 3'
5 'Codes 1 and 2'
6 'Codes 1 and 3'
7 'Codes 2 and 3'
8 'No Codes'
9 '!! Error!!'.
list.
Result:
NUM1 NUM2 NUM3 COMBI
1,0 1,0 1,0 1,0
1,0 1,0 . 5,0
1,0 . . 2,0
. . 1,0 4,0
1,0 1,0 . 5,0
. . . 8,0
. 1,0 . 3,0
. 1,0 1,0 7,0
. 1,0 1,0 7,0
Number of cases read: 9 Number of cases listed: 9
The DO REPEAT section standardizes the dichotomous coding from 1 and 0
to 1 ("present") and system-defined missing ("not present"). The DO IF-
ELSE IF sections check the respective combinatorics occurring row-wise.
COMPUTE COMBI=999 is a safety programming which outputs an error
message if the combinatorics have not been programmed exhaustively. The
code "999" is then output for a missing combinatorics. VARIABLE resp.
VALUE LABELS assign the correct interpretations to the combinatorics.
The DO IF approach assumes that the data is strictly complete and genuinely
dichotomous; only the two event possibilities "present" and "not present"
count. This approach is not suitable for three data levels, e.g. 1, 2 and
missings. For more than two value levels e.g. alternatively a combination of
CONCAT and SUBSTR commands could be used, as they are presented
e.g. in Section 4.12. The only condition which would have to be considered
here is that numeric variables must be converted into strings before. If the
variables were converted to strings, an alternative programming of the
combinatorics could look e.g. in such a way:
string COMBI2 (a3).
compute COMBI2 = concat
(substr(NUM1,1,1),substr(NUM2,1,1),substr(NUM3,1,1)).
exe.
variable labels COMBI2 "Combinatorics".
value labels
/COMBI2
"111" 'Codes 1, 2, 3'
"100" 'only Code 1'
"010" 'only Code 2' . etc.
This approach has several advantages. First, it is possible to work with
numeric, string, and both data formats at the same time (provided, of course,
that they have all been converted to strings beforehand). Second, there is no
need for recoding, e.g. from zeros to missings. And third, there is no need to
program the combinatorics per se (DO IF etc.). For an explanation of
programming with combinations of CONCAT and SUBSTR commands,
please refer to Section 4.12.

10.3.3 Column-wise analysis for absolute match


The following program allows to check a direct sequence of numeric or
alphanumeric data for absolute match. The first variable (after the ID
variable) must be the common reference variable (REFERENCE) for all
subsequent variables (CMPARSN1, CMPARSN2 etc.). This program is
particularly suitable for checking the repeated, column-wise absolutely
correct spelling of complicated technical codes, such as those for persons or
products. A conceivable application of this approach would be, for example,
the use of 100% error-free reference tables, which are used to check entries in
other data stores.
data list
/ ID 1 REFERENCE 3-9 (A) CMPARSN1 11-17 (A)
CMPARSN2 19-25 (A) CMPARSN3 27-33 (A) CMPARSN4 35-41 (A).
begin data
1 141B85B 141B85B 141B85B 141B85B 141B85B
2 241B85B 241BX5B 241B85B 241B85B 241B85B
3 341B85B 341B85B 341X85B 341B85B 341B85B
4 441B85B 441B85B 441B85B 441BX5B 441B85B
end data.
vector CORRECT (4) .
vector CMPARSN = CMPARSN1 to CMPARSN4.
Note: In earlier SPSS versions there was a different VECTOR specification.
loop #i=1 to 4.
compute CORRECT(#i)=(REFERENCE=CMPARSN(#i)).
end loop.
exe.
compute FILTRE=sum(CORRECT1 to CORRECT4).
select if (FILTRE < 4).
save outfile = 'C:\MyData\[Link]'
/keep ID CORRECT1 to CORRECT4.
exe.
get file = ' C:\MyData\[Link]'.
exe.
list variables ID CORRECT1 to CORRECT4.
The output is limited to the lines with values without absolute match. The
display of rows with perfect matches without exception is suppressed by the
FILTRE filter, e.g. the result for line 1. In the output without matches, the
code "0" indicates that the value of the variable concerned is different from
that of the reference variable.
ID CORRECT1 CORRECT2 CORRECT3 CORRECT4
2 ,00 1,00 1,00 1,00
3 1,00 ,00 1,00 1,00
4 1,00 1,00 ,00 1,00
Number of cases read: 3 Number of cases listed: 3

10.3.4 Column-wise and row-wise analysis of several


numerical data
The following program allows to analyze a sequence of numerical data
column- and row-wise. This analysis is strictly speaking a multiple univariate
formal analysis and does not allow to derive a bi- or multivariate plausibility.
However, in the form of a data screening it allows to check numerous
numerical variables for compliance with certain parameters, e.g. outliers,
maxima and missings.
input program.
vector VAR(20).
loop #Z = 1 to 1000.
loop #V = 1 to 20.
compute VAR(#V) = normal(1).
end loop.
end case.
end loop.
end file.
end input program.
title "Column-wise analysis (display of 20 variables)".
descriptives variables=all.
exe.
title "Row-wise analysis (display of 1000 rows)".
compute MAXVALUE=max(VAR1 to VAR20).
exe.
compute VALCOUNT=0.
exe.
vector VAR_LIST=VAR1 to VAR20.
loop #cnt=20 to 1 by -1.
do if MAXVALUE=VAR_LIST(#cnt).
compute VARNUMBR=#cnt.
compute VALCOUNT=VALCOUNT+1.
end if.
end loop.
exe.
compute
MAXVALUE2=max(VAR1 to VAR20).
exe.
list variables=VARNUMBR MAXVALUE VALCOUNT MAXVALUE2 .
In the section from INPUT PROGRAM to END INPUT PROGRAM, a
random dataset with 1000 rows and 20 variables ("VAR1" to "VAR20") is
created. The values of the variables VAR1 to VAR20 are normally
distributed. Using the procedure DESCRIPTIVES, all variables are analyzed
column by column; in this case, the parameters of 20 variables are displayed.
Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
VAR1 1000 -3,03 2,89 -,0231 1,01520
VAR2 1000 -3,18 3,82 -,0077 1,02171
VAR3 1000 -3,38 3,21 -,0047 1,02458
VAR4 1000 -3,60 2,99 -,0272 ,98769
VAR5 1000 -3,06 2,95 ,0031 1,01111
VAR6 1000 -3,57 3,01 -,0707 ,95463
VAR7 1000 -3,29 3,48 -,0611 ,99149
VAR8 1000 -3,21 3,30 ,0351 ,95719
VAR9 1000 -3,17 2,97 -,0141 ,95985
VAR10 1000 -3,24 3,53 -,0012 ,97541
VAR11 1000 -3,81 3,42 ,0321 ,98711
VAR12 1000 -3,56 3,51 ,0180 ,99674
VAR13 1000 -3,39 3,31 -,0230 1,00341
VAR14 1000 -2,87 3,10 ,0230 ,99288
VAR15 1000 -3,91 2,99 -,0148 ,99417
VAR16 1000 -2,99 3,62 -,0101 1,00645
VAR17 1000 -2,91 3,47 -,0269 ,96262
VAR18 1000 -3,57 3,03 -,0458 1,00807
VAR19 1000 -3,03 3,61 ,0261 1,02311
VAR20 1000 -2,89 3,80 ,0485 ,99904
Valid N
1000
(listwise)

For the row-wise analysis of VAR1 to VAR20), the variables MAXVALUE,


VALCOUNT and VARNUMBR are created via the vector VARLIST. The
largest value of a variable list is stored MAXVALUE, the frequency of its
occurrence in VALCOUNT, and in VARNUMBR the number of the variable
in which this value occurs for the first time.
VARNUMBR MAXVALUE VALCOUNT MAXVALUE2
3,00 1,70 1,00 1,70
4,00 2,07 1,00 2,07
15,00 1,32 1,00 1,32
10,00 2,64 1,00 2,64
2,00 1,34 1,00 1,34
11,00 2,45 1,00 2,45
11,00 1,17 1,00 1,17
18,00 1,20 1,00 1,20
15,00 2,03 1,00 2,03
9,00 2,15 1,00 2,15
6,00 1,96 Output abbreviated
MAXVALUE works accurately: If values are not internally stored uniformly
in SPSS they are then not output as equal. MAXVALUE2 is intended to
illustrate that this multivariate screening program can be supplemented by
numerous other functions (e.g. for missings), e.g.
compute
SYSMISUM=SYSMIS(VAR1)+SYSMIS(VAR2)+SYSMIS(VAR3).
exe.
Finally, a simple variant for column- and row-wise analysis of missings is
presented; note also the effect of the MISSING VALUES statement on the
output of FREQUENCIES.
data list
/ID 1 VAR1 3-6 (A) VAR2 8-11 (A) VAR3 13-16 (A).
begin data
1 YES NO NO
2 YES NO
3 NO
4 YES NO
5 NO YES
end data.
exe.
title "Row-wise analysis of missings".
count TEXTMISS = VAR1 VAR2 VAR3 (" ") .
variable labels
TEXTMISS 'Sum of row-wise missings' .
exe .
list variables=all.
title "Column-wise analysis of missings ".
compute misvar1 = (length(rtrim(var1)) = 0).
compute misvar2 = (length(rtrim(var2)) = 0).
Compute misvar3 = (length(rtrim(var3)) = 0).
exe.
descriptives
variables = misvar1 misvar2 misvar3
/ statistics= sum.
* Output in joint table *.
frequencies
variables=VAR1 VAR2 VAR3
/format=notable
/order=analysis .
* Output in separate tables *.
missing values VAR1 VAR2 VAR3 (" ").
frequencies
variables=VAR1 VAR2 VAR3.

10.3.5 Recoding of values and missings in several


variables
Overview:

Mixed variable lists: DO REPEAT - END REPEAT approach


Sorted variable lists: Macro approaches

The "multivariate" approaches presented so far were primarily used for


analyzing data, e.g. counting of values (combinations), missings and strings,
so the two following programs are mainly used for recoding values and
missings in several variables. Different codes for missings often have to be
unified after merging datasets. The following approaches, DO REPEAT -
END REPEAT and the two MACMISS macros are able to standardize values
in hundreds of variables and thousands of data rows at one time.
The following DO REPEAT - END REPEAT program, as well as the two
macros below, each allow to unify values in hundreds of numeric variables at
once. The DO REPEAT - END REPEAT program works "blind"; it processes
all variables between the beginning and the end of a list, regardless of
whether they can be processed at all.
The following two macros are recommended to work with in programming
environments in which macros are used or in which variable lists of a
uniform type are available. The first of the macros introduced later processes
individually specified variables; the second macro processes variables
specified in list form.
data list
/ID 1 VAR1 3-5 VAR2 7-9 (A) VAR3 11-14 .
begin data
1 123 AAA 857
2 BBB 857
3 857
4 CCC 857
5 857 323
end data.
exe.
do repeat VARLIST=VAR1 to ID VAR1 VAR2 VAR3
VAR3.
1 999 AAA 857
recode VARLIST (123=999).
2 . BBB 857
end repeat.
3 857 .
exe.
4 . CCC 857
list. 5 857 323
Number of cases read: 5
Number of cases listed: 5
The DO REPEAT - END REPEAT program also contains string variables
in the VARLIST variable list (cf. VAR2). SPSS issues an error message (see
below) because the recoding can only be done on numeric values. This error
message can therefore be ignored. The values of all numeric variables are
unified without exception.
>Error # 4654 in column 17. Text: 123
>The RECODE command attempts to test a string variable for >having a
numeric value. Note that LOWEST, HIGHEST, and
>SYSMIS are considered to be numeric values.
>This command not executed.
So the advantage of DO REPEAT - END REPEAT programming is that it
can also be applied to lists of variables of mixed types.

The two following MACMISS macros also allow to recode missings or


values in several variables.
The first macro unifies e.g. missings in several variables individually.
data list
/VAR1 1-2 VAR2 4-5 VAR3 7-8 VAR4 10-11 .
begin data
6 12 66 98
14 97 27 29
end data.
define MACMISS (!pos!charend('/')).
!do !i !in (!1).
missing values !i (99).
if (!i = 97) or (!i = 98) !i=99.
!doend
!enddefine.
MACMISS VAR1 VAR2 VAR3 VAR4 /.
list.
The second macro unifies missings in several variables list-wise.
define !MACMISS (arg1=!tokens(4)).
missing values !arg1 (99).
recode !arg1 (97 98=99).
exe.
!enddefine.
!MACMISS arg1 = VAR1 VAR2 VAR3 VAR4.
list.
The outputs of the two macros are identical.
VAR1 VAR2 VAR3 VAR4
6 12 66 99
14 99 27 29
Number of cases read: 2 Number of cases listed: 2
The advantage of macros in general is that they can be extended into very
powerful applications by further statements. The disadvantage of both
macros, however, is that they cannot be applied to mixed variable lists
without further ado. Either the MACMISS macros can be extended by further
statements and/or the variable lists can be sorted.

10.3.6 Uniform "filling" of several data rows (LAG


function)
If a dataset contains several data rows for the same person, they may be
different in their completeness. The following program allows to fill gaps in
numeric and string variables. The values are copied from the uppermost one
(e.g. SMITH) to the rows below, to the next person in the dataset (e.g.
JONES).
data list
/NAME 1-5 (A) FNAME 7 (A) PNUMBER 9-11 DPTMENT 13 (A) .
begin data
SMITH J 131 M
SMITH
JONES K 234 M
JONES
JONES
end data.
exe.
do if NAME = lag(NAME). NAME FNAME PNUMBER
compute FNAME = lag(FNAME). DPTMENT
compute PNUMBER =
SMITH J 131 M
lag(PNUMBER).
SMITH J 131
end if.
JONES K 234 M
exe.
JONES K 234
list. JONES K 234
Number of cases read: 5
Number of cases listed: 5
The prerequisite is that the topmost data row in each case contains "filling
material" at all. In the example, the values of the variable DPTMENT are
intentionally not filled.
Please refer to Schendera (2005, 89f.) for special features when using the
LAG function (sorting of data, interaction with row-selecting statements such
as SELECT IF or SAMPLE).

10.3.7 Renaming numerous variable names (prefixes,


suffixes)
Overview:

Assigning prefixes
Assigning suffixes
The following two programs are very useful for renaming several variable
names at once, e.g. adding or changing uniform prefixes or suffixes. The
uniform renaming of variable names may be necessary, e.g., if variables were
named incorrectly and received the wrong prefixes. This would be an
example for a uniform change of several variable names, e.g. changing an
incorrect "T1" prefix into a correct "T2".
If variables from two datasets are to be compared with each other by a
difference calculation, at least the variable names of one of the datasets must
be differently named, ideally renamed uniformly. As an example one can
imagine a survey with two measurement points in time. The variables at the
first measurement time were always named "T1" in the prefix of the dataset
(e.g. T1AGE, T1CITY, T1VOTE, etc.; not shown). The values from the
second measurement time were stored in a separate dataset, but under the
same variable names (see below). To calculate the difference between the
values of the two measurement points, uniform variable names are not
permitted; for the calculation, it is necessary to create two systematically
different designations. This would be an example for a uniform addition of
several variable names, e.g. adding a pre- or suffix to a variable name. The
following program renames the "T1" variables to "T2" (e.g. T2AGE,
T2CITY, T2VOTE, etc.). If there is a first (not displayed) dataset with a
"T1" prefix, as well as a second dataset with a "T2" prefix, then after merging
the data, the variables could easily be compared with each other in pairs via
difference formation, e.g. compute AGEDIFF = T2AGE - T1AGE.
data list free
/ T1DAY, T1AGE, T1ORT, T1LAKE, T1FOOD, T1FISH.
begin data
12 31 4 2 34 5
end data.
list.
save outfile='C:\[Link]'.
flip.
string NEWNAME(A8).
compute NEWNAME=concat("T2",substr(CASE_LBL,3)).
write outfile 'C:\[Link]'
/"rename variable ("CASE_LBL"="NEWNAME").".
exe.
get file='C:\[Link]'.
include 'C:\[Link]'.
list.
Note: The CASE_LBL variable is created by the SPSS FLIP process and
contains the old variable names. The variables in dataset "DATA2" now have
the following names:
T2DAY T2AGE T2CITY T2LAKE T2FOOD T2FISH
12,00 31,00 4,00 2,00 34,00 5,00
Number of cases read: 1 Number of cases listed: 1
Minor changes to this program make it possible to assign uniform suffixes
resp. other names. The following variant assigns suffixes, for example.
string NEWNAME(A8).
compute NEWNAME=CASE_LBL.
exe.
compute POSITN=index (NEWNAME,'T1').
exe.
string STRING (A8).
compute STRING=substr(NEWNAME,POSITN+2,7-POSITN).
exe.
string STRINGWITH (A8).
compute STRINGWITH=concat ((rtrim(STRING,' ')), "T1").
exe.
write outfile 'C:\[Link]'
/"rename variable ("CASE_LBL"="STRINGWITH").".
exe.
get file='C:\[Link]'.
include 'C:\[Link]'.
list.
DAYT1 AGET1 CITYT1 LAKET1 FOODT1 FISHT1
12,00 31,00 4,00 2,00 34,00 5,00
Number of cases read: 1 Number of cases listed: 1

11 Working with several (separate)


datasets
The previous chapters introduced the review of quality criteria on one dataset
at a time. Of course, working with only one dataset has its advantages.
However, the practice of data analysis and data management is diverse
enough that working with several datasets at the same time can become
necessary, e.g. splitting a dataset into several subsets, merging several files or
even comparing several datasets.
Working with several datasets requires, with one exception, that all criteria
presented so far are met. However, such work can also include test steps. In
the case of this single permissible exception, it is therefore necessary that at
least after this work all criteria presented so far have been checked and are
okay.
The following sections will therefore cover selected facets of working with
multiple datasets:
Overview:

Chapter 11.1. discusses basic checking rules for merging several


datasets.
Chapter 11.2. introduces approaches for the checking of continuously
or segmented stored data; the conceptual approach is the main focus.
Chapter 11.3. demonstrates the screening of separate datasets by
means of a macro.
Chapter 11.4. demonstrates the combining of several datasets
(macro).
Chapter 11.5. presents the comparison of two sets of data on
absolutely identical content.
Chapter 11.6. covers standardizing of values in separate datasets.
Chapter 11.7. covers the splitting of a dataset into several subsets,
either divided by categories or uniformly filtered.
Chapter 11.8. introduces the meaning and limits of working with
DATASET. Since version 14, this command allows to have several
files open at the same time.
Finally, Chapter 11.9. shows the advantages of the FILE HANDLE
command.

11.1 Checking rules for joining


This chapter introduces checking rules for merging several datasets. These
rules are essentially not checking rules that SPSS can control, but rather
criteria that users must verify compliance with. The contents of several
datasets can be easily checked for internal formal correctness and plausibility
of content, especially when they are combined into a single analysis dataset.
To ensure correct joining, several check criteria are listed below to help rule
out the possibility that data errors were caused by incorrect joining.

1. Each step in the procedure must be carefully documented in a


protocol or audit trail; before checking several datasets, it must be
checked whether their documentation is already available. The
documentation must also be carefully checked and questioned. Back-
up copies of the datasets must be made.
2. All datasets contain the required data; the datasets have different
names.
3. All datasets have the same structure and format (e.g. *.sav).
4. All datasets contain the required key variable(s); depending on the
complexity of the data management, these can be much more than
just one or two variables.
5. All key variable(s) are uniform (variable name, label and format).
6. If all or several datasets contain several identical variables, name,
label and format must be identical, which is especially true for
grouping and date variables. For grouping variables, the coding and
value labels must be identical.
7. The data rows have values in all key variables.
8. The datasets, sorted by the keys, do not contain duplicate data rows
(cf. 5.2.1. and 5.2.2.).
9. In the case of string keys, e.g. product or person names, a uniform
and absolutely identical spelling must be ensured in all datasets.
Semantically identical, but syntactically inconsistently written keys
(e.g. "MUELLER" vs. "MUELLER", "CHRISTOF" vs.
"CHRISTOPH") result in differently combined datasets.
10. The datasets are sorted uniformly (only upwards or only
downwards) before merging.
11. If necessary: The datasets are in the chronologically correct
order. It is helpful to include the storage date in the dataset name
"backwards" in the form YYYYMMDD (e.g. [Link]
etc.).

It is helpful to prepare the actual combining process (joining, merging,


match-merging, etc.) by means of a sketch and to carefully compare the result
of the combining process (e.g. at least in the Variable View) with the plan,
especially with regard to the number of rows (cases) and columns (variables).
For very large datasets, this process should be tested on subsamples. It is
recommended that the datasets are merged one after the other and not all at
once. After merging, the date variables of the different datasets should be
compared; dates from chronologically earlier datasets should always be
smaller than dates from chronologically later datasets. Once the data have
been combined, they can easily be subjected to the checks for univariate to
multivariate data quality described above.
Depending on the structure and size of the data storage, merging data can
often have physical limits in terms of memory or processor capacity. When
merging data, users should therefore consider (a) whether this step is
necessary and (b) if so, whether they want to merge complete datasets or only
partial datasets. Especially with large amounts of data, (c) the timing of
combining the files should also be considered. It makes a difference whether
the data screenings are applied before or after the merge. Once the partial
datasets have been merged into a “big whole”, the screenings run slower
overall (because they always "drag along" the rest of the unscreened dataset)
than if the screenings are applied to the partial datasets (if possible).
So the recommendation is: If the size of the partial or even the targeted
complete file is essential, screenings should be performed on the partial files
before merging them. Only multivariate screenings resp. screenings that span
the partial files should be performed on the complete file.

11.2 Checking several datasets for


completeness
In business or larger research projects, data is often created via distributed
transaction data, POS triggers, or entry masks, collected as separate, but
structurally identical datasets via parallel threads or channels, and then
merged for further processing. Masks (especially web-based) ensure an
uniform input, e.g. via former SPSS product Data Entry Builder or Station
4.0. If applied, plausibility checks could already avoid a large number of
input errors during online entry.
The type of plausibility check for several datasets depends on the structure in
which the data was repeatedly saved, not the time increment, e.g. daily,
hourly or more frequent. From the numerous ways to structure and save data,
the segmented and the continuous storage method will be discussed next.
These two methods are so different that they are ideal for drawing attention to
different approaches to checking. Both methods have their merits; the
important point is that checking methods should always fit the data
philosophy. The technical difference between these two methods is that the
second approach has, due to its nature, overlaps in data rows (in a sense,
duplicates). Often continuously stored datasets are a temporary pre-stage of
segmented datasets.
With the segmented storage method, the dataset is saved in such a way that
it contains only new data. If the datasets saved in this way are combined
together, they are subsets of a complete dataset. The following table
schematically shows a daily saving (e.g. on April 2, 2006, April 3, 2006 etc.)
in segmented form. In order to facilitate the identification of the storage, the
date of storage was included in the dataset name. Note that the storage date
has been recorded "backwards" in the form YYMMDD (e.g.
[Link] etc.). This is simplified structure is typical for
transaction or production datasets that may even be stored resp. buffered in
fractions of seconds.
1. Storage 2. Storage 3. Storage 4. Storage 5. Storage
DATA060402 DATA060403 DATA060404 DATA060405 DATA060406
ID VAR1 VAR2 ID VAR1 VAR2 ID VAR1 VAR2 ID VAR1 VAR2 ID VAR1 VAR2
01 03 66 04 13 54 07 03 43 10 02 98 14 94 45
02 07 54 05 34 65 08 07 73 11 06 98 14 94 45
03 14 54 06 94 64 09 14 22 12 34 98 16 51 55
etc.
Note: Please note that the last dataset (DATA060406) contains three errors.
The IDs 13 and 15 are missing (error type: gap) and the values of ID 14 are
contained multiple times (error type: double).
With the continuous storage method, the dataset is stored in such a way that
the dataset contains the new data together with the previously entered data.
Later versions of datasets always contain the data from previously saved
datasets. The last saved dataset corresponds to a complete dataset. This is
simplified structure is typical for backup datasets that capture data up to a
preset time limit.
1. Storage 2. Storage 3. Storage 4. Storage 5. Storage
DATA060402 DATA060403 DATA060404 DATA060405 DATA060406
ID VAR1 VAR2 ID VAR1 VAR2 ID VAR1 VAR2 ID VAR1 VAR2 ID VAR1 VAR2
01 03 66 01 03 66 01 03 66 01 03 66 01 03 66
02 07 54 02 07 54 02 07 54 02 07 54 02 07 54
03 14 54 03 14 54 03 14 54 03 14 54 03 14 54
04 13 54 04 13 54 04 13 54 04 13 54
05 34 65 05 34 65 05 34 65 05 34 65
06 94 64 06 94 64 06 94 64 06 94 64
07 03 43 07 03 43 07 03 43
08 07 73 08 07 73 08 07 73
09 14 22 09 14 22 09 14 22
10 02 98 11 06 98
11 06 98 12 34 98
12 34 98 13 65 54
14 94 45
15 51 55
etc.
Note: Please note that the last dataset (DATA0604) contains one error. The
ID 10 is missing (error type: gap).
So much for the theory. In practice, however, data may be missing, e.g. ID 10
is missing in DATA060406; this can happen, for example, if the data of ID
10 was accidentally overwritten when entering the data of Case 11. How can
it be ensured that no data rows have been lost, e.g. during continuous
storage?
A note in advance: The following two approaches assume that the ID has
been assigned correctly, is not a missing value itself and can therefore be
considered representative for a data row. It must also be emphasized that only
the completeness of data already entered is checked at this point. If already
entered data do not contain any possibility to identify data gaps, e.g. if the ID
is incomplete or missing completely, the only remaining option is to resort to
external checking criteria, e.g. other data sources or data documentation,
which may have logged the completeness of the data in parallel, e.g. in the
form of logging a response, the number of completely filled out
questionnaires, the completeness of laboratory data, etc.

11.2.1 Checking segmented stored data


There are many ways to check segmented stored data. Starting with basic
options like counters (row numbers, record counts), or a bit more
sophisticated approaches like check sums (batch total, hash total; cf. 3.1,
3.2).
For example, you could calculate a hash total for each dataset and add them
up, and also a hash total for the complete dataset. If both values match, then
the whole dataset equals the sum of its parts. Mismatches indicate that there
may be differences somewhere.
The following approach focusses on the ID of the datasets and is a bit more
detailed despite its simplicity (it does not assume that the data has not been
sorted since the last storage). The IDs of the segmented saved datasets (and
only (sic) the IDs) are combined into a joint dataset TESTDATA using ADD
FILES (shown in abbreviated form, see also the macro under 11.3.). Due to
the adding (appending) of cases it is not necessary to rename the IDs.
get file='D:\My personal data\[Link]'.
add files /file=*
/file='D:\My personal data\[Link]'.
exe.
...
save outfile='D:\My personal data\[Link]'
/compressed.
The duplicates of the IDs are first queried via an uncomplicated
FREQUENCIES and are easily recognizable, especially in bar charts. The
gaps can be recognized in the sorted sequence in the table; however, this
procedure can be cumbersome and error-prone for large datasets with
thousands of data rows.
FREQUENCIES ID
VARIABLES=ID Valid Cumulative
/BARCHART FREQ Frequency Percent Percent Percent
/ORDER= ANALYSIS Valid 1 1 6,7 6,7 6,7
. 2 1 6,7 6,7 13,3
3 1 6,7 6,7 20,0
4 1 6,7 6,7 26,7
5 1 6,7 6,7 33,3
6 1 6,7 6,7 40,0
7 1 6,7 6,7 46,7
8 1 6,7 6,7 53,3
9 1 6,7 6,7 60,0
10 1 6,7 6,7 66,7
11 1 6,7 6,7 73,3
12 1 6,7 6,7 80,0
14 2 13,3 13,3 93,3
16 1 6,7 6,7 100,0
Total 15 100,0 100,0

In the case of extensive datasets, an additional summarizing counting of the


(multiple existing) IDs by means of AGGREGATE is a further verification
possibility.
GET FILE='D:\My personal data\[Link]'.
compute IDX=ID.
exe.
AGGREGATE
/OUTFILE='D:\My personal data\[Link]'
/break=ID
/IDx = NU(IDx)
/N_BREAK=N.
GET FILE='D:\My personal data\[Link]'.
FREQUENCIES
VARIABLES=idx
/BARCHART FREQ
/ORDER= ANALYSIS .
The variable IDx contains the information about the frequency of occurrence
of IDs and is queried via FREQUENCIES. Logically, IDs may only occur
once; values greater than 1 are therefore an indication that duplicates occur in
the dataset.
IDx
Valid Cumulative
Frequency Percent Percent Percent
Valid 1 13 92,9 92,9 92,9
2 1 7,1 7,1 100,0
Total 14 100,0 100,0

The "2" indicates that an ID occurs twice. The ID 14 occurs actually twice in
the dataset DATA0604.
This approach is suitable for the identification of duplicate data rows in
complete data and, to a limited extent, also for the identification of data gaps
in merged segmented data resp. the last continuous storage. The check for
data gaps can only be performed with an approach that allows to consider an
external criterion. The following approach can be used to check absolutely
identical contents of structurally identical datasets.

11.2.2 Checking continuously stored data


Since this storage method intends to save redundantly (that means it aims for
overlaps), the use of options such as counters (e.g. record counts) or
checksums (e.g. batch total, hash total; cf. 3.1, 3.2) may be not
straightforward resp. of even limited use depending of the data or error
pattern. You could compute counters or check sums, but you need to consider
the overlaps between the individual datasets.
Usually a simple strict monotonic ascending function of the computed
sums is not enough: You may not really be satisfied with the information that
the more recent a data storage is the more data rows it contains, e.g. the first
dataset has 3 rows, the second 6 rows, the third 9 and so on.
In this case, you may consider splitting the continuously stored datasets into
segments, the differences between its overlaps. The first dataset e.g. is the
first segment (IDs 1 to 3), the second minus the first dataset is the second
segment (IDs 4 to 6), and so on. These actual segments should add up to the
whole dataset as a sum of its parts. A related way to check continuously
stored data is to typically use SQL to identify the (negative) overlaps resp.
intersections between several datasets using EXCEPT or INTERSECT (e.g.
Schendera, 2011, 2012). Unfortunately, SPSS’ SQL statement e.g. in GET
CAPTURE or GET DATA does not offer the EXCEPT or INTERSECT
options.
The following audit approach assumes that the data were not changed after
saving in any way (e.g. by sorting), thus are still in the original storage
condition (possibly recognizable also by the date of the last storage).
In the first step, the IDs of the datasets are renamed via RENAME
VARIABLES (not displayed) to indicate to which dataset they belong (e.g.
the ID from DATA060402 becomes ID060402). In the next step, the
individual IDs (and only (sic) the IDs) are merged via MATCH FILES
(displayed abbreviated) into a joint dataset TESTDATA.
get file='D:\My personal data\[Link]'.
match files /file=*
/file='D:\My personal data\[Link]'.
exe.
...
save outfile='D:\My personal data\[Link]'
/compressed.
This is where the checks start. The CHECK1 check variable uses the
COMPUTE function NVALID to determine the number of valid cases in a
variable series, e.g., from ID060402 to ID060406. From the frequencies of
these values, it can be read off how many new cases the next consecutive
storage will provide. The check variable CHECK2 returns the difference
between minimum and maximum of this variable list. Since a continuous
storage without gaps always contains the previous data in the same sequence,
the difference must (sic) be zero. If the difference is not zero, there is either a
data gap or a segmented stored dataset.
get file='D:\My personal data\[Link]'.
compute
CHECK1=nvalid(ID060402,ID060403,ID060404,ID060405,ID060406).
exe.
compute MINI=min(ID060402,ID060403,ID060404,ID060405,ID060406).
exe.
compute MAXI=max(ID060402,ID060403,ID060404,ID060405,ID060406).
exe.
compute CHECK2=MAXI-MINI.
exe.
list.
Output:
ID060402 ID060403 ID060404 ID060405 ID060406 CHECK1 MINI MAXI CHECK2
1,00 1,00 1,00 1,00 1,00 5,00 1,00 1,00 ,00
2,00 2,00 2,00 2,00 2,00 5,00 2,00 2,00 ,00
3,00 3,00 3,00 3,00 3,00 5,00 3,00 3,00 ,00
. 4,00 4,00 4,00 4,00 4,00 4,00 4,00 ,00
. 5,00 5,00 5,00 5,00 4,00 5,00 5,00 ,00
. 6,00 6,00 6,00 6,00 4,00 6,00 6,00 ,00
. . 7,00 7,00 7,00 3,00 7,00 7,00 ,00
. . 8,00 8,00 8,00 3,00 8,00 8,00 ,00
. . 9,00 9,00 9,00 3,00 9,00 9,00 ,00
. . . 10,00 11,00 2,00 10,00 11,00 1,00
. . . 11,00 12,00 2,00 11,00 12,00 1,00
. . . 12,00 13,00 2,00 12,00 13,00 1,00
. . . . 14,00 1,00 14,00 14,00 ,00
. . . . 15,00 1,00 15,00 15,00 ,00

Number of cases read: 14 Number of cases listed: 14

The check variable CHECK1 displays the row-wise valid cases; the missing
values should not necessarily be noticeable here. CHECK2 indicates with a
check value unequal to zero but unambiguously that the storages resp.
datasets DATA060405 or DATA060406 contain data gaps, recognizable
from the first value unequal to zero.
This approach is suitable for identifying duplicate data rows in complete data.
It is less suitable for identifying data gaps. For example, this program does
not allow to check whether data rows are missing in the new cases of the last
storage, e.g. in the dataset DATA060406 from ID 13. To identify
systematically missing data rows, an approach must be chosen which allows
to consider an external criterion.

11.3 Screening of separate consecutively


named datasets (macro)
The following macro has been developed for screening single consecutively
named datasets, e.g. if the datasets were named according to the scheme
YYMMDD as in 11.2.2. (e.g. [Link]). The advantage of a
"backward" date is that datasets can be sorted, e.g. by storage day. With a
designation in the form DDMMYYY, a sorting on the level MY_FROM day
values, and thus also an application of the following macro would not be
possible.
define !SCREENING
(MY_PATH=!tokens(1)
/MY_DATA=!tokens(1)
/MY_FROM=!tokens(1)
/MY_TO=!tokens(1))
!do !MY_NUMBER=!MY_FROM !to !MY_TO.
get file=!quote(!concat(!unquote(!MY_PATH),"\",
!MY_DATA,!MY_NUMBER,".SAV")).
title Lists of the dataset !concat(!MY_DATA,!MY_NUMBER).
list var = ID VAR1 VAR2.
!doend.
!enddefine.
!SCREENING MY_PATH="C:\" MY_DATA=DATA06040 MY_FROM=2
MY_TO=6.
MY_PATH is used to pass the joint storage location of all files to the macro,
MY_DATA is used to pass the joint PREFIX (in this case "DATA06040"),
and MY_NUMBER is used to pass a placeholder for the incrementing suffix,
which is ultimately defined by MY_FROM and MY_TO.
Thus, the SCREENING macro successively accesses the datasets
DATA060402 through DATA060406 in storage location "C:\" and lists the
contents of variables VAR1 and VAR2 after an automatically assigned
heading (cf. TITLE). The actual analysis, in the example only a simple ID
LIST=VAR1 VAR2 was used, can of course be made as complex as desired,
including the addition of further macro variables. The goal of this example is
first and foremost to demonstrate how a macro with only a few lines of code
could easily and efficiently screen the contents of hundreds of datasets,
provided they fulfill a few requirements: Same storage location, same file
name (except for the suffix), same variable name and same variable formats.

11.4 Combining of consecutively named


datasets (macro)
The following macro was developed for merging consecutively named
datasets, e.g. if the datasets are named [Link],
[Link], [Link] etc.
define !ADDING
(MY_PATH=!tokens(1)
/MY_DATA=!TOKENS(1)
/MY_FROM=!TOKENS(1)
/MY_TO=!TOKENS(1))
!do !MY_NUMBER= !MY_FROM !to !MY_TO.
!if (!MY_NUMBER= !MY_FROM) !then
get file=
!quote(!concat(!unquote(!MY_PATH),"\",!MY_DATA,!MY_NUMBER,".SAV")).
compute !concat(!MY_DATA,!MY_NUMBER)= 1.
!else
add files
/file=*
/file=!quote(!concat(!unquote(!MY_PATH),
"\",!MY_DATA,!MY_NUMBER,".SAV"))
/in=!concat(!MY_DATA,!MY_NUMBER).
!ifend
!doend
save outfile=!quote(!concat(!unquote(!MY_PATH),
"\","ADDED","_",!MY_FROM,"_",!MY_TO,".SAV"))
/drop=!concat(!MY_DATA,!MY_FROM) to
!concat(!MY_DATA,!MY_TO).
get file=!quote(!concat(!unquote(!MY_PATH),
"\","ADDED","_",!MY_FROM,"_",!MY_TO,".SAV")).
list var=ID VAR1 VAR2.
!enddefine.
!ADDING MY_PATH="C:\" MY_DATA=DATA06040 MY_FROM=2
MY_TO=6.
The ADDING macro accesses the datasets DATA060402 to DATA060406 in
the storage location "C:\" one after the other and merges them one after the
other using ADD FILES. The values von ID, VAR1 and VAR2 are appended
to each other. In addition, the final SAVE OUTFILE resp. GET FILE
commands assign a dynamic dataset name
"ADDED_!MY_FROM_!MY_TO.SAV". The name of the dataset is
automatically adapted to the range of the merged data by the macro variables
!MY_FROM resp. !MY_TO. If the datasets DATA060402 to DATA060406
are merged, the resulting dataset is called "ADDED_2_6.SAV"; if the
datasets DATA060402 to DATA060410 are merged the resulting dataset is
called "ADDED_2_10.SAV" etc.
Here, too, there are few but central prerequisites for the datasets to be
merged: Same storage location, same file name (except for the suffix), same
variable name and same variable formats.
Section 11.7. presents two approaches which allow to split a dataset into
several single files.

11.5 Comparing structurally identical


datasets on absolutely identical content
With SPSS, two structurally identical datasets can be compared to see if they
contain absolutely identical values. “Structurally identical“ means the same
number and sequence of columns and rows. If necessary, the rows or
columns of the datasets should first be uniformly ordered (sorted). The
following program is able to check numeric and alphanumeric values for
absolute match. If strings are written with inconsistently written upper
("DD") or lower ("Dd") case, this also leads to an error message (ERROR). In
the following example, the two datasets DATA1 and DATA2 are checked to
see whether their values match absolutely.
data list
/ID 1 VAR1 3-4 VAR2 6-7 VAR3 9-10 (A) VAR4 12-13 .
begin data
1 6 12 AA 99
2 13 95 BB 29
3 6 12 CC 96
4 14 27 DD 29
end data.
save outfile'[Link]'.
exe.
data list
/ID 1 VAR1 3-4 VAR2 6-7 VAR3 9-10 (A) VAR4 12-13 .
begin data
1 6 12 AA 99
2 13 96 BC 29
3 6 12 CC 96
4 14 27 Dd 28
end data.
save outfile'[Link]'.
exe.

get file='[Link]'.
match files file=*
/file='[Link]'
/rename (VAR1 to VAR4 = VAR5 to VAR8)
/by ID.
exe.
do repeat
X=VAR1 to VAR4
/Y=VAR5 to VAR8
/ERROR=ERROR1 to ERROR4.
if (X ne Y) ERROR=1.
end repeat.
exe.
compute FILTER=sum(ERROR1 to ERROR4).
exe.
select if (FILTER>=1).
save outfile = '[Link]'
/keep ID ERROR1 to ERROR4.
exe.
get file '[Link]'.
exe.
list.
The output is limited to the detected differences (ERRORn) in rows and
columns between DATA1 and DATA2.
ID ERROR1 ERROR2 ERROR3 ERROR4
2 . 1,00 1,00 .
4 . . 1,00 1,00
Number of cases read: 2 Number of cases listed: 2

11.6 Macro for standardizing values in


separate datasets
The following program allows to standardize the code for missings in
separate datasets. The prerequisite of this program is that the datasets contain
the same variables with the same names and formats, and in the same
sequence (cf. [Link] and [Link]).
data list
/VAR1 1-2 VAR2 4-5 VAR3 7-8 VAR4 10-11 .
begin data
6 12 66 98
14 97 27 29
end data.
save outfile'[Link]'
/keep VAR1 VAR2 VAR3 VAR4.
save outfile'[Link]'
/keep VAR1 VAR2 VAR3 VAR4.
define FILEMAC (!pos !charend('/')).
!do !i !in (!1).
get file=!quote(!i).
missing values VAR1 VAR2 VAR3 VAR4 (99).
do repeat VARLIST=VAR1 to VAR4.
recode VARLIST (97=99) (98=99).
end repeat.
exe.
list.
!doend.
!enddefine.
filemac [Link] [Link] /.
The first output belongs to [Link], the second to [Link].
VAR1 VAR2 VAR3 VAR4
6 12 66 99
14 99 27 29
Number of cases read: 2 Number of cases listed: 2
VAR1 VAR2 VAR3 VAR4
6 12 66 99
14 99 27 29
Number of cases read: 2 Number of cases listed: 2
For many or extensive datasets, the LIST statement should possibly be
removed or restricted.

11.7 Splitting a dataset


This section presents two approaches which allow to split a dataset into
several single files. Both approaches are based on AGGREGATE; the first
approach is based on a macro for numerical variables, the second on a macro
for strings.

11.7.1 Splitting a dataset by categories (e.g. IDs)


(macro)
The following approach allows to split a dataset using a numerical split
variable and to store it filtered accordingly in separate datasets. The only
requirement for such a dataset would be that it contains a so-called split
variable. A split variable is nothing more than a variable that has at least two
nonmissing levels and thus allows to split the dataset into two levels. The
term split variable thus derives from the function of such a variable, namely
to be able to divide a dataset. In Section 11.4., the dataset
"ADDED_2_6.SAV" was created. Let us assume that this dataset would
additionally contain a discretely scaled variable PAT_CODE (see below).
The following approach divides the dataset according to the levels of
PAT_CODE and stores the contents in separate datasets, filtered accordingly.
data list list
/ID PAT_CODE VAR1 VAR2.
begin data
1 2 5 3
2 2 2 3
3 9 3
4 5 4 8
5 1 2 3
end data.
save outfile='C:\ADDED_2_6.SAV'.
get file='C:\ADDED_2_6.SAV'.
define !MY_SPLITTER
(DSNAME=!tokens(1)
/SPLITTER=!tokens(1))
temp.
select if (!DSNAME=!SPLITTER).
save outfile= !quote(!concat('C:\PAT_CODE',!SPLITTER,'.sav.')).
!enddefine.
sort cases by PAT_CODE.
aggregate outfile=*
/presorted
/break=PAT_CODE
/notused = N.
format PAT_CODE(F4.0).
write outfile='C:\Split_Macro.sps'
/'!MY_SPLITTER DSNAME=PAT_CODE SPLITTER=' PAT_CODE '.'.
exe.
get file ='C:\ADDED_2_6.SAV'.
insert file='C:\Split_Macro.sps'.
The approach splits the dataset according to the characteristics of
PAT_CODE and stores the content filtered accordingly in separate datasets,
e.g. [Link], [Link], etc. The suffix of the respective
dataset designations provides information about which data for which
PATCODE was stored in them. The call of the macro MY_SPLITTER! is
written into the normal SPSS output.
!MY_SPLITTER DSNAME=PAT_CODE SPLITTER= .
!MY_SPLITTER DSNAME=PAT_CODE SPLITTER= 1.
!MY_SPLITTER DSNAME=PAT_CODE SPLITTER= 2.
!MY_SPLITTER DSNAME=PAT_CODE SPLITTER= 5.
If a PAT_CODE is missing, no PAT_CODE dataset is created; e.g., because
PAT_CODE 3 is missing, the PAT_CODE3.SAV dataset is not created.
Instead, a PAT_CODE.SAV dataset is created, which, however, contains all
data rows of the source dataset "ADDED_2_6.SAV". Missings in the split
variable thus lead to incorrectly created subsets. If a split variable occurs
twice, all data rows belonging together are stored in one file. The dataset
PAT_CODE2.SAV therefore contains two data rows.
Dataset name Contents:
PAT_CODE.sav ID PAT_CODE VAR1 VAR2
3,0 . 9,0 3,0
5,0 1,0 2,0 3,0
1,0 2,0 5,0 3,0
PAT_CODE1.sav 2,0 2,0 2,0 3,0
PAT_CODE2.sav 4,0 5,0 4,0 8,0

5,0 1,0 2,0 3,0


PAT_CODE5.sav
1,0 2,0 5,0 3,0
2,0 2,0 2,0 3,0
4,0 5,0 4,0 8,0

This approach assumes that the split variable is numeric, has at least two
levels with values in it and no missings. If there are missings, these must be
converted into a numeric code before this program can be used, e.g. 99999
etc. If the split variable is a string, either the string can be converted into a
numeric variable or the macro can be adapted to the particularities of the split
string.

11.7.2 Splitting a dataset into uniformly filtered


subsets
The beginning of the following program should seem familiar; it was already
presented as an approach to unify text strings by acronyms. However, this
version does two more things. The data is stored in separate datasets for each
acronym. The datasets are additionally named according to the acronyms
themselves. The dataset "[Link]" thus contains only the values "IBM".
data list
/ CORPORATE 1-50 (a).
begin data
IBM
Industrial Business Machines
IBM Ltd.
Industrial Business Machines International
MB
Daimler-Benz
DaimlerChrysler
Mercedes-Benz
Daimler-Chrysler
end data.
save outfile '[Link]'.
GET file='[Link]'.
define MY_FINDER (!pos !charend('/') / !pos !tokens(1)).
!do !i !in (!1).
string ACRONYM (A20).
if (index(upcase(CORPORATE), (!quote(!upcase(!i)))) ne 0)
ACRONYM = (!quote(!2)).
exe.
!doend.
!enddefine.
MY_FINDER IBM Industrial Business Machines / IBM.
MY_FINDER MB Daimler Benz Chrysler Mercedes / DC.
save outfile '[Link]'.
get file='[Link]'.
aggregate
/outfile='[Link]'
/break=ACRONYM
/N_break=N.
get file='[Link]'.
string NAME (a10).
compute NAME=CONCAT(ltrim(rtrim(ACRONYM)),".SAV").
write outfile='[Link]' /'get file="[Link]".'
/'select if ACRONYM="' ACRONYM'".'
/'save outfile=' NAME'.'.
exe.
include '[Link]'.
The log contains information about which datasets were created:

835 save outfile=[Link] .

839 save outfile=[Link] .

11.8 Working with several files (SPSS


command DATASET)
Since version 14, SPSS offers the possibility to have several data sources
open at the same time. SPSS responded to the request of many SPSS users to
be able to open and view multiple datasets simultaneously. Up to and
including version 13, it was only ever possible to open one working file with
SPSS. If a (second, new) working file was opened, the result was that the
already open file was closed.

11.8.1 Meaning and limits of the DATASET


approach
DATASET, especially in combination with the options ACTIVATE, NAME
and COPY, enables the comfortable import and merging of several (third-
party) files, as well as simpler interfaces with the SPSS commands
AGGREGATE, MERGE or OMS.
DATASET differs in the data basis from most of the other SPSS applications
and commands. DATASET is based on the processing of all possible data
formats as temporary SPSS files. A data file is therefore not a permanent
SPSS file (*.sav) but a temporary file (*.tmp). Whenever we talk about files
in the following (especially in connection with the DATASET options
ACTIVATE, COPY, DECLARE, etc.), we always mean temporary and not
permanent SPSS files. Files processed by DATASET must be saved as
permanent files; otherwise all transformations are lost at the end of the
session (unless they are logged in syntax).
The DATASET command allows, especially compared to earlier SPSS
versions:

Simultaneous display of multiple data sources (via DATASET


NAME). Each new data source can be displayed in a separate
window. Each previously opened data source remains open and thus
available for further use.
Control of the functionally correct processing of transformations,
e.g. in interaction with AGGREGATE, the joining of files
(MATCH/ADD FILES) or also during OMS programming (OMS).
Easier joining of files from different source formats (e.g. EXCEL
files, databases, text files) via MATCH / ADD FILES or UPDATE),
because e.g. each data source no longer has to be saved in SPSS
format first.
Copy and paste of variables, cases and/or variable properties
between two or more open data sources (using DATASET
DECLARE).
Creating subsets of cases and/or variables for analysis (e.g. using a
combination of DATASET ACTIVATE and e.g. SELECT IF)
Switching between datasets resp. sources (e.g. via DATASET
ACTIVATE).
Experienced SPSS syntax users might not be able to recognize easily what is
special about DATASET, i.e. what the already known SPSS syntax was not
able to do. The DATASET options refer to the processing of all possible data
formats as temporary SPSS files. In addition to the speed advantage, there are
also easier programmability and simplified interfaces to other SPSS
commands (e.g. AGGREGATE, 6a/b under 11.8.2.). DATASET allows
programming without long paths or cumbersome permanent (but safer)
buffering.
In order to be able to evaluate the DATASET possibilities realistically,
possible limitations should be pointed out:

The simultaneous display of several data sources, which is quite


practical depending on the application, is relativized by three factors:
the size of the screen, human attention, and the available working
memory of the computer. Depending on the application, it may not
always make sense to have the whole screen full of files. Especially if
the screen are small and the opened files wide (i.e. with many
variables), the simultaneous display of many files in particular is
quickly comes up against a certain limit of the manageable. It may
not very helpful that the activated file is marked with a green cross in
the lower menu bar.
SPSS cannot currently be set to display only one file. A switch-off
mechanism is planned for SPSS version 16. This switch-off
mechanism allows to open or view only one file at a time (often an
implicit requirement of many user-defined SPSS programs).
DATASET knows no upper limit in terms of the possible number of
open windows. However, SPSS points out that the number of open
windows may well be limited by the available resources (SPSS
Command Syntax Reference 15.0, 2006, 487).
Even if several files can be opened and viewed at the same time, this
does not mean that they all have the same status of an active working
file. Only one file at a time can be defined as the active working file.
Opening several files is therefore not the same as defining several
working files. Only the variables in the working file are available for
SPSS commands for transformation or analysis. Variables that are not
in the active working file cannot be accessed.
In order to access a certain file, it must be explicitly referenced. In
mouse control, this is done with a single click; in syntax control, this
referencing is done by explicitly assigning a name already when
opening the desired data sources, e.g. via GET FILE or GET DATA
(via DATASET NAME). SPSS users who work with syntax are
familiar with the explicit referencing of files as a necessary
prerequisite for secure, because controlled data management.
If you work with many open single files, the individual referencing
(e.g. via DATASET ACTIVATE) may become a bit annoying.
Some of the DATASET options conflict with other possibilities of the
SPSS syntax. For example, DATASET ACTIVATE cannot be used
together with DO IF, DO REPEAT or LOOP.
Experience also shows that some DATASET options do not yet work
quite as described in the technical documentation. Activating a file
using DATASET ACTIVATE will trigger pending transformations
on the previously active file; the "Open Transformations" note in the
lower menu bar though falsely suggests that they have not yet been
executed.
DATASET may get into a reference conflict with FILE HANDLE
(see 11.9.). If a dataset as well as a "pointer" in a program have the
same name, DATASET prefers the dataset before the "pointer".

The following chapter will introduce and explain various possible


applications of the DATASET options in the shape of a summarizing SPSS
program.

11.8.2 Examples of common applications


The following pages present the numerous possible applications of the
DATASET options in the form of a single SPSS program. The following
overview compiles the applications explained.
Overview:
(1) Assigning file names
(2a) Displaying the currently available files
(3a) Application I: Copying variables, values and variable properties
(4) Activate a file
(3b) Application I: Copying variables, values and variable properties
(5) Application II: Working with file names (e.g. merging files)
(6a) Application III: Creating an empty file (e.g. aggregating files)
(6b) Application IV: Creating an empty file (e.g. OMS programming)
(6c) Application V: Joining of third-party files (e.g. EXCEL)
(2b) Displaying the currently available files
(6d) Working with file names (e.g. regarding LIST)
(7a-c) Closing files
The numbers assigned in the comments correspond to the above overview
and should help to facilitate the comparison between the syntax itself and the
subsequent explanations.
It is important to know that, with the exception of one example (6c), a
permanent file is only accessed once, before (1). From this moment on, all
further applications are based on the temporary files created from it. The
example (6c) demonstrates the uncomplicated joining of several third-party
files and will read in two EXCEL files for this purpose.
Syntax:
GET
FILE='C:\Programs\SPSS\Employee [Link]'.
*** (1) Assigning file names ***.
dataset name EMPLOYED.
*** (2a) Displaying the currently available files ***.
dataset display.
*** (3a) Copying variables, values and variable properties ***.
dataset copy OFFICE.
dataset activate OFFICE.
select if JOBCAT=1.
compute INDEX_B = (SALARY/SALBEGIN).
exe.
format INDEX_B (F8.2).
*** (4) Activate a file ***.
dataset activate EMPLOYED.
*** (3b) Copying variables, values and variable properties ***.
dataset copy MANAGMT.
dataset activate MANAGMT.
select if JOBCAT=3.
compute INDEX_M = (SALARY/SALBEGIN).
exe.
format INDEX_M (F8.2).
*** (5) Working with file names (e.g. merging files) ***.
dataset activate OFFICE .
add files
/file=*
/file='MANAGMT'.
exe.
dataset name OFFC_MAN.
dataset activate OFFC_MAN.
*** (6a) Creating an empty file (DECLARE with AGGREGATE) ***.
dataset activate EMPLOYED .
dataset declare DATAGG.
aggregate
/outfile='DATAGG'
/break=JOBCAT
/SALBEGIN2= mean(SALBEGIN).
*** (6b) Creating an empty file (DECLARE with OMS) ***.
dataset name EMPLOYED.
dataset declare R_TABLES.
oms
/select tables
/if commands = ['Regression']
subtypes = ['Coefficients']
/destination format = SAV
outfile = 'R_TABLES'.
regression
/missing listwise
/statistics COEFF OUTS R ANOVA
/criteria=PIN(.05) POUT(.10)
/noorigin
/dependent SALARY
/method=enter SALBEGIN .
omsend.
dataset activate R_TABLES.
list.
*** (6c) Joining of EXCEL files (DECLARE, NAME and ACTIVATE)
***.
*** Assumption: There are two EXCEL files [Link] and [Link].
***.
GET DATA /TYPE=XLS
/FILE='C:\[Link]'.
DATASET NAME XLdata1.
GET DATA /TYPE=XLS
/FILE='C:\[Link]'.
DATASET NAME XLdata2.
DATASET ACTIVATE XLdata1.
ADD FILES
/FILE=*
/FILE='XLdata2'.
EXE.
dataset name XLSDATA .
DATASET ACTIVATE XLSDATA.
*** (2b) Displaying the currently available files ***.
dataset display.
*** (6d) Working with file names (e.g. regarding LIST) ***.
dataset activate DATAGG.
list.
*** (7a) Closing files (list-wise CLOSE does not work) ***.
dataset close OFFICE OFFC_MAN MANAGMT DATAGG.
*** (7b) Targeted closing of single files (works) ***.
dataset close OFFICE .
dataset close OFFC_MAN .
dataset close MANAGMT .
dataset close DATAGG .
*** (7c) Closing all files except the active file ***.
dataset close ALL .
Explanations:

*** (1) Assigning file names ***.


The name "EMPLOYED" is assigned to the temporary version of the SPSS
sample file "Employee [Link]". The file EMPLOYED is automatically the
active file, because it was last opened by a GET command (in the example:
GET FILE).

*** (2a) Displaying the currently available files ***.


DATASET DISPLAY is used to display a list of the currently available
datasets. An output can look like this:
Dataset Display
Datasets Datasets Creation Timestamps

EMPLOYEDa 10-AUG-2018 [Link].00

Associated Files
C: \ Programs\SPSS\Employee [Link]
a. Active dataset
Note: Earlier SPSS Versions may generate different output, e.g. without
timestamps and associated files.

*** (3a) Application I: Copying variables, values and variable properties


***.
DATASET COPY creates a new file, OFFICE, as a copy of the active dataset
EMPLOYED. OFFICE corresponds to the status of EMPLOYED at the time
of the copy operation. OFFICE is activated and filtered out as desired using
SELECT. The copy operation also triggers pending transformations, such as
a COMPUTE without EXE.

*** (4) Activate a file ***.


Now another copy is to be created from EMPLOYED. Since however due to
the DATASET ACTIVATE under (3a) still OFFICE is the active dataset,
EMPLOYED has first to be made again explicitly the active dataset by
submitting of the DATASET ACTIVATE command.

*** (3b) Application I: Copying of variables, values and variable


properties ***.
DATASET COPY creates another new file, MANAGMT, as a copy of the
now again active dataset EMPLOYED (cf. further notes under 3a).

*** (5) Application II: Working with file names (e.g. merging files) ***.
In (5), the two files OFFICE and MANAGMT are merged by means of ADD
FILES and stored in the temporary file OFFC_MAN (the name was assigned
by DATASET NAME). For viewing, OFFC_MAN is defined as the active
dataset.

*** (6a) Application III: Creating an empty file (e.g. aggregating files)
***.
Under (6a), certain data should be aggregated in the EMPLOYED dataset. To
make EMPLOYED the active dataset again, the DATASET ACTIVATE
command is submitted first. DATASET DECLARE is then used to define the
file in which the data from AGGREGATE are to be stored via OUTILE=
(AGGDATA). Then the desired aggregation is performed. The example is
continued under (6d).

*** (6b) Application IV: Creating an empty file (e.g. OMS


programming) ***.
EMPLOYED is explicitly made the active dataset again via DATASET
ACTIVATE. DATASET DECLARE then defines the file in which the results
of the REGRESSION procedure are to be stored (R_TABLES). OMS is used
to instruct the subsequently executed REGRESSION procedure to store the
calculated coefficients into the predefined R_TABLES file using OUTFILE=.
DATASET ACTIVATE makes R_TABLES the active dataset and can be
conveniently checked via LIST. An OMS output obtained in this way can
look like this, for example:
Command_ Subtype_ Label_ Var1 Var2 B [Link] Beta t
Sig
Regression Coefficients Coefficients 1
(Constant) 1928,206 888,680 . 2,170 ,031
Regression Coefficients Coefficients 1 Beginning Salary 1,909 ,047 ,880
40,276 ,000
Number of cases read: 2 Number of cases listed: 2
Note: The regression is only performed for demonstration of OMS
programming in this way and does not correspond to a professional
calculation of a simple linear regression.

*** (6c) Application V: Joining of third-party files (e.g. EXCEL) ***.


The two EXCEL files [Link] and [Link] are to be merged in SPSS.
After reading [Link] with GET FILE, XLdata1 is automatically the
active dataset. The name XLdata1 was assigned to this first file via
DATASET NAME. XLdata1 is no longer in EXCEL format, but in the
format of a temporary SPSS file. After reading in (GET FILE), the second
EXCEL file becomes the temporary SPSS file XLdata2 via DATASET
NAME. Because of the last import process, XLdata2 is the currently active
dataset. In order to append XLdata2 to XLdata1, XLdata1 is defined again as
the active dataset via ACTIVATE DATASET. Using ADD FILES, both files
are appended one to the other. First the file XLdata1 is extended by XLdata2,
and then renamed to XLSDATA. For review purposes, XLSDATA is then
defined as the active dataset.

*** (2b) Displaying the currently available files ***.


Using DATASET DISPLAY, a list of the currently available files is
displayed. An output can now look like this:

Dataset Display
Datasets Datasets Creation Timestamps Associated Files
EMPLOYED 10-AUG-2014 [Link].00 C: \Programs\SPSS\Employee [Link]
XLSDATAa 10-AUG-2014 [Link].00
XLdata2 10-AUG-2014 [Link].00
a. Active dataset

Note: XLdata1 is missing because it was renamed to XLSDATA after taking


over the data from XLdata2. Earlier SPSS Versions may generate different
output, e.g. without timestamps and associated files.

*** (6d) Working with file names (e.g. regarding LIST) ***.
Finally, the temporary file DATAGG (cf. 6a) is to be accessed again and
output in list form. After having made DATAGG the active dataset by
submitting the DATASET ACTIVATE command, the contents of the file can
be easily accessed, e.g. using LIST. An output of aggregated data may look
like this:
JOBCAT SALBEGIN2
1 14096,05
2 15077,78
3 30257,86
Number of cases read: 3 Number of cases listed: 3
*** (7a) Closing files (list-wise CLOSE does not work) ***.
*** (7b) Targeted closing of single files (works) ***.
*** (7c) Closing all files except the active file ***.
Notes (7a) to (7c) are self-explanatory.

11.8.3 An overview of DATASET syntax


The DATASET option is therefore particularly suitable when only a few and
at the same time manageable files are opened. The advantage of
manageability stands and falls with human attention. Comparisons between
different opened files can only be made visually, but not directly technically.
If, for example, two files are to be compared directly with each other (cf.
Section 11.3.), they would first have to be combined into a single working
file.
The following is an overview of the most important options of the
DATASET command. For further information about DATASET, please refer
to the SPSS syntax documentation.
DATASET ACTIVATE
The DATASET ACTIVATE command makes the referenced file the active
working file. Transformations on the active file (before or after assigning a
dataset name) are retained during the session. The referenced file does not
necessarily have to be an SPSS file, but can also be an EXCEL file, for
example. If another working file was previously active, and if the previously
active working file has a (no) file name (DATASET NAME), it is still (no
longer) available. The DATASET ACTIVATE command assumes that the
referenced file really exists. If the referenced file does not exist, an error
message is displayed.
DATASET ACTIVATE triggers pending transformations. The activation of a
data source possibly triggers pending transformations on the already active
working file. If no transformations are pending, activating another data
source does not trigger any transformations either.
The active file is always the file last opened by a GET command (e.g. FILE,
DATA, SAS, STATA or TRANSLATE) or the file activated by DATASET
ACTIVATE. Opening a new file automatically defines a new active file.
DATASET CLOSE
The DATASET CLOSE command closes the referenced file. The command
is executed immediately. If the referenced file is not the active working file,
the file is closed and is no longer available. If, however, the active working
file is referenced via the concrete name specification or via a so-called
wildcard (e.g., *, ALL), only the referencing itself is cancelled. The file is
still actively available, but no longer has its own name. DATASET CLOSE
does not trigger any pending transformations.
DATASET COPY
The DATASET COPY command can be used to create one or more subsets
of already opened files. After DATASET COPY only the name of the file to
be created needs to be specified. The name of the file must conform to the
conventions for SPSS datasets, e.g., up to 64 bytes long (32 characters in
double-byte language), in one word, beginning with a letter, etc. The
DATASET COPY command then reads the contents of the currently active
file, creates a copy of the read data, and triggers the execution of any pending
commands (e.g., for filtering or more complex transformations). These
commands can, among other things, filter the contents of the read and/or
stored file, e.g. SELECT IF. If no commands are pending for execution, exact
copies of the active file are stored under the new, specified name. If the
specified file name has already been assigned and it is not the name of the
active dataset, the content of the already existing file is overwritten. In this
case, SPSS returns a message. If the specified dataset name is the name of the
active file, the assigned name is also used. SPSS stores the content of the
active file as an unnamed copy. Moreover, DATASET COPY behaves like a
valid EXE. DATASET COPY triggers pending transformations; i.e., if
transformations are pending before the copy is created, then these
transformations are applied to the original and to the copy/s.
DATASET DECLARE
Using the DATASET DECLARE command, variables, cases and/or variable
properties can be exchanged between two or more open data sources via copy
and paste. After DATASET DECLARE only the name of the file to be
created needs to be specified. The name of the file must correspond to the
conventions for SPSS files (see above). The command is executed
immediately. The created file corresponds to an empty temporary SPSS file.
Unlike DATASET COPY, this file is not associated with the active file or
any other file. An association is only established at the moment when other
procedures (e.g. via OUTFILE= in AGGREGATE or REGRESSION) or
commands store data etc. in this file. DATASET DECLARE does not trigger
any pending transformations.
DATASET DISPLAY
The DATASET DISPLAY command can be used to request a list of the
currently available datasets. The command is executed immediately. The
DATASET DISPLAY. statement is sufficient; it is not necessary to specify a
file name. DATASET DISPLAY does not trigger any pending
transformations.
DATASET NAME
The DATASET NAME command assigns a file its own name. This name can
be used in subsequent DATASET commands and/or most commands
referencing SPSS datasets. Furthermore, the designated file remains available
even if other files have been opened and/or activated.
After DATASET NAME only the name of the file to be created needs to be
specified. The name of the file must correspond to the conventions for SPSS
datasets (see above).
If the active file already has a name, it will be overwritten. If the specified
name is already assigned to another file, this association is dissolved and the
old file is closed. In any case the new name is now associated with the active
file. The command is executed immediately. DATASET NAME does not
trigger any pending transformations. It is strongly recommended to assign
separate names to the used files. The automatically assigned file names
(„DataSetn“) can all too easily be confused.
WINDOW options
Most of the presented DATASET commands can be supplemented by the
WINDOW keyword. The exceptions are CLOSE and DISPLAY. With this
keyword it can be specified after WINDOW= whether the window of the
Data View of the respective file is to be displayed in the foreground
(FRONT), reduced in size (MINIMIZED) or not at all (HIDDEN). ASIS
leaves the Data View as it is. The following overview shows the differences
of the DATASET options in the scope of functions of the WINDOW
keyword.
ASIS FRONT HIDDEN MINIMIZED
ACTIVATE X X
CLOSE - - - -
COPY X X X
DECLARE X X X
DISPLAY - - - -
NAME X X

11.9 Digression: Working with FILE


HANDLE
In addition to DATASET, also the FILE HANDLE option (available since
SPSS version 13) is interesting for working with several datasets. While
DATASET is practical for opening several datasets at the same time, FILE
HANDLE facilitates the programming of data access. FILE HANDLE is,
simply put, the abbreviation of sometimes very long path and dataset names,
a so-called "pointer". In many programs, the storage location of a dataset is
indicated by a path in the following form, whereby the path itself indicates
the storage location, and the designation at the end of the path indicates the
name of the file itself.
e.g. C:\Programs\SPSS\World 95 for Missing [Link]
The link between path and file name allows any system to access the required
data. However, as you can see in the example, long path names can lead to
error-prone typing, especially if the writing of this path should be repeated
several times. Since version 13 SPSS offers the possibility to save long path
names by using an abbreviation.
FILE HANDLE assigns an own temporary pointer resp. reference (so-called
"file handle") to the specified path (optionally with or without dataset). In
further processing steps the defined pointer is accessed as if it were the actual
SPSS dataset. The advantage of this is not only that there is no need to type
long path names several times. The one-time assignment of a path is
sufficient. If the same datasets have to be referenced at different storage
locations, the calculation program does not have to be completely changed
each time. It is sufficient to adapt the referencing once.
FILE HANDLE is a requirement for datasets with record lengths exceeding
8192 characters. Using the MODE option, FILE HANDLE is able to
reference unformatted Binary / FORTRAN data, so-called Column Binary
data, or EBCDIC data. For further data types see the SPSS Command Syntax
Reference Version 15. FILE HANDLE is only active during the current
session. FILE HANDLE can get into reference conflicts with DATASET (cf.
11.8.1.). If a dataset and a "pointer" have the same name in a program,
DATASET prefers the dataset before the "pointer".
The following sections introduce the two basic FILE HANDLE options:

FILE HANDLE only replaces the path to the location of the desired
file.
FILE HANDLE replaces the path to the location including the desired
file.
The practical use of FILE HANDLE is demonstrated by working with the
string function VALUELABEL.
Non-specific pointer:
FILE HANDLE replaces only the path to the location of the desired file.
FILE HANDLE PathWithoutDS
/NAME="C:\Programs\SPSS\".
Later accesses to one or more files at the desired location do not have to
specify long paths each time, but only the abbreviating pointer (e.g.). By
means of this unspecific variant, only the notation of the path to a storage
location is abbreviated; if there are several files in this location, they can also
be accessed via the abbreviation (see examples 3 and 4).
GET FILE "PathWithoutDS\Employee [Link]".

Specific pointer:
FILE HANDLE replaces the path to the storage location including the desired
file (see examples 1 and 2).
FILE HANDLE PathinclW95
/NAME="C:\Programs\SPSS\World 95 for Missing [Link]".
GET FILE "PathinclW95".
Later accesses to the file at the desired location need only specify the
"shortcut". This specific variant abbreviates both the specification of the path
and the dataset at this location. If there are further files at this location, they
cannot be accessed using this abbreviation variant. In this case, either
unspecific "pointers" or the required specific "pointers" would have to be
defined for an access.
If a system does not successfully access data using such specifications, the
error can be caused by the path specification, the path itself, the specified file
name, or even by the fact that the data is not located at the specified location
at all.
Examples for FILE HANDLE with the function VALUELABEL:
The VALUELABEL string function can be used to convert value labels to
string data. This function is useful if you need to continue working with the
labels (now as string entries) instead of the original value labels. The
VALUELABEL function can convert value labels from string and numeric
variables to string entries (and also simultaneously, see Example 4).
Example 1 (numeric variable, univariate):
Converting labels of a numeric variable to string entries
GET FILE "PathinclW95".
string REGION_LB (A12).
* Note: If the variable length is too short (see STRING), the created entries
are truncated *.
compute REGION_LB=valuelabel(REGION).
exe.
freq REGION_LB.
Example 2 (numeric variables, multivariate):
Convert labels of multiple numeric variables to string entries
GET FILE "PathinclW95".
do repeat VAR_LIST=REGION, PAT_LITF
/ALL_LABL=REGION_LB, PATF_LB.
string ALL_LABL (A12).
compute ALL_LABL=valuelabel(VAR_LIST).
end repeat.
freq REGION_LB PATF_LB.
Example 3 (string variable, univariate):
Convert labels of a string variable to string entries
GET FILE "PathWithoutDS\Employee [Link]".
string GENDER_LB (A12).
compute GENDER_LB=valuelabel(GENDER).
exe.
freq GENDER_LB.
Example 4 (numeric and string variables, multivariate):
Converting labels of several numeric and string variables into string entries
GET FILE "PathWithoutDS\Employee [Link]".
do repeat VAR_LIST=GENDER, MINORITY
/MIXDLABL=GENDER_LB, MINORITY_LB .
string MIXDLABL (A10).
compute MIXDLABL=valuelabel(VAR_LIST).
end repeat.
freq GENDER_LB MINORITY_LB.
12 Time and date related problems
– Detect and resolve
Time resp. date related problems can also affect the quality of data and its
analysis. Frequently occurring areas of application are, for example, the
checking and correction of incorrect date and time information (12.1., 12.2.),
compliance with the ISO 8601 standard resp. resolving the "Year 2000"
problem (12.3.), or the advantages of working with a so-called time stamp
(12.4.).

Checking for possible time- resp. date-related problems requires that the
criteria "completeness", "uniformity", "duplicates", "missings" and "outliers"
have already been checked and are OK. When checking entries in time and
date variables, special attention must be paid to correct punctuation, see
Chapter 4 ("Uniformity"). Depending on the context, Chapter 12 also covers
to a certain extent a form of special time resp. date related plausibility. In
certain circumstances, criteria from Chapter 12 have to be applied before
checking the general plausibility (sensu Chapter 8 resp. 9).
These are general recommendations for working with time and date values:

In general: Time resp. date values should always be stored in the


format of SPSS time resp. date variables. Pass the correct time resp.
date format explicitly to SPSS.
Scenario 1: The time and date values are available as raw data, e.g. a
date in the format DDMMYYYY distributed on the variables DD,
MM and YYYY: The variable DD has to be checked for integer
values less than or equal to 31. The variable MM is to be checked for
integer values that are less than or equal to 12. The variable YYYY
should have four digits. For two-digit values, the beginning of the
calculation of time (“calendar") should be checked (see the example
for the € introduction). If two-digit values exceed 99, there is an error.
Two-digit annual values should be converted to four-digit annual
values. Depending on the application area, special care must be taken
with historical dates (e.g. older than 1582) due to possible calendar
changes. The plausibility of the converted time and date values must
be checked.
Scenario 2: The time and date values are available e.g. as strings in
the format DDMMYYYY. These values have to be converted and
stored in the format of SPSS time resp. date variables (see 4.11.).
Check the plausibility of the converted time and date values.

The test steps presented in the following chapters provide further insights.

12.1 Insights by time differences


If dates are collected in at least pairs, e.g. date of birth and date of death, start
and end time of a treatment, admission and discharge dates, or other
measurements in the form of "before"-"after" (there can be more than two
measurement times), simple differences are on the one hand very effective for
finding errors, but are nevertheless susceptible to errors despite their apparent
simplicity.
In the following SPSS syntax, three exemplary data pairs are read in, as they
often occur in analysis practice:

START and FINISH are a pair of variables in TIME format for the
start and end time of a treatment or similar in the unit
hours:minutes:seconds.
BIRTHDAY and DEATHDAY contain the date of birth and death,
respectively.
RECEPTION and DISCHARGE contain reception and discharge
data (the last four variables are in European date format).
For further time-related applications, please refer to Schendera (2005).
DATA LIST
/START 1-8 (TIME) FINISH 10-19 (TIME) BIRTHDAY 20-28 (EDATE)
DEATHDAY 30-38 (EDATE)
RECEPTION 39-47 (EDATE) DISCHARGE 48-56 (EDATE).
BEGIN DATA
[Link] [Link] 01/10/45 21/10/90 11/12/97 11/11/97
[Link] [Link] 02/11/45 21/10/45 11/12/97 11/12/97
[Link] [Link] 03/10/54 21/10/90 11/12/97 13/12/97
END DATA.
COMPUTE DIFF_1=FINISH-START.
COMPUTE NDAYSB=[Link](BIRTHDAY).
COMPUTE NDAYSDE=[Link](DEATHDAY).
COMPUTE NDAYSR=[Link](RECEPTION).
COMPUTE NDAYSDI=[Link](DISCHARGE).
exe.
COMPUTE DIFF_2=NDAYSDE-NDAYSB .
COMPUTE DIFF_3=NDAYSDI-NDAYSR + 1 .
exe.
FORMATS DIFF_1 (TIME15) DIFF_2 DIFF_3 (F8.0) .
LIST VARIABLES=DIFF_1 DIFF_2 DIFF_3 .
The variable DIFF_1 is the difference between end and start time (FINISH,
START) in the unit hours:minutes:seconds. The variable DIFF_2 is the
difference between day of death and birthday (for demonstration purposes);
previously, the number of days since October 14, 1582 ("day 0" of the
Gregorian calendar) was determined for BIRTHDAY and DEATHDAY
using XDATE. The unit of the variable DIFF_2 is therefore in days. The
variable DIFF_3 is the difference between the DISCHARGE and
RECEPTION dates; the calculation is analogous to BIRTHDAY and
DEATHDAY. The unit of the variable DIFF_3 is therefore also in days. The
only difference here is that DIFF_3 is adjusted by 1 in contrast to DIFF_2.
This peculiarity is explained further below. In all three difference
determinations, the earlier time value (which must therefore be the smaller
time value) is subtracted from the later time value (which must therefore be
the larger time value). If the data is correct, the differences determined can
therefore only be positive (or 0, if permissible in the respective context). Each
negative value is an indication of an error in the date values.
DIFF_1 DIFF_2 DIFF_3
[Link] 16436 -29
[Link] -32 1
-[Link] 13147 3
Number of cases read: 3 Number of cases listed: 3
As the output values show, negative differences indicate incorrect date resp.
time values (cf. also the raw data in the DATA LIST step). First simple
difference formations should therefore not result in negative values for
successive time or date values. If the earlier time value is subtracted from the
later time value, positive values should result without exception (possibly
also 0, if allowed). Despite this rather simple procedure, there are several
pitfalls:

If several differences are made simultaneously, later and earlier


values are not consistently "polarized" correctly. To avoid
confusion, either earlier values should be subtracted only from later
time values, or uniformly vice versa.
The time values are in different units, either already at measurement
or by formatting in the dataset. Before the difference is calculated, the
time resp. date units should be checked and, if necessary, adjusted.
Unequal time units lead to meaningless differences.
The decisive factor in difference formation is what a protocol
definition of difference is. For example, the simple differences of
START and FINISH (DIFF_1) or BIRTHDAY and DEATHDAY
(DIFF_2) are in contrast to the adjusted difference of RECEPTION
and DISCHARGE (DIFF_3). The difference is that in the
administration of e.g. hospitals, a patient who is admitted on one day
and discharged on the same day is often counted as one day present
(DIFF_3). However, a pure (unadjusted) difference between
RECEPTION and DISCHARGE of the same date would result in the
value 0 (in case of correct data; e.g., if there is a difference between
two treatment times) and must therefore be corrected by 1 in order to
be correctly billed. However, this adjustment has the consequence
that the value 0 must not be interpreted as a valid value. In contrast to
the non-adjusted difference, the value 0 in the adjusted difference is
an incorrect value. – In hospital administration, the billing of patients
is done by means of the so-called "midnight statistics". Patients who
are present until 24:00 of the same day are e.g. billed with one day.
Patients who are still present after 0:00 o'clock the following day are
billed with two days.
A conspicuous (e.g. negative) difference value does not directly
indicate the source of the error. A value in e.g. START or FINISH
(if only one value is wrong) or also in START and FINISH (if both
values are wrong) may be possible. Difference values which are not
conspicuous (e.g. not negative) are no guarantee for a complete
exclusion of errors. Non-negative difference values only guarantee
that later time values are higher than earlier time values; however,
despite non-negative difference values, wrong differences could be in
a grey area of apparent plausibility. Such differences can, however,
be checked more precisely by, for example, comparing the START
and FINISH values with other variables.

Data from production lines may be a special case where later


measurements may not automatically mean more recent data. I once had
data from a sensor chain where an earlier sensor had conspicuously more
recent data than a sensor that operated later in the chain according to the
documentation. An inspection of the production line itself revealed that
these two sensors operated alternately in fact. In each turn, the “earlier”
sensor provided more recent data then the “later” sensor.
Please refer to Section 4.11.2 for the problem of non-uniform dates resp.
differences unequal to zero for apparently identical appearing date values.

12.2 Checking date entries (transposed


digits)
If dates are taken manually from paper questionnaires, entered via data masks
(CATI, CAPI) or directly into SPSS data tables, transposed digits can quickly
sneak in. If dates (or other time units) are central to the data analysis, it is
recommended that dates (e.g. month, day and year) are created in three
variables (e.g. MY_DAY, MY_MONTH and MY_YEAR) from the outset,
and that these are subjected to a check program in order to be able to identify
and correct any transposed numbers. This seemingly cumbersome procedure
has the advantage that the checks and, if necessary, corrections for the time
units (e.g. day, month and year) can be carried out separately and specifically.
Once the variables have passed through the check program, they can later be
aggregated to the desired date and viewed e.g. via LIST.
Checking date specifications in a complete date, e.g. in the European date
format EDATE, e.g. "32.13.1898", fails already during reading or input,
because values outside the plausible range are not even accepted by SPSS
whereby no differentiated error message is possible for the day and month
specifications and only the verification of the year specification.
* A. Reading an example date *.
data list
/MY_DAY 1-2 MY_MONTH 4-5 MY_YEAR 7-10 .
begin data
16 09 1942
01 11 1963
32 13 1898
23 03 2005
end data.
formats MY_DAY MY_MONTH MY_YEAR (F4.0).
In section A, dates are read in, including an incorrect date (line 4).
* B.1 Viewing the original input (e.g.)*.
list variables=MY_DAY MY_MONTH MY_YEAR .

* B.2 Automated error output (e.g.)*.


if MY_YEAR < 1900 ERROR_Y1= 1 .
exe.
if MY_YEAR > 2005 ERROR_Y2= 1 .
exe.
if MY_MONTH > 12 ERROR_M1= 1 .
exe.
if MY_MONTH < 0 ERROR_M2= 1 .
exe.
if MY_DAY > 31 ERROR_D= 1 .
exe.
The test program can also be designed very strict and complex, depending on
requirements. From simply looking at clearly arranged data records (e.g. B.1.,
e.g. by means of LIST), up to concrete test criteria (e.g. B.2., e.g. by means of
IF), which exclude a possible overlooking with certainty (but not the skipping
of this feedback). At this point, comparisons with already existing date
specifications (mutual validation) or a recourse to external check criteria
would also be possible.
* C. Corrections (e.g.)*.
list variables=ERROR_Y1 ERROR_Y2 ERROR_M1 ERROR_M2
ERROR_D .
compute MY_YEARC=MY_YEAR.
exe.
compute MY_MONTHC=MY_MONTH.
exe.
compute MY_DAYC=MY_DAY.
exe.
if MY_YEAR < 1900 MY_YEARC= MY_YEAR + 100 .
exe.
if MY_MONTH = 13 MY_MONTHC = 12.
exe.
if MY_DAY = 32 MY_DAYC=23.
exe.
list variables=MY_DAYC MY_MONTHC MY_YEARC .
Under C., copies of the original variables are created and the necessary
corrections are stored in them. Depending on the required precision of the
corrections, they can be made on a case-by-case basis, but also group-wise.
* D. Summarize the corrected values *.
compute
DATE_OK=[Link](MY_DAYC,MY_MONTHC,MY_YEARC).
exe.
compute DATE_NOK=[Link](MY_DAY,MY_MONTH,MY_YEAR).
exe.
formats DATE_OK(DATE11) DATE_NOK(DATE11) .
Under D., the corrected values are summarized in a variable (DATE_OK)
with the desired date format (DMY). The DATE_NOK variable is created
with the uncorrected values to show that values outside the day and month
range will trigger an error message in SPSS.
>Warning # 613
>One of the arguments to the DATE function is out of range or is not an
>integer. The result has been set to the system-missing value.
Finally, in E. only those cases are requested via the SELECT IF filter for
which a correction has been made. The output of DATE_NOK is done to
show that values outside the expected range lead to a missing in the affected
case.
* E. Filtering and displaying only cases with corrections *.
select if ERROR_Y1= 1 | ERROR_Y2= 1 | ERROR_M1= 1 | ERROR_M2=
1 | ERROR_D= 1 .
exe.
list variables=DATE_OK DATE_NOK.
List
DATE_OK DATE_NOK
23-DEC-1998 .
Number of cases read: 1 Number of cases listed: 1

12.3 Variants for solving the "Year 2000"


problem (ISO 8601, Y2K)
The year 2000 is particularly important for data storages. (Numeric) dates,
which were previously only given with two digits, e.g. to save storage space
by omitting the "19" or simply because the meaning of the two-digit code
was previously clear, can now no longer be distinguished from two-digit
dates from the year 2000 onwards. For example, "04" can mean both "1904"
and "2004" in a data table. In order to ensure unambiguous information,
uniform formatting (e.g. F4.0), as well as uniform handling (retrieval of four-
vs. two-digit values), data storage has been and is still being standardized
from two-digit to four-digit year dates, also in order to implement certain ISO
requirements, e.g. ISO 8601 resp. the conversion to the European standard
EN 28601 resp. the DIN regulation DIN5008:Rev 1996.
ISO 8601 is an international standard that describes date formats and time
specifications and makes recommendations for the international use of date
and time specifications, e.g. in the international form "YYYY-MM-DD".
The following standardizing variants are relatively straightforward
approaches that can be applied to ISO standardization ("Y2K") and similar
problems, e.g. unifying or extending old serial numbers, encodings, etc.
Numeric values
Syntax List
data list free MY_YEAR
/MY_YEAR . MY_YEAROLD
begin data
1995 95
95 96 97 98 99 2000 2001 2002 2003 2004
1996 96
2005
1997 97
end data.
1998 98
exe.
1999 99
compute MY_YEAROLD=MY_YEAR. 2000 2000
exe. 2001 2001
formats MY_YEAR MY_YEAROLD 2002 2002
(F4.0). 2003 2003
do if (MY_YEAR < 100). 2004 2004
compute 2005 2005
MY_YEAR=MY_YEAROLD+1900.
end if. Number of cases read: 11
exe. Number of cases listed:11
list.
DATA LIST FREE is used to read in values of the variable MY_YEAR.
Then a copy (MY_YEAROLD) is created via COMPUTE and formatted
accordingly via (F4.0). The DO IF - END IF statement overwrites all (two-
digit) year specifications (whereby it should be ensured that these values
actually stand exclusively for values between 1900 and 1999) with four-digit
year specifications. LIST is used to list the determined results (cf. right).
In this example, numerical year specifications were standardized; the
following example standardizes string year specifications. However, the
output values (MY_YEARNUM) are numeric.
Alphanumeric characters (Strings) I
DATA LIST FREE Liste
/YEARSTRG (A4) .
YEARSTRG YEARNUM
BEGIN DATA
95 96 97 98 99 2000 2001 2002 2003 2004 95 1995
2005 96 1996
END DATA. 97 1997
98 1998
compute
99 1999
YEARNUM=number(YEARSTRG, F4).
2000 2000
exe. 2001 2001
do if (YEARNUM < 100). 2002 2002
compute YEARNUM=YEARNUM+1900. 2003 2003
end if. 2004 2004
formats YEARNUM (F4.0). 2005 2005
list. Number of cases read: 11
Number of cases listed: 11
With DATA LIST FREE, values of the variable YEARSTRG are read in
first; since in the example this variable is to be a character string, these are
defined as strings with length 4 when they are read in with (A4). The
NUMBER function converts the string numbers to numeric numbers
(variable YEARNUM) and overwrites them with four-digit values using a
slightly modified DO IF - END IF statement. LIST is used to list the
determined results (cf. right). In this example, string year specifications were
standardized in such a way that they were stored in uniform numerical
values. The following two examples store unified string specifications again
in strings.
Alphanumeric characters (Strings) II
Example String II is based on the same data as the above example. The
DATA LIST FREE step can be taken from the example String I and is also
explained there. The output also corresponds to the data from example string
I and is not shown here. However, the output standardized values
(YEAR_STR) are now strings.
compute YEAR_I=number(YEARSTRG,F4).
exe.
if YEAR_I < 2000 INDEX= 1.
exe.

string YEAR_STR (A4) .

compute YEAR_STR=YEARSTRG.
exe.
do if (INDEX=1).
compute YEAR_STR=replace(YEARSTRG,"9","199",1).
end if.
exe.
list variables= YEARSTRG INDEX YEAR_STR.
The NUMBER function converts the strings into numeric values (variable
YEAR_I). YEAR_I is required for the creation of an index, which outputs the
index=1 for each value smaller than 2000. Within a DO-IF condition, the
string "9" is replaced by the string "199" via REPLACE for each INDEX=1
(which stands for two-digit year specifications). The parameter "1" is
necessary (for explanation of REPLACE see 4.3). Each two-digit
YEARSTRG value is thus overwritten by a correct four-digit value. LIST
lists the determined results (not shown further). The standardized values of
YEAR_STR are in the string format.
Alphanumeric characters (Strings) III
Example String III is based on slightly different data than the above examples
(cf. the 2011 values etc.). The output standardized values (YEAR2) are
strings. The difference to the two approaches above is the check range.
While examples I and II use values smaller than 2000 as test criteria, this
example uses values greater than or equal to 2000 as test criteria (cf. creating
the INDEX variable). At this point, advanced users surely have some ideas
how to program Example III more elegantly. Keywords could be: IF ANY,
DO IF-ELSE etc.

DATA LIST FREE


/YEAR (A4) .
BEGIN DATA
95 96 97 98 99 2000
2001 2002 2003 2011 2014 2015 Liste
END DATA.
YEAR1 INDEX YEAR2
string YEAR1 (A4) .
compute YEAR1=YEAR. 95 ,00 1995
exe. 96 ,00 1996
97 ,00 1997
compute INDEX= 98 ,00 1998
rindex(YEAR1, "200") 99 ,00 1999
or rindex(YEAR1, "201") 2000 1,00 2000
or rindex(YEAR1, "202") 2001 1,00 2001
or rindex(YEAR1, "203") 2002 1,00 2002
or rindex(YEAR1, "204") 2003 1,00 2003
or rindex(YEAR1, "205") 2011 1,00 2011
or rindex(YEAR1, "206") 2014 1,00 2014
or rindex(YEAR1, "207") 2015 1,00 2015
or rindex(YEAR1, "208")
or rindex(YEAR1, "209"). Number of cases read: 12
exe. Number of cases listed:12
string YEAR2 (A4) .
compute YEAR2=YEAR1.
exe.
do if INDEX = 0 .
computeYEAR2=
concat("19",substr(YEAR1,1,2)).
end if.
list variables= YEAR1 INDEX YEAR2.
Using DATA LIST FREE, values are read in and formatted to strings. A copy
of the original variable YEAR is created (YEAR1) and worked with. The
RINDEX section creates an index (INDEX) to determine for which character
strings no transformations should be performed, e.g. for all years containing
the strings "200", "201" etc., i.e. the years 2001, 2002, 2003, in total all years
from 2000 to 2099.
The second COMPUTE creates a copy of YEAR1. Filtered via INDEX, the
values from YEAR1 are appended to the "19" strings using a CONCAT-
SUBSTR combination, stored in YEAR2 and output via LIST. the
functioning of this program requires that it should be clarified, for which
values no transformation is allowed (see INDEX), that the specified string
("19") can indeed be assigned uniformly (and e.g. no further string is
required, e.g. "18"), the spelling is uniform (e.g. typing errors, special
characters) and the length of the strings is constant (e.g. two or four digits).

12.4 Time stamps


Datasets or rows can be assigned with a time stamp, e.g. when merging
individual datasets into one working resp. master dataset (or any other
operation). In the following example, the date resp. the system time is stored
via COMPUTE e.g. optionally in the variables MY_SYSDATETIME (date
and time), MY_SYSDATE (date only) or MY_SYSTIME (time only).
get file='C:\[Link]'.
compute MY_SYSDATETIME=$time.
compute MY_SYSDATE=[Link]($time).
compute MY_SYSTIME=[Link]($time).
exe.
formats MY_SYSDATETIME(datetime25).
formats MY_SYSDATE(edate8).
formats MY_SYSTIME(time20).
add files /file=*
/file='C:\[Link]'.
exe.
save outfile='C:\[Link]'.
get file=' C:\[Link]'.
The advantage of the timestamp is that the last time data was added to these
two variables in the finally created dataset can be traced back to hours,
minutes or seconds. In an alternative approach, an ODBC access via GET
CAPTURE, SPSS currently only gives the system-internal time in one
variable only in days. With an additional OMS program it would be possible
to store this information in a separate file (despite the extension
"C:\my_protocol.sav" is not an SPSS record). Using an additional OMS
program it would be possible to store this specification in a separate file
(despite the extension "C:\my_protocol.sav" is not an SPSS record).
OMS
/select all
/destination format=TEXT outfile = "C:\my_protocol.sav" .
OMSINFO.
OMSEND.
If, in addition, the synchronization of different data supplies or connected
workstations is to be controlled, it must first be ensured that the data
suppliers or connected workstations have a uniformly set and formatted
system time. In the case of geographically distributed systems, an
agreement or at least unifying adjustment to a common standard, e.g. GMT
(Greenwich Mean Time), is recommended due to time zones or time changes
(summer/winter time).
13 Further criteria for data quality
The DQ Pyramid presented the following first criteria for the quality of data
in this book:

Completeness (Chapter 3) resp.


Controlled missings (Chapter 6),
Avoiding duplicate data (Chapter 5),
Uniformity (Chapter 4),
Evaluation of outliers (Chapter 7) and
Plausibility (interpretability) (Chapter 8)

These quality criteria are not the only ones. Numerous other criteria are
discussed in the literature, especially on data warehouses, but also in survey
research, among other things with regard to structurability, hierarchization
and formalization (measurability) (e.g. Lee et al., 2006, Chapters 3, 4, and
Appendix 3; Batini & Scannapieco, 2006, Chapter 7.2.; Long et al., 2004,
201-203; Gackowski, 2004; Redman, 2004, 2001; Fugini et al., 2002; Pernici
& Scannapieco, 2002; Helfert et al., 2001; Naumann & Rolker, 2000;
Brackstone, 1999; Chapman et al., 1999; Garvin, 1998; Wang & Strong,
1996). The essential difference is that the verification of the following criteria
is not so much performed by SPSS on the basis of formal validation rules, but
rather by the user resp. client on the basis of precisely formulated (semantic)
objectives. For example, SPSS cannot judge the accuracy, unambiguity or
relevance of information, only users resp. clients can do that. The various
approaches in computer science to formalize also semantic aspects in such a
way that they can be validated by (automatic) checking rules are not yet
powerful enough to be implemented in a statistics software such as SPSS.
Interested readers are referred to the current data warehouse literature.
For the further content-related evaluation of data and especially for
preventive measures to ensure the highest possible data quality, the following
criteria can be considered, for example:

Quantity: Are there sufficient information resp. indices available for


the central constructs (cf. relevance, granularity and timeliness)?
Unambiguity: Is the data semantically unambiguously interpretable
(cf. also comprehensibility)?
Relevance: Do data fields cover knowledge-relevant characteristics
or not? An early restriction to a few, but central knowledge-relevant
information allows a drastic reduction of the data volume to be
analyzed and the required resources on the one hand. On the other
hand, such planned datasets may not always cover the specifically
required demand and thus may have gaps. Certain information that is
only relevant at a later date is therefore not always available and are
to be supplemented by further surveys or other data sources.
Relevance can also change (see also timelines). For example, are the
existing data fields still relevant or already outdated? This aspect is
especially important in rapidly changing market, competitive and
business situations. A few years ago, for example, in sales (and thus
in the corresponding databases) the information that a computer
should have a 3.5-inch floppy disk drive was still important. In the
meantime, this information has become technologically obsolete and
with it the corresponding entry in the databases of purchasing and
sales. However, relevance can also refer to inherent bias. For
example, does a data storage contain any information at all about the
actual target group or only about non-customers (the decisive factor
here is not relevance, but bias!)? Missing relevant characteristics for
data fields resp. groups should be collected additionally. Irrelevant
variables that are difficult to fill out often cause intentional
misstatements in order to save time and work. Relevance is often
equated with correctness; this equation applies, if at all, only to the
query of information, but not to its definition.
Accuracy (granularity, level of detail): Strictly speaking, accuracy
has two meanings: a) Does the content of the dataset reflect empirical
reality accurately enough: For example, is the year sufficient as a
date, or should the month and year also be recorded? b) How high is
the proportion of exact values in the dataset compared to the non-
precise values? The meaning a) thus refers to the external accuracy of
data, b) on the other hand refers to the extent of the internal accuracy
of data. b) presupposes a); therefore, only a) will be discussed here.
Accuracy (in the sense of a) is always "accuracy in the context of a
specific question" and can therefore be relative. Here, however, finely
recorded information can always be coarsened resp. compressed, but
roughly recorded information can never be made sophisticated. Here
you can ask yourself the question: Are certain codings or
abbreviations of information still acceptable for the purpose of
information compression or do they unnecessarily coarsen the desired
level of detail? In case of doubt, data should always be recorded as
accurately as possible. Accuracy is related to comprehensibility, but
also needs to be delimited a bit.
Comprehensibility: Comprehensibility is always also
"comprehensibility in the context of a specific question for a specific
user, decision-maker or customer" and can therefore also be relative.
When it comes to comprehensibility, it is particularly important to
ensure that data, codes, formats, etc. are understandable not only for
one user in the quiet chamber, but ideally also for all others.
Comprehensibility is difficult to vary; once data, codes, formats, etc.
are declared to be comprehensible, they remain so, but only within a
context resp. within a certain time. Comprehensibility can also
include aspects such as flags, formats or factors, etc.; these technical-
methodological nuances are usually less important for a more
content-oriented definition. Comprehensibility is a central criterion.
Data that is not understandable is not useful. Comprehensibility often
goes along with timeliness.
Timeliness: Timeliness in terms of the average age of data is always
"timeliness in the context of a specific question". Timeliness here
means the time interval between the last update and the time of the
question to be answered. "Timeliness" and its relevance resp. time
unit are always relative. In clinical applications, data may need to be
updated in real time. Stock market data may already be offset by
minutes, etc. For (old) data storages, the actuality of the entries must
therefore be checked at regular intervals. Timeliness is not an end in
itself, but is also relevant in practice, e.g. in marketing. For example,
is the meanwhile successful entrepreneur not contacted because she is
still a student in the data? If, above all, time-related data
(information) is no longer up-to-date, it is (negatively) described as
"expired", "invalid" or "expired". Typical examples are legal
deadlines or guarantees, validity of documents (e.g. ID cards) or the
durability of goods (stock).
Documentation: Is the documentation of the derivation, archiving of
data resp. of measures to ensure data quality exhaustive, structured
and comprehensible? Or have employees in charge left the project,
are on vacation resp. have fluctuated so often that nobody knows
where and how the data and its quality is documented? The criterion
of verifiability is closely related to documentation, as are other types
of data (metadata) and their documentation (see below).

For strategy-relevant measures, the following criteria could be of interest, for


example:

Credibility: Can the data be regarded as credible by the decision-


maker, user resp. customer? Because credibility is influenced by other
factors, such as independence, integrity, respectability, level of
standards, etc., this term goes far beyond "mere" accuracy (see
above). The more credible data is, the higher is its weight in a
competition-relevant comparison with other data. Data "only" to trust,
especially when it comes to sensitive data, is not a recommendable
approach; especially in this case, a validation option is particularly
important, because this allows the possibility of identifying
information as incorrect.
Objectivity: Can the data be regarded by the user resp. customer as
objective, free from external influence, and verifiable? Objectivity is
important if users resp. clients want to avoid data-related risks and
negative effects in the context of sensitive questions and decisions
resp. in the case of unexpected results, at least be able to check the
(data) cause. Objectivity goes along with transparency,
accountability, verifiability and thus concrete control of the data, but
also with the positioning of the user resp. client with regard to the
importance of data quality. The more comprehensible data is, the
more objective it is and the more likely it is to be recognized and used
as relevant for decision-making.
Status (added value, reputation): Credibility and objectivity combine
to form one criterion, the status of data. Data with higher status is
preferred to other data. A high level of user, decision-maker or
customer satisfaction with data and its quality result in a long-term,
marketing-relevant high status of data.
Verifiability: Verifiability of data increases the credibility of data.
Verifiability contributes to a higher status; also because it already in
the beginning excludes the possibility of users, customers or decision
makers becoming suspicious due to non-transparent,
incomprehensible data or results. Verifiability refers to several levels
of the data collection process simultaneously: Definition,
operationalization/coding, collection, verification (cleansing) and
analysis resp. presentation. All levels of the data collection process
should therefore be carefully documented and recorded to ensure
verifiability. The maxim "trust is good - control is better" applies
precisely because a (naive) trust in data quality believes data
unchecked, i.e. without checking it (for safety's sake). Verifiability,
however, makes it possible to find potential errors in data that may
have been hastily trusted. Verifiability is essential for third-party data,
exactly this increases the credibility of a (data) source. It should be
clear that the comparison of internal company resp. project data with
external data for verification and possible enrichment requires the
highest possible quality of the external (reference) data.

This list cannot be concluded without a compressing reference to rather


informatics-oriented criteria, which, when looked at more closely, have less
of a content-semantic and more of a material resp. temporal relation with data
quality, e.g.

Compatibility (portability): Are the (new) data compatible for all


applications within a more complex system resp. project? The
definition of compatibility refers to two levels: On the level of
defining the information (e.g. is a single currency acceptable in all
application areas?), as well as compatibility with other data formats
or hardware requirements. For example, can all applications handle
variable names longer than 8 characters or strings with a length of
255 characters?
Availability: How quickly can high quality data be provided? In
addition to genuine processes of data collection and cleansing, this
question is also related to aspects of hardware, software, technical
networking, and human resources. More specific aspects, e.g. reaction
times, provision times or even data protection, will not be further
presented here.
Resource consumption (environmental protection): If the definition
of quality of data (storages) is extended to include its efficiency, then
accelerated loading times or generally "leaner" data (generally more
efficient work processes) are accompanied by lower power
consumption and indirectly also a contribution to environmental
protection. The quality of data (storages) should also be attractive in
this respect.
Price / Costs: How much do high-quality data bring / cost the user,
client resp. decision maker? The calculation is based on the number,
complexity and extent of the considered levels of the data collection
process, e.g. definition, operationalization / coding, collection, review
(cleansing) and analysis resp. presentation. More sophisticated data is
more profitable due to its higher quality and therefore justifies its
understandably higher price. When making a cost-benefit calculation,
it must be taken into account that data quality does not allow for any
compromises in two respects. "Cheap" data with questionable quality
can only have a suspected, speculative benefit. However, to speculate
one does not necessarily have to spend money. When checking data
quality, one cannot do without one or the other check with the
"argument": "There won't any mistakes in it". Because it can be quite
possible, that the not applied check criterion could be exactly that,
which could identify the most massive errors resp. bias. The decision
in assessing the efficiency of the cost-benefit factor should always be
in favor of data quality. High quality data is a competitive advantage
in every respect.

The verification of data quality is always possible only after the related
investments. One should not succumb to the temptation to make the
assessment of data quality dependent on the expected contribution of the data
to the achievements of the analytical goals of the analysis or even data
mining projects. Data quality is independent of the content of data, the
reflection of empirical or economic realities. It is unrealistic to assume that
investments in data quality will help to enforce one's own expectations as a
kind of instrumentalization. Ideally, data quality is credible, verifiable, and
objective, and can provide a solid foundation for relevant decisions, but not
necessarily always in a direction that initially meets your expectations. Data
quality allows a clear view of empirical or economic realities and is therefore
an important protection against potential errors and wrong decisions.
On the basis of the preceding remarks, a three-part differentiation of data and
their quality can now be woven in at this point:

Documentation of data (see above): Documentation of types and


levels of formatted information, e.g. datasets (files), variables (data
fields) and values (characters).
Documentation of data quality (see above): Documentation of the
complete process of data collection and the concrete criteria and
measures of its review or, if necessary, correction.
Documentation of metadata (cf. also ISO/IEC 11179-1; cf. also
Chapter 19.1.): Metadata or meta information are data about data
(including their quality), i.e. information about information and
describe, among other things, data by means of a standardized
documentation, e.g. file name, format and size, names and formats of
variables, the date of the last change, etc. (partly also including their
collection, presentation and distribution, cf. e.g. United Nations,
2003, 2002, 1995). These (technical) metadata are usually not stored
in the data file, but in a so-called 'data dictionary' (a somewhat fuzzy
synonym is 'data directory').

If metadata are defined, recorded and stored in such a structured way, they
enable convenient archiving, retrieval and maintenance of various data
storages (e.g. datasets, databases). Changes to the data do not
automatically cause changes to the metadata. Metadata often (but not
always) require much less storage requirements than the actual data
storage. The documentation of metadata has the same relevance as the
documentation of data quality, especially in the case of larger data
storages.
In addition, it is possible to evaluate data by means of data quality key figures
and to assign certain error codes or quality levels by means of so-called
'flags', e.g.

the data can be used (e.g. code "green"),


the data can be used with restrictions (code "yellow")
the data are not to be used (e.g. code "red"),
and also to implement e.g. appropriate blocking options or access rights.
However, this approach will not be pursued further at this point.

14 A little exercise
In this book you have learned about several criteria for checking data quality.
In the final example several errors and implausibilities are hidden. Check this
dataset by sight or by syntax and find these errors.
data list
/ID 1-3 GENDER 5(A) AGE 7-9 MARRIED 11-14(A)
FIRSTNAME 16-23(A) INSR_NO 25-37(A).
begin data
001 f 28 yes Caroline 8940233093
002 f 19 yes Sibylle 5914874441
003 m 99 je Volker 3645651657
004 m 75 no Rudolf 4361324654
005 m 7 yes Carl 6541980814
006 m 23 no Hank 6516968984
007 m 999 no Marcus 6105198454
008 f 34 yes Laura 3614984414
008 f 63 yes Lilli 691444O414
009 f 22 yes hTeresia 3949912351
010 f 65 yes Roland 9465914512
end data.
exe.
Two implausibilities (which should at least attract your attention) and seven
clear errors were hidden among these values. Can you find them? To enable
you to check the data for plausibility, here is some background
information: Each data row contains values of one person, namely ID,
gender (variable GENDER), age (AGE), whether this person is married
(MARRIED), first name (FIRSTNAME) and an insurance number (which
consists of a sequence of digits, INSR_NO). Once you have found the errors
and implausibilities (one of the errors is extremely nasty), you are ready to
tackle real data.
Take any (preferably: small) dataset, a section of a large dataset, or, perhaps
even better, ask someone to hide five errors in a (hopefully error-free) dataset
and systematically check this dataset against the following criteria:
Completeness, uniformity, duplicates, missings, outliers and plausibility. Let
your imagination run free when it comes to creatively developing test rules
for finding error sources or inconsistencies. Just think, for example, of
pregnant men, children of retirement age, or even women named "Paul".
Finally, compile an action catalog for the prevention of data errors, in which
you include, for example, the presented criteria relevance, accuracy or
unambiguity.
Readers may find the solution for this little exercise on the website of the
author.

15 A program example for a first


strategy
The literature provides quite a few strategies that users could apply to their
data. In fact, they often remain at the general level for the good reason that,
depending on the requirements and characteristics of the data and data
storages, each required strategy may be completely different.
The following is a simplified example of how one's own data could be
checked and prepared for analysis. This program consists of several steps that
clarify both the required criteria and the sequence of operations. The
advantage of such an approach is of course the immediate transparency, as
well as accountability and verifiability of the steps performed.
The (fictitious) program consists of three steps, e.g. "(1) Verification and
assurance of data quality", "(2) Standard analysis according to document
'Specification [Link]'" and "(3) Analysis variants according to
'Specification-2- [Link]'". These steps are preceded by a
recommended section to ensure uniform analyses (cf. "(0) Settings"), in
which the format of values, diagrams, tables, etc. is uniformly specified.
The program presented below is called "Project 'Prospective Market
Segmentation 2006'". It begins with a so-called "header", in which you can
enter project-relevant information, e.g. program name, person responsible for
the project, date of the last change, required datasets and, most importantly,
modification made or also notes on work still to be done, e.g. problems to be
solved. This header is comparatively short; in professional programs, such
headers can be several pages long.
The further structure of the following program is a logical and content-related
sequence of analysis steps (including checking for data quality) and
(successive) dataset accesses (cf. the change from [Link] to
MARKET_FILT.SAV). Depending on your data quality and analysis
requirements, you can include i.a. all possible examples from this book under
the respective program steps and let SPSS process them one after the other,
e.g. with the help of macro loops (cf. also Schendera, 2005).
When writing the program, it is helpful to provide certain sections with
explanations ("comments"), e.g. if you have noticed during programming
that unexpected peculiarities occurred unexpectedly in a certain analysis
phase (data errors, duplicates, etc.). Not only your (future) staff will know
what you have done here; you yourself will not need to think twice at a later
stage. You only need to read the logged notes in the syntax. You make it
much easier for yourself, your colleagues and the whole project.
The content of such an analysis sequence corresponds to a so-called "stream",
as it is known from data mining systems like IBM SPSS Modeler (cf. Chapter
16). In a "stream" the data passes through a sequence of access, filter and
analysis processes that build on each other. In the context of these successive
program steps, "streams" form a stream in two different ways: on the level of
the successive processes (in data mining comparable to "nodes"), as well as
on the level of the corresponding states of the data passing through. The
relation of processes to data is mutual: processes imply data in certain states,
as well as data with certain properties (states) imply certain processes.

*************************************************************.
*************************************************************.
*************************************************************.
*************************************************************.
*Program name:
*Project "Prospective Market Segmentation 2006"
*Author: CFG Schendera, Method Consult
*For: Sommer, Elke (Hamburg, Germany) 040-1234567-89
*Created on: 29.02.2006
*Dataset (size, date): "[Link]"
* (45.983 KB, 28.02.2006)
*Path: "E:\Clients\"
*Program units:
*(1) Verification and assurance of data quality
*(2) Standard analysis according to specification
* Document "Specification [Link]"
*(3) Analysis variants according to
* "Specification-2- [Link]"
*Letzte Änderung:
30.02.2007

*************************************************************.
*************************************************************.
*************************************************************.
*************************************************************.

***********************************************************.
* (0) Settings *.
***********************************************************.
SET FORMAT=F8.2.
SET EPOCH=1950.
SET DECIMAL=DOT.
SET JOURNAL "C:\Programs\SPSS\project17_2006.jnl" Journal On.
* Note: Does not apply to version SPSS 15 or later. *.
SET BLANKS=sysmis.
SET TLOOK "C:\Biz_Data\Market Segmentation [Link]" .
SET CTEMPLATE
"C:\Biz_Data\SPSS\Looks\[Link]" .

***********************************************************.
* (1) Verification and assurance of data quality *.
***********************************************************.

GET FILE="E:\Clients\[Link]".
* At this point insert the required operations and add explanatory notes, e.g *.
* (a) Dealing with missings *.
define macsymis (!POS!CHAREND('/')).
!do !i !in (!1).
if !i = 999 !i = $SYSMIS.
exe.
!doend
!enddefine.
macsymis ITEM1 ITEM2 ITEM3 ITEM5 ITEM5 ITEM6 /.
* (b) Filtering using ID *.
select if (ID ne . ).
exe.
* (c) Filtering based on completeness *.
compute SYSMISUM
= SYSMIS(ITEM1) + SYSMIS(ITEM2) + SYSMIS(ITEM3)
+ SYSMIS(ITEM4) + SYSMIS(ITEM5) + SYSMIS(ITEM6).
exe.
select if (SYSMISUM < 2).
exe.
* (d) Standardization of central text entries *.
define FINDING (!pos !charend('/') / !pos !tokens(1)).
do !i !in (!1).
string ACRONYM (A20).
if (index(upcase(COMPANY),
(!quote(!upcase(!i)))) ne 0) ACRONYM = (!quote(!2)).
exe.
!doend.
!enddefine.
FINDING IBM Industrial Business Machines / IBM.
FINDING MB Daimler Benz Chrysler Mercedes / DC.
list variables = ACRONYM.
SAVE OUTFILE="E:\Clients\MARKET_FILT.SAV".
*************************************************************.
* (2) Standard analysis according to document *.
* "[Link]" *.
*************************************************************.
title "Standard analysis according to specification document 'Specification-
[Link]' " .

GET FILE="E:\Clients\MARKET_FILT.SAV".
MEANS
TABLES=ITEM1 ITEM2 BY PRODUCT
/CELLS MEAN COUNT STDDEV .
GRAPH
/LINE(MULTIPLE)=MEAN(ITEM1) MEAN(ITEM2) BY PRODUCT
/MISSING=VARIABLEWISE REPORT
/TITLE= 'Product Comparison'
/SUBTITLE= "Project 'Prospective Market Segmentation 2006'".
FACTOR
/VARIABLES ITEM1 ITEM2 ITEM3 ITEM4 ITEM5 ITEM6
/MISSING LISTWISE
/ANALYSIS ITEM1 ITEM2 ITEM3 ITEM4 ITEM5 ITEM6
/PRINT UNIVARIATE INITIAL CORRELATION
SIG KMO INV REPR AIC EXTRACTION
/PLOT EIGEN
/CRITERIA FACTORS(2) ITERATE(50)
/EXTRACTION PAF
/ROTATION NOROTATE
/METHOD=CORRELATION .

*************************************************************.
* (3) Analysis variants according to "Specification-2- *.
* [Link]" *.
*************************************************************.

title "Analysis variants according to 'Specification-2- [Link]'" .


DEFINE !LOOP (MY_LIST= !CHAREND ("/")).
GET FILE="E:\Clients\MARKET_FILT.SAV".
SELECT IF (CITY ne .).
EXE.
!DO !PRICE !IN (!MY_LIST).
SELECT IF (PRICE > !value).
SORT CASES BY city .
SPLIT FILE
SEPARATE BY city .
MEANS
TABLES=ITEM1 ITEM2 BY PRODUCT
/CELLS MEAN COUNT STDDEV .
GRAPH
/LINE(MULTIPLE)=MEAN(ITEM1) MEAN(ITEM2) BY PRODUCT
/MISSING=VARIABLEWISE REPORT
/TITLE= 'Product Comparison'
/SUBTITLE= "Project 'Prospective Market Segmentation 2006'".
FACTOR
/VARIABLES ITEM1 ITEM2 ITEM3 ITEM4 ITEM5 ITEM6
/MISSING LISTWISE
/ANALYSIS ITEM1 ITEM2 ITEM3 ITEM4 ITEM5 ITEM6
/PRINT UNIVARIATE INITIAL CORRELATION
SIG KMO INV REPR AIC EXTRACTION
/PLOT EIGEN
/CRITERIA FACTORS(2) ITERATE(50)
/EXTRACTION PAF
/ROTATION NOROTATE
/METHOD=CORRELATION .
SPLIT FILE OFF .
!DOEND.
!ENDDEFINE.
!LOOP MY_LIST=10 50 100 150 /.
It is recommendable to clearly structure the individual (content-related)
program sections by using headings, frames and spacing. Clear spacing
improves the readability of the program and makes it easier, for example, to
find certain sections of the program more quickly. The contents of the
sections themselves should always build on each other logically and be
bundled according to individual analysis goals or higher-level questions. If
the analysis allows or provides for it, an increase from simple to complex
(e.g. by scale level and/or descriptive, then inferential statistical analysis) or
from univariate to multivariate procedures (e.g. univariate description, t-test,
multiple regression) is recommended.
For reasons of clarity and completeness, it is generally not recommended to
perform analysis steps in separate syntax files. A syntax file can also be
understood as a kind of "externalized brain". If you have several syntax
programs and at some point you have to check whether and what you did or
did not do (sic) at an earlier analysis step, you can assume that searching
around is more cumbersome than if you had all the information and
comments in one working syntax. As a consequence, it is not unusual that
such programs can sometimes become very extensive (no wonder, as an
"externalized brain" it contains everything you have thought of so far, and
from now on you don't have to think about anymore). Syntax programs
should, if at all, only exceptionally and if possible, only project-specifically
be split up.
16 Notes on IBM SPSS Modeler
IBM is regarded as one of the market leaders in the data mining sector with
vision areas of analysis and modeling, for example in the Magic Quadrant for
Data Science Platforms (formerly the Gartner Magic Quadrant for Advanced
Analytics) (e.g. Gartner, 2017).
IBM SPSS Modeler (formerly known as “Clementine”) is considered one of
the most popular data mining and text analytics software application. IBM
SPSS Modeler, short: “Modeler”, has a visual interface which allows users
to leverage statistical and data mining algorithms without programming (cf.
SPSS, 2018, 2017c,d).
Modeler is used in application areas as diverse as CRM, fraud detection and
prevention, risk management (credit scoring), insurance, health care,
demand/sales forecasts (pricing, response modeling), telco, entertainment, or
government and law enforcement. Modeler supports the entire data mining
process including CRISP DM methodology (cf. SPSS, 2000; Chapman et al.,
1999).
Modeler is available in two editions, with different features (v18.2.1); both
editions are available as desktop and as server configurations:

SPSS Modeler Professional: For structured data, such as


databases, mainframe data systems, flat files or BI systems.
SPSS Modeler Premium: Includes all the features of Modeler
Professional, plus Text Analytics.

In addition to the traditional IBM SPSS Modeler desktop installations, IBM


now offers the SPSS Modeler interface as an option in the Watson Studio
product line.
The history of Modeler starts back in 1989 with the formation of Integral
Solutions Limited (ISL). ISL released Clementine 1.0 in the year 1994
(Khabaza, 1999). In 1998 ISL was acquired by SPSS Inc.; SPSS released
Clementine 12.0 in the year 2008, and as PASW Modeler in 2009. After
SPSS Inc. was acquired by IBM in 2010, formerly known Clementine was
released as IBM SPSS Modeler.
16.1 Nodes Palette
Most of the data and modeling tools in IBM SPSS Modeler are provided by
so-called nodes, available from the Nodes Palette, below the stream window.
Each tab of the palette tab contains several nodes used for different
operations, for example (recently used nodes are shown under the “Favorites”
tab):

Sources: Read data into IBM SPSS Modeler.


Record Ops: Operations on data records, such as selecting,
merging, and appending.
Field Ops: Operations on data fields, such as filtering, deriving
new fields, etc.
Graphs: Display data before and after modeling e.g. using plots,
histograms, web nodes, evaluation charts and t-SNE.
Modeling: Algorithms such as neural networks, decision trees,
clustering, association and data sequencing.
IBM SPSS Text Analytics: Approaches such as Text Link
Analysis or Text Mining.
Python: Run Python algorithms.

Depending on the Modeler version, you can either use an “IBM SPSS
Statistics” tab, or use the Statistics File node from “Sources”, or the Statistics
Export node from “Export” to read/write data from IBM SPSS Statistics.
Later sections will highlight various options among these nodes for data
preparation resp. data quality, including the cleansing and transformation
of data for analysis as well as the identification of statistical outliers and
dubious fields.
IBM SPSS Statistics and IBM SPSS Modeler operate within the co-called
CRISP DM methodology, a standards framework both in data analysis and
in data quality (see SPSS, 2000; Chapman et al., 1999). Data quality is e.g. an
element of phase 2 ("Data understanding") of the CRISP-DM methodology
(see SPSS, 2000; Chapman et al., 1999). In this respect, this book may be
also of some interest to Modeler users; it gives a conceptual background to
selected criteria, deviations, possible causes, and their evaluation.

You learn criteria for data quality and their relationships


You recognize data problems and the possible risks they pose
You understand your data problems, and their consequences.
You solve data problems = establish data quality.
You’re enabled to communicate data quality.
These general objectives and principles do also apply to Modeler. You can
identify data problems by using Modeler nodes directly, or indirectly by
using R, Python, or even SPSS syntax depending on the Modeler/Clementine
version. The above discussions may draw the attention of Modeler users to
the fact that some of the methods implemented in Modeler (e.g. with
missings) are not without problems, and could possibly be replaced by
alternative approaches.

Nodes and streams


Nodes form a so-called "stream", a sequence of Source, Record/Field ops
and Modeling etc. nodes that build on each other; streams correspond to the
functional structure of a SPSS syntax program (see Chapter 15). Streams
form a process flow on two levels: On the level of the successive nodes
(processes) and on the level of the data states. The relation of nodes to data is
mutual: Nodes imply data in certain states, as well as data states with certain
properties imply certain nodes. Streams are explained below and by means of
the next figure.
In the left area of this figure, you see an arrow diagram, this is a so-called
stream. Below the stream window, you see the Nodes Palette. You can also
request these nodes from the “Insert” menu (third menu from the left). The
above screenshot actually shows two simple streams.

The first stream (starting on the very left) reads in the SPSS
dataset “[Link]” (Statistics File node from “Sources”).
Because you want to sample a subset, you may choose e.g. the
node Sample (from “Record Ops”), and then save/export the
resulting subset as “loan - [Link]” using the Statistics
Export node from ”Export” to save as an IBM SPSS .sav file. A
second arm of the stream branches off downwards and requests
the Data Audit node (from “Output”) for the original
“[Link]”.
The second stream, starting with the subset “loan -
[Link]” created in the first stream, requests three analytics
nodes (Random Trees, C5.0, C&R Tree) from “Modeling”, and
the Statistics Export node from ”Export”. The Random Trees
node builds an ensemble model that consists of multiple decision
trees. The C5.0 node creates either a decision tree or a rule set.
C&R Tree denotes the "Classification and Regression" node
which helps to predict and classify future observations.

Complex processing may lead to complex and nested streams of


nodes. You have at least two graphical, non-programming options to
streamline complex streams. You can split a complex stream into
several separate streams, branches or phases. In the above graph, the
first stream creates a subset that the second stream uses as input. You
can also create SuperNodes. SuperNodes combine several
consecutive nodes into a single node. For example, you could
integrate both streams above into a single SuperNode (not shown).
The advantage of SuperNodes is that they make streams more
manageable by combining several nodes.
Users should not confuse the usability of the interface with the complexity
and power of the processes behind it and the knowledge required. As for
SPSS, this applies all the more to Modeler: Modeler does not replace any
statistical or informatics knowledge, Modeler requires it.

16.2 Nodes for Data Preparation and Data


Quality
IBM SPSS Modeler offers several nodes for data preparation and data
quality. The checks for individual data quality criteria are still distributed
over several nodes (e.g.):

Completeness (Chapter 3): "Data Audit" node, "Auto Data Prep"


node
Controlled missings (Chapter 6): "Data Audit" node, "Auto Data
Prep" node, “Feature Selection” node
Assessment of outliers (Chapter 7): "Data Audit" node, "Auto Data
Prep" node, “Anomaly” node
Identify and avoiding duplicates (Chapter 5): "Distinct" node

Overview
The overview lists selected nodes and follows the structure of the Node
Palette.
Node Description

“Record Identify and remove duplicate data


Ops” rows (cf. 16.2.3).

Performs data transformations using


Python or R Syntax.

“Field Ops” Improves the data quality for modeling


(cf. 16.2.1), especially after
explorations using Data Audit.

“Modeling” Constructs model to identify data


anomalies, e.g. outliers or unusual
cases.

Identifies important input fields for


predictive modeling using the feature
selection algorithm.
“Output” Summary display of statistics and
allows handling of outliers, and
extreme and missing values (cf.
16.2.2).

Calculates univariate descriptive


statistics and pairwise correlations.

Apply different types of


transformations, e.g. inverse, logn,
log10, exponential, or square root.
Performs data analysis using Python or
R Syntax.

Modeler users may integrate separate nodes into a superordinate SuperNode.


Each stream should start with a data quality SuperNode or at least before any
relevant modeling.

16.2.1 Data Audit node


Main purpose: The Data Audit node helps to get a first insight at the data
“as is”. You find the Data Audit node under “Output”.
Requirements:

The dataset has to be successfully accessed and imported, e.g.


the SPSS dataset “[Link]”.
The Data Audit node has to be connected to the data source to be
checked.
You need to click on the connected Data Audit node and then on
Run [Ctrl+E].
Results:
Audit tab:
The Audit tab lists graphs and descriptive statistics for each of the selected
fields in the data source, e.g. name, distribution graphs, measurement level,
maximum, minimum, unique and valid.

Unique: Displays the number of unique levels of a field, e.g. ed


has five different levels. You can double-check by looking at
maximum and minimum values.
Valid: Displays number of available cases. System-defined
missing, user-defined missings, null (undefined) values, blank
values, white spaces and empty strings are treated as invalid
values. Default, e.g. has 700 valid values, judging from the
consistent 850 valid values in other fields we may expect 150
invalid values. Actually, default contains N=183 1s (“Yes”),
N=517 0s (“No”), and 150 system-defined missings. The Audit
tab shows only the valid, not the number of invalid values.

Quality tab:
The Quality tab lists field names, measurements and more data quality
related information of the audited data.
Extremes: Number of extreme values, definition-dependent.
Outliers: Number of outliers, definition-dependent.
Action: Actions in connection with outliers and extreme values
(including deleting, replacing with missings or with non-extreme
values).
Impute Missing: You can set whether and how Modeler should
replace certain values with calculated values. The options
include: "Never", "Null Values" (system-defined numerical
missings), "Blank Values" (user-defined missings), "Blank and
Null Values", “Conditions…” (define the values to be imputed),
as well as the methods (cf. "Methods") and “Specify…”.
%Complete, Valid Records, Null Value: Describe the
completeness of the respective field. %Complete is defined as in
percent (100 minus proportion of missings). Valid Records
displays the number of available cases. Null Value displays the
number of system-defined missings. Other fields display
information about other type and number of missings ("Empty
String", "White Space"). Valid Records and the information in
the missing data fields on the right should add up to the total of
rows in the audited file. In default, the 700 valid values plus the
150 invalid resp. null values add up to 850 data rows in total.
According to this information, this field is 82,353% complete.

Complete Fields (top left) indicates the average completeness of


all fields (columns) in percent (example: 91,67%).
Complete Records indicates the percentage of complete cases
(rows). If 100% is displayed, all data rows are complete. If 0% is
displayed, each data row has at least one empty cell. The value in
the example says that 82,35% are complete; because this value
exactly matches the value of default, it also says that this is only
caused by that field. This information can be used e.g. in
"Missing Values" nodes ("Filter", "Select") to exclude the least
complete variables.

16.2.2 Auto Data Prep node


Main purpose: Improve the data quality for modeling, especially after
explorations using Data Audit indicate peculiarities detrimental for statistical
modelling. You find the Auto Data Prep node under “Field Ops”.
Requirements:

The dataset has to be successfully accessed and imported, e.g.


the SPSS dataset “[Link]”.
The Auto Data Prep node has to be connected to the data source
to be checked.
You need to click on the connected Auto Data Prep node, choose
your settings and then click on Run.
You may want to save the transformations in a new file.
Only Prepare Inputs & Target of the Settings tab will be presented. You
can select the following settings for inputs, target, or both:

Adjust the type of numeric fields. Convert Ordinal to


Continuous, or vice versa.
Reorder nominal fields e.g. to have smallest category first,
largest last.
Replace outlier values in continuous fields.
Continuous fields: replace missing values with mean.
Nominal fields: replace missing values with mode.
Ordinal fields: replace missing values with median.

The benefits and limits of these approaches for missings and outliers are
discussed in Chapters 6 and 7.

16.2.3 Distinct node


Main purpose: Identify and remove duplicate data rows before data analysis
resp. application in business fields. You find the Distinct node under “Record
Ops”.
Requirements:

The dataset has to be successfully accessed and imported, e.g.


the SPSS dataset “[Link]”.
The Distinct node has to be connected to the data source to be
checked.
You need to add an output format, e.g. the Table node from
“Output”.
Choose your settings in Distinct. Click either on Distinct or Table
and then on Run [Ctrl+E].
Only the Settings tab and the will be presented.

Choose a mode:

Create a composite record for each group. If you choose this


option for non-numeric fields, then you can choose in “Composite”
how to construct the composite value. For string or typeless fields you
can choose e.g. the first or last alphanumeric value, for date fields e.g.
the oldest or most recent date, and for numeric fields e.g. first, last,
total, mean etc. Per default a record count is included.
Include only the first record in each group. Selects only the first
rows from each group of duplicate rows and discards the rest. The
first row is determined by the sorting order defined below.
Discard only the first record in each group. Discards the first row
from each group of duplicate rows and keeps the rest. The first record
is determined by the sort order defined below.

Key fields for grouping: Lists one or more fields to determine


when rows are considered to be identical. Typical fields are
names, IDs, and the like. The more fields you select, the less
rows escape from being identified as being identical. We choose
age from the SPSS dataset “[Link]” for demonstration
purposes.
Choose a sorting variable and a sort order: The sort order
determines which row is considered as the first within a group.
Otherwise, all duplicates are considered to be interchangeable
and any record might be selected. We choose ed from
“[Link]”. Sort order is ascending.

Examples:
Settings: age + ed (include the
first row only)
Result: 37 rows

Settings: age + ed (discard the


first row only)
Result: 813 rows

16.2.4 Other nodes


Node Description

“Record Performs data transformations using


Ops” Python or R Syntax.

“Modeling” Constructs model to identify data


anomalies, e.g. outliers or unusual
cases.

Identifies important input fields for


predictive modeling using the feature
selection algorithm. Includes
removing inputs or records that do not
contribute to improve modeling the
input/target relationship. For example,
fields with a large percentage of
missing values, a large percentage of
records in a single category or fields
with too many categories. Records that
have missing values for the target field
or missing values for all inputs are
automatically excluded from all
computations.
“Output” Calculates univariate descriptive
statistics and pairwise correlations on
selected fields. Statistics: Count,
Mean, Sum, Min, Max, Range,
Variance, Std Dev, Std Error of Mean,
Median, and Mode. Correlations:
Pearson's correlation coefficient.

Apply different types of


transformations, e.g. inverse, logn,
log10, exponential, or square root.
Performs data analysis using Python or
R Syntax.

16.3 Modeler and SPSS Syntax: An old


story …
In earlier Clementine/Modeler versions, users could apply SPSS syntax and
procedures (such as VALIDATEDATA), e.g. to ensure the quality of the data
provided. Unfortunately, you cannot import and run scripts created in IBM
SPSS Statistics within IBM SPSS Modeler. At the moment it is not clear,
whether or when IBM may reintroduce the benefits of SPSS syntax
programming, e.g. by “legacy” options or the like. The following sections
are therefore written for users of older Clementine/Modeler versions. For
newer versions the content may unfortunately not apply.
The figure shows a Clementine 11.0 screenshot. The stream in the upper left
area starts with the access to the SPSS file "bankloan_estimate.sav". The
"Statistics Transform" node is placed at the beginning of the stream and
connected with the other nodes. The node is used for filtering, sorting and
data checking (cf. the SPSS syntax on the right). The "Type" node defines
the type of the variables (fields) (e.g. Range, Set or Flag; not to be confused
with the storage format). After that, four different nodes branch off from
"Type" (from top right to bottom left): "Default" denotes the "Binary
Classifier" node; this allows to create and compare different models for
binary events. "C5 1" denotes the "C5.0" node, this node creates either a
decision tree or a rule set. "C&R Tree 1" denotes the "Classification and
Regression" node; with this node, future observations can be predicted and
classified. The "Statistics Output" node allows to include and execute SPSS
syntax for data analysis (see the SPSS syntax below left).
You can see two syntax windows: The one on the right belongs to the node
"Statistics Transform"; the second, smaller window below belongs to
"Statistics Output" node.
Earlier Clementine/Modeler versions provided at least two nodes for using
SPSS syntax:

The "Statistics Transform" node (names may vary) was


intended to perform complex data management tasks and to pass
the prepared data to Modeler for further processing. This node
essentially supported the SPSS syntax for data management tasks,
e.g. ADD VALUE LABELS, APPLY DICTIONARY,
COMPUTE, DO IF/END IF, INCLUDE, INPUT PROGRAM-
END INPUT PROGRAM, VECTOR, etc., as well as syntax for
programming with the SPSS macro language, e.g. DEFINE-
!ENDDEFINE. In the "Statistics Transform" node the
entered/loaded syntax can be subjected to a validation. Possibly
incorrect syntax is reported in a dialog box.
The "Statistics Output" node was designed to perform
challenging data analysis tasks and to pass the prepared data to
Modeler for further processing. This node primarily supported the
SPSS syntax for data analysis tasks, e.g. procedures such as
2SLS, ALSCAL, ANOVA, CATPCA, CATREG, CLUSTER,
CORRELATIONS, CORRESPONDENCE, COXREG,
CROSSTABS, DETECTANOMALY, FREQUENCIES, GLM,
MEANS etc.

SPSS syntax could be integrated in at least two ways: Either by


accessing SPSS syntax programs, or by directly writing SPSS syntax
into the programming window ("Syntax" tab) of the corresponding
nodes.
Syntax programming is generally interesting for SPSS Statistics users as
well as Modeler users. SPSS Statistics users qualify themselves via syntax
not only for SPSS but also for Modeler. Modeler users, on the other hand,
who have been working with the mouse on the graphical user interface, will
appreciate the possibilities of syntax in Modeler. The scope of performance
of Modeler as well as SPSS can be significantly extended by programming
languages such as Python, R or Spark. Modeler and SPSS can be more
customized, controlled and automated through syntax. At the moment, this
seems to be no longer possible for genuine SPSS syntax. For future
developments, please refer to the latest releases resp. technical documentation
of Modeler (SPSS, 2017b,e, 2016) and the IBM SPSS Command Syntax
Reference (SPSS, 2017a, 2011, 2006).

This chapter will conclude with a summary of three benefits of syntax


programming for Modeler users:

Clementine/Modeler can be extended by using programming


languages, depending on the version (SPSS syntax, SPSS macros, and
external languages such as R, Spark or Python; SPSS, 2017e).
Modeler therefore has all the advantages of programming like SPSS
(see Chapter 2)
Modeler can be extended in regard to criteria-led checking and
improving data quality, by using R or Python syntax in the
"Extension Transform" and "Extension Output" nodes, e.g. by
integrating functions, user-written programs or special algorithms.
Since created data, results and models are based on a systematic
criteria canon for data quality (which can be verified at any time), this
will also promote the transparency, credibility and professionalism of
data mining with IBM SPSS Modeler.

17 Notes for Macintosh Users


SPSS also offers a version for the Macintosh. The latest version is currently
SPSS 26.0 for Mac OS X 10.7 (32-/64-Bit), Mac OS X 10.8 (64-Bit only);
some older versions only available in English, which is not relevant for
syntax and macro programming. Further system requirements are: Mac OS
X version 10.3.9 with Intel® or AMD-x86-processor with 1 GHz or more and
at least 1 GB RAM. Depending on the Mac version, a network installation
may not be possible yet; not all modules are available. Minimum 800 MB
free space on hard disk.
In general: SPSS syntax resp. programmed SPSS macros developed on
Windows systems can also be used on Macs. For practical work only three
essential aspects have to be considered:

SPSS for Windows usually has a wider range of Add-Ons resp.


functions than SPSS for Mac, e.g. SPSS Exact Tests (in v26) or
Maps (in earlier versions). Windows programs or macros will
work on a Macintosh as long as the Mac version has licensed the
modules that programs taken from SPSS for Windows try to
access. If the programs try to access modules that are not (yet)
available for the Mac, then these programs will not work.
The second important aspect is the path. The programs
developed in SPSS for Windows can be run unchanged on a
Macintosh with one exception. The only concession to the other
operating system is to adapt the path specifications.
Windows GET
(e.g.) FILE=
'C:\My_data\Path\Project\[Link]'.
Macintosh GET FILE=
(e.g.) 'Macintosh HD:Users:chris:chris_010507:My_data:
Path:[Link] '.
The required Mac paths can easily be taken from the log after a mouse
operation and set to the corresponding position of the Windows path
information.

A last aspect is the language setting of Macs. This may result in


labels with special characters (umlauts etc.) that were assigned
correctly in SPSS for Windows but are now reproduced
completely wrong (e.g., the correct spelling of VAR2 would be
“Körpergrösse” in German):
VAR1 "Beurteilung KigŠ Frage 1 / PostT"
VAR2 "Kšrpergršsse in cm"
In this case, either the language setting in the Mac would have to
be adjusted. Or the placeholders (such as Š or š) are replaced in
the program editor via "Search" and "Replace" by the correct
special characters or at least umlaut versions, e.g. "ae" instead of
an "ä" etc.

18 Checklist: Test documentation


Name Dataset Project Company Last editor
Incl. Extender Department Department
…………………….. …………………….. …………………….. ……………………
Last storage date Exact size SPSS Version Storage Location
(DD-MM-YYYY- In GB-MB-KB-Byte Incl. OS (Server, CD, etc.)
HH-MM-SS-etc.)
…………………….. …………………….. …………………….. ……………………
Storage method Storage duration
Documentation Filing
(compression, etc.) (expiry date) available documentation
Yes ⃝
…………………….. …………………….. ……………………
No ⃝
Relevant numbers: Contact details re. Contact details re. Begin Audit
Rows, cases, etc. Processing Archiving (DD-MM-YYYY)
Rows:
Cases: …………………….. …………………….. ……………………
Columns:
End Audit Contact details re. Place / Date Signature
(DD-MM-YYYY) Audit (DD-MM-YYYY)
…………………….. …………………….. …………………….. ……………………

Completeness
Datasets (partial datasets, subsets)
Cases (data rows)
Variables (data columns)
Values
Controlled Missings

Uniformity

Datasets (e.g. variables)


Numerical variables (e.g. codes, value labels, special characters,
currency and measurement units)
String variables (e.g. name, type, label and length)
Labels: Correctly written?
Unique information (e.g. names of people, places and products or
even dates, telephone or personnel numbers)
Special characters (e.g. symbols for currencies or measurement units)

Duplicate data rows

Data record may only contain one case per line


Dataset may contain several cases per line (e.g. repeated
measurements)
Checking for intra-group and intergroup duplicates?
Cause(s) for non-permitted duplicate data rows can be identified?

Missings

Proportion of missings checked?


Patterns identifiable?
Cause(s) for missings identifiable?
Need for action?
Reason for action (e.g. deletion, reconstruction, or inclusion in an
analysis)?
Review of the specific procedure, e.g. reconstruction (logic, hot deck,
MVA)?

Outliers
Check for univariate outliers
Check for multivariate outliers
Cause(s) for outliers identifiable (e.g. sampling error)?
Differentiation of outliers from data errors possible?
Need for action?
Reason for action (e.g. delete, reconstruct, set to missing, or include
in an analysis)?
Review of the specific procedure, e.g. reconstruction (logic, hot deck,
MVA)?

Plausibility

External formal correctness given


Internal formal correctness given
Plausibility of content given

Documentation

Audit documentation drawn up


Checks and documentation checked for completeness
Checks and documentation checked for correctness
Documentation created / saved under the following name

19 Communication of quality
Criteria for successful communicating the quality of data,
analyses and results you can trust

What is communication of quality? The end of communication is to lead. In


the context of data analysis and data quality, communication first leads you to
how you can ensure the quality of your data by applying criteria, methods
and key figures. Communication also enables you to lead your audience from
your analytical results to the methods you gained them and the data quality
methods to ensure them. In the end, your communication builds up trust. In
your data. In your results. This chapter will show you how you can improve
resp. even enrich the communication about quality.
This book attempts to make the essential need for data quality transparent and
to provide SPSS users with solutions to common problems. However, this
book cannot be concluded without referring to other aspects related to data
quality, analysis, and communication of results (e.g. Spiess, 2006; Hager,
2005; DeGEval, 2004, 2002; Schendera et al., 2003; Wright, 2003; American
Psychological Association, 2001; Wilkinson & APA Task Force on
Statistical Inference, 1999; Deutsche Forschungsgemeinschaft, 1998).
This book does not artificially divide research methods into fields of
application; it is always science and business, without any further judgement.

Ensuring data quality is only one link in a coherent chain of complex process
phases; however, it is the first resp. fundamental one, since all other phases of
scientific work and communication are based on it. Not only has the phase of
assuring data quality to be executed and the compliance with all criteria
presented so far has to be proven. Each phase of the business resp. research
process must be made transparent and verifiable already during the planning
phase. Careful planning before the data collection (e.g. survey) saves many
difficulties in evaluation and interpretation. Before data collection, for
example, the quality of the research design and especially of survey
instruments should be thoroughly checked. Even if the data are already
available, the quality of operationalization and experimental design should
not be assumed without checking.
Data analysts cannot be held responsible for failures in the collection resp.
coding of data and can only rarely "rescue" them. The number of unreported
cases of suboptimal, incomplete, or clearly incorrectly published results
should not be underestimated. Many publications that examine the quality of
published studies confirm these cautious critical assessments. Authors such
as Hager (2005), Ioannidis (2005), Diekmann (2002) or Nuovo, Melnikow &
Chang (2002), who critically examined, among other things, designs of
experiments (DOE) and research in publications, to some extent speak of a
large part. The often deplored deficient communication of (often equally
deficient) research results often enough reflects the lack of a methodological
basis of scientific thinking and working (e.g. Sarris, 1992, Chapt. 19). The
complexity of data analysis should be adequately assessed. The dangers of
methodological half-education really cannot be warned of often enough
(Kromrey, 1999). Depending on design, data and analysis, knowledge of
many disciplines, approaches and theories may be required, including theory
of subject matter resp. sciences, measurement resp. test theory, statistics and
computer science. All computational methods require e.g. statistical
prerequisites, which the data must fulfill in order to be analyzed at all. Those
responsible in the individual phases of the business resp. research process
have the task of ensuring that the statistical requirements of the respective
statistical procedures are met. No statistics program, not even SPSS, can
check these prerequisites; only the respective users can do this. The analysis
of data is a complex, iterative process and is literally called art in some
statistics books (e.g. Menard, 2001²; Chatterjee & Price, 1995²). This fact is
countered by an ever-increasing trivialization of, among other things,
statistics and research methodology. Being able to control a software program
is equated with statistical expertise, for example. However, statistics is a full-
fledged university course of studies. The assumption, however, that a
complex multivariate analysis can be carried out "just like that" with a few
mouse clicks, illustrates phenomenal "prophetic" abilities in assuming that
the data can be considered 100% error-free and therefore ideal even without
verification and that all necessary statistical requirements can be considered
given without further tests. Or simply ignorance of the complexity of
research methods, statistics, data quality and the responsibility towards
clients, affected persons and resources. Research is a matter of attitude.
Towards quality, precision and responsibility. Results from science and
research flow back into everyday reality, where they form the basis for future
actions and decisions. For SPSS as well as for all other program packages the
following applies: SPSS does not replace statistical knowledge. SPSS
requires statistical knowledge.
Ensuring data quality is the first link in a coherent chain of complex process
phases, upon which further phases of scientific work and communication are
built. The communication of research results also follows certain structures
(e.g. abstract, introduction, method etc.), standards (e.g. correctness, clarity
etc.) and rules (e.g. citation conventions, decision parameters etc.). The
concluding sections summarize some of the most important criteria for data
quality (19.1.), data analysis (19.2.) and the communication of results (19.3.).
This compilation does not claim to be exhaustive, and given the complexity
of the subject matter, this should not be attempted at all. However, the central
importance of some selected criteria, e.g. the logic of the significance test and
its correct interpretation, is especially emphasized.

19.1 Critera for data quality


By means of the following criteria central aspects of the data definition are
specified and thus in the long run also characteristics of the analysis results
are co-determined, e.g. reliability, representativeness etc. This also applies to
official statistics (e.g. Voas, 2007; Schulz, 2007; Radermacher & Körner,
2006). Using data from the British Office for National Statistics, Voas (2007,
4-8) found that massive cheating had taken place in England and Wales
during the 2001 census. The questionnaires had often been filled out by one
person for all persons living in the household. This could be recognized by
coincidentally identical birth dates, gender, zodiac signs or other conspicuous
features, e.g. the first of the month was often given. Of the total of 10.3
million married couple data, about 800,000 (approx. 7.8%) data were affected
by this cheating. Felix Schulz (2007, 25ff., 193ff.) examined e.g. the
reliability of the German PKS statistics and concludes that unreliable data is
more likely to be assumed when the suspects are registered with the police.
As a consequence, unreliable studies must also be assumed with regard to the
reliability of dark field examinations.

Data must be carefully documented (metadata).


Metadata is data about data. Metadata is therefore information about context
and circumstances that influence the handling of data. In the following, four
areas will be roughly outlined:

the definition (collection),


the quality of data,
the handling (processing), and
the storage (archiving) of data

(for the presentation and further distribution of data, see e.g. United Nations,
2003, 2002, 1995).
The information on the collection resp. definition of data includes aspects
such as semantic definitions (including inclusion and exclusion criteria),
information on the data source (location, number), mode of collection
(customer or household surveys, sales figures, weekly reports, etc.), units
(currencies, quantities, or geographical data such as countries, cities, regions,
longitudes and latitudes), number of included or excluded cases, and also
time-related information such as time or phases of the survey. Depending on
the type of data, indications of bias, data resp. survey quality (reliability) or
data suppliers (in the case of third-party providers) must also be provided.
Information on the quality of data includes information on the tools resp.
applications used, the type and number of criteria checked, the methods used
to check them (approaches, rules), the tolerances/limits used in the process,
and the results of the measures taken (e.g. in the form of completeness
metrics). Any exclusion of criteria, any renunciation of test methods, as well
as any applied tolerance criteria must be made explicitly transparent and
justified.
The information of saving or archiving data includes aspects such as file
name and format, codebook (documentation of the definition, coding or
possible transformation or updating of the data), documentation of the
variables (number, designations, labels, attributes, etc.). Documentation of
the cases (number, weighting), archiving resp. storage location (country,
location, department, server/computer, paths), storage format (file format, file
size), storage time (date of archiving resp. updating, expiry date), storage
type (protection, compression, sorting, encoding), and responsibility (person,
department, access rights). Further, more technical information can include
software version (release), operating system, the size and number of data
pages, the maximum number of cases per data page, the number of dataset
repairs, type and number of index variables or even the character set used.
The management (handling, processing) of data includes aspects such as
conversions, filters, transformations, calculations, definitions of index
variables, creation of new variables, derivation of weightings, handling of
missings (imputation), exclusion of outliers and duplicates, standardization of
strings, etc. In the context of data warehouses, there are many other types of
metadata, e.g. analysis, authorization, business information, grouping, etc.
The management of data is ideally documented in the form of syntax
programs that are in principle self-explanatory (see Schendera, 2005, 2004).
Specification of inclusion/exclusion criteria (survey units)
Inclusion and exclusion criteria of the cases studied must be explicitly stated.
The catalog of the included or excluded characteristics of the test persons
determines the population to which the results can be generalized via the
sample. Because of these far-reaching consequences of the applicability of
results, a justification of inclusion and exclusion criteria is highly
recommended. In clinical research, for example, a catalog is usually compiled
before data collection. This catalog and any violations of these inclusion
criteria are recorded "a priori" in a so-called study protocol. So-called
equivalence studies in particular are very sensitive to violations of inclusion
criteria. For surveys applies analogously that the exact survey units, strata,
etc. must be specified.

Example: Breast cancer study: Inclusion criteria for a breast


cancer study can be, for example (a) female, (b) histologically
confirmed diagnosis of a newly occurring mamma carcinoma, (c)
sufficient language skills to understand and complete the
questionnaires, (d) patient consent. Exclusion criteria can be e.g.:
(a) age (e.g. younger than 20 years or older than 75 years), (b)
recurrence of mamma carcinoma, (c) drug abuse, (d) known
contraindications, (e) non-compliance etc.
Example: In database marketing, vast amounts of data are often
available in data warehouses. The catalog of criteria is therefore
operationalized "ex post" in the form of SQL queries. For
example, if a telecommunications provider investigates the
telephoning behavior of its current cell phone users, a cell phone
contract is an inclusion criterion. The non-use of a cell phone
(e.g. if a customer is with several providers at the same time) or
an already terminated contract can be examples of exclusion
criteria.
The statements made on the basis of the sample drawn apply only to a
population with corresponding characteristics. An extrapolation on other
populations is at least very questionable (e.g. generalizing the results of the
cancer of the breast study on men or generalizing mobile telephoning on
landline users.
Complete survey or sample: Calculation of the required sample size
If your goal is to examine all feature carriers (population), you will conduct a
complete survey. In this case, the following remarks on calculating the
required size resp. drawing a sample are irrelevant, but not a comment on
statistical significance. Significance is a sampling concept. In contrast, for a
complete survey a generalization from a partial quantity (sample) to the total
quantity (population) is not necessary at all. Instead of the significance, it is
recommended, for example, to specify confidence intervals. For analyses
based on data warehouses, it should be checked a priori whether they contain
complete surveys or only samples (e.g. internal company data).
However, if your investigation goal is the investigation of a subset of the
feature carriers, you will investigate a sample instead of the population. The
size of a sample influences the scientific work: If too little data is available,
the power of a test may be too small and no reliable conclusions can be
drawn because the sensitivity may be too low to actually detect differences
that actually exist. If too much data is available, adequate power is given and
reliable conclusions can be made because of a sufficiently high sensitivity,
but at the price of additional work that could have been saved if the size of
the required sample had been determined beforehand, not to mention ethical
reasons. Especially in experiments, the required sample size should ideally be
determined prior to a survey by means of a sample size calculation resp.
power analysis (cf. Bock, 1998). Since the required sample size often
depends on only approximately known parameters, this is referred to as
sample size estimation. If the data are already available, an ex post analysis
may prevent an overinterpretation, if e.g. too little data have not led to
substantial effects. The required ('true') sample size can ultimately only be
derived from the most accurate preliminary information possible (Bock,
1998).
Measures for surveys
Surveys generally provide measures for the quality of the coverage including
the respective definition as well as the specific formula for its calculation (if
necessary weightings). Especially for random samples the so-called sampling
errors have to be specified. It is well known that estimates based on samples
are not exact, but only estimates, because they are based on only a part of the
data (see Chambers, 1999a,b; Draper & Smith, 1999a). The so-called
sampling errors indicate the uncertainty of the sample-based estimates.
Sample-based errors are not relevant for complete surveys. The other
measures are usually formulated as errors (positively formulated, they are
expressed as "rate" or "efficacy") and are generally not further distinguished
for complete surveys (censuses) or samples. As non-sampling errors have to
be specified: Coverage errors, non-response errors, measurement errors, and
processing errors. For details and further types of errors (e.g. model
assumption errors), please refer to the survey literature (e.g. Office of
Management and Budget, 2006a,b; DESTATIS, 2005; Eurostat, 2003, 2002,
1999/1998; Statistics Canada, 2003; Kostanich & Haines, 2003; Kostanich,
2003; US Census Bureau, 2000; Federal Statistical Office, 2004, 1993;
Davies & Smith, 1999a; United Nations, 1995).
Coverage errors (syn.: coverage deficiency, coverage errors, frame errors)
describe the relationship between the survey framework and the cases (units)
that were surveyed in practice (see above all Elvers, 1999, 88). Coverage
errors are thus measures of the accuracy of whether the units surveyed reflect
the target population. The survey framework defines the totality of the cases
to be investigated theoretically, e.g. an address database for an online survey
on singles. Undercoverage means that certain cases are not recorded due to
this framework, e.g. singles without online access. Overcoverage means that
the frame (address database) contains more cases than belong to the target
population, e.g. married people with online access. Duplicates are also
possible (see also 5.1), e.g. singles with two or more online accesses or
technically caused duplicates (see Särndal, 1992).
Measures for non-responses (missings, see Chapter 6) are measures of the
completeness of the data. Nonresponses are distinguished according to
whether they are non-responses at the unit level (unit non-response, e.g., not
encountered, language problems, answer denied) or non-responses at the
level of important questions resp. features (item non-response, e.g.,
technically caused) (e.g., Skinner, 1999a).
Measurement errors are measures of the deviation of the collected data
from the "true" data (e.g. Skinner, 1999b). On closer inspection, the term
"measurement error" is a generic term for a wide range of different sources
and types of error. Depending on the cause of the error (e.g. respondent,
interviewer, survey instrument etc.), the errors can result in different patterns,
dimensions, as well as solution steps. For example, respondents may not be
able to understand questions, instructions and/or interviewers or may refuse
to answer certain questions. Interviewers can e.g. influence interviewees.
With survey instruments, e.g. a faulty technique can lead to systematically
incorrect data (see also 6.1.1.). The processing errors (see examples in 7.4.;
see e.g. Davies, 1999) are errors that occur after data collection and before
the actual statistical analysis, i.e. errors in the context of data management,
e.g. incorrect data entry, incorrect combination of cases and/or variables, or
incorrect calculations (e.g. deviations from SOPs, manuals for scales, testing
or scoring etc.). Reliable measures for the type and extent of possible
measurement or processing errors are often to be formulated by the user
himself, carefully tailored to the specific data situation.
Model assumption errors concern errors in [inferential] statistical analyses
based on models (see Draper & Smith, 1999b). Such errors can be
distinguished as rather theory-based errors during the phase of assumption-
based model specification (theory-based modeling), and as rather statistically
defined errors resp. measures of goodness (e.g. AIC, RSME, residuals, etc.)
in the inferential-quantitative phase of evaluating the quality of the specified
model (inferential modeling). Statistical modelling must be specified
carefully, iteratively and at the end including the model-specific measures of
goodness resp, error. If this is not done, there is a risk of being accused of
fraud if insufficient information is provided on designing, executing and
evaluating a study. This happened, for example, to Burnham et al. (2006)
with their massively criticized extrapolation of the number of war victims in
Iraq (see, for example, Boseley, 2006; Bohannon, 2006; Fischer, 2006).
Drawing the sample (random sample)
If no complete survey is performed, the method of obtaining (drawing) the
data has to be specified. A random sample (probabilistic sample) is ideal;
however, there are many other variants of true random data sampling
(including stratified sampling, multi-stage sampling, cluster sampling, etc.).
Non-probabilistic drawing of data (e.g. the ad hoc or quota sample) must be
justified resp. ideally avoided. Avoid bias in the drawing of the sample as
well as in the return of the data. This applies regardless of the study plan and
the survey procedure (paper&pencil, CATI/CAPI, online, etc.).
Otherwise, you may be treated like the so-called Hite Reports after a careful
methodological assessment: "In the marketplace of scientific ideas, Hite's
work would be found in the curio store of the bazaar of pop and
pseudoscience" (Smith, 1989, 546). The SPIEGEL university ranking from
2004 (McKinsey & Company, 2004; Friedmann et al., 2004) also had to put
up with massive criticism of the research methodology (e.g. Schendera, 2006;
Alsleben & Richter, 2005; Grözinger & Matiaske, 2005; Höding et al., 2005;
Liebeskind & Ludwig-Mayerhofer, 2005; Claus, 2004).
Specification of the data basis (sample size) in N
The data basis (sample size) must always be specified in N. Please refrain
from an exclusive specification in percentage values, since these can conceal
a sometimes changing basis (see Klimanskaya et al., 2006).
Checking for data quality
Only exceptionally, data are available in a state of immediate readiness for
analysis. Check your data and especially the data of third parties. Let the
criteria compiled in this manual help you: Completeness, duplicates, outliers,
missings, plausibility, etc. Specify the quality criteria that have been checked,
the measures applied, and the result obtained.
Effects and effect sizes
On the one hand, effects are descriptive parameters such as mean value and
standard deviation for continuous variables, and N or percent for category
variables. On the other hand, effect sizes are also measures of "practical
significance" to describe the size of the difference between groups or
variables (see also the following notes on descriptive statistics).

19.2 Criteria for the quality of data analysis


Documentation of data preparation
Do not be fooled by predominantly ideal analysis situations in textbooks on
SPSS or statistics. Data are rarely available in a state ready for analysis. The
process of preparing data for analysis, the so-called data management, should
be logged, ideally in the form of a syntax program (e.g. Schendera, 2005). Do
not exclude the possibility that data management may be more complex than
data analysis itself. To prepare the data, also the analysis-preparatory or
result-relevant (dummy) coding of variables could be counted at this point,
for example.
Derived data
If data are not collected directly but are derived, e.g. via scale construction,
clustering or segmentation, the concrete procedure must be made transparent
in order to avoid artificial results or to ensure comparability with other
publications. Psychometric scales, for example, must be explicitly analyzed
according to the specifications in the corresponding manuals. Also specify
the test quality criteria or even the measurement or object theoretical
background of the scales used. This applies all the more to self-developed
scales. If there are several scales with an equivalent content: Always use the
scale with the highest possible professional-methodological level (cf. Rost,
2005). Conversely, frequent errors should be avoided, such as so-called
atheoretical ("naive") clusters without stability and validity checks (cf.
Bacher, 2002²), the uncritical further processing of "factors" despite minimal
explanation of variances, the (tautological) testing of cluster solutions by
means of a subordinate discriminant analysis (cf. Wiedenbeck & Züll, 2001,
17) or the testing for significant differences between the groups found by
means of an analysis of variance.
Specification of the study design
A study design (experimental design) is the definition, structure and
instruction for implementing a methodology of detection resp. proof. In
practice, an experimental design is therefore always given priority over data
collection, inferential statistics and result interpretation (see e.g. Rasch et al.,
1999, 1996; Sarris, 1992; Hager, 1987, 55ff., Meinert, 1986). An
experimental design should be defined in such a way that it is at all capable
of detecting what the investigator claims to be able to detect.
The experimental design transfers a hypothesis into a practical testing
process. Definition and structure reflect the hypothesis under investigation
(including question, elements, direction of question). By instructions for
action we mean, among other things, processes of sampling and assigning
cases (subjects) to groups (e.g. randomization, parallelization, matching). The
concrete procedure of the experiment can be carried out in detail in many
variants (cf. Sarris, 1992, Chapter 13.1). There are different criteria to
classify experimental designs in types of designs (experimental, repeated
measurement, time-series, block, factorial, etc.). In clinical trials, for
example, a similar distinction is made between phase I-IV studies, in
epidemiology, for example, between cohort, case-control or intervention
studies (see DFG, 2005; ICH Guidelines E6(R1), E8 and E9; Cleophas et al.,
2002; Rasch et al., 1999, 1996; Sarris, 1992; Meinert, 1986).
An experimental design generally describes the allocation of subjects to
factors (e.g. treatments). By varying the independent variables (IV, factors) in
levels, the aim is to control and investigate their influence on so-called
dependent variables [DV]. The factors of an experimental design are given by
the hypothesis under investigation, the most important factors being, for
example:

Treatment factors e.g. treatment A vs. B vs. C, with vs. without


medication, etc. (Application: comparison between two or more
hypothesis-relevant factor levels).
Experimental factors e.g. treatment for t1, t2...tn (Application:
Repeated measurement in the sense of before-and-after comparison;
syn.: time factors, trial factor), repeated measurements of trials,
therefore, also include personal factors.
Person factors e.g. person to t1, t2...tn (Application: Repeated
measurement; eliminating individual interference (noise) variance
when repeatedly measuring the same persons).
Block factors Application: Elimination of interference (noise) by
homogenizing cases into blocks using variables that correlate with
the DV of the experiment. Depending on the research question, these
can be variables such as age, gender, income, etc. Baseline
differences between different experimental groups are thus reduced.
Effects in the DV are more evident, since they are adjusted by the
block factors for pre-experimental differences (interferences, noise).
Specification of the study design
With the respective (factorial) experimental design, a first interpretation
framework for the validity and reliability of the obtained results is thus also
defined, as well as in principle for the assessment whether the applied
statistical methods are permissible at all (although the latter should be
clarified and defined a priori in the form of the so-called study protocol).
For the verification of a condition factor (e.g. in a two-group comparison:
Treatment A vs. B) e.g. a T-test or ANOVA are often used as inferential
statistical test procedures. But what is the prerequisite for the treatment to be
interpreted as the only causal factor? The often overlooked prerequisite is that
all other interfering (noise) factors are excluded or absolutely identical in
both groups. Thus, statistically significant differences between two groups
may well be caused by effects other than those investigated. Multifactorial
experimental designs have the additional advantage that they allow, among
other things, to model interactions. If the individual factors of different
design types are mixed, one speaks of so-called “mixed designs”. The
increasing complexity of experimental designs is often accompanied by a
larger number of required test persons or an increasing complexity in the
interpretation of the (content-related, statistical) results (Sarris, 1992,
Chapters 13, 14).
A design of experiment thus supports the correctness of the selected
statistical test procedure, the parameters determined, and the assessment of
appropriate of inferential statistical interpretations. Documentation of the
experimental design is therefore essential, especially because the
experimental design must be absolutely consistent with structures and results
of other studies (e.g. effect sizes, see below) in a (meta-analytical)
comparison.
The less well controlled resp. controllable designs include the so-called
quasi-experimental (not to be confused with the semi-experimental) design,
the correlative design and the ex post facto design (cf. Sarris, 1992, Chapter
15). When interpreting the results in the framework of a quasi-experimental
design, confounded variables resp. possible validity-impairing effects of
interfering (noise) factors (e.g. time, maturation, test exercise, selection and
dropout effects) must be taken into account.
With ex post facto designs (and thus the non-random allocation of cases to
groups and the absence of experimental variable manipulation) there is the
risk to causal overinterpret the correlative data basis, also because
confounding factors cannot be (completely) statistically controlled
afterwards; this risk exists e.g. in database marketing by means of data
warehouses or e.g. in epidemiology in the so-called retrospective or
prospective studies.
Correlational designs also do not support causal theoretical statements. So-
called "pre-experimental designs" are not suitable for a valid hypothesis test,
since interfering (noise) variables are not controlled and therefore other than
the effects investigated can be responsible for the realization of results (cf.
Sarris, 1992, Chapter 12).
Justified specification of decision parameters
Depending on the research question and the experimental design, decision
parameters include general specifications such as the hypotheses themselves,
possible test distributions, the distinction between nondirectional and
directional hypotheses, or even parameters such as (usually) one alpha before
the hypothesis test. The examined hypotheses themselves are to be
formulated and applied in such a way that they also test what they claim to
test. In order to avoid a significance-relevance problem when interpreting an
inferential statistical result, parameters for the alternative hypothesis must be
defined already at the time of hypothesis formulation, which are also of
practical relevance with regard to the situation.
For example, in the case of a Pearson correlation, an expected coefficient >=
0.4 or, in the case of a difference (depending on the measurement unit and
expected magnitude), a value such as >= 15.00 can be specified. At this point,
we would also like to refer to the further explanations on the function and
interpretation of the classical hypothesis test on the following pages.
Depending on the method used, further specifications must be made, e.g. for
an analysis of variance the type of sum of squares calculation (type I, II III or
IV) or for a binary logistic regression e.g. which is the comparison group and
which is the reference group. In the case of multiple tests, alpha corrections
may need to be made (e.g. in correlation analysis).
Selection, justification and documentation of the chosen statistical
method
The choice of the appropriate statistical procedure (and thus of the model)
does not depend on the available SPSS menus resp. procedures or "recipes",
but specifically on content-related and methodological aspects, e.g. on the
research question (hypothesis; there are different types, e.g. difference vs.
relationship), but also on definitions to be specified, such as the measurement
level of the data, distributions, transformations, (non)relatedness of data, the
modeling of main and interaction effects, and much more. The selection
should be made in consultation with an experienced methodologist resp.
statistician (for a first rough conceptual orientation see e.g. Schendera, 2004,
101). For special questions, it is possible that the required statistical
procedures are not implemented in the standard software; in this case, special
analysis software should be used. The procedure for model specification and
inferential statistical hypothesis testing usually corresponds to a stepwise
increase in complexity.
Three inadequate approaches are to be avoided (they are also counted as
unprofessional actions resp. scientific misconduct, see 19.4):

Inadequate complexity reduction ("too little"): A complex


statistical (multivariate) problem is analyzed only with the simplest
resp. available means. Due to a (perhaps even openly admitted) lack
of understanding of the complexity of multivariate inferential
statistical procedures and parameters, to work only with (apparently)
more easily interpretable bivariate (possibly even only descriptive)
procedures demonstrates both ignorance of the possibilities of
multivariate statistics and of the limits of (apparently) more easily
interpretable bivariate approaches (e.g. in multivariate variable
relations the problem of the artificiality of the result situation). This
behavior also includes limiting the "assessment" of the adequacy of
statistical procedures to their availability in software (menus,
procedures) and not to carry out ”software-independent” evaluations
based on content-statistical criteria, as should actually be done.
Wilcox (1999) wonders how many phenomena have not been
discovered simply because users were and are not able to exploit and
interpret the possibilities of (not necessarily complex multivariate)
statistics.
Inappropriate complexity enrichment ("too much"): A simple
question is analyzed only with the most sophisticated methods. This
approach poses several problems: No added value in knowledge:
With the same data and constellations of variables, sophisticated
procedures necessarily come to the same result, e.g. a multinomial
regression compared to a Chi² approach. Waste of resources: The
consequence is an effort that is not justified by the result and even
carries certain risks. Susceptibility to errors: Interpreting the
parameters of complex procedures is more error-prone than those of
simpler procedures (but also the prerequisites of the respective
procedure). Not only the view that statistics is limited to clicking on
menus commits the cardinal mistake of not checking the quite
complex requirements of sophisticated procedures and of directly
presenting the results obtained as reliable and significant. – With
initially simpler procedures, a user would still have the opportunity
under certain circumstances to detect data phenomena that show at an
early stage of the analysis that no substantial results are to be
expected. I am aware of a case in which a complex path model was
published as significant, in which ex post explorations showed that
not even substantial correlations or partial correlations were present. –
However, the use of "exotic" special procedures is to be avoided
altogether, above all by suppressing any statement of reasons and
requirements with the sole aim of being able to present a significance
at any price. The latter expressly does not mean the appropriate and
transparent application of special procedures and their prerequisites,
but a "running for significance" wrapped in pseudo-complexity with
the aim of deliberately misleading the scientific public.
Unknown complexity ("blind flight"): A question is analyzed
without checking whether the procedure is even appropriate. There
are several variants of this approach, e.g. (a) confusing the ease of use
of a statistical software with statistics (e.g. because of the assumption
that statistics correspond to the unrealistic-ideal-typical analysis, as it
is presented e.g. in predominantly "naive" software manuals) and
above all, (b) the unchecked adoption of procedures from third-party
projects (e.g. internet, institute, team). Doctoral and diploma students,
for example, often orientate themselves on the work of their
predecessors. There is nothing wrong with this in itself. You should
not immediately assume that the approach of your predecessor or
colleagues is wrong (but you should check it as a precaution). But
keep in mind that your predecessors chose the procedures for their
own data with their own special distributions. What may have been
correct for your predecessors may be completely wrong for the
characteristics of your own data. This behavior also includes the
tendency to (c) evaluate questions only by means of statistical
procedures provided in SPSS procedures resp. menus, which does not
necessarily imply the appropriate procedure; however, SPSS syntax
puts users basically in the position of being able to program the
actually appropriate statistical procedure for the given question
themselves.
Preferable to all this is the appropriate application of (perhaps even) simpler
procedures, but these should be done by every trick in the book.
When specifying the statistical procedure, not the name of the SPSS
procedure (menus) used should be stated, e.g. "GLM", but the concrete
statistical procedure, e.g. "univariate analysis of variance", "multivariate
analysis of variance" or "repeated measures ANOVA".
Documentation of the tested assumptions (requirement testing)
The requirements of the statistical methods used must be checked and met,
e.g. the results of the goodness-of-fit tests returned by SPSS etc. It should be
noted that multivariate data analysis in particular is a complex, iterative
process (Menard, 2001²; Chatterjee & Price, 1995²).
Even with fundamentally correct procedures, suboptimally specified and/or
computed models or values cannot be avoided in the beginning. Numerous
program packages (such as SPSS) are designed in such a way that in many
cases they only return results of included requirement tests in the form of key
figures (e.g. residuals) after an inferential statistical analysis which indicate
only ex post (sic!) whether the analysis should have been performed in this
way at all (although even this is not the case with every inferential statistical
procedure).
However, not all necessary requirement tests are being computed already
when the procedure is requested. Some of the necessary requirement tests
must be calculated separately or are not even implemented in the respective
program package. The use of special procedures must be specially justified.
The necessary requirements have to be checked and the result has to be
specified in detail to exclude a procedure-based "running for significance".
Specification and verification of possible interference (noise) effects
The goal of inferential statistics is not to calculate results into being, but to
subject them to critical examination. The effect of possible interference
(noise) variables must therefore be modeled and tested. The hypothesis
testing resp. results safeguarding also includes the consideration of possible
interference (noise) effects, e.g. in the form of main and interaction effects. A
suitable approach is a stepwise increase of complexity in model specification
and hypothesis testing.
Significance is not the measure of all things
The measure of significance is a consequence of a schematized "either/or"
decision, in an unreflected application highly problematic. Treat non-
significant results at first with the same significance as significant results.
The 'running for significance' in research and publication, lamented for
decades, is based on an incorrectly interpreted logic of the significance test
(e.g. Gigerenzer, 1999, 607-618; Carver, 1993; Begg & Berlin, 1988; Witte,
1980, 51-59; Kriz, 1973, Chapt. 5; Bredenkamp, 1972).
The classical significance test has several statistical-conceptual peculiarities
that impair its practical usefulness (Beaulieu-Prévost, 2006, 14-15; Cumming
& Finch, 2005; Nickerson, 2000) and consequently encourage erroneous
interpretations:

Dependence on sample size: If the sample is large enough, a


statistically significant result will always result (with the one
exception that the effect size is zero), no matter how small resp.
irrelevant the empirical difference.
The null hypothesis is unrealistic: The null hypothesis (against
which is being tested) assumes that there is absolutely no relationship
or difference. Especially in the biological and social sciences, the
variables under investigation are always related in some way or differ
from one another. The practical consequence is therefore this:
Because at least a small effect could have been expected, because
variables correlate or differ from each other a priori, a statistically
significant result can therefore not necessarily be considered to
support a hypothesis. In a first, empirical sense, the null hypothesis
(because it is unrealistic) is therefore unlikely, but also in a second,
statistical sense:
The null hypothesis is logically less likely from the outset: The
hypothesis that something is equal to 0 thus denotes a single point (so
called point hypothesis) on a theoretically infinite continuum. Thus
H0 is statistically less probable. All other points on the continuum
except 0 are covered by the alternative hypothesis, H1, a range
hypothesis. Thus the inherent test logic is to compare all logically
possible points for H1 with the only point for H0. The consequence of
this is that the alternative hypothesis H1 is more probable than the
null hypothesis from the beginning (theoretically always).
The alternative hypothesis is not falsifiable: The alternative
hypothesis that is actually (predominantly, but not exclusively) of
interest is therefore in principle not falsifiable by means of the
significance test. The significance test is directed against the null
hypothesis, it was not developed to prove the alternative hypothesis.
"We can never prove a theory although we can refute it (…). Using
significance testing to appraise the validity of a scientific hypothesis
implies using a decision criterion (i.e. the p value) that confounds
effect size and sample size to test a hypothesis already known to be
false and unrealistic" (Beaulieu-Prévost, 2006, 15).

Precisely because the significance test has fundamental conceptual problems,


it should be applied all the more carefully and interpreted correctly within the
permissible framework.

19.3 Criteria for the quality of


communicating results
Avoid misleading illustrations
Statistical diagrams must be adequate to scale level (e.g. box plots for
continuous data), measurement variation of data (e.g. pie charts) resp. also
the number of variables to be represented (uni-, bi-, multivariate) (see
Wilkinson, 2005; Schendera, 2004, Ch. 23).
In individual diagrams, much can be done right (and possibly wrong). For
example, a y-axis should start at 0. If several illustrations of the same type,
e.g. bar charts, are positioned next to each other, then the y-axes should at
least each start at the same value (offset) and have the same scaling. For
examples of the more sophisticated design of presentations (so-called
information design) in general, we refer to the work of Edward R. Tufte (e.g.
2003, 2001, 1997); for maps in particular, we refer to Monmonier (1996). In
"Visual and statistical thinking" Tufte (1997) demonstrates, for example, how
a suboptimal visual presentation ("chart junk") could not adequately convey
the significance of a known technical defect and subsequently allowed the
Challenger 51-1 to take off. The space shuttle crashed a few minutes after
launch precisely because of this defect.
Descriptive statistics
The descriptive parameters to be specified must fit the measurement unit
under investigation, but also the hypothesis. In any case, the unit of the
displayed variables must be clear from the closer text, e.g. €, km/h etc. For
inconspicuously distributed metric variables (e.g. in the unit "kg"), an
arithmetic mean can be given on the one hand; however, if on the other hand
the temporal development of a variable is to be described, the geometric
mean should be used (cf. Schendera, 2004, see Chapter 11.3). In general,
when specifying location measures (e.g. arithmetic mean) a dispersion
measure (e.g. standard deviation) has to be added, when specifying
frequencies the corresponding percentage values. Basically: No inferential
statistics without systematic and exhaustive descriptive statistics!
Inference statistics (model)
The presented inferential statistics (model) must fit the research question
(hypothesis). The statistically tested hypothesis has to be described clearly.
Decisive parameters, such as measurement level, distributions, parameters
of the hypothesis, alpha or sum of squares type must also be given, and if
necessary, the iterativity of the approach in the analysis (e.g. in regression
analysis before and after the removal of cases with high leverage and
influence values). For reasons of space, we cannot go into further procedure-
specific necessary checks. Keywords are, among others, the variation of
dummy codes, checking for multicollinearity or heteroscedasticity in multiple
linear regressions (if necessary, also including transformations towards
linearity of the relationships), checking for stability and validity of cluster
solutions, etc. Ingeborg Stelzl (1982) demonstrates in her still highly
recommendable publication that even apparently simple inferential statistical
methods can have various errors and traps.
Correct specification of p-values, test values, confidence intervals and
effect sizes
The general dynamics of the data situation is more important than an isolated
p-value. If something is real, this is also noticeable in the form of other
parameters. Give your audience the opportunity to independently assess these
parameters. Provide exact p-values instead of an unspecific "< 0.05", or
(even worse) only in the form of *, ** or ***. The result should be indicated
with the expression "statistically significant" instead of just "significant".
This formulation makes it easier to distinguish "statistically significant" from
"practically significant". Statistical significance does not necessarily imply
practical (e.g. economic, clinical, etc.) significance. Also provide the test
values calculated (e.g., F, Chi² values, etc.). Also provide confidence
intervals (e.g. Beaulieu-Prévost, 2006). Confidence intervals are often more
informative than significances (see Cumming & Finch, 2005, Appendix),
which may sometimes be the reason why they are not provided (Reichardt &
Gollob, 1997). Statements based on confidence intervals are mathematically
equivalent to significance tests. If the confidence interval excludes the value
0, this corresponds to statistical significance (Beaulieu-Prévost, 2006, 11-12;
Schendera, 2004, 420-421). However, in contrast to significance tests,
usefulness increases with sample size, whereas it decreases with significance
tests; with large amounts of data, a significance test becomes significant even
with very small differences and is therefore practically useless (e.g. Ioannidis,
2005, 700; Quatember, 2005, 128-150). Confidence intervals are essential for
large and very large samples. If your hypothesis allows it, you should also
include measures of the so-called "practical significance" (syn.: effect sizes)
(for (methodological) alternatives see also Nickerson, 2000, 274-277;
Prentice & Miller, 1992).
Statistical significance only states whether an effect occurs; measures of
practical significance, however, indicate how large resp. strong this effect is
(hence the synonymous terms effect strength, effect size). However, the
measures of practical significance do not replace the content-related
definition of what constitutes "significance". Measures of practical
significance thus describe the size of a possible difference or relationship,
e.g., independently of the sample size (Keppel & Wickens, 20044, 159).
Depending on the research question (type of hypothesis), scale level, and
application area, different measures are possible resp. necessary (Olejnik &
Algina, 2003, 2002; Richardson, 1996). A distinction is made between effect
sizes for the population or for samples, between effect sizes for difference or
correlation hypotheses, or between non-standardized and standardized effect
sizes. The specific measure should be appropriate to the research question.
Effect sizes are usually only stated for the central results; when specifying the
effect size, the study design should always be included. Keppel & Wickens
(20044, 166-167) recommend Cohen's d, for example, for the comparison of
several groups, while the correlation ratio, e.g. Eta², is recommended for
correlation hypotheses (see, however, the restrictive notes below).
Eta² can be requested via ANOVA, GLM or MEANS; Cohen's d is not
calculated by SPSS, but can be determined using the partial Eta² (e.g. option
ETASQ in GLM) itself (see Cohen, 1988, 276-280). The advantage of non-
standardized measures (such as non-standardized regression coefficient B,
the mean difference for continuous variables or the odds ratio for category
variables) is that they can be interpreted directly. The advantage of
standardized measures is that they can be directly compared with each other
within a sample (not necessarily between different studies or study designs,
see below); these include the standardized regression coefficient beta,
Cohen's d (defined as the standardized mean difference between the means of
two continuous variables relative to the common standard deviation) or e.g.
R² / Eta² (the latter are synonymous in certain contexts, see Hays4, 1988,
369). Furthermore, a distinction must be made between statements about
samples or populations. Cohen (1988), for example, originally developed d
only for the population and not for the sample. Similarly, Epsilon² and
Omega² are related to the population, whereas Eta² is related to the sample
(see also Diehl & Arbinger, 2001³, 650; Keppel & Wickens, 20044, 163).
However, there are also some particularities to be considered with regard to
the measures of practical significance. Depending on the type of hypothesis,
the scale level, or the scope of statements, the appropriate measures for
specifying effect strengths resp. practical significance must be applied. d or
Eta² can be impaired by sampling errors, for example, and thus overestimate
effects in the population (Keppel & Wickens, 20044, 164). Epsilon² or
Omega² are considered undistorted estimators provided that balanced sample
sizes are available, that k populations are normally distributed, and show the
same variance. According to Diehl & Arbinger (2001³, 673-676), Eta² should
also hardly differ from Epsilon² or Omega² and provide sufficient protection
against a hasty overrating of statistical significance in large samples.
Accordingly, determining effect sizes is only recommended for large
samples, but is not very informative for small samples (Olejnik & Algina,
2000, 280). In the case of statistical significance in small samples, Diehl &
Arbinger (2001², 675) even explicitly warn against prematurely equating high
point estimators for effect sizes with real effects in the population. Even for
effect sizes applies, that they do not really solve the "significance/relevance"
problem. Effect sizes as statistical reflection of practical effects are only
useful if the size of an effect corresponds to its practical relevance. In fact
(analogous to statistical significance) there can be a "practical"
"significance"/"relevance" gap even for effect sizes. Olejnik & Algina (2000,
277) cite, for example, a study by Feldt (1973) according to which practically
relevant high effects, if they had been interpreted against the background of
Cohen's conventions, would have turned out to be medium or even low
effects.
Effect sizes from different studies cannot necessarily be easily compared
with each other (Olejnik & Algina, 2000, 280-282); their calculation is
influenced by parameters such as study design (random vs. fixed effects,
unequal number resp. spacing of factor levels, different operationalization,
different model specification, etc.), sampling errors, sample size, and
possibly variants in their calculation (cf. Maier-Riehle & Zwingmann, 2000).
The calculation and comparability of effect sizes can be massively impaired
by different reliability of the central dependent variables or by a different
heterogeneity of the population. Most of the effect measure variants
(depending on the study design) cannot be determined directly in SPSS, but
by syntax. Only with substantial knowledge of meaning and limits, the
specification of an effect measure in addition to the test value (e.g. F-value,
Chi²-value etc.) and statistical significance (p-value) contributes to a better
communication of a research result.
Correct interpretation of significance: Significance is not equal to
significance
In interpreting the statistical significance test, so many errors are made
(Gigerenzer, 1999, 612-614; Witte, 1980, 51-60) that Gigerenzer (ibid., 612)
even speaks of "collective illusions". The reasons for this may be found in
the area of tension between serious gaps in the foundations of statistics on the
one hand (Beaulieu-Prévost, 2006; Wainer & Robinson, 2003; Nickerson,
2000; Gigerenzer, 1999; Krantz, 1999; Cohen, 1994) and insufficient training
in research methods and statistics to be able to critically evaluate them on the
other hand (Spiess, 2006; Krämer & Gigerenzer, 2005; Rost, 2005; Ludwig-
Mayerhofer, 2003; Pötschke & Simonson, 2003; Rohwer & & Pötter, 2002;
Schnell, 2002; Haller & Kraus, 2002; Müller, 2001; von der Lippe, 1998;
Krämer, 1995; but also already e.g. Kriz, 1981, 1973). On the history, logic
and terminology of the significance tests of Fisher resp. Neyman and
Pearson, respectively, we would therefore recommend Gigerenzer (1999) and
Witte (1980).
The scheme of the "classical" test of significance is technically (sic)
roughly as follows: Based on the question (hypothesis) and the scale level, a
theoretical distribution form is specified; on this basis, SPSS internally
calculates a reference value. The value based on the observed distribution,
e.g. F or Chi² value (indicated by SPSS), is now compared with this reference
value (not indicated by SPSS). If the observed test value differs from the
reference value, the null hypothesis is usually rejected and the alternative
hypothesis is accepted. Significance is the probability that random errors
cause the difference between the expected and observed values of the test
distribution.
However, significance tests were not designed to answer the question of
whether a hypothesis is right or wrong. Significance tests therefore do not
prove a hypothesis. Significance tests do not themselves test hypotheses, but
only state the probability (p-value, exact significance level, after the test) that
random errors in a given hypothesis, e.g. H0, will cause the difference
between the expected and observed values of a test distribution. Alpha
(conventional significance level, before the test) is thus used to decide "only"
the difference between observed and expected values of a test distribution,
not between the hypotheses themselves. Thus, most interpretation errors
occur when transferring the result of the test distributions to the hypotheses.
The hypotheses themselves cannot as easily be rejected or kept on the basis
of p-values for test distributions, as the conventional decision scheme
suggests (see also Nickerson, 2000, 242-244).

Test decision

H0 not rejected H0 rejected

H1 rejected H1 not rejected

Unknown Correct decision. Type I Error:


H0 is true
reality Specificity of a test: 1- ☐ Rejecting a null hypothesis when in
of reality it is true.
H0 resp. H1: Type II Error: Correct decision.
H1 is true
Keeping a null hypothesis when Power of a test: 1- ☐
in reality it is false.
A significant p-value says nothing about the probability of H0, H1 or the size
of an effect. For example, a significant p-value is not (Gigerenzer, 1999, 612-
614; Dar et al., 1994, 76; Witte, 1980, 51-60):

A probability that the null hypothesis (H0) is correct. Thus, in case of


an "accidental" rejection of H0 (Type I Error), you do not know the
probability this decision was wrong.
A measure for the alternative hypothesis (H1), e.g. the probability of
its correctness, but always against the (normally less interesting) H0.
H1 is not proven by a significance. Thus, one cannot conclude from a
significant p-value that it is a non-random deviation from H0.
Therefore, it is not possible to deduce from a significant p-value how
probable H1 is.
A measure for the effect of a variable in an experiment (this depends,
among other things, on the specified model, power of the test and/or
also on N).
A measure of confidence for the repeatability of the observed effect
under the same experimental conditions; e.g. you cannot deduce from
p=0.05 that you will get a significant result in 95% of the repetitions.
An indication that a result corresponds to the "standard" (however
defined) and should therefore be preferred to a non-significant result
at publication.

A significant p-value is in principle only very rough, not to say misleading


information. Significance is directed against the null hypothesis. The
probability for accepting the actually (predominantly, but not exclusively)
interesting H1 remains undetermined. Significance can even be caused by
sufficient sample size alone (Bortz & Lienert, 1998, 40-41). Especially in the
case of large to very large data volumes, a significant result can be caused by
the sample size alone and is therefore a little valid indication of a more or less
small difference or relationship (Witte, 1980, 55-56). Significances can also
occur by chance (see alpha error inflation in multiple testing, e.g. in
correlation). More helpful are confidence intervals or effect sizes. In the new
APA Publication Manual (5th Edition), the specification of effect sizes has
been mandatory for several years (American Psychological Association,
2001). As already explained above, it is highly problematic to infer the
theoretical and practical significance of findings from their statistical
significance. However, even alternative approaches such as measures of
effect size resp. practical significance (see Bredenkamp, 1969) are purely
statistical procedures that cannot replace an inevitable theoretical and
practical evaluation of a finding (partly because of certain limitations, e.g. the
dependence on the spread). By the way, a significant but inaccurate direction
of a difference resp. a relationship is also called Type III Error.
A non-significant p-value also says nothing about the probability of H0, H1
or the size of an effect. A non-significant p-value, for example, is not
necessarily:

A probability that the null hypothesis (H0) is wrong. Thus, if you


"accidentally" keep H0 (Type II Error), you do not know the
probability that this decision was wrong.
A measure against the alternative hypothesis (H1) e.g. the probability
of H1 being false, but always for H0. H0 is also not proved by a non-
significance. The unsuccessful rejection of H0 does not imply the
confirmation of H0. Thus, it is not possible to conclude from a
nonsignificant p-value that it is a random deviation from H0. It is also
not possible to deduce from a nonsignificant p-value, how probable
H1 is.
No absolute exclusion of the validity of the actually (mainly, but not
exclusively) less interesting null hypothesis (depending on the effect
size).
An indication that no effect is present. A non-significant p-value is
not necessarily an indication that no effect exists. On the contrary:
In small samples, non-significant results are often reported, especially
when real, relevant effects are involved (cf. Bortz & Lienert, 1998).

For a hypothesis test, this means that a justified statement can be made using
a significance if the hypothesis is well-founded, the finding is ideally
confirmed i.a. by a confidence interval and a substantive effect is present.
A non-significance is also revealing, since it does not exclude plausibility
resp. a statistical effect (especially in the case of a too small sample), and
possibly even that of the null hypothesis itself. From both, it can be
concluded that two effects can be quite similar to each other. One effect
makes you achieve significance (possibly because of a larger sample), the
other effect does not allow to achieve significance.

Note: Please note in this figure that no explicit boundaries have been drawn
between the individual decision options; this is to illustrate that the transitions
between the individual decision options are not always simply disjunctive,
but can often be complex and high-dimensional. Therefore, even in scientific
decision making a situation can occur where arguments for a decision against
(or even for) H0 are not clear.
The direct comparison of non-significant with significant p-values within and
between separate analyses is not possible using the output significance value
alone. The comparison of a non-significant p-value (e.g. 0.06, e.g. in Study
A) with a significant p-value (e.g. 0.04, e.g. in Study B) is, despite the
absolutely identical operationalization of the two studies (except for the
underlying sample itself, e.g. with regard to N), only possible by considering
further parameters such as effect size or N.
An obvious approach is to check the statistical significance of the difference
(e.g. Study B-A) instead of a direct, purely descriptive comparison of both
significance values. This applies to the comparison of significances between
studies (assuming identical operationalization) as well as to the comparison
of significances within an analysis, e.g. of predictors in multiple linear or
logistic regressions.
Any empirical reality, especially as a research object in the biological and
social sciences, is defined by the criteria "difference" vs. "non-difference"
resp. "relationship" vs. "non-relationship" between entities. Since empirical
reality is never absolutely invariant, but is always syn- and diachronically in a
dynamic state of change, it would therefore be expected with some
plausibility that the research situation does not run in an one-way direction of
exclusively "difference" resp. " relationship". When interpreting a
significance test, this means that non-significant as well as significant results
have the same value as equivalent manifestations of empirical reality and
should be treated as such, and with all due care, in interpretation and
publication (cf. also Nida-Rümelin, 2005). For example, the DFG
recommends to publish (clinical) studies be regardless of their result (DFG,
2005, 4).
Presentation of the hypothesis
In the case of the hypotheses examined, it must be demonstrated in terms of
content that they also test what they claim to test. This, in turn, requires that
an alternative hypothesis is present before each hypothesis test and that this
alternative hypothesis is precisely formulated (Nickerson, 2000, 276). Thus,
H1 would only to be accepted if a statistically significant effect is realized
whose explanatory value is also practically significant.
Finally: Do not interpret your results retrospectively using several alphas
(Dar et al., 1994, 76; Witte, 1980, 53), e.g.
0.05 < p < 0.10: "tendentially significant" or "marginally
significant"
0.01 < p < 0.05: "significant" [*]
0.001 < p < 0.01 for e.g. "highly significant" [**], or even
p < 0.001: "extremely significant" [***].
This approach is not correct (see also Rasch et al., 2004). Statistical
significance is only defined a priori using one threshold. Therefore, it is also
a mistake to assume that the smaller a p-value is, the "better" it is in terms of
a larger effect (e.g. Nickerson, 2000, 257). Multiple ex-post levels are not
permitted. SPSS itself assigns such marks, e.g. by means of the procedure
CORRELATIONS. Although such markings may facilitate the readability of
extensive correlation tables, this should not be confused with the ex-post
definition of alphas, as well as a simple interpretation of the significances
(alpha error inflation, effect size). In general, be cautious with the scientific
"language game". Only use terms if you know exactly what they mean. For
example, the term "substantial effect" generally means "practically significant
effect" in research, but can also mean more precisely, depending on the
research context, that an effect that has occurred is at least as great as a
previously defined minimum effect.
Refrain from a biased misleading presentation
Express yourself clearly and distinctly. Be suspicious of any unclear, vague
and all the more so of any biased, misleading presentation. Someone who
expresses himself unclearly either does not know what he is talking about.
Or, much worse, he or she knows exactly and therefore wants to hide it
deliberately.
For example, the US government was accused in early 2006 of manipulating
reports on climate change. Scientists working in i.a. the US Environmental
Protection Agency and the space agency Nasa had been systematically
pressured for years to remove terms such as "climate change" or "global
warming" from reports (NCAC, 2007). Because of this and other political
influence, the Bush administration was awarded the "Muzzle" prize by the
renowned Thomas Jefferson Center in 2007.
Another example is a publication by Klimanskaya et al (2006, August) in the
journal "Nature", a research group led by stem cell researcher Robert Lanza.
Klimanskaya et al. claimed in it that stem cells can be obtained without
destroying the human embryo. What was hastily hailed as a sensation turned
out, on closer inspection, to be skillful juggling with inaccurate data (not N,
but percentages), misleading illustrations and an overall biased presentation
(e.g. Stollorz, 2006). In fact, the embryos were dissected. Klimanskaya et al.
(2006) had in fact failed to provide credible evidence for their assertion,
instead fueling (additional) distrust of stem cell research. Strictly speaking,
this publication was probably about money: In fact, Lanza owns the company
Advanced Cell Technology (ACT), whose continuously falling stock prices
rose again for a short time after the publication of the article.
Do not use the terms "extremely significant" or "highly significant" either
(see above). Even rhetorical figures, such as the so-called "omnipotent
anecdote" (see the general presentation in Friedmann et al., 2004) have no
place in communicating results in words, text or pictures.
Correct interpretation of results
Results can and may only be interpreted within the limits of the given
theoretical and empirical framework. The framework includes theory,
hypothesis, operationalization, experimental design and the range of data
obtained. However, an interpretation can and should go beyond the simple
"hypothesis accepted"/"hypothesis rejected" scheme. The art is to avoid
overinterpretation (speculation) at the same time (see next point).
Discussion of potential interference (noise) effects
You have to consistently doubt all your results yourself. The goal of a
discussion is not to talk results into being, but to subject them to critical
examination (see also DFG, 1998). Only when they stand up to critical
examination they can and should be communicated as relevant, stable and
reliable. The result of checking the effect of potentially relevant confounding
variables should be discussed.
The previous sections mainly compiled points that should be observed. –
Necessarily, individual basic concepts (e.g. measurement and scaling, cf. e.g.
Velleman & Wilkinson, 1993; Gigerenzer, 1981) could only be touched
upon, but by no means should the conclusion be drawn that these are less
relevant.
The points compiled below should in any case be omitted.

19.4 Criteria for "mortal sins" of


professional work
As violations of professional scientific practice, criteria of the Office of
Research Integrity of the U.S. Department of Health and Human Services
resp. the German Research Foundation (DFG, 1998) are listed below; these
include the

invention of data and results,


manipulation of data,
plagiarism, and
other violations of standards relating to the planning, conduct, or
publication of scientific studies.
These criteria are often used to determine scientific misconduct ("fraud"), but
are in principle to be seen as general standards of professional conduct. The
point of this chapter is not to string together scandals that are certainly not
rare when looking at the sciences more closely, for example: Hwang Woo-
Suk, Jan Henrik Schön, the „Sündenfall“ [“fall of man”] Hermann/Brach or,
among others, Cyril Burt's invented twins (see BBC News, 2005; SPIEGEL
ONLINE, 2004; Abschlussbericht [Final Report] 2001; Di Trocchio, 1999;
Finetti & Himmelreich, 1999, especially Erster Teil [First Part]; for further
examples see also Monmonier, 1996; Broad & Wade, 1984, e.g. Chapter 11;
Gould, 1983).
The purpose of this chapter is to raise awareness of problem areas that need
to be dealt with within the challenging and strenuous scientific work
(although one should not actually exclude the media (but not only) and their
often unprofessional handling of scientific reports, see Krämer & Gigerenzer,
2005; Best, 2001; Krämer, 2000; Müller-Ullrich, 1998, especially the chapter
"Auf Klärung aus"; Dewdney, 1994; Huff, 1993/1954). First of all: Work
like a pro. Laliberté et al (2004, 16), for example, explicitly consider
professionalism a basic requirement of data integrity: Develop values,
standards and criteria for yourself and demand them from others. Develop
for yourself the claim to be able to explain, justify and prove every aspect of
your work to yourself and others. Demand this duty of proof from others as
well. First and foremost from those who are supposed to support you. Learn
to distinguish between those who support you in fact and those who are not
able to do so. Always assume (not only in your scientific work) the following
scenario: You must be able to take responsibility for everything you do in
your name and you are capable of doing so. Others should also be able to do
so. Therefore, for these reasons, do not get involved in so-called honorary
authorships, which, by the way, the DFG (1998) also declared "excluded" in
Empfehlungen [recommendations] 11 and 12, albeit for slightly different
reasons. Professional scientific work is generally based on several basic
principles (e.g. DeGEval, 2004, 2002; Deutsche Forschungsgemeinschaft,
1998).
If you are in a more managerial position, you may want to follow some of
the FDA’s recommendations. Above all, create an atmosphere of trust,
provide the necessary resources and support, make responsibilities crystal
clear, and communicate the rules and relevance of trusted data:

Don’t shoot the messenger: Listen and learn.


Expect fraud: Start from assuming fraud and then go into details.
Get technical: Dive deep into data, systems and equipment.
Fill the void: Question missing data, information, etc.
Be curious: Also when it comes to blame shifting.
Cultivate whistleblowers. Whistleblowers are good for you and
your team. Protecting them means protecting others.
Don’t be intimidated: Tell the emperor he has no clothes.

Above all, however, there is honesty (integrity) towards oneself and all
others. Thus, strict honesty and transparency must be guaranteed with regard
to the collection, processing, analysis and publication resp. archiving of data,
including: You work according to all rules of the art ("lege artis"). You
document your results and discuss them critically. Archive your data and
syntax programs and make them available for review or even meta-analysis,
if necessary.
Inventing data or results
Recording, using or publishing data or results that have been deliberately
made up. This includes the replacement of "unwanted" results with desired
results. Depending on the circumstances, these practices may fall within the
context of fraud and/or falsification of documents.
Data manipulation
Data manipulation includes, among other things, the arbitrary

modification,
falsification,
exclusion or suppression (concealment, deletion) of data(sets),
procedures (processes), material and technical equipment (e.g.
quality, calibrations) and/or results

with the intention of presenting a study or its results differently. Regardless


of their format (measurements, photographs, tape logs, etc.), data may not be
manipulated in any way.
Data manipulation also includes serious violations of regulations in the
conduct of experiments, for example

a distorting performance of an experiment with the aim of


obtaining desired results, and/or
the misuse of statistical procedures with the intention of
interpreting data in an unjustified way, and/or
the distorting presentation and interpretation of results and
unjustified conclusions towards desired results.

Depending on the circumstances, these practices may fall within the context
of fraud, breach of privacy policies and/or falsification of documents.
Violations of data protection
Violations of data [privacy] protection include, but are not limited to

unauthorized use of confidential information (e.g., publication of


third party data),
falsification of documents (e.g., forging of documents or
signatures), and
falsification of qualifications or references.
Depending on the circumstances, these practices may fall into the context
[examples from German Criminal Code, StGB: Strafgesetzbuch] of

fraud (§ 263 StGB),


violation of privacy policies (§§ 303a, 303b StGB) and/or
falsification of documents (§ 267 StGB; see also § 268 StGB for
the falsification of technical records, and § 269 StGB for the
falsification of evidential data).
Plagiarism
Plagiarism is generally the appropriation and use of the intellectual property
of third parties, for example

thoughts,
ideas,
observations/results,
(un)published texts,
techniques/processes or data of others,

to pass off as one's own performance without appropriate identification.


Practices of unjustified authorship include, among others,

copying resp. take (possession of) text passages of others without


proper citation of the author and his text,
incomplete citation of literature sources in the case of text
passages that were written in the style of other authors, as well as
not considering employees as co-authors despite their
contributions to the publication.
Depending on the context, these practices may fall into the category of
intellectual property theft and/or copyright infringement.
Any other violation of accepted standards
For example, inciting unprofessional behavior (scientific misconduct)
including covering up or concealment, as well as retaliation against third
parties who draw attention to such behavior.
Misconduct and consequences
The respective misconduct may be defined differently depending on country,
jurisdiction, and research context. Especially scientists working
internationally should be aware of the fact that misconduct, depending on the
country-specific jurisdiction, may violate applicable law and may result in
consequences under private and criminal law.
The German Criminal Code (StGB), for example, defines fraud in § 263
StGB as follows: "Anyone who, with the intention of obtaining an unlawful
pecuniary advantage for himself or a third party, damages the property of
another by causing or perpetrating an error by presenting false facts or by
distorting or suppressing true facts, shall be punished with imprisonment for
up to five years or with a fine. Attempting to do so is punishable by law."
Orig.: Wer in der Absicht, sich oder einem Dritten einen rechtswidrigen
Vermögensvorteil zu verschaffen, das Vermögen eines anderen dadurch
beschädigt, dass er durch Vorspiegelung falscher oder durch Entstellung oder
Unterdrückung wahrer Tatsachen einen Irrtum erregt oder unterhält, wird mit
Freiheitsstrafe bis zu fünf Jahren oder mit Geldstrafe bestraft. Der Versuch ist
strafbar."
In 1986, the German Penal Code introduced §§ 303a (data modification) and
303b (computer sabotage). According to § 303a StGB, anyone who illegally
deletes, suppresses, makes unusable or alters data is liable to prosecution.
Also the attempt is punishable. According to §303b StGB, anyone who
disrupts data processing of essential importance for a business, company or
authority through changes to the data mentioned in §303a StGB or through
physical damage or destruction is liable to prosecution. Here, too, the attempt
is punishable. Depending on the severity and/or criminal energy, one or more
offences against physical integrity (§§ 223-231 StGB) could also be
punishable, e.g. (negligent, dangerous) bodily injury due to falsified study
results, incorrect therapy recommendations, etc.
Possible reasons for misconduct (be it unintentional misconduct
[malpractise], or intentional fraud) are not considered as excuses, but as
sources of error that need to be changed, including

(apparent) ignorance resp. personal attitude,


(apparently) unclearly formulated expectations resp. non-binding
standards,
behavior resp. processes (seemingly) tolerated by convention, or
(seemingly) lack of penal consequences,
(seemingly) ineffective controls (journals, peer reviews / audit
trails, companies/universities, etc.),
(seemingly) unalterable working or competitive conditions
(deadline / career constraints, etc.),
(apparent) acceptance or coverage of the fraud "from above" (e.g.
by superiors, committees),
etc.

Clichés such as "data always contain errors" or "others do it just as [wrong]"


are system-preserving statements resp. attitudes; they are to be rejected
because they only serve to distract from the consequences of incompetent
own doings resp. to irresponsibly incite to do erroneous work, data and
results.
The scope of consequences can often not be estimated far enough;
consequences can range from unpleasant attention from superiors, top
management, sponsors, the media or the public, termination of a
job/contract, compensatory damages, loss of confidence in one's own work in
general to the scientific community in particular; at this point, it is also worth
remembering the special consequences (not only) in clinical research (see
Chapter 1), for example, when drugs are approved on the basis of false or
falsified data.
The harmless-sounding term "fraud" for scientific misconduct, which
supposedly "only" stands for fraud, is e.g. interpreted by U.S. courts as the
theft of funding. Depending on the misconduct, sanctions can include

recording of a breach of official duty,


warning,
suspension of doctoral procedures etc.,
revocation of titles,
claims for damages (on the part of sponsors, employers or
publishers),
reimbursement of research funds or grants,
exclusion from review panels,
dismissal and even
prison sentences lasting several years (e.g. Office of Research
Integrity, 2006, 2; DFG, 1998).
Depending on the country, legal system and research area, one must also
expect to be pilloried. The FDA (USA), for example, collects and publishes
fraud cases on a publicly accessible "Misconduct" database. Numerous
institutions also maintain more or less official black lists or corresponding
databases. In Germany, for example, the so-called "Golden Memory" of the
German Society for Pharmaceutical Medicine collects information on
misconduct and suspected or actual fraud in clinical research, and is often
consulted accordingly before clinical studies are conducted.
Everyone represents the scientific system through their professional activities
in companies, public or academic institutions, etc. Scientific standards must
be adhered to, and exculpatory resp. pseudo-legitimizing behavior must be
avoided.
Only professional work and communication represents scientific nature and is
the sole basis for trustworthy research as a system.
Being allowed to do research is a privilege. Being able to research is a value.
Award both sustainability through quality.

20 Literature
Abschlussbericht (2001). Abschlussbericht der Task-Force F.H. (Fassung
vom 08.02.2001).
Albert, Hans (1968). Traktat über kritische Vernunft. Tübingen: Mohr.
Allison, Paul D. (2000). Multiple Imputation for Missing Data: A Cautionary
Tale. University of Pennsylvania.
Alsleben, Christoph & Richter, Wolfram F. (2005). Deutschlands
uninformierte Elite: Über das aktuelle Spiegel-Ranking. Forschung & Lehre,
2005, 2, 80–81.
American Psychological Association (2001). Publication Manual.
Washington, DC: American Psychological Association (5th ed.).
Atherton , Mark (2007). Errors up as taxman seeks more power. TIMES
ONLINE (Meldung vom 06.07.2007)
Bacher, Johann (2002²). Clusteranalyse. München Wien: [Link]
Verlag.
Bacher, Johann; Wenzig, Knut & Vogler, Melanie (2004). SPSS Two-Step
Cluster – A First Evaluation. Friedrich-Alexander-Universität Erlangen-
Nürnberg. Sozialwissenschaftliches Institut. Lehrstuhl für Soziologie.
Arbeits- und Diskussionspapiere 2004-2 (2. Auflage).
Bange, Carsten & Schinzer, Heiko (2001). Am Anfang steht die
Datenqualität. Computerwoche, 44, 56–58 (02.11.2001).
Barnett, Vic & Lewis, Toby (1994³). Outliers in statistical data. NewYork:
John Wiley & Sons.
Batini, Carlo & Scannapieco, Monica (2006). Data Quality: Concepts,
Methodologies and Techniques. New York: Springer.
BBC News (2020b). Excel: Why using Microsoft's tool caused Covid-19
results to be lost. BBC News (Kelion, Leo, 05.10.2020)
BBC News (2020a). A-levels: Ofqual's 'cheating' algorithm under review.
BBC News (Wakefield, Jane, 20.08.2020).
BBC News (2005). S Korea cloning research was fake. BBC News
(23.12.2005).
Beaulieu-Prévost, Dominic (2006). Confidence Intervals: From tests of
statistical significance to confidence intervals, range hypotheses and
substantial effects. Tutorial in Quantitative Methods for Psychology, Vol 2
(1), 11–19.
Begg, Colin B. & Berlin, Jesse A. (1988). Publication bias: A problem in
interpreting medical data. Journal of the Royal Statistical Society, A, 151, (3),
419–463.
Beikler, Sabine (2005). Hartz-IV-Panne: Berlins Arbeitsämter zahlen
Abschlag. Der Tagesspiegel (Meldung vom 02.01.2005)
Berry, Michael J.A. & Linoff, Gordon, S. (2000). Mastering Data Mining:
The Art and Science of Custer Relationship Management. New York: John
Wiley & Sons.
Best, Joel (2001). Damned Lies and Statistics: Untangling Numbers from the
Media, Politicians and Activists. University of California Press.
Bettschen, Patrick (2005). Master Data Management (MDM) enables IQ at
Tetra Pak, 1–31. In: (eds.) Proceedings of the Tenth International Conference
on Information Quality (ICIQ 2005). Boston, MA: MIT Sloan School of
Management – Total Data Quality Management Program.
Blanc, Michel; Radermacher, Walter & Körner, Thomas (2001). Qualität und
Nutzer – Grundlagen und Instrumente der Nutzerorientierung in der
amtlichen Statistik. Statistisches Bundesamt – Wirtschaft und Statistik, 10,
799–807.
Bock, Jürgen (1998). Bestimmung des Stichprobenumfangs. München Wien:
[Link] Verlag.
Bohannon, John (2006). Iraqi Death Estimates Called Too High; Methods
Faulted. In: Science, 314, 5798, 396 – 397 (20 October 2006).
Bonanos, Alceste Z., Stanek, Kris Z.; Kudritzki, Rolf P.; Macri, Lucas M.;
Sasselov, Dimitar D.; Kaluzny, Janusz; Stetson, Peter B.; Bersier, David;
Bresolin, Fabio; Matheson, Tom; Mochejska, Barbara J.; Przybilla, Norbert;
Szentgyorgyi, Andrew H.; Tonry, John; Torres, Guillermo (2006). The first
DIRECT distance determination to a detached eclipsing binary in M33. The
Astrophysical Journal, Vol. 652, 313–322.
Bortz, Jürgen & Lienert, Gustav A. (1998). Kurzgefaßte Statistik für die
klinische Forschung: Ein praktischer Leitfaden für die Analyse kleiner
Stichproben. Heidelberg: Springer.
Bortz, Jürgen (19934). Statistik für Sozialwissenschaftler. Heidelberg:
Springer.
Bortz, Jürgen & Döring, Nicola (1995²). Forschungsmethoden und
Evaluation für Sozialwissenschaftler. Heidelberg: Springer.
Boseley, Sarah (2006). UK scientists attack Lancet study over death toll. The
Guardian,
Tuesday October 24, 2006.
Brackstone, Gordon (1999) Managing data quality in a statistical agency.
Survey Methodology, 25, 2, 129–149.
Bredenkamp, Jürgen (1972). Der Signifikanztest in der psychologischen
Forschung. Frankfurt/M.: Akademische Verlagsanstalt.
Bredenkamp, Jürgen (1969). Über Maße der praktischen Signifikanz.
Zeitschrift für Psychologie, 177, 310–317.
Broad, William & Wade, Nicholas (1984). Betrug und Täuschung in der
Wissenschaft. Basel: Birkhäuser.
Brockman, John (1991). Einstein, Gertrude Stein, Wittgenstein und
Frankenstein: Die Geburt der Zukunft: Die Bilanz unseres
naturwissenschaftlichen Weltbildes. München: Goldmann.
Bundesärztekammer, Kassenärztliche Bundesvereinigung &
Arbeitsgemeinschaft der Wissenschaftlichen Medizinischen
Fachgesellschaften (2003). „Curriculum Qualitätssicherung / Ärztliches
Qualitätsmanagement“. Reihe: Texte und Materialien der
Bundesärztekammer zur Fortbildung und Weiterbildung. Band 10:
Curriculum Qualitätssicherung / Ärztliches Qualitätsmanagement. 3.
überarbeitete Auflage.
Buolamwini, Joy & Gebru, Timnit (2018). Gender Shades: Intersectional
accuracy disparities in commercial gender classification. Proceedings of
Machine Learning Research 81:1-15 [2018 Conference on Fairness,
Accountability, and Transparency].
Burnham, Gilbert; Lafta, Riyadh; Doocy, Shannon et al. (2006). Mortality
after the 2003 invasion of Iraq: A cross-sectional cluster sample survey. In:
The Lancet, 368(9545), 1421–1429.
Buu, Yuh-Pey Anne (1999). Analysis of longitudinal data with missing
values: A methodological comparison. Unpublished doctoral dissertation,
Indiana University.
Cabena, Peter, Hadjinian, Pablo, Stadler, Rolf, Verhees, Jaap & Zanasi,
Alessandro (1998): Discovering data mining – from concept to
implementation. Upper Saddle River.
Calvert, William S. & Ma, Meimei J. (1996). Data Management: Concepts
and Case Studies. Cary, NC: SAS Institute Inc.
Carson, Carol S. & Liuksila, Claire (2001) Further steps toward a framework
for assessing data quality. International Monetary Fund (IMF): Statistics
Department . Washington DC.
Carson, Carol S. (2000). What Is Data Quality? A Distillation of Experience.
International Monetary Fund (IMF): Statistics Department. Washington DC.
Carver, Ronald P. (1993). The case against statistical significance testing,
revisited. Journal of experimental education, 61 (4), 287–292.
Chambers, Ray (1999a). Probability sampling: Basic methods, 7–39. In:
Davies, Pam & Smith, Paul (eds.). Model Quality Report in Business
Statistics – Vol. 1: Theory and Methods for Quality Evaluation. UK Office
for National Statistics. London.
Chambers, Ray (1999b). Probability sampling: Extensions, 40–64. In:
Davies, Pam & Smith, Paul (eds.). Model Quality Report in Business
Statistics – Vol. 1: Theory and Methods for Quality Evaluation. UK Office
for National Statistics. London.
Chapman, Pete; Clinton, Julian; Khabaza, Thomas; Reinartz, Thomas; Wirth,
Rüdiger (1999): The CRISP-DM Process Model. Discussion Paper. The
CRISP-DM consortium NCR System Engineering Copenhagen (Denmark),
DaimlerChrysler AG (Germany), Integral Solutions Ltd. (England) and
OHRA Verzekeringen en Bank Groep B.V (The Netherlands).
Chatterjee, Samprit & Price, Bertram (19952). Praxis der Regressionsanalyse.
München Wien: [Link] Verlag.
Claus, Prof. Dr. Volker (2004). Spiegel Rangliste vom November.
Universität Stuttgart. Email, 16.12.2004.
Cleophas, Ton J., Zwinderman, Aeilko H. & Cleophas, Toine, F. (2002).
Statistics applied to Clinical Trials. Dordrecht: Kluwer Acedmic Publisher.
Cleveland, William S. (1993). Visualizing data. Summit, NJ: Hobart Press.
Cohen, Jacob et al. (20033). Applied Multiple Regression/Correlation
Analysis for the Behavioral Sciences. Mahwah NJ: Lawrence Erlbaum Ass.
Cohen, Jacob (1994). The earth is round (p < .05). American Psychologist,
49, 997–1003.
Cool, Angela L. (2000). A review of methods for dealing with missing data.
Paper presented at the Annual Meeting of the Southwest Educational
Research Association, Dallas, TX. (ERIC Document Reproduction Service
No. ED 438 311).
Copas, John B. & Li, H.G. (1997) Inference for non-random samples (with
discussion). Journal of the Royal Statistical Society, Series B, 59 , 55–96.
Cumming, Geoff & Finch, Sue (2005). Inference by eye: Confidence
intervals and how to read pictures of data. American Psychologist, 60, 2,
170–180.
Dar, Rueven; Serlin, Ronald C. & Omer, Haim (1994). Misuse of Statistical
Tests in Three Decades of Psychotherapy Research. Journal of Consulting
and Clinical Psychology , 62, 75–82.
Davies, Pam (1999). Processing errors, 16–24. In: Davies, Pam & Smith,
Paul (eds.). Model Quality Report in Business Statistics – Vol. 1: Theory and
Methods for Quality Evaluation. UK Office for National Statistics. London.
Davies, Pam & Smith, Paul (1999a) (eds.). Model Quality Report in
Business Statistics – Vol. I: Theory and Methods for Quality Evaluation. UK
Office for National Statistics. London.
Davies, Pam & Smith, Paul (1999b) (eds.). Model Quality Report in
Business Statistics – Vol. II: Comparison of Variance Estimation Software
and Methods . UK Office for National Statistics. London.
Davies, Pam & Smith, Paul (1999c) (eds.). Model Quality Report in Business
Statistics – Vol. III: Model Quality Reports. UK Office for National
Statistics. London.
Davies, Pam & Smith, Paul (1999d) (eds.). Model Quality Report in Business
Statistics – Vol. IV: Guidelines for Implementation of Model Quality
Reports. UK Office for National Statistics. London.
DeGEval (2004): Deutsche Gesellschaft für Evaluation: Empfehlungen für
die Aus- und Weiterbildung in der Evaluation: Anforderungsprofile an
Evaluatorinnen und Evaluatoren. Köln: Z.B.!Kunstdruck GmbH.
DeGEval (2002): Deutsche Gesellschaft für Evaluation: Standards für
Evaluation. Köln: Zimmermann-Medien.
Dewdney, Alexander K. (1994). 200 Prozent von Nichts: Die geheimen
Tricks der Statistik und andere Schwindeleien mit Zahlen. Basel: Birkhäuser.
DFG / Deutsche Forschungsgemeinschaft (2005): Grundsätze und
Verantwortlichkeiten bei der Durchführung klinischer Studien. Bonn:
Deutsche Forschungsgemeinschaft (12. Juli 2005).
DFG / Deutsche Forschungsgemeinschaft (1998): Vorschläge zur Sicherung
guter wissenschaftlicher Praxis. Bonn: Deutsche Forschungsgemeinschaft.
Diehl, Joerg M. & Arbinger, Roland (20013). Einführung in die
Inferenzstatistik. Eschborn: Verlag Dietmar Klotz.
Diehl, Joerg M. & Kohr, Heinz U. (199912). Deskriptive Statistik. Eschborn:
Verlag Dietmar Klotz.
Diekmann, Andreas (2002). Diagnose von Fehlerquellen und methodische
Qualität in der sozialwissenschaftlichen Forschung. Wien: Österreichische
Akademie der Wissenschaften, Institut für Technikfolgen-Abschätzung (ITA)
(06/2002, ITA-02-04).
Di Trocchio, Federico (1999). Der große Schwindel: Betrug und Fälschung in
der Wissenschaft. Reinbek [Link]: Rowohlt.
Draper, David & Bowater, Russell (1999a). Sampling errors under non-
probability sampling, 65–81. In: Davies, Pam & Smith, Paul (eds.). Model
Quality Report in Business Statistics – Vol. 1: Theory and Methods for
Quality Evaluation. UK Office for National Statistics. London.
Draper, David & Bowater, Russell (1999b). Model assumption errors, 40–64.
In: Davies, Pam & Smith, Paul (eds.). Model Quality Report in Business
Statistics – Vol. 1: Theory and Methods for Quality Evaluation . UK Office
for National Statistics. London.
Dravis, Frank (2004). Data Quality Strategy: A step-by-step approaach, 27–
43. In: Proceedings of the Ninth International Conference on Information
Quality (ICIQ 2004). Boston, MA: MIT Sloan School of Management –
Total Data Quality Management Program.
Eckerson, Wayne W. (2002). Data quality and the bottom line: Achieving
Business Success through a Commitment to High Quality Data. The Data
Warehousing Institute: TDWI Report Series.
Elvers, Eva (1999). Frame errors, 82-5. In: Davies, Pam & Smith, Paul (eds.).
Model Quality Report in Business Statistics – Vol. 1: Theory and Methods
for Quality Evaluation . UK Office for National Statistics. London.
Elvers, Eva & Rosen, Bengt (1999) Quality Concept for Official Statistics,
621–629. In: Kotz, Samuel; Read, Campbell B. & Banks, David L. (eds.).
Encyclopedia of Statistical Sciences, Vol. 3 – Update (John Wiley & Sons.
New York.
English, Larry (2002). Process Management and Information Quality: How
improving information production processes improve information (product)
quality, 206–209. In: Proceedings of the Seventh International Conference on
Information Quality (ICIQ 2002). Boston, MA: MIT Sloan School of
Management – Total Data Quality Management Program.
English, Larry (1999). Improving Data Warehouse and Business Information
Quality. NewYork: John Wiley & Sons.
Eppler, Martin J. & Helfert, Markus (2004). A classification and analysis of
data quality costs, 311–325. In: Proceedings of the Ninth International
Conference on Information Quality (ICIQ 2004). Boston, MA: MIT Sloan
School of Management – Total Data Quality Management Program.
Eubanks, Virginia (2018). Automating Inequality. How High-Tech Tools
Profile, Police, and Punish the Poor. Maidenhead: Melia Publishing Services.
Eurostat (2004). European Conference on Quality and Methodology in
Official Statistics (Q2004). Mainz, Germany, Castle of the Prince Elector
(24–26 May 2004). Federal Statistical Office Germany / Statistisches
Bundesamt, Wiesbaden).
Eurostat (2003). “How to make a quality report”. Working Group
“Assessment of quality in statistics”. 6th meeting. Eurostat, Luxembourg. 2–3
October 2003.
Eurostat (2002). Definition of quality in statistics. Working Group
“Assessment of quality in statistics”. 5th meeting. Eurostat, Luxembourg. 2–3
May 2002.
Eurostat (1999). Quality work and quality assurance within statistics.
Luxembourg: Eurostat.
Eurostat (1998). Qualitätsarbeit und Qualitätssicherung in der Statistik.
Luxembourg: Eurostat.
Experian (2019) Global data management research: Taking control in the
digital age (Benchmark report). Experian, Nottingham.
Fay, Robert E. (2002). Probabilistic models for detecting census person
duplication. American Statistical Association: Joint Statistical Meetings –
Section on Survey Research Methods, 746, 969–974.
Fay, Robert E. (2000). Theory and application of nearest neighbour
imputation in Census 2000. US Census Bureau, Decennial Statistical Studies
Division, Washington DC .
Feldt, Leonard S. (1973). What size samples for methods/materials
experiments? Journal of Educational Measurement, 10, 221–231.
Feyerabend, Paul (1986). Wider den Methodenzwang. Frankfurt a.M.:
Suhrkamp.
Feyerabend, Paul (1980). Erkenntnis für freie Menschen. Veränderte
Ausgabe. Frankfurt a.M.: Suhrkamp.
Finetti, Marco & Himmelreich, Armin (1999). Der Sündenfall: Betrug und
Fälschung in der deutschen Wissenschaft. Stuttgart: [Link] Raabe.
Fischer, Hannah (2006). Iraqi Civilian Deaths Estimates. CRS Report for
Congress, Congressional Research Service, The Library of Congress
(November 22, 2006).
Fitzmaurice, Garrett; Laird, Nan & Ware, James (2004). Applied
Longitudinal Analysis. NewYork: John Wiley & Sons.
Forster, Jonathan J. & Smith, Peter W.F. (1998) Model-based inference for
categorical survey data subject to non-ignorable nonresponse (with
discussion). Journal of the Royal Statistical Society, Series B, 60, 57–70.
Friedmann, Jan; Hackenbroch, Veronika; Hipp, Dietmar; Klawitter, Nils;
Koch, Julia; Lakotta, Beate; Mohr, Joachim; Schmitz, Christoph; Thimm,
Katja & Wüst, Christian (2004). Die Elite von Morgen – Wo studieren die
Besten? DER SPIEGEL, 48, 178–200.
Fry, Hannah (2018). Hello World: How to be Human in the Age of the
Machine. New York: [Link].
Fugini, Maria Grazia; Mecella, Massimo; Plebani, Pierluigi; Pernici, Barbara;
Scannapieco, Monica (2002). Data Quality in Cooperative Web Information
Systems. Kluwer Academic Publishers.
Gackowski, Zbigniew J. (2004). Logical interdependence of data/information
quality dimensions – A purpose-focused view on IQ, 126–140. In:
Proceedings of the Ninth International Conference on Information Quality
(ICIQ 2004). Boston, MA: MIT Sloan School of Management – Total Data
Quality Management Program.
Gartner (2018). How to Create a Business Case for Data Quality
Improvement (Moore, Susan, 19.06.2018).
Gartner (2017). Magic Quadrant for Data Science Platforms. Gartner RAS
Core Research Note.
Garvin, David A. (1998). What does ‘Product Quality’ really mean? Sloan
Management Review, Fall, 25–43.
Gassman Jennifer J., Owen WW, Kuntz TE, Martin JP, Amoroso WP (1995).
Data quality assurance, monitoring, and reporting. Control Clin Trials, 16,
104–136 (Supplement).
Gigerenzer, Gerd (19995). Über den mechanischen Umgang mit statistischen
Methoden, 607–618, in: Roth, Erwin, Heidenreich, Klaus, und Holling, Heinz
(Hsg.). Sozialwissenschaftliche Methoden: Lehr- und Handbuch für
Forschung und Praxis. München Wien: [Link] Verlag.
Gigerenzer, Gerd (1981). Messung und Modellbildung in der Psychologie.
München: UTB Reinhardt.
Goerk, Manfred (2005). Data Quality@SAP for Field Processes Excellence.
First Information Quality Forum. Dublin City University (February 24th,
2005)
Goerk, Manfred (2004). Data Quality in Practice @ SAP AG – an enterprise
wide approach. CAiSE Workshop on Data and Information Quality (DIQ),
Riga, Latvia.
Gould, Stephen J. (2000²). Ein Dinosaurier im Heuhaufen: Streifzüge durch
die Naturgeschichte. Frankfurt a.M.: [Link].
Gould, Stephen J. (1983). Der falsch vermessene Mensch. Basel: Birkhäuser.
Graham, David J; Campen, David; Hui, Rita; Spence, Michele; Cheetham,
Craig; Levy, Gerald; Shoor, Stanford; Ray, Wayne A (2005). Risk of acute
myocardial infarction and sudden cardiac death in patients treated with cyclo-
oxygenase 2 selective and non-selective non-steroidal anti-inflammatory
drugs: Nested case-control study. Lancet 2005, 365: Early Online
Publication.
Green, Sam (1991). How many subjects does it take to do a regression
analysis? Multivariate Behavioral Research, 26, 455–510.
Grözinger, Gerd & Matiaske, Wenzel (2005). Ein „Montags“-Ranking. Über
die Spiegel/McKinsey-Umfrage unter Studenten. Forschung & Lehre, 2005,
2, 82–83.
Guardian UK (2020). Home Office to scrap a “racist” algorithm for UK visa
applicants. The Guardian UK (McDonald, Henry, 04.08.2020).
Hager, Willi (2005). Vorgehensweisen in der deutschsprachigen
psychologischen Forschung: Eine Analyse empirischer Arbeiten der Jahre
2001 und 2002. Psychologische Rundschau, 56 (3), 191–200.
Hager, Willi (1987). Grundlagen einer Versuchsplanung zur Prüfung
empirischer Hypothesen in der Psychologie, 43–264, In: Lüer, Gerd (ed.).
Allgemeine experimentelle Psychologie, Stuttgart: Fischer.
Haller, Heiko & Kraus, Stefan (2002). Misinterpretations of significance: A
problem students share with their teachers? Methods of Psychological
Research Online, 7, 1, 1–20.
Hampel, Frank R., Ronchetti, Elvezio M., Rousseeuw, Peter J. & Stahel,
Werner A. (2005). Robust Statistics: The Approach Based on Influence
Functions. Wiley Series in Probability and Mathematical Statistics. New
York: Wiley.
Hampel, Frank R. (1985). The breakdown points of the mean combined with
some rejection rules. Technometrics, 27, 95–107.
Hampel, Frank R. (1971). A general qualitative definition of robustness.
Annals of Mathematical Statistics, 42, 1887–1896.
Hartung, Joachim (199912). Statistik. München Wien: [Link] Verlag.
Hartung, Joachim & Elpelt, Bärbel (19996). Multivariate Statistik: Lehr- und
Handbuch der angewandten Statistik. München Wien: [Link] Verlag.
Haughton, Dominique; Robbert, Mary Ann; Senne, Linda P.; Gada, Vismay
(2003). Effect of dirty data on analysis results, 64–79. In: Proceedings of the
Eighth International Conference on Information Quality (ICIQ 2003).
Boston, MA: MIT Sloan School of Management – Total Data Quality
Management Program.
Hawking, Stephen W. (1988). Eine kurze Geschichte der Zeit: Die Suche
nach der Urkraft des Universums. Reinbek b. Hamburg: Rowohlt.
Hawkins, Douglas M. (1980). Identification of outliers. Kluwer Pub.
Monographs on Statistics and Applied Probability.
Hays, William L. (19884). Statistics. Fort Worth: Harcout Brace Jovanovich
College Publishers.
Hedeker, Donald & Gibbons, Robert D. (1997). Application of random-
effects pattern-mixture models for missing data in longitudinal studies.
Psychological Methods, 2(1), 64–78.
HEISE online news (2005). Falsche Storno-Meldungen an Krankenkassen
durch Fehler in Hartz-IV-Software (Meldung vom 08.08.2005)
Helfert, Markus (2000). Maßnahmen und Konzepte zur Sicherung der
Datenqualität. In: Jung, Reinhard & Winter, Robert (Hrsg.). Data
Warehousing Strategie: Erfahrungen, Methoden, Visionen. Berlin: Springer,
61–77
Helfert, Markus; Herrmann, Clemens & Strauch, Bernhard (2001).
Datenqualitätsmanagement. Universität St. Gallen – Hochschule. Institut für
Wirtschaftsinformatik.
Hippel, Paul T. von (2004). Biases in SPSS 12.0 Missing Value Analysis.
The American Statistician, May 2004, Vol. 58, No. 2, 160–164.
Hite, Shere (1976). The Hite Report. A nationswide study on female
sexuality. New York/London, MacMillan.
Hite, Shere (1981). The Hite Report on Male Sexuality. New York: Alfred A.
Knopf.
Höding, Maia; Michalke, Meik; Nass, Oliver (2005). Die Magie der Zahlen –
Die Hochschulverrenkungen von AOL, McKinsey und Spiegel. Marburg:
Philipps-Universität Marburg: Fachschaftsrat Psychologie.
Horton, Nicholas J. & Lipsitz, Stuart R. (2001). Multiple imputation in
practice: Comparison of software packages for regression models with
missing variables. The American Statistician, 55(3):244–254.
Hosmer, David W. & Lemeshow, Stanley (2000). Applied Logistic
Regression. Second Edition. Wiley & Sons: New York.
Huff, Darrell (1993/1954). How to lie with statistics. New York:
[Link] & Company.
Iannacchione, Vincent G. (1982). Weighted sequential hot deck-imputation
macros. Proceedings of the SAS Users Group International Conference, Vol.
7, 7, 759– 763.
ICH E6(R1). ICH Harmonised Tripartite Guideline: Guideline for Good
Clinical Practice E6 (R1), (10 June 1996).
ICH E8. ICH Harmonised Tripartite Guideline: General Considerations for
Clinical Trials E8 (17 July 1997).
ICH E9. ICH Harmonised Tripartite Guideline: Statistical Principles for
Clinical Trials E9 (5 February 1998).
IMF Survey (2004). Why statistics matter: African seminars raise awareness,
February 16, 1–3.
Infeld, Eric & Sebastian-Coleman, Laura (2004). Galaxy’s Data Quality
Program: A Case Study, 84–88. In: Proceedings of the Nineth International
Conference on Information Quality (ICIQ 2004). Boston, MA: MIT Sloan
School of Management – Total Data Quality Management Program.
Ioannidis, John P.A. (2005). Why most published research findings are false.
PLoS Medicine (2) 8, e124, 696–701.
Jacob, Peter; Kaiser, Jan C.; Blettner, Maria; Bertelsmann, Hilke;
Kutschmann, Marcus; Likhtarev, Ilya; Kovgan, Lina; Vavilov, Sergej;
Tronko, Mykola D.; Bogdanova, Tetyana I. (2005). Anwendungsbereich von
epidemiologischen Studien mit zusammengefassten Daten zur Bestimmung
von Risikofaktoren. Bonn: Bundesministerium für Umwelt, Naturschutz und
Reaktorsicherheit (ed.), Schriftenreihe: Reaktorsicherheit und Strahlenschutz.
Juran, Joseph M. & Godfrey, A. Blanton (19995). Juran's Quality Handbook.
New York: McGraw-Hill.
Kearney, Anne T. (2002). A.C.E. Revision II: Missing DataEvaluation. US
Census Bureau, Planning, Research, and Evaluation Division, Washington
DC (December 31, 2002).
Keeling, Kellie B. & Pavur, Robert J. (2007). A comparative study of the
reliability of nine statistical software packages. Computational Statistics &
Data Analysis, 51 (8), 3811–3831.
Keppel, Geoffrey & Wickens, Thomas, D. (20044). Design and analysis: A
researcher’s handbook. Upper Saddle River, NJ: Pearson / Prentice Hall.
Khabaza, Tom (1999). The Story of Clementine. Integral Solutions Limited,
1999 (March).
Kimball, Ralph & Merz, Richard (2000). The Data Webhouse Toolkit:
Building the web-enabled Data Warehouse. New York: John Wiley & Sons.
Kish, Leslie (1990). Weighting: Why, when, and how? American Statistical
Association, 18, 121–130.
Klimanskaya, Irina; Chung, Young; Becker, Sandy; Lu, Shi-Jiang & Lanza,
Robert (2006). Human embryonic stem cell lines derived from single
blastomeres. In: Nature, 444, 481–485. Zzgl. zwei Korrigenda (23.11.2006,
15.03.2007).
Klotz, Karlhorst (2005). Fehler im System: Der Traum von Software ohne
Bugs. Technology Review, Juli.
Knusel, Leo (2005). On the accuracy of statistical distributions in Microsoft
Excel 2003. Computational Statistics & Data Analysis, 48(3), 445–449.
Konno, Sayako (2006). The Bank of Japan’s Basic Principles Toward Further
Improvement of Financial and Economic Statistics. Paper presented at the
European Conference on Quality in Survey Statistics in Cardiff, UK. Session
3: Improving Survey Data (24 – 26 April 2006).
Körner, Thomas & Schmidt, Jürgen (2006). Qualitätsberichte – ein neues
Informationsangebot über Methoden, Definitionen und Datenqualität der
Bundesstatistiken. Statistisches Bundesamt – Wirtschaft und Statistik, 2,
109–117.
Kostanich, Donna (2003). A.C.E. Revision II: Design and Methodology,
Memorandum Series #PP-30. US Census Bureau, Decennial Statistical
Studies Division, Washington DC (March 11, 2003).
Kostanich, Donna & Haines, Dawn E. (2003). Census 2000 Accuracy and
Coverage Evaluation Revision II. American Statistical Association: Joint
Statistical Meetings – Section on Survey Research Methods, 388, 2229–2235.
Krämer, Walter & Gigerenzer, Gerd (2005). How to confuse with statistics
or: The use and misuse of conditional probabilities. Statistical Science, 20, 3,
223–230.
Krämer, Walter (2000). So lügt man mit Statistik. München: Piper.
Krämer, Walter (1995). Was ist faul an der Statistik-Grundausbildung an
deutschsprachigen Wirtschaftsfakultäten? Allgemeines Statistisches Archiv,
79, 196–211.
Krantz, David. H. (1999). The null hypothesis testing controversy in
psychology. Journal of the American Statistical Association, 44, 1372–1381.
Kriz, Jürgen (1981). Methodenkritik empirischer Sozialforschung. Stuttgart:
B.G. Teuber.
Kriz, Jürgen (1973). Statistik in den Sozialwissenschaften. Reinbek b.
Hamburg: Rowohlt
Kromrey, Helmut (2005). "Qualitativ" versus "quantitativ" – Ideologie oder
Realität? Symposium: Qualitative und quantitative Methoden in der
Sozialforschung: Differenz und/oder Einheit? 1. Berliner Methodentreffen
Qualitative Forschung, 24.–25. Juni 2005.
Kromrey, Helmut (1999). Diskussion: Von den Problemen
anwendungsorientierter Sozialforschung und den Gefahren methodischer
Halbbildung. SuB 1/99: Anwendungsorientierte Sozialforschung – Bl.1–16.
Kuhn, Thomas S. (1976). Die Struktur wissenschaftlicher Revolutionen.
Frankfurt a.M.: Suhrkamp. Zweite revidierte und um das Postskriptum von
1969 ergänzte Auflage.
Laliberté, Lucie; Grünewald, Werner & Probst, Laurent (2004). A
comparison of IMF’S Data Quality Assessment Framework (DQAF) and
EUROSTAT’s quality definitions. International Monetary Fund (IMF):
Statistics Department . Washington DC (January 2004).
Lee, Yang L.; Pipino, Leo L.; Funk, James D. & Wang, Richard D. (2006).
Journey to Data Quality. Cambridge, Massachusetts: MIT Press.
Levesque, Raynald (20074). SPSS Programming and Data Management.
Chicago: SPSS Inc.
Liebeskind, Uta & Ludwig-Mayerhofer, Wolfgang (2005). Auf der Suche
nach der Wunsch-Universität – im Stich gelassen: Anspruch und Wirklichkeit
von Hochschulrankings, Soziologie, 4, 442–462.
Little, Roderick J.A. (2007). Persönliche Information, 16.04.2007.
Little, Roderick J.A. & Rubin, Donald B. (2002²). Statistical Analysis with
Missing Data. New York: John Wiley & Sons. Second Edition.
Little, Roderick J.A. & Rubin, Donald B. (1989). The Analysis of Social
Science Data with Missing Values. Sociological Methods and Research, 18,
292–326.
Litz, Hans Peter (2000). Multivariate statistische Methoden. München:
Oldenbourg.
Long, Jennifer; Seko, Craig; Robertson, Chris; Morrison, Laurie J. (2004).
Where to start? A preliminary data quality checklist for emergency medical
services data, 197–210. In: Proceedings of the Ninth International Conference
on Information Quality (ICIQ 2004). Boston, MA: MIT Sloan School of
Management – Total Data Quality Management Program.
Longford, Nicholas T. (2000). Multiple imputation in an international
database of social science surveys. ZA-Information, 46, 72–95.
Ludwig-Mayerhofer, Wolfgang (2003). Zur Qualität der
sozialwissenschaftlichen Methodenausbildung – am Beispiel statistischer
Datenanalyse. ZA-Information, 53, 144–155.
Maier-Riehle, Brigitte & Zwingmann, Christian (2000). Effektstärkevarianten
beim Eingruppen-Prä-Post-Design: Eine kritische Betrachtung.
Rehabilitation, 39, 189–199.
Mariotte, Henri (1999). Estimating the number of falsely active legal units in
the French Business Register. Paper presented at the 13th International
Roundtable on Business Survey Frames. Session No. 6, Paper No. 2. Paris,
France, September 27 – October 1, 1999.
Mayring, Philip (1990). Einführung in die qualitative Sozialforschung.
München: Psychologie Verlags Union.
McClellan, James E. & Dorn, Harold (1991). Werkzeuge und Wissen:
Naturwissenschaft und Technik in der Weltgeschichte. Frankfurt a.M.:
Rogner & Bernhard bei Zweitausendeins.
McCullough, Bruce D. (1999). Assessing the Reliability of Statistical
Software: Part II. The American Statistican, May, 53, 2, 149–159.
McCullough, Bruce D. (1998). Assessing the Reliability of Statistical
Software: Part I. The American Statistican, November, 52, 4, 358–366.
McCullough, Bruce D. & Wilson, Berry (2005). On the accuracy of statistical
procedures in Microsoft Excel 2003. Computational Statistics & Data
Analysis, 49(4), 1244–1252.
McCullough, Bruce D. & Wilson, Berry (2002). On the accuracy of statistical
procedures in Microsoft Excel 2000 and Excel XP. Computational Statistics
& Data Analysis, 40,
713–721.
McCullough, Bruce D. & Wilson, Berry (1999). On the accuracy of statistical
procedures in Microsoft Excel 97. Computational Statistics & Data Analysis,
31, 27–37.
McFadden, Eleanor (1998). Management of Data in Clinical Trials. New
York: John Wiley & Sons.
McKeon, Adrian (2003). Barclays Bank Case Study: Using Artificial
Intelligence to Benchmark Organizational Data Flow Quality, 1–31. In:
Proceedings of the Eighth International Conference on Information Quality
(ICIQ 2003). Boston, MA: MIT Sloan School of Management – Total Data
Quality Management Program.
McKinsey & Company (2004). Methodik des Studentenspiegels. [Link]
[Link]/[Link].
McNally, James W. (1997). Generating Hot-Deck Imputation Estimates:
Using SAS for Simple and Multiple Imputation Allocation Routines. Brown
University: Population Studies and Training Center. Working Paper Series.
PSTC Working Paper #97-12 (auch: SUGI 1997, Paper 239–26).
Meinert, Curtis L. (1986). Clinical Trials: Design, Conduct and Analysis.
New York: Oxford University Press.
Menard, Scott (2001²). Applied Logistic Regression Analysis (Series:
Quantitative Applications in the Social Sciences). Thousand Oaks: Sage
Publications.
Monmonier, Mark (1996). Eins zu einer Million: Die Tricks und Lügen der
Kartographen. Basel: Birkhäuser.
Mule, Thomas (2002). A.C.E. Revision II Results: Further Study of Person
Duplication. US Census Bureau, Decennial Statistical Studies Division,
Washington DC (Dec 31, 2002).
Mule, Thomas (2001). Person Duplication in Census 2000. US Census
Bureau, Decennial Statistical Studies Division, Executive Steering
Committee on Accuracy and Coverage Evaluation Policy (ESCAP) II,
Washington DC. Report Number 20 (Oct. 11, 2001).
Müller, Walter (2001). Konsequenzen für die Methodenausbildung aus dem
Gutachten der Kommission zur Verbesserung der informellen Infrastruktur
zwischen Wissenschaft und Statistik (KVI). ZUMA-Nachrichten, 49, Jg. 25,
November, 81–99.
Müller-Ullrich, Burkhard (1998). Medienmärchen: Gesinnungstäter im
Journalismus. München: Siedler / Goldmann.
Naumann, Felix & Rolker, Claudia (2000). Assessment methods for
information quality criteria. In: Proceedings of the International Conference
on Information Quality (ICIQ 2000). Boston, MA: MIT Sloan School of
Management – Total Data Quality Management Program, October 2000,
148–162.
NCAC / National Coalition Against Censorship (2007). Political Science: A
Report on Science & Censorship. The Knowledge Project At The National
Coalition Against Censorship. New York, NY.
Neter, John, Wasserman, William & Whitmore, George A. (19883). Applied
Statistics. Boston: Allyn and Bacon.
Nickerson, Raymond S. (2000). Null hypothesis significance testing: A
review of an old and continuing controversy. Psychological Methods, 5, (2),
241–301.
Nida-Rümelin, Julian (2005). Wider die Schmalspur-Wissenschaften. Neue
Zürcher Zeitung (28.12.2005).
Nuovo, Jim, Melnikow, Joy & Chang, Denise (2002). Reporting number
needed to treat and absolute risk reduction in randomized controlled trials,
JAMA, 287, 2813–2814.
NZZ (2019). Evidenzbasierter Rassismus – Algorithmen sind immer nur so
schlau wie die Daten, auf denen sie beruhen – Gastkommentar. Neue
Zürcher Zeitung (Kaeser, Eduard 06.07.2019).
O’Brien, James A. & Marakas, George (2003). Introduction to Information
Systems. McGraw-Hill/Irwin. Twelfth Edition.
OCC (2020). News Release 2020-132 (07.10. 2020). OCC Assesses $400
Million Civil Money Penalty Against Citibank.
OECD (2003). Quality framework and guidelines for OECD statistical
activities. Organisation for Economic Co-operation and Development,
STD/QFS, JT00151752 (17-Oct 2003).
Office of Management and Budget (2007). Draft 2007 Report to Congress on
the Cossts and Benefits of Federal Regulations. Executive Office of the
President of the United States: Office of Management and Budget,
Washington, D.C.
Office of Management and Budget (2006a). Standards and Guidelines for
Statistical Surveys. Executive Office of the President of the United States:
Office of Management and Budget, Washington, D.C. (September 2006).
Office of Management and Budget (2006b). Questions and Answers When
Designing Surveys for Information Collections. Executive Office of the
President of the United States: Office of Management and Budget,
Washington, D.C. (Januar 2006).
Office of Research Integrity (2006). Newsletter, Volume 14, No. 2, March.
Ofori-Kyei, Mark; Lister, Jimmie; Mobisson, Geoffrey (2002). Evolution of a
Data Quality Strategy, 102–105. In: Proceedings of the Seventh International
Conference on Information Quality (ICIQ 2002). Boston, MA: MIT Sloan
School of Management – Total Data Quality Management Program.
Olejnik, Stephen & Algina, James (2003). Generalized Eta and Omega
Squared Statistics: Measures of Effect Size for Some Common Research
Designs. Psychological Methods, 8 (4), 434–447.
Olejnik, Stephen & Algina, James (2000). Measures of Effect Size for
Comparative Studies: Applications, Interpretations and Limitations.
Contemporary Educational Psychology, 25, 241–286.
Oliveira, Paulo; Rodrigues, Fátima; Henriques, Pedro (2005). A formal
definition of data quality problems, 102–105. In: Proceedings of the Tenth
International Conference on Information Quality (ICIQ 2005). Boston, MA:
MIT Sloan School of Management – Total Data Quality Management
Program.
Pedhazur, Elazar J. (1982²). Multiple Regression in Behavioral Research:
Explanation and Prediction. Fort Worth: Holt, Rinehart and Winston Inc.
Peng, Chao-Ying Joanne; Harwell, Michael; Liou, Show-Mann & Ehman,
Lee H. (2006). Advances in missing data methods and implications for
educational research, 31–78. In: Sawilowsky, Shlomo S. (ed.). Real data
analysis. Greenwich, CT: Information Age Publishing.
Pernici, Barbara & Scannapieco, Monica (2002). Data Quality in Web
Information Systems, 397–413. In: Spaccapietra, Stefano; March, Salvatore
T. & Kambayashi, Yahiko (eds.). Conceptual Modeling – ER 2002:
Proceedings of the 21st International Conference on Conceptual Modeling.
Tampere, Finland, October 7–11, 2002. London: Springer (Series: Lecture
Notes In Computer Science, Vol. 2503).
Peterson, Tommy (2003). Data Scrubbing by the Numbers (February 10,
2003).
[Link]
Popper, Karl R. (19716). Logik der Forschung. Tübingen: Mohr.
Popper, Karl R. (1963). Conjectures and refutations. London: Routledge and
Keagan Paul.
Popper, Karl R. (1962). Some comments on truth and the growth of
knowledge, 285–292. In: Nagel, Ernest; Suppes, Patrick; Tarski, Alfred
(eds). Logic, methodology, and philosophy of science. Stanford: Stanford
University Press.
Pötschke, Manuela & Simonson, Julia (2003). Konträr und ungenügend?
Ansprüche an Inhalt und Qualität einer sozialwissenschaftlichen
Methodenausbildung. ZA-Information, 52, 72–92.
Prentice, Deborah A. & Miller, Dale T. (1992). When small effects are
impressive. Psychological Bulletin, 112, 1,160–164.
Prewitt, Kenneth (2000). Accuracy and Coverage Evaluation: Statement on
the feasibility of using statistical methods to improve the accuracy of Census
2000. US Bureau of the Census. Washington DC (June 2000).
Quatember, Andreas (2005). Das Signifikanz-Relevanz-Problem beim
statistischen Testen von Hypothesen. ZUMA-Nachrichten, 57, Jg. 29,
November, 128–150.
Radermacher, Walter & Körner, Thomas (2006). Fehlende und fehlerhafte
Daten in der amtlichen Statistik. Neue Herausforderungen und
Lösungsansätze. Allgemeines statistisches Archiv, 90, 4, 553–576.
Rasch, Dieter; Kubinger, Klaus D.; Schmidtke, Jörg & Häusler, Joachim
(2004). The misuse of asterisks in hypothesis testing. Psychology Science,
46, 2,227–242.
Rasch, Dieter; Verdooren, Rob L. & Gowers, Jim I. (1999). Fundamentals in
the design and analysis of experiments and surveys. München Wien:
[Link] Verlag.
Rasch, Dieter; Herrendörfer, Günter; Bock, Jürgen; Victor, Norbert und
Guiard, Volker (Hsg.) (1996). Verfahrensbibliothek: Versuchsplanung und -
auswertung. Band I. München Wien: [Link] Verlag.
Raymond, Mark R. & Roberts, Dennis M. (1987). A comparison of methods
for treating incomplete data in selection research. Educational and
Psychological Measurement, 47, 13–26.
RTÉ (2019) State fund for jobs loses €750k due to 'human error'. RTÉ
(Murphy, David, 26.11.2019).
Redman, Thomas C. (2004). Data: An Unfolding Quality Disaster. DM
Review Magazine, August, 57, 22–23.
Redman, Thomas C. (2001). Data Quality: The Field Guide. Woburn, MA:
Butterworth-Heinemann.
Reichardt, Charles S. & Harry F. Gollob (1997). When confidence intervals
should be used instead of statistical significance tests, and vice versa, 259–
84; in: Harlow, Lisa L.; Mulaik, Stanley A . & Steiger, James H. (eds.), What
if there were no significance tests? Mahwah, NJ: Lawrence Erlbaum.
REPORT MAINZ (2007). Die verschwundenen Geheimdienstakten. Sendung
vom 25.06.2007 (21.45 Uhr). Moderation: Fritz Frey.
Richardson, DeJuran & Chen, Shande (2001). Data quality assurance and
quality control measures in large multicenter stroke trials: the African-
American Antiplatelet Stroke Prevention Study experience. Current
Controlled Trials in Cardiovascular Medicine, 2(3): 115–117.
Richardson, John T.E. (1996). Measures of effect size. Behavior Research:
Methods, Instruments & Computers, 28, 12–22.
RiskNet (2020). Data error inflated Wells Fargo’s op risk capital by $5
billion. [Link] (Woodall, Louie, 03.06.2020).
Rohe, Julia & Beyer, Martin (2005). Schönschrift lohnt sich. Der Hausarzt, 5,
73.
Rohwer, Götz & Pötter, Ulrich (2002). Methoden sozialwissenschaftlicher
Datenkonstruktion. Weinheim und München: Juventa.
Rost, Jürgen (2005). Messen wird immer einfacher! ZA-Information, 56,
Mai, 6–7.
Roth, Erwin, Heidenreich, Klaus, und Holling, Heinz (Hsg.) (19995).
Sozialwissenschaftliche Methoden: Lehr und Handbuch für Forschung und
Praxis. München Wien: [Link] Verlag.
Roth, Philip L. (1994). Missing data: A conceptual review for applied
psychologists. Personnel Psychology, 47, 537–560.
Rothhaas, Rob (2002). American Statistical Association: Joint Statistical
Meetings – Section on Survey Research Methods, 211, 2984–2989.
Rubin, Donald B. (1996). Multiple imputation after 18 years. Journal of the
American Statistical Association, 91, 473–489.
Rubin, Donald B. (1988). An Overview of Multiple Imputation. Proceedings
of the Survey Research Methods Section of the American Statistical
Association, 79–84.
Rud, Olivia (2001). Data Mining Cookbook. New York: John Wiley & Sons.
Särndal, Carl-Erik (1992) Methods for estimating the precision of survey
estimates when imputation has been used. Survey methodology , 18 , 241–
252.
Sarris, Viktor (1992). Methodologische Grundlagen der
Experimentalpsychologie. Band 2: Versuchsplanung und Stadien des
psychologischen Experiments. München: UTB Reinhardt.
Schafer, Joseph L. & Graham, John W. (2002). Missing data: Our view of the
state of the art. Psychological Methods, 7, 147–177.
Schendera, Christian F.G. (2015). SPSS 回归分析 -Regression Analysis
with SPSS. Beijing: Broadview.
Schendera, Christian F.G. (2015). Deskriptive Statistik. München: UTB
Lucius.
Schendera, Christian F.G. (2014) Regressionsanalyse mit SPSS (2.A.).
München: DeGruyter.
Schendera, Christian F.G. (2012). SQL mit SAS: Band 2: PROC SQL für
Fortgeschrittene. München: Oldenbourg,
Schendera, Christian F.G. (2011). SQL mit SAS: Band 1: PROC SQL für
Einsteiger. München: Oldenbourg,
Schendera, Christian F.G. (2010). Clusteranalyse mit SPSS: Mit
Faktorenanalyse. München: Oldenbourg.
Schendera, Christian F.G. (2009). Regressionsanalyse mit SPSS. München:
Oldenbourg.
Schendera, Christian F.G. (2008). Vertrauen ist gut, Kontrolle ist besser: Die
Qualität von Daten in Unternehmen auf dem Prüfstand, Economag, 5, 96, 1-
6.
Schendera, Christian F.G. (2007). Datenqualität mit SPSS. München:
Oldenbourg.
Schendera, Christian F.G. (2006). Analyse einer Hochschulevaluation: Der
Studentenspiegel 2004 – Die Qualität von Studie, Daten und Ergebnissen.
Zeitschrift für Empirische Pädagogik, 20, 4, 421–437.
Schendera, Christian F.G. (2005). Datenmanagement mit SPSS. Springer:
Heidelberg.
Schendera, Christian F.G. (2004). Datenmanagement und Datenanalyse mit
dem SAS System. München: Oldenbourg.
Schendera, Christian F.G., Janz, Frauke, Klauß, Theo und Wolfgang Lamers
(2003). Die Umsetzung von Evaluationskriterien im Projekt „Perspektiven
der schulischen Erziehungs- und Bildungsrealität von Kindern und
Jugendlichen mit schwerer und mehrfacher Behinderung“ (BiSB-Projekt).
Zeitschrift für Evaluation, 2/2003, 223–232.
Schendera, Christian F.G. (2001). Methodik und Technik der Forschung –
Teil I: Die Geschichte der Molekularbiologie als eine Geschichte ihrer
Methoden. In: klinBiol, 5, Mai, 22–28.
Schnell, Rainer (2002). Ausmaß und Ursachen des Mangels an quantitativ
qualifizierten Absolventen sozialwissenschaftlicher Studiengänge, 35–44, in:
Engel, Uwe (ed.). Praxisrelevanz der Methodenausbildung. Bonn:
Informationszentrum Sozialwissenschaften. Reihe: Sozialwissenschaftliche
Tagesberichte Band 5.
Schnell, Rainer, Hill, Paul B. und Esser, Elke (19996). Methoden der
empirischen Sozialforschung. München Wien: [Link] Verlag.
Schnell, Rainer (1994). Graphisch gestützte Datenanalyse. München Wien:
[Link] Verlag.
Schulz, Felix (2007). Entwicklung der Delinquenz von Kindern,
Jugendlichen und Heranwachsenden in Deutschland: Eine vergleichende
Analyse von Kriminalstatistiken und Dunkelfelduntersuchungen zwischen
1950 und 2000. Berlin: LIT-Verlag Dr. [Link]. Reihe:
Kriminalwissenschaftliche Schriften (Herausgegeber: Schöch, Heinz;
Dölling, Dieter; Meier, Bernd-Dieter & Verrel, Torsten), Band 11.
Singer, Eleanor (1995). The Professional Voice 3: Comments on Hite's
Women and Love, 132–136. In: Rubenstein, Sondra Miller. Surveying Public
Opinion. Belmont, CA: Wadsworth Publishing.
Skinner, Chris (1999a). Nonresponse errors, 25–39. In: Davies, Pam &
Smith, Paul (eds.). Model Quality Report in Business Statistics – Vol. 1:
Theory and Methods for Quality Evaluation. UK Office for National
Statistics. London.
Skinner, Chris (1999b). Measurement errors, 6–15. In: Davies, Pam & Smith,
Paul (eds.). Model Quality Report in Business Statistics – Vol. 1: Theory and
Methods for Quality Evaluation. UK Office for National Statistics. London.
Smith, Tom W. (2003). American Sexual Behavior: Trends, Socio-
Demographic Differences, and Risk Behavior. University of Chicago
National Opinion Research Center. GSS Topical Report No. 25 (Updated,
April, 2003, Version 5.0).
Smith, Tom W. (1989). Sex Counts: A Methodological Critique of Hite's
Women and Love, 537–547, in: Turner, Charles F., Miller, Heather G. &
Moses, Lincoln E. (Eds.). AIDS: Sexual Behavior and Intravenous Drug Use.
Washington, DC: National Academy of Sciences Press.
SPIEGEL ONLINE (2007). Falsche Statistik: Bundesagentur meldet zu
geringe Arbeitslosenzahlen (31.05.2007).
SPIEGEL ONLINE (2004). Forschungsbetrüger Schön: Doktortitel futsch
(Meldung vom 11.06.2004).
Spiess, Martin (2006). Wozu ein tieferes Verständnis von Statistik? Ein
Kommentar zu Hager (2005). Psychologische Rundschau, 57 (1), 43–46.
SPSS (2017a). IBM SPSS v25.0 Command Syntax Reference. IBM
Corporation.
SPSS (2011). IBM SPSS v20.0 Command Syntax Reference. IBM
Corporation.
SPSS (2018). IBM SPSS Modeler v18.2 User's Guide. IBM Corporation.
SPSS (2017b). IBM SPSS Modeler v18.1 Modeling Nodes. IBM
Corporation.
SPSS (2017c). IBM SPSS Modeler v18.1 Algorithms Guide. IBM
Corporation.
SPSS (2017d). IBM SPSS Modeler v18.1 Applications Guide. IBM
Corporation.
SPSS (2017e). IBM SPSS Modeler v18.1.1 Python Scripting and
Automation. IBM Corporation.
SPSS (2016). IBM SPSS Modeler v18.0 Source, Process, and Output Nodes.
IBM Corporation.
SPSS (2006). SPSS 15.0 Command Syntax Reference. Chicago: SPSS Inc.
SPSS (2000). CRISP-DM 1.0. Chicago: SPSS Inc.
SPSS Technical Support (2006). Persönliche Information, 12.09.2006.
SPSS Technical Support (2004). Persönliche Information, 14.10.2004.
Statistics Canada (2003). Quality Guidelines. Statistics Canada, Methodology
Branch, Ottawa, Ontario (4th ed.).
Statistics Canada (2000). Policy on Informing Users of Data Quality and
Methodology. Policy Manual. Statistics Canada, Methodology Branch,
Ottawa, Ontario (Approved March 31, 2000).
Statistische Ämter des Bundes und der Länder (2003) (eds.).
Qualitätsstandards in der amtlichen Statistik. Erarbeitet von der Bund-
Länder-Arbeitsgruppe „Qualitätsleitlinien für die Produkte der amtlichen
Statistik“. Herausgeber: Statistische Ämter des Bundes und der Länder. April
2003.
Statistisches Bundesamt (2004). Zensustest. Statistisches Bundesamt,
Wirtschaft und Statistik, 8, 813–833.
Statistisches Bundesamt (1993) (ed.). Qualität statistischer Daten. Beiträge
zum wissenschaftlichen Kolloquium am 12./13. November 1992 in
Wiesbaden. Wiesbaden: Statistisches Bundesamt. Schriftreihe Forum der
Bundesstatistik, Band 25.
Stelzl, Ingeborg (1982) Fehler und Fallen der Statistik für Psychologen,
Pädagogen und Sozialwissenschaftler. Bern: Huber.
Stollorz, Volker (2006). Frankfurter Allgemeine Sonntagszeitung, Nr. 34,
Seite 57 (27.08.2006).
Süddeutsche (2018). Hunderte verlieren wegen Softwarefehler ihre Häuser,
Süddeutsche (Dornis, Valentin, 06.08.2018).
Süddeutsche (2014). Behörden-Panne: Fiskus verteilte
Steueridentifikationsnummern falsch; Süddeutsche (Bohsem, Guido,
13.02.2014).
Sutton, Anne (2007). Permanent Fund backup for backup fails in data glitch.
The Anchorage Daily News (March 19, 2007).
Tagesanzeiger (2015). Deutsche Bank überweist sechs Milliarden Dollar –
ungewollt. Tagesanzeiger (20.10.2015).
Taleb, Nassim Nicholas (2019). Skin in the Game: Hidden Asymmetries in
Daily Life. London: Penguin Books.
The New York Times (2018). Facial Recognition Is Accurate, if You’re a
White Guy. The New York Times (09.02.2018, Lohr, Steve).
The New York Times (2015). Google Photos Mistakenly Labels Black
People ‘Gorillas’. The New York Times (01.07.2015, Dougherty, Conor).
Totterdell, Nigel (2005). How to make a start: DQ Health Check, 1–6. In:
Proceedings of the Tenth International Conference on Information Quality
(ICIQ 2005). Boston, MA: MIT Sloan School of Management – Total Data
Quality Management Program.
Tufte, Edward R. (2003). The Cognitive Style of Power Point. Cheshire,
Conn.: Graphics Press.
Tufte, Edward R. (20012). The Visual Display of Quantitative Information.
Cheshire, Conn.: Graphics Press.
Tufte, Edward R. (1997). Visual and statistical thinking: Displays of evidence
for making decisions. Cheshire, Conn.: Graphics Press.
United Nations (2003). Handbook of Statistical Organization: The Operation
and Organization of a Statistical Agency. United Nations, Department of
Economic and Social Affairs, Statistics Division (Studies in Methods, Series
F No. 88, Third Edition). New York: United Nations.
United Nations (2002). United Nations and Economic Commssion for
Europe: Terminology on Statistical Metadata. Conference of European
Statisticians, Statistical Standards and Studies – No. 53. Geneva: United
Nations.
United Nations (1995). United Nations and Economic Commssion for
Europe: Guidelines for the Modeling of Statistical Data and Metadata.
Conference of European Statisticians, Methodologial Material. Geneva:
United Nations.
United Nations (1983). United Nations and Economic Commssion for
Europe: Guidelines for Quality Presentations that are prepared for Users of
Statistics. Conference of European Statisticians. Methodologial Material.
Geneva: United Nations.
US Census Bureau (2006). Census Bureau Principle: Definition of Data
Quality. Census Bureau Methodology & Standards Council, Washington DC.
Issued: 14 Jun 06. Version 1.3.
US Census Bureau (2000). Census 2000: A.C.E. Methodology, Volume 1 .
U.S. Bureau of the Census Bureau: Decennial Statistical Studies Division,
Washington DC.
Velleman, Paul F. & Wilkinson, Leland (1993). Nominal, Ordinal, Interval,
and Ratio Typologies are Misleading. American Statistician, 47, 1, 65–72.
Voas, David (2007). Ten million marriages: A test of astrological ‘love
signs’. University of Manchester: Cathie Marsh Centre for Census and
Survey Research (25.03.2007).
Von der Lippe, Jürgen (1998). Mit Mikro-Daten einen Makro-Wirbel
machen. Allgemeines Statistisches Archiv, 82, 380–386.
Wainer, Howard & Robinson, Daniel H. (2003). Shaping up the practice of
null hypothesis significance testing. Educational Researcher, 10, 22–30
Wan, Ivy; Guess, Frank & Bates, Rodney (2002). FedEx & Information
Quality, 84–88. In: Proceedings of the Seventh International Conference on
Information Quality (ICIQ 2002). Boston, MA: MIT Sloan School of
Management – Total Data Quality Management Program.
Wang, Richard Y. & Strong, Diane M. (1996). Beyond accuracy: What data
quality means to data consumers. Journal of Management Information
Systems, 12, 4, 5–33.
Wiedenbeck, Michael & Züll, Cornelia (2001). Klassifikation mit
Clusteranalyse: Grundlegende Techniken hierarchischer und K-means-
Verfahren, ZUMA How-to-Reihe, Nr. 10, 1–18.
Wilcox, Rand R. (1998). How many discoveries have been lost by ignoring
modern statistical methods? American Psychologist, 53, 3, 300–314.
Wilcox, Rand R. (1997). Introduction to robust estimation and hypothesis
testing. San Diego, Calif.: Academic Press.
Wilkinson, Leland (2005). The grammar of graphics. New York: Springer.
Wilkinson, Leland & APA Task Force on Statistical Inference (1999):
Wilkinson, Leland & The Task Force on Statistical Inference, APA Board of
Scientific Affairs (1999). Statistical methods in psychology journals:
Guidelines and explanations. American Psychologist, 54, 8, 594–604.
Willeke, Caroline; Salou, Gérard; Ahsbahs, Catherine & Damia, Violetta
(2006). Statistical Quality in the European Central Bank: Experience so Far
and Way Forward. Paper presented at the European Conference on Quality in
Survey Statistics in Cardiff, UK. Session 15: Quality Management (24 – 26
April 2006).
Wilson, Thomas P. (1981). Qualitative versus quantitative methods in social
research, 37–69. In: ZUMA-Arbeitsbericht No. 1981/19: Integration von
qualitativen und quantitativen Forschungsansätzen. Arbeitstagung 13.–
17.07.1981. November 1981.
WIRED (2009). Scanning Dead Salmon in fMRI Machine Highlights Risk of
Red Herrings, WIRED (18.09.2009, Madrigal, Alexis).
Witte, Erich H. (1980). Signifikanztest und statistische Inferenz: Analysen,
Probleme, Alternativen. Stuttgart: Enke.
Wothke, Werner (1998). Longitudinal and multi-group modeling with
missing data. In T.D. Little, Kai U. Schnabel, & Jürgen Baumert (Eds.)
Modeling longitudinal and multiple group data: Practical issues, applied
approaches and specific examples. Mahwah, NJ: Lawrence Erlbaum
Associates.
Wright, Daniel B. (2003). Making friends with your data: Improving how
statistics are conducted and reported. British Journal of Educational
Psychology, 73, 123–136.
Yaffee, Robert & McGee, Monnie (2000). Times Series Analysis and
Forecasting. Orlando/Fl.: Academic Press.
Zajac, Kevin J. (2003). Analysis of Imputation Rates for the 100 Percent
Person and Housing Unit Data Items from Census 2000: Final Report. US
Census Bureau: Decennial Statistical Studies Division, Washington DC
(September 25, 2003).
Zumbo, Bruno D. & Jennings, Martha J. (2002). The Robustness of Validity
and Efficiency of the Related Samples t-Test in the Presence of Outliers.
Psicológica (2002), 23, 415–450.

21 Your opinion about this book


My intention was to make this book as comprehensive, understandable, error-
free and up-to-date as possible, but some inaccuracies or misunderstandings
may have escaped the numerous controls. In possibly future editions, the
errors and inaccuracies discovered should ideally be corrected. SPSS, too, has
certainly undergone technical or statistical-analytical developments that
should perhaps be taken into account.
I would like to take this opportunity to offer you my help in making this book
on SPSS even better. If you have any suggestions for additions or
improvements to this book, please send an e-mail to the following address:
SPSS2@[Link]
in the "Subject" field the keyword "Feedback SPSS book", and make sure to
include at least the following information:

1. Edition

2. Page
3. Keyword (e.g. 'typo')
4. Description (e.g. for statistical analyses)

In case of program code, please comment.

Many thanks!
Dr. Christian Schendera
22 Author
Knowledge and insight are method-dependent. In order to be able to assess
knowledge and insight, and also to estimate the consequences and quality of
decisions based on it, it must be transparent which (research) methods were
used to obtain it.
Dr. CFG Schendera is a scientific data analyst. His main interest is the
rational (re)construction of knowledge, i.e. the influence of (non)scientific
(research) methods (including statistics) of all kinds on the construction and
reception of knowledge. Various publications on data analysis (including data
quality), statistics, research methods, evaluation and statistical systems
(SPSS, SAS) in English, German, and Chinese. In 2016, he was invited to
revise Chapter 11 “Data Preparation” of famous German standard work Bortz
& Döring’s “Forschungsmethoden und Statistik” (transl.: “Research Methods
and Statistics”; 2016, 5th ed.; explicit acknowledgement page VII). Dr.
Schendera is a member of several academic societies.
Dr. CFG Schendera is also managing director of Method Consult
Switzerland, [Link]. Method Consult offers, among other things,
professional data analysis/mining, consulting in scientific methods, and
training on SPSS or SAS (including data quality, research methods, SQL, and
multivariate statistics). Method Consult's clients and projects include
business, research, and government beyond Germany, Austria and
Switzerland. Further information can be found at [Link].

Common questions

Powered by AI

Involving domain experts in the process of outlier identification and treatment is critical as they possess specialized knowledge that enables the correct interpretation of the data within its specific context. Outliers might superficially appear incorrect but could represent valid, albeit unusual, real-world phenomena that require expert insight to understand. For example, in medical datasets, experts can differentiate between plausible patient outcomes versus data errors. Domain expertise ensures that outlier treatment decisions elevate data quality by avoiding the erroneous exclusion or alteration of meaningful data and devising context-aware strategies, such as modeling outliers or applying domain-specific corrections .

Software errors in data analysis systems can significantly impact data quality by introducing biases, inaccuracies, and misinterpretations into the dataset. These errors might manifest as incorrect calculations, mishandling of missing values, or faulty data transformations. They compromise the reliability and validity of the analysis outcomes, which can lead to misguided decision-making. Specifically, software errors can cause flawed models and analyses that produce erroneous conclusions, adversely affecting strategic decisions ranging from policy formulations to business strategies. Therefore, rigorous validation of software and regular testing against known datasets are essential steps to mitigate these risks .

The MATCH FILES command in SPSS offers two approaches for handling duplicate data: keeping the first duplicate row and keeping the last duplicate row. The first approach provides a view of the dataset where only the initial occurrence of duplicates is retained, which might be useful in situations where earlier data is considered more reliable. On the other hand, keeping the last duplicate row might be preferred if it is believed that later data entries are more amended or verified. Both approaches can significantly affect data interpretation because they determine which version of the duplicate data is analyzed and reported, potentially altering conclusions. Using the first approach can neglect more recent corrections while using the last might overlook initial, possibly correct data .

Different data cleaning strategies play varied roles in maintaining data quality by addressing outliers in specific contexts. Deletion of outliers completely eliminates their influence, suitable when outliers are confirmed errors with no informational value. However, this approach reduces sample size and can weaken statistical power. Bundling involves grouping outliers into categories, allowing them to be analyzed in generalized terms, which is useful when the presence of outliers is informative for certain analyses. The reduction technique mitigates the extreme influence of outliers by adjusting their values within a certain range, helping retain some influence without overwhelming the analysis. Each strategy should be employed based on the data context, research objectives, and domain knowledge to enhance quality without compromising the integrity of analyses .

Ignoring outliers in datasets can be risky as they might indicate underlying data errors, such as measurement or transcription mistakes, or reveal important empirical phenomena that deviate from expectations. Mistakenly omitting outliers assumes that all such points are irrelevant or incorrect; however, they could represent genuine variations, indicating new insights or trends within the data. For instance, large datasets are more susceptible to outliers reflecting data entry errors (e.g., swapping digits) but may also uncover interesting phenomena not captured by the bulk of the distribution. Thus, careful evaluation of the causes and effects of outliers is necessary to distinguish between data quality issues and significant observations that should be analyzed further .

Assigning attributes such as frequency and diversity to detect duplicates enhances dataset management and cleaning by effectively categorizing duplicate records. 'Frequency' indicates how often a data row is repeated, while 'diversity' provides insight into the uniqueness of these repeats considering additional variable comparisons. These attributes enable precise evaluation of duplicates beyond a mere ID match, allowing for a nuanced approach to deduplication where decisions can incorporate whether to retain versions of data based on additional criteria. Effective management of these attributes helps ensure that duplicates are not prematurely removed and that important patterns or errors are not overlooked, ultimately maintaining data integrity for accurate analysis .

Data aggregation can potentially compromise dataset integrity by masking the variability and granularity of the original data, leading to loss of important information. Aggregation processes tend to sum, average, or count elements, which might obscure data subtleties and outlier effects that could be critical in some analyses. These effects can be mitigated by careful planning and testing of the aggregation processes, ensuring that they maintain key data attributes and relationships relevant to the research question. Using techniques like weighted aggregation or stratification can also help preserve some nuances. Ensuring transparency in the aggregation criteria and involving stakeholders in setting these can further uphold dataset integrity .

A 'negative rule' in data validation specifies conditions where certain data entries are considered invalid, effectively listing unacceptable rather than acceptable values. This approach is beneficial in identifying outlier data entries that deviate from predefined standardizations or expected formats. For instance, a negative rule might be applied to validate postal codes, flagging all that do not belong to a corresponding state designation, such as 'Bayern' or 'Berlin', thus catching errors where postal codes are mismatched with their regions. Implementing negative rules helps ensure data integrity by systematically identifying and correcting mismatches and format deviations .

Single-variable validation rules focus on ensuring that each individual data field adheres to specified criteria, such as correct format or value range. These rules help tackle straightforward data validation issues and inconsistencies at the column level, like ensuring all entries for a date follow a particular format. Cross-variable rules, however, are needed when data integrity depends on relationships between multiple fields. For example, a cross-variable rule might ensure that a leaving date for an employee is after the joining date. The distinction is crucial for addressing different levels of complexity in data validation, where single-variable rules handle isolated checks and cross-variable rules address interdependencies in datasets .

Suboptimal data quality in large datasets can arise from various issues such as completeness (missing data), duplicates, uniformity (inconsistent units), missings (loss of data), correctness (e.g., typographical errors), and plausibility (implausible data). These issues can skew analysis outcomes by introducing biases, reducing the power of statistical tests, and potentially leading to incorrect conclusions. For instance, using metric units instead of English ones in some calculations (as occurred with NASA's satellite loss) can result in vastly different outcomes. These errors can either be corrected during the data preparation stage or by using data quality tools like IBM SPSS Modeler's data preparation nodes, which include the Data Audit and Auto Data Prep nodes .

You might also like