0% found this document useful (0 votes)
83 views118 pages

Social Quantitative Research Methods

This document presents a summary of chapter 2 of part 3 of the book "Methodology of Quantitative Social Research" by Pedro López-Roldán and Sandra Fachelli. The chapter describes the processes of preparing data for analysis, including data creation and identification, data transformation, and data file processing. Explains how to carry out these tasks using SPSS and R programs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views118 pages

Social Quantitative Research Methods

This document presents a summary of chapter 2 of part 3 of the book "Methodology of Quantitative Social Research" by Pedro López-Roldán and Sandra Fachelli. The chapter describes the processes of preparing data for analysis, including data creation and identification, data transformation, and data file processing. Explains how to carry out these tasks using SPSS and R programs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 118

METHODOLOGY OF THE

INVESTIGATION
SOCIAL
QUANTITATIVE

Pedro López-Roldán
Sandra Fachelli

BMU
Autonomous University of Barcelona
METHODOLOGY OF THE
INVESTIGATION
SOCIAL
QUANTITATIVE

Pedro López-Roldán
Sandra Fachelli

Bellaterra (Cerdanyola del Vallès) | Barcelona


Digital Document Disposal
Autonomous University of Barcelona

BY-NC-ND
ccreative
Sommons

This digital book is published under a Creative Commons license, anyone is free to copy, distribute or publicly
communicate the work, in accordance with the following conditions:

Recognition. You must properly acknowledge authorship, provide a link to the license, and indicate if
changes have been made. You may do so in any reasonable manner, but not in a manner that suggests
that you have the support of the licensor or receive support for your use.
No comercial . You may not use the material for a commercial purpose.

No derivative work. If you remix, transform or build upon the material, you may not broadcast the
modified material.

There are no additional restrictions. You may not apply legal terms or technological measures that legally
restrict you from doing what the license allows.

Pedro López-Roldán
Center for Sociological Studies on Daily Life and Work ( http://quit.uab.cat )
Institute of Work Studies ( http://iet.uab.cat/ )
Department of Sociology. Autonomous University of Barcelona
[email protected]

Sandra Fachelli
Department of Sociology and Analysis of Organizations
University of Barcelona
Research Group in Eduació i Treball ( http://grupsderecerca.uab.cat/gret )
Department of Sociology. Autonomous University of Barcelona
[email protected]

Digital edition: http://ddd.uab.cat/record/129382

1st edition, February 2015

Building B · UAB Campus · 08193 Bellaterra


(Cerdanyola del Vallés) · Barcelona · Spain
Tel. +34 93 581 1676
and
General index

PRESENTATION

PART I. METHODOLOGY
1.1. METHODOLOGICAL FOUNDATIONS
1.2. THE RESEARCH PROCESS
1.3. METHODOLOGICAL PERSPECTIVES AND MIXED DESIGNS
1.4. CLASSIFICATION OF RESEARCH TECHNIQUES

PART II. PRODUCTION


11.1. THE MEASUREMENT OF SOCIAL PHENOMENA
11.2. DATA SOURCES
11.3. THE SOCIAL SURVEY METHOD
11.4. THE SAMPLE DESIGN
11.5. EXPERIMENTAL RESEARCH

PART III. ANALYSIS


111.1. SOFTWARE FOR DATA ANALYSIS: SPSS, R AND SPAD
111.2. PREPARATION OF DATA FOR ANALYSIS
111.3. DESCRIPTIVE DATA ANALYSIS WITH A VARIABLE
111.4. FUNDAMENTALS OF INFERENTIAL STATISTICS
111.5. CLASSIFICATION OF DATA ANALYSIS TECHNIQUES
111.6. ANALYSIS OF CONTINGENCY TABLES
111.7. LOG-LINEAR ANALYSIS
111.8. VARIANCE ANALYSIS
111.9. REGRESSION ANALYSIS
111.10. ANALYSIS LOGISTIC REGRESSION
111.11. ANALYSIS FACTORIAL
111.12. ANALYSIS CLASSIFICATION
Investigation methodology
Social Quantitative

Pedro López-Roldán
Sandra Fachelli

PART III. ANALYSIS

Chapter III.2
Data preparation
for analysis

Bellaterra (Cerdanyola del Vallès) | Barcelona


Digital Document Disposal
Autonomous University of Barcelona

BY-NC-ND
How to cite this chapter:

López-Roldán, P.; Fachelli, S. (2015). Preparation of data for analysis. In P. López-Roldán and S. Fachelli,
Methodology of Quantitative Social Research . Bellaterra (Cerdanyola del Vallès): Dipòsit Digital de
Documents, Universitat Autònoma de Barcelona. Chapter III.2. 1st edition.
Edition digital:
http://ddd.uab.cat/record/129381

Chapter written in February 2015


of content

BMU..........................................................................................................................1
ccreative.............................................................................................................4
Sommons....................................................................................................4
General index........................................................................................................5
PART II. PRODUCTION..............................................................................5
PART III. ANALYSIS..........................................................................................7
Chapter III.2 Data preparation for analysis.........................................................7
Preparing data for analysis..................................................................................11
1. Creation and identification of data.........................................................7
1.1. Creation and identification of data with SPSS.......................................9
1.2. Creation and identification of data with R................................................27
2. Data transformation...........................................................................38
2.1. Data transformation with SPSS............................................................38
520................................................................................................................47
W v aran e s = 480 = 1.083.............................................................................47
480................................................................................................................47
Wmu J ere s = 520 = 0.923.............................................................................47
FREQUENCIES P45.............................................................................68
MISSING VALUES P45m(9999).........................................................68
VARIABLE LEVEL P45m ( SCALE )..................................................68
x—x..............................................................................................................75
zi= -—..........................................................................................................75
yes................................................................................................................75
SPA _ 100.p,+75.P,+50p,+25.p4+0.p;...................................................................77
Pi +P2+P3....................................................................................................77
EXECUTE...............................................................................................83
EXECUTE...............................................................................................84
[ ELSE ]...................................................................................................85
END IF.....................................................................................................85
ELSE........................................................................................................86
END IF.....................................................................................................86
2.2. Data transformation with R..................................................................88
x—x zi= -— yes.......................................................................................105
/cp_SPA + IEP.................................................................................................108
3. Bibliography.......................................................................................113
1.
Preparing data for analysis PART III

Episode 2

l The data handled in social research usually require that they be prepared for
analysis. This need can arise from the beginning or during the process of analysis and interpretation of the
information. When we refer to data preparation we understand that it is a set of data processing tasks that
range from registration and identification on a computer medium, through their purification, and their
transformation, which includes both the modification of the original information and the creation of new information
from existing variables, or the processing of data files.

Preparing data for analysis is surely one of the least recognized and at the same time one of the most important tasks
in research. Perhaps because it is usually a more technical task that is usually left in the hands of skilled specialists
in the management of computer programs. But the quality of the data depends enormously on this set of aspects in
interrelation with the other phases of the research process.

The original data matrix obtained in a research process is therefore raw information material that requires its
adaptation and conditioning to the needs of data exploitation and analysis. These operations are carried out with the
help of the specific data processing and analysis software with which you work. Graph III.2.1 presents the data
processing flowchart that summarizes and outlines the general work dynamics with the software to perform the
different tasks of preparing the data for analysis. It is presented with particular reference to data matrices and syntax
programs in SPSS, but is equally applicable as a dynamic to working with R or SPAD.

Data processing involves four fundamental tasks:


1) Creation and identification of data, either by recording (“chopping”) it ourselves 4 , or importing them by
reading external data files in flat format ( TXT , DAT ) or with formats from other systems ( XLS , SAS , R ,
…).

4 There is specific software for this task such as Data Entry in SPSS that allows you to create templates for entering, identifying and controlling data
recording.
6| III. Analysis

This generates the active file of the system that we will save on the hard drive with an identifying name.

Graph III.2.1 Flowchart of data processing with SPSS

Research design Data


collection

Data registration

Create original data file Tabulation


(Data Editor) Active file External data
4 -Reading: DAT, TXT -Import:
XLS, R. SAS,...
Data Identification Verification
/ Debugging Name.SPS /
Name. SPV

No

Data file
Name.SAV

Data transformation
Name.SPS / Name. SPV

Expanded data
file Name.
S.A.V.

Data exploitation Name.SPS /


New syntax Name. SPV

Data analysis
New syntax

No No
yariable: lottati
on:
Results

2) Verify the correctness of the data and its identification to purify (correct) them in the event that we detect
errors5 . Different commands can be used to first detect and then correct erroneous data.
3) Transform the original data with the aim of conditioning the variables for exploitation and analysis, a task that
usually involves the generation of new variables that expand the original data matrix 6 . Data transformation can
also be understood in relation to the tasks of manipulating the data matrix as a whole (weighting, selecting,
ordering, adding,... the data) or merging one data file with others.
4) The analysis of the data based on its exploitation with the different tabulation procedures and statistical
analyzes (univariable, bivariable and multivariable) guided by the objectives of the research and the
analysis model.

In this chapter we will dedicate ourselves to explaining the first three tasks. With them we will be able to

5 Much of the purification can or should also be carried out in the previous phase of field work, as in the case of a survey. Computer-assisted picking
systems greatly reduce this work.
6 In some survey research processes, the original data matrices with a given number of variables may be duplicated at the end of the process.

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |7

know the quality, structure and properties of the data we handle. Starting in the next chapter we will see the
different analysis procedures, taking into account that they usually also involve the need to carry out new
transformations of the data as illustrated in the organization chart. We will see these tasks with SPSS and R,
after a presentation of their characteristics, and we will exemplify them with different practical data
processing exercises.

1. Creation and identification of data

As we have mentioned, data can be created using two basic procedures: we save it or we import it. The data
thus created constitutes the data matrix, a set of rows and columns that respond to information coding criteria.
These criteria and other aspects that characterize them allow us to identify them and generate what we call the
data dictionary .7

We will carry out a practical exercise of creating a simple data matrix by entering the data and then other
exercises that involve importing existing data from other applications or formats.

For the first exercise, the information obtained from the answers to the questions in the survey questionnaire
attached in Table III.2.1 will be considered. The exercise involves the process of coding, recording and
identifying data. The following sections will detail how to perform the recording and identification tasks with
SPSS and R. In what follows we will present the questionnaire and a data coding exercise from a specific
case.

The attached questionnaire gives rise to 16 variables, each of the information derived from each question, plus
a first additional variable that identifies the questionnaire number assigned to each person who responds. We
call these variables, for example: ID, P1, P2, P3_1, P3_2, P3_3, P4, P5, P6_1, P6_2, P6_3, P6_4, P6_5, P6_6,
P6_7 and P7 .

The data matrix will, therefore, have 16 columns with the responses of each individual. These responses are
coded with numerical or textual values depending on the type of variable.

7 For more information, see chapters 3, 4, 5 and 6 of the central system manual (IBM Corporation, 2015).

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


8| III. Analysis

Table III.2.1. Quiz for the exercise of creating a data matrix


Questionnaire number ___ ___ ___

1. How old are you? ______________________ Does not answer □ (999)


2. What is your sex? Male □ (1)
Woman □ (2)
3. Can you tell me the highest level of education you have completed and completed, as well as the
Ego Father Mother
No studies, unfinished primary □ (1) □ (1) □ (1)
EGB, elementary baccalaureate, ESO □ (2) □ (2) □ (2)
Higher baccalaureate, BUP, COU
Vocational training
□ (3) □ (3) □ (3)
First grade, officers □ (4) □ (4) □ (4)
Second degree, industrial master's degree □ (5) □ (5) □ (5)
University students
Does not know
□ (6) □ (6) □ (6)
□ (8) □ (8) □ (8)
Does not answer
His parents?
□ (9) □ (9) □ (9)
5. How many hours did you work? ____________hours

Had a job Didn't □ (1) Does not answer . □ (99)


work Doesn't answer □ (2) Not relevant (did not work) □ (97)
6. In relation to the following statements, indicate your degree of agreement or
disagreement:

4. What was your work situation last week?


Totally In OK Totally
disagree disagreemen Neither agree disagree
N.S. NC
t nor disagree

1. Immigration is one of the main problems


in Europe today □ □ □ □ □ □ □
(1) (2) (3) (4) (5) (8) (9)
2. If we do not control Europe's borders, our
Welfare State will be unsustainable □ □ □ □ □ □ □
(1) (2) (3) (4) (5) (8) (9)
3. Immigration has increased insecurity on
the streets
□ □ □ □ □ □ □
(1) (2) (3) (4) (5) (8) (9)

4. The settlement of non-EU immigrants is


causing a loss of the labor rights acquired
□ □ □ □ □ □ □
(1) (2) (3) (4) (5) (8) (9)
until now
5. It is necessary to implement cooperation
policies with countries of origin
□ □ □ □ □ □ □
(1) (2) (3) (4) (5) (8) (9)

6. Immigrants should have the right to vote


□ □ □ □ □ □ □
(1) (2) (3) (4) (5) (8) (9)
7. Immigrants must adapt to the culture of
the country where they settle
□ □ □ □ □ □ □
(1) (2) (3) (4) (5) (8) (9)

7. In politics we usually talk about left and right. On this card there are a series of boxes that go from left to right. In which box

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |9

would you place yourself? SHOW


Left 1 I 2 I 3 I 4 I 5 6 I 7 I 8 I 9 I 10 I Right.

Does not know □ (98)


Does not answer □ (99)

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |10

Let's imagine the case of the first questionnaire, a person who answers:
“I am 35 years old, I am a man , I have completed university studies , my father has not completed primary school and
my mother has not completed elementary school ; I'm working , I do 40 hours a week. I completely agree that “
Immigration is one of the main problems in Europe today ”, I agree that “ If we do not control Europe's borders, our
Welfare State will be unsustainable ”, I completely disagree that “ Immigration has increased insecurity on the streets ",
disagreeing that " The settlement of non-EU immigrants is causing a loss of labor rights acquired until now ", completely
disagreeing " It is necessary to implement cooperation policies with countries of origin to reduce the entry of non-EU
immigrants ", agreeing that " Immigrants should have the right to vote ", completely agreeing that " Immigrants must
adapt to the culture of the country where they settle ". “I place myself in box 3 between left and right.”

The coding of their responses is shown in Table III.2.1:

We have followed a double criterion, first introducing only numerical codes, and then combining numerical
codes with text. The first case will serve to create and identify the data in SPSS (section 1.1) where all the
information can be coded numerically and a label assigned to the codes whose meaning requires to be made
explicit, which is the case of qualitative variables. The second coding will be the one necessary in R (section
1.2) where numerical codes are maintained for the quantitative variables and synthetic textual codes for the
qualitative variables since in R it is not possible to differentiate the values or codes of the labels.

1.1. Creation and identification of data with SPSS

1.1.1. Entering data in SPSS

We will start with the task of entering the data, later we will see how to import it. If we enter the application
we can directly access the data editor to enter the information. Let's remember that if we have previously
opened a data matrix in the editor and we want to create a new one, we will first proceed to open a new blank
data editor window: File / New / Data . The data editor allows you to create or browse a data matrix from two
tabs:
Data View Variable View

In the data view we will enter the data itself, that is, the codes or values of the variables, while in the variable
view we will identify their characteristics, their dictionary. We can choose either to start entering the data or to
create the dictionary.

We will first proceed to enter the data of the first individual in the data viewer as follows:

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |11

A name is automatically generated for each variable and the default format is assigned: numeric type with width 8
and 2 decimal places, without labels, missing values
nei measurement level. The image at the beginning of the variables tab it is The next:
the Name Guy Width Decimals Label Values Lost Columns Alignment Extent Role
r 1 VAR00001 Numeric 8 2 None None 7 = Right A stranger % Entrance
2 VARO 00 02 Numeric 8 2 None None 7 = Right A stranger N Input
3 VARO 00 03 Numeric 8 2 None None 7 Right A stranger X Entry
4 VARO 00 04 Numeric 8 2 None None 7 = Right A stranger X Entry
5 VARO 00 05 Numeric 8 2 None None 7 = Right A stranger X Entry
6 VARO 00 06 Numeric 8 2 None None 7 = Right A stranger X Entry
7 VARO 00 07 Numeric 8 2 None None 7 = Right A stranger X Entry
8 VARO 00 08 Numeric 8 2 None None 7 Right A stranger X Entry
9 VARO 00 09 Numeric 8 2 None None 7 = Right A stranger X Entry
10 VARO 0010 Numeric 8 2 None None 7 = Right A stranger X Entry
11 VARO 0011 Numeric 8 2 None None 7 = Right A stranger X Entry
12 VARO 0012 Numeric 8 2 None None 7 = Right A stranger X Entry
13 VARO 0013 Numeric 8 2 None None 7 Right A stranger X Entry
14 VARO 0014 Numeric 8 2 None None 7 = Right A stranger X Entry
15 VARO 0015 Numeric 8 2 None None 7 = Right A stranger X Entry
16 VARO 0016 Numeric 8 2 None None 7 = Right A stranger X Entry

We then proceed to identify and create the data dictionary in the variables view, which involves specifying the
following information in relation to each variable that is arranged in the rows:

- The name of the variable ( Name ): it can have a length of 64 characters, they must begin with a letter of the
alphabet ( A - Z ) or with the signs @ , and also # for a temporary variable and $ for a system variable ;
the rest can also be a number, a " . " or a " _ ". But they cannot end in a period, nor do spaces or special
characters like ! , ? , ' either * . It is indifferent to use upper or lower case letters, preserving the chosen
form. The keywords ALL, AND, BY, EQ, GE, GT, LE, LT, NE, NOT, OR, TO, and WITH cannot be
used.

- The format type of the variable ( Type ): each variable is a data type that is defined according to the
following types: numeric (values are numbers in standard format), comma and period (numeric type that
accepts the comma or the period as separator every three positions), scientific notation (numeric whose
values are shown with an interspersed E and a signed exponent that represents a power of base 10), date
(numeric variable with different date-calendar or clock-time formats) dollar o custom currency (numeric
variable displayed with a leading dollar sign ($) or in the formats defined in options),

string (values are textual with any character) and restricted numeric (non-negative integer values)8 :

- The positions ( Width ) are the digits occupied by the variable, a part of which corresponds to the number
of decimals ( Decimals ). It can be specified both in the type dialog box and in its own column. In the case

8 To access the dialog box to define the type of variable, it is necessary to click on the right side of the box: Numerice .

López-Roldán and Fachelli | Quantitative social research methodology


12| III. Analysis

of variables of type string, date and restricted numeric, the number of decimal places is always 0.
It is generally recommended to use the standard numerical format as it makes it easier to process the
variables. With greater mastery of the software or for specific needs, obviously all formats are valid. The
standard numeric format is defined by default with the F8.2 format, that is, with 8 positions of width and 2
decimal places that correspond to the following arrangement: 5 positions of the integer, one position for
the decimal point and 2 positions of the decimals: ___ . _ _ . So, for example, the value 1 of the
variable number of children corresponds to 00001.00 and is displayed as 1.00. If we change the variable
to F1.0 format it will then simply be 1. In either case it only affects the way you look.

- The variable label ( Label ) allows you to assign a text identifying its content, with a maximum length of
256 characters. However, in many results it is not possible to see the label in its entirety. In general 36
characters may be sufficient9 . The label is written directly on the box.

- The variable value labels ( Values ) assign a text identifying their meaning, with a maximum length of 120
characters, but a maximum of 16 characters may be sufficient. To enter it, click on the right side of the
box and access a dialog box where you write each value with its label and click on “Add”:

9 The \n symbols can be inserted in variable and value labels to force a line break.

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |13

- The missing values declared by the user ( Missing ). It is common that we find ourselves in the situation of
absence of values, of not having information for some cases or individuals in relation to one or more variables.
The system needs, however, to also identify these situations with a certain value. These values are called
missing values. There are two types:

- User missing values . These are the values that imply a lack of information (for example, "does not know",
"does not answer" or "not relevant") that are coded with a specific value (for example, 8, 9 and 0), and are
declared by the user as lost in the identification of the variables to be treated differently and which, by default,
are not part of the calculations.
- System Missing Values . They also correspond to the lack of information, but they are generated
automatically by the software when they find a blank box in the data matrix, or when we generate a new
variable and a specific value is not assigned to one or more cases. Missing values are displayed in the editor
with a dot (" . ") and appear in the tables with the label " System Missing ".
The user's missing values are those identified in the data dictionary. To do this, it is necessary to click on the
right side of the box and access the dialog box where specific values (up to 3) or ranges of values are detailed:

- The column width displayed in the data editor ( Columns ).

- You can control the presentation of data values and/or value labels in the data view ( Left
Alignment ): left, right, and center. Right
Center

López-Roldán and Fachelli | Quantitative social research methodology


III.2 Preparation of data for analysis |14

The measurement level of each variable ( Measure ) by default is unknown and it is convenient to define
it because in some procedures it is taken into account to decide the type of analysis or graph. In other
cases, for the most part, the procedure accepts any level of measurement; As users we must be aware of
what scale of measurement of the variables is used in each case. In SPSS there are three levels of
measurement:
G Nominal Ordinal Scale

- The variable role ( Role ) identifies a particular type of variable with a specific function
' Entry ©
that is predefined and allows variables to be preselected for analysis only in dialog
Destination
boxes. The available roles are: input (the variable is used as independent, default
Both
option), output (result or dependent variables), both (double role of input and output),
© None
none (no function), partition (variable used to segment data) and split (for compatibility
with IBM SPSS Modeler ). E Partition E
Divide

Each of the attributes that define the dictionary of each variable can then be copied and pasted
into the definition of another (or other) variable(s). You can also copy (and delete) entire variables by selecting
a line10 .

With these indications we proceed to identify the data with the particular properties of each of the variables.
The final result appears in Graph III.2.2.

Graph III.2.2 Identification of survey data: variable view

Identification made from the data editor window can also be done with the SPSS command language. The
Surveys.sps syntax file includes this information.
Once the dictionary or the properties of the variables have been defined, we are left to complete the
information of the data matrix with the introduction of the values in the data view 11 . In our case we have
introduced 9 more cases that give rise to an image like that in Graph III.2.3.

Entering the data involves no more than standing on the corresponding box and entering the data value and
pressing <Enter> or going to another box. When entering qualitative variable data, if we have not pre-coded
the data separately and therefore have to choose the code, we must consult it. For this we have a very

10 The attribute columns can be rearranged, to do this it is necessary to go to the menu: View / Customize variable view . You can also create
custom attributes from the menu: Data / New Custom Attribute .
11 You can go to the data view from the variable view by double-clicking on a variable row in the variable view. Equivalently, from the data
view you can go to the variable view by double-clicking on the column name of a variable in the data view.

López-Roldán and Fachelli | Quantitative social research methodology | (EC)


III.2 Preparation of data for analysis |15

interesting visualization option..., A.


in SPSS. It is necessary to first activate the Value Labels button 1+ . Next, on the box where we want to enter
the value, click on the right side of the box where we are, a drop-down menu will open where we can choose
with the right mouse button and choose the label that corresponds to the value. This option to display the value
labels is equally interesting in a regular data analysis since the variables that appear with labels are the
qualitative or categorical ones (nominal and ordinal), while in the quantitative ones the numerical value
already speaks for itself. alone and does not require an identification tag.

Figure III.2.3 Identification of survey data: data view

In the data editor you can:


- Insert rows (cases) or columns (variables) by first selecting a row or column to determine the insertion
point and then, through the context menu, clicking on Insert case or Insert variable . These actions can be
executed through the "Data" menu or through the icons on the toolbar: .
- Delete rows (cases) or columns (variables) by selecting the row or column (or more than one), and click
on <DEL> or do <CTRL> +<X> (or with the Edit menu or with context menu).
- Copy rows (cases) or columns (variables) with <CTRL> + <C> or with the "Edit" menu or with the
context menu.

- Paste rows (cases) or columns (variables) with <Ctrl>+<V> or with the "Edit" menu or with the context
menu.
- We can undo or redo actions through the IT' f icons.

- Search values via toolbar icon: or through the


"Edit" menu.

Once the data has been entered, or as we record it so as not to lose the work done, we must save it and convert
it into a SPSS system file, for example with the name Survey . sav12 . To save a data file : - Through the
menu: File / Save or File / Save As - With the keyboard: Ctrl+S
- Clicking on the “Save this document” button h .

Once the data matrix is created, we can ask SPSS for the data dictionary information. Through the menu: File /
Show data file information , choosing work file, since you can choose between this one (the one that is open in
the editor) or another external file that is saved on the disk (Graph III.2.4 ) . This procedure corresponds to the
SPSS syntax command: DISPLAY DICTIONARY .

12 This data matrix can be found on the chapter website.

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


16| III. Analysis

Graph III.2.4 List of the data dictionary of the survey data matrix
Variable information
Measuremen Column Print format Recording Missing
Variable Position Label t level Role Width Alignment format values

ID 1 Questionnaire number Nominal Entrance 2 Right F3 F3


P1 2 Age Scale Entrance 2 Right F2 F2 99
P2 3 Sex of the person interviewed Nominal Entrance 4 Right F1 F1 8.9
P3_1 4 ego studies Ordinal Entrance 8 Right F1 F1 8, 9
P3_2 5 Father's studies Ordinal Entrance 8 Right F1 F1 8, 9
P3_3 6 mother's studies Ordinal Entrance 8 Right F1 F1 8. 9
P4 7 Last week 's employment situation Nominal Entrance 8 Right F1 F1 9
P5 8 Hours worked Ordinal Entrance 3 Right F2 F2 97,99
Immigration is one of the main problems in
P6_1 9 Europe today Ordinal Entrance 6 Right F1 F1 8, 9
P6_2 If we do not control Europe's borders, our
10 Welfare State will be unsustainable Ordinal Entrance 6 Right F1 F1 8. 9

Immigration has increased insecurity on the


P6_3 11 Ordinal Entrance 6 Right F1 F1 8, 9
streets
P6_4
The settlement of non-EU immigrants is
12 Ordinal Entrance 6 Right F1 F1 8. 9
causing a loss of the labor rights acquired until
now
It is necessary to implement cooperation
P6_5 13 Ordinal Entrance 6 Right F1 F1 8. 9
policies with countries of origin
Immigrants should have the right to vote
P6_6 14 Ordinal Entrance 6 Right F1 F1 8,9
Immigrants must adapt to the culture of the
P6_7 15 Ordinal Entrance 6 Right F1 F1 8. 9
country where immigration is established.
P.T. 16 Left-right ideology Scale Entrance 2 Right F2 F2 98,99
Variables in the working file

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |17

Variable values
Worth Label
P1 993 Does not answer
P4 1 I had a job
P2 1 Male
2 I didn't work
2 Women a.m.
9 Does not answer
P3_1 1 No studies, unfinished primary
P5 97 a NP: did not work

2 EGB, elementary baccalaureate, ESO 99 to Does not answer

P6_1 1 Completely agree


3 Higher baccalaureate, BUR, COU
2 OK
4 First grade FP , officers
3 Neither agree nor disagree
5 Second degree FP , industrial master's
4 In disagreement
degree
5 Completely disagree
6 University students
9 a.m. Does not answer
8 a.m. Does not know
a.m.
9 Does not answer P6_2 1 Completely agree
2 OK
P3_2 1
No studies, unfinished primary 3 Neither agree nor disagree
2 EGB, elementary baccalaureate, ESO 4 In disagreement
5 Completely disagree
3 Higher baccalaureate, BUP, COU
9 a.m. N
4 First grade FP , officers
5 Second degree FP , industrial master's …
degree
6 University students P6_7 1 Completely agree
2 OK
8 a.m. Does not know
3 Neither agree nor disagree
9 a.m. Does not answer
4 In disagreement
P3_3 1
5 Completely disagree
No studies, unfinished primary
2
9 a.m. Does not answer
EGB, elementary baccalaureate, ESO
P7 1 Left
3 Higher baccalaureate, BUP, COU 10 Right
4 First grade FP , officers 98 to Does not know
5 Second degree FP , industrial master's 99 to Does not answer
degree
6 University students to. Lost value

8 a.m. Does not know

9 a.m. Does not answer

Likewise, the Code Book procedure (SPSS CODEBOOK command) that is executed in the menu: Analyze /
Reports / Code Book , allows us to obtain the information from the dictionary and the summary statistics of
the specified variables that we choose: counts and percentages with variables nominal and ordinal; and mean,
standard deviation and quartiles for the scale variables.

Graph III.2.5 Codebook of some variables of the survey matrix


P2
Worth Count Percentage

Standard attributes Position 3


Label Sex of the
person
interviewed

Guy Numeric
Format F1
Measurem
Nominal
ent
Role Entrance

Valid values 1 Male 5 50,0%


2 Women 5 50,0%

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |18

P3-1
Worth Count Percentage

Standard attributes Position 4

Label ego studies


Guy Numeric

Format F1
Measuremen Ordinal
t
Role Entrance

Valid values 1 No studies, unfinished primary 0 0,0%

2 EGB, elementary baccalaureate , ESO 2 20,0%

3 Higher baccalaureate, BUR, COU 3 30,0%


4 First grade FP, officers 1 10,0%

5 Second degree FP , industrial master's degree 1 10,0%

6 University students 3 30,0%


Missing values 8 Does not know 0 0,0%

9 Does not answer 0 0,0%

P7
Worth Count Percentage

Standard attributes Position 16


Label Ideology
left-
right
Guy Numeric
Format F2
Measurement Scale

Role Entrance
N Valid 10

Lost 0
Trend and dispersion Average 5,50

central Standard deviation 1,958


25th percentile 4,00
50th percentile 5,50
75th percentile 6,00
Labeled values 1 Left 0 0,0%
10 Right 0 0,0%

98 Does not know 0 0,0%


99 Does not answer 0 0,0%

The data dictionary, in addition to being able to consult it in the variable view tab and the previous procedures,
can be consulted at any time

through the Variables icon from the toolbar. When clicked you get a box like the
following:

where the main properties of each variable are reported.

López-Roldán and Fachelli | Quantitative social research methodology | (EC)


III.2 Preparation of data for analysis |19

Finally, the information of a variable can also be consulted within a menu dialog box by right-clicking on the
variable and then clicking on Information about the variable . For example from the Frequencies menu:

Statisticia
o Questionnaire number DD] AgeP1]
o Sex of 13 person interviewed (P2) ns.
ul Ego Studies [P3_1] Grains, j

to Father Studies [P3_2] Format


J Mother studies [P3_3]
Esjo,
Work situation of the previous week Bootstrap

° Show varlable names V Variable information

o Show «Variable aquettes


Name:
Or Sort alphabetically

* Sort pce order Label: [Last week's work situation

Or Order some medication number


Value tags: 1 Topic a job
Information about the yarlabie
Custom attributes:

Once the data has been identified, one way to check the correctness of the work carried out is to request the
frequency tables through the Analyze / Descriptive statistics / Frequencies menu. We select the variables and
pass them to the Variables box by clicking on the E icon. Finally we execute the procedure to obtain the
frequencies by clicking on Accept .

Finally, just comment that the dictionary of one variable can be applied to others through the Data / Copy data
properties menu (SPSS APPLY DICTIONARY command), either from an external data file or from an open
data set.

► Exercise 1. Proposed
From the data matrix created Survey.sav, obtain the frequency tables of the different variables and check the
correct identification of the data.

► Exercise 2. Proposed
With the CIS3041.sav data matrix obtain the data dictionary and the code book for the variables: CCAA,
TAMUNI, P3, P901, P1001, P1101, P1301, P15, P1601, P1701, P18, P2013, P23, P25 , P28, P29, P31, P32,
P46, VOTOSIM, REMEMBRANCE, STUDIES, OCUMAR11, CONDITION and STATUS , which allow us
to recognize the main types of variables and questions of the CIS Barometer.
You can also request the frequency tables of all of them.
Let us remember the interest of having the “Name and labels” options activated for the variables and “Values
and labels” for the values in “Pivot table labeling”.

To conclude this section, the image of the syntax file that performs the different identification aspects that we
have been commenting on is attached in Graph III.2.6 . This syntax can be found in the Survey.sps file on the
website. We briefly comment on the syntax used.

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


20| III. Analysis

At the beginning, some comments are introduced that are indicated in the syntax by starting the comment text
with an asterisk ( * ). Before proceeding with the identification, the options that we discussed in the previous
chapter for activating names and labels of the variables and values and labels of the values of the variables are
activated.

If we first enter the data without naming the variables, we have seen that the SPSS system assigns a name by
default. The RENAME VARIABLES command changes the original name to the one we agreed upon.

Graph III.2.6 Syntax for identifying survey data. Survey.sps

Labels are then assigned to the variables ( VARIABLE LABELS command) and also to the values of the
variables ( VALUE LABELS command). The FORMATS command determines the type of format of the
variables, in our case all the variables are numeric and are defined with three different widths and without
decimals: F1.0 , F2.0 and F3.0 . Missing values are specified with the MISSING VALUES command by
specifying in parentheses after each group of variables the values that the user defines as missing. The
measurement level is set with the VARIABLE LEVEL command: we group the variables into three blocks
and assign the three possible levels in parentheses. With VARIABLE WIDTH you specify the width of the
column in the data editor and with VARIABLE ALIGNMENT the alignment of the box values. Finally,
the identification of the dictionary is completed with the role that is assigned to the variables ( VARIABLE
ROLE command). The syntax program is completed with three more instructions intended to obtain the
frequency tables of all the variables ( FREQUENCIES command), to list the dictionary of the variables that

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |21

we have created ( DISPLAY DICTIONARY command) and the code book ( CODEBOOK command) .

1.1.2. Import and export data in SPSS

Data files created in other software with a defined format (SPSS, SAS, Excel,…) or without format, plain text
(DAT, TXT), can be easily imported from SPSS. Through the File / Open / SPSS Data menu or with the
<CTRL>+<O> keys, or the _ button
From the data editor, we access a dialog box that allows us
to open a file choosing from a variety of formats:
SPSS Statistics (.sav)
SPSS Statistics compressed (zsav)
SPSSIPC-C.sys)
Systat (*.syd. *.sys)
Portable (*,by)
Excel (• xis. *xisx, * xlsm)
Lotus (*.W)
Sylk(*.sIk)
dBase c.dDf)
SAS (".sas7bdat, • sd7, *.sd2, • ssdOl, • ssd04, *xpt)
Stata (*.dta)
Text (".tit, * dat, 'csv, '.tab)
They were all archived.')

On the one hand, there are three SPSS formats, in addition to the usual sav : one that compresses the data (
zsav ), another that opens an old format of the software version that was called SPSS/PC+ ( sys ) and the
portable format that allows bring it between operating systems where SPSS is installed. The rest of the formats
refer to other statistical packages such as Systat , SAS or Stata , spreadsheets such as Excel , Lotus or Sylk ,
database managers such as dBase , as well as plain text formats, that is, without formatting, where the data is
separated by commas, tabs, spaces,… ( txt , dat , csv , tab ).

On the web page of this chapter there are the files Data.xlsx , Data.csv and Data.dat , which we will use to
carry out an import exercise. They can be imported directly by opening them and completing the dialog boxes
that appear. In all cases it is the data matrix that we have identified above and saved as Survey.sav , with all
the information coded numerically.

In the case of opening or importing the Excel file Data.xlsx , a dialog box appears to define the data sheet, the
range of the data and to report the existence of a first line with the name of the variables:

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |22

After accepting, the data appears in the editor with the names of the variables and the numerical format for all
of them. Therefore, it will be necessary to complete the data dictionary with all the information on labels,
missing values and other formats.

In the case of the Data.csv and Data.dat files, both correspond to a delimited data format, by semicolons in the
first case and by tabs in the second. The import process is similar, we will see it with the first of the files. Once
it opens, this dialog box appears, the first of six:

López-Roldán and Fachelli | Quantitative social research methodology | (EC)


III.2 Preparation of data for analysis |23

It displays the layout of the data and determines if it corresponds to any format that we have predefined. We
click on next and the second dialog box appears:

It is determined whether the data is delimited, as is our case, or whether the data is arranged aligned in
10
columns with a certain width . It is also reported whether the variable name appears in the first row of the
file. We go to the next window:

10
Later (section 1.1.3) we will present the example of importing and identifying data from the Barometer and other CIS surveys whose data
are presented in text format with a fixed column layout.

López-Roldán and Fachelli | Quantitative social research methodology


24| III. Analysis

In this case we configure the import indicating that the data begins in row 2, that each record (row)
corresponds to a case and that it imports all cases. We go to the fourth window:

Here we specify the delimiter, in our case the semicolon, and whether we have textual data that is delimited
between particular characters. Next in the fifth step:

We can change the name of the variables and the type of data format of each of the variables (numeric,
string,...).

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |25

Finally we arrive at the sixth and final stage:

In this last box of the wizard we can save the format used for another occasion and choose to immediately
execute the import or convert that action into SPSS command language that will be attached in a syntax
window. To finish, click on Finish .

As in the previous case, we have only imported the data, the names of the variables and part of the possible
formats have been defined. The rest, such as labels or missing values, must be completed immediately.

López-Roldán and Fachelli | Quantitative social research methodology


III.2 Preparation of data for analysis |26

On the other hand, we may need to export our data from SPSS to other applications. We can also save (export)
our data in different formats. When we do Save or Save As we have these alternatives available in the Save as
type drop-down:

SPSS Statistics (*.sav) dBASE III ('.dDf)


SPSS Statistics compressed (*.zsav) dBASE II f.dbf)
SPSS 7 0 c sav) SAS v6 for Windows (*.sd2)
SPSS/PC+ c sys) SASv6 for UNIX(*ssd01)
Portable (*.por) SAS v6 for Alpha/OSF (".ssd04)
Tab delimited (*.dat) SAS Version 7-8 for Windows, short extension (*.sd7)
Comma delimited (.csv) SAS Version 7-8 for Windows, long extension (* sas7bdat)
ASCII in fixed format (*.dat) Version 7-8 of as for ni (*.as7ba)
SAS version 9+ for Windows (*.sas7bdat)
Excel 21 (xis)
SAS version 9+ for UNIX (*.sas7bdat)
Excel 97 to 2003 (*.xls)
SAS Transport (xpt)
Excel 2007 to 2010 (*xlsx)
Stata versions 4-5 (.dta)
1-2-3 version 3.0 (*-wk3)
Stata version 6 (*.dta)
1-2-3 version 2.0 (.wk1)
Stata version 7 Intercooled (* dta)
1-2-3 version 1.0 ("'.wks)
Stata version 7 SE (*.dta)
SYLK c slk)
Stata version 8 Intercooled (* dta)
dBASE IV (*.dbf)
Stataversion 8 SE (*.dta)

1.1.3. Import and identification of CIS survey data

Since January 1, 2009, the Center for Sociological Research ( http://www.cis.es/ )13 makes the data files of the
surveys carried out by this organization available to interested parties free of charge. The data files are in
ASCII format (flat format, TXT or DAT ) and can be downloaded from the CIS website, as well as the syntax
files of the SPSS and SAS statistical packages, the questionnaire, the technical sheet, the code book. and the
cards, to through of the address:
http://www.cis.es/cis/opencms/CA/2_bancodatos/ . In this manual we handle this source of information that
we consider essential for knowledge of the Spanish political and social reality, in addition to constituting an
invaluable teaching resource in teaching and learning the methodology of quantitative research. For this
reason, it is of interest to know in greater detail the procedure for importing and identifying CIS data in SPSS.
We will also do so by presenting the syntax language that executes this task.

Once the file of the data of interest ( MDxxxx.zip ) has been downloaded, in our case we will refer to study
number 3041 corresponding to the Barometer for the month of October 2014, it is necessary to unzip it and
select two of the files included in the zip file. On the one hand, the DA file with the number contains the raw
data. It can be opened with Notepad or Excel and view its contents. On the other hand, the SE file with the
number corresponds to the SPSS syntax file. You can change its name ESnº to ESnº.sps to open it directly
with the SPSS software and execute the syntax.

13 The Sociological Research Center (CIS) is an autonomous body dependent on the Ministry of the Presidency of Spain, with the main
function of contributing to the scientific knowledge of Spanish society.
Finally, everything is selected, executed and the data file that is generated is saved, in our case we save it with
the name CIS3041.sav .

The CIS data are arranged in a fixed column format, that is, each variable is located in specific columns that
affect all individuals and align all the data vertically. The columns occupied by each variable are specified in
the questionnaire by a number in parentheses on the right side of the response categories and in the codebook.

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |27

On the website of this chapter you can find the ES3041.sps file provided by the CIS and partially reproduced
in Graph III.2.7. The instruction program can be selected and executed taking care to place the DA3041 data
file in the same working folder of the software.

Graph III.2.7 CIS syntax files for data identification

Alternatively we have two options to ensure that the data will be located. On the one hand, we can use the CD
command (change directory) which tells the system which is the default working folder (for example C:\
Data ), placing it in the first line of the syntax file:

CD 'C:\Data'.

On the other hand, we can specify the file path in the DATA LIST command:

DATA LIST FILE 'C:\Data\DA3041'.

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


28| III. Analysis

To identify this information we could use the previous procedure applied to Data.csv with the import wizard.
Using the syntax that we discussed, the DATA LIST command is used, intended to define the data, adapting
to its layout and assigning a name and type of format. In the fixed column format, the name of each variable
and the numbers of the columns it occupies are placed. Additionally, the type of format (type, width and
decimals) can be assigned. In this case, the width is given by the columns that each variable occupies and
numerical format is assigned by default to all variables. If we had decimals or the variable had a different
format, it would need to be detailed in the command.

The syntax program is completed by assigning labels to variables ( VARIABLE LABELS command),
labels to values ( VALUE LABELS command), assigning missing values ( MISSING VALUES
command) and requesting frequency tables for all variables ( FREQUENCIES command).14 .

1.2. Creation and identification of data with R

1.2.1. Data entry in R

Our first task will be to enter the data and later we will see how to import it into R. We will carry out this task
with Deducer, which will facilitate the creation and identification work in a window environment. To create a
data matrix, if we have just entered Deducer, we will have the option of clicking on New Data in the initial
Data Viewer window, a box will then appear to give it a name that does not contain accents or spaces. We
can name it Survey :

The blank data editor will open :

14 In the CIS3041.sav data matrix we have incorporated a more complete identification of the data since some variables are not identified
with variable and value labels, on the other hand the definition of missing values can be expanded to also consider the responses of “no
knows” and “does not answer”, and the level of measurement of the variables has also been defined.

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |29

If we were working with other data, from the open editor we will proceed to open a new blank data editor
window using: File / New Data / Data , or with the <CTRL>+<N> keys.

The type of data files we will usually work with, our data arrays, are identified in R as data frames .

The data editor, which opens at startup or from the console menu, allows you to create or examine a data
matrix from two tabs:

Data View Variable View

In the Data View we will enter the data itself, that is, the codes or values of the variables, while in the
Variable View we will identify their characteristics, their dictionary. We could choose either to start entering
the data or to create the dictionary, but it is advisable to proceed first by entering the data, as they will help us,
in the case of qualitative variables, to automatically generate the dictionary of their values.

In the data viewer, if we click the right mouse button on any row, in addition to copying, cutting and pasting,
we can: insert a new row ( Insert New Row ), delete it ( Remove Row ) and change the name of the row (
Edit Row Name ). From the moment we create a new line, it appears with the value NA ( Not Available ) in
each box that identifies the absence of a value (blank box).

Copy Cut Paste

Insert
If we click the right mouse button on any column, in addition to copying, cutting and pasting,
Insert New Row
we can: insert a new empty column ( Insert Empty ), delete it ( Remove ), or duplicate it ( Remove Row

Duplicate ), as well as sort the column data ascending or descending ( Sort: Increasing- Edit Row Name

Decreasing ). Sort (Increasing)

Sort (Decreasing)

Copy Cut Paste

Paste into New


Let's consider the responses of the first individual that we suggest in Table III.2.1: 1, 35,
Insert Empty
Male, University, EGB, High School, Works, 40, CDagree, Agreement, CDagree, Disagree,
Duplicate
CDagree, Agreement, CDagree, 3. Remove

A name is automatically generated for each variable and the default format is assigned
according to the value we have entered. If we go to the variable viewer, the initial image of
the Variable View tab is the following:

and we will literally enter them in the data viewer, in row 1, as follows:

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |30

The values that we have entered with numerical codes have the Double format while the values with textual
code are identified with the Character format.

We then proceed to modify this identification information and create the data dictionary. To do this, we will
specify the following information in relation to each variable that is available in the rows:

- The name of the variable ( Variable ): it must begin with a letter or a period, the name assigned is different if
it is written with uppercase or lowercase, they cannot have accents, neither ñ nor ç , nor blank spaces, nor
any character outside the English standard, nor does it support the symbols of arithmetic operators.

The type of variable format ( Type ): the variables of an R data frame can be of different types. In
particular, we can make the fundamental distinction between: - Qualitative or categorical : text or label
values (numeric or textual) that represent the group or category to which the case belongs. They can be
differentiated between nominal (for example sex) and ordinal (level of education). In R they are called
factors , and in the case of ordinal level ordered factors .
- Quantitative : numerical values with which it makes sense to perform arithmetic. They can be
differentiated between continuous (body mass index) and discrete (number of children). In R they are
called double if they have decimals and integer if they represent discrete data. Factor
Double
Integer
When we click on each box in the Type column, a drop-down menu opens that allows
Logical
us to define the format of the variable. Date
time
Thus, the Deducer statistical package classifies the types of variables into: Other
- Character : string (text) variables.
- Factor : categorical variables that can be nominal or ordinal.
- Double : continuous quantitative variables.
- Integer : discrete quantitative variables.
- Logical : logical or dichotomous variables.
- Date : date variables.
- Time : time variables.
- Other types of variables

- The values of the factor variables ( Factor Levels ): the labels or values of these variables that we treat as
qualitative, level of measurement are detailed.

López-Roldán and Fachelli | Quantitative social research methodology


III.2 Preparation of data for analysis |31

nominal or ordinal, and where each label or value of the variable must be specified. Labels can be defined and
edited by clicking on the cell itself.
When we create a data matrix it is not necessary to define the factor variable labels beforehand. As we
will see, as the data is entered, the following labels will be incorporated automatically.

Each label or value of the qualitative variables that is entered is a text that identifies each category of the
variable, and the set of labels are ordered according to the order of introduction: either in the factor editor
or in the factor view. data. This order may be relevant to the characteristics of the variable and it may
result that the introduction of the labels does not suit what we want. With the arrows ¡Al l'll
we can order them. We can also add them with H or remove them with .

When in particular the category of the variable ( level ) can take several orderable values following a pre-
established scale (ordinal variable) we will check the Ordered box. They can also be modified through the
console in the Data / Edit Factor menu.

Finally, it should be noted that each label is identified in the R system with a consecutive integer value that
appears in parentheses in each cell of the variable, numerically specifying the order.

An important aspect in the identification and subsequent treatment and analysis of the data is the absence of
values, the so-called missing values . It is common for us to find ourselves in the situation of not having
information about some cases or individuals in relation to one or more variables, for example, in the cases of
don't know , do not answer or not relevant , this is information that is not usually processed. Therefore, in order
to correctly carry out the analyzes and their interpretation, they must be treated specifically. Unlike other
statistical packages where specific values can be assigned to each situation and treated in different ways, in R
the solution is drastic: any value that is considered missing is not encoded and is treated in a unified way,
identifying them with the symbol NA ( Not Available ). In R it is not necessary to assign them any particular
value, they simply correspond to a “hole” of information in the matrix, boxes that are left blank and that we
recognize because the letters NA appear.15 .

With these criteria we proceed to identify the data with the particular properties of each of the variables. The
final result of the data dictionary appears in Graph III.2.8 and the data can be visualized in Graph III.2.9. To
reach this result we have first changed the name of the variables, then we have specified their type and finally
we have coded the data of the factor variables. For coding, the codes available in the image of the variable
viewer tab illustrated in Graph III.2.9 can be used.16 .

Graph III.2.8 Identification of survey data: variable view

15 An alternative way to treat these missing values differentially in R is to (1) encode them with a distinct value, (2) create a copy of the
original variable in which the corresponding missing values are blank (NA), and (3 ) perform the analyzes selecting the version of the variable
that is most interesting in each case, with or without NA, or combining the information from both.
16 In the case of the factor variables, we follow the criterion of using a synthetic code of a single word, being able to use accents. However,
working with accents in R is problematic and requires giving up the specificity of one's own language in favor of the Anglo-Saxon one, an
aspect that should be reviewed. In the case of the variables, we have taken the name assignment criterion as the question number of the
questionnaire, but the criterion of using a synthetic name that refers to its content can also be followed.
López-Roldán and Fachelli | Quantitative social research methodology
32| III. Analysis

The values or categories of the qualitative variables do not necessarily have to be entered from the variable
viewer, the system can create them automatically as we enter the data in the data viewer tab, and it also assigns
them internally a numerical value that indicates the order of position of each category of the variable. When
entering data into the Data View , Deducer interprets the type of the variable according to the information
provided, and even changes (without warning) the type. This can cause problems: if we define a variable as
integer but enter a number with decimals, 2.0 for example, it converts it to double , if we introduce a decimal
number with a comma 2.3 instead of a point 2.3 converts it to character . In R, and therefore in Deducer, the
decimal separator is the period, and not the comma. An entered data that contains a comma is not treated as
numeric, but as text.

It must also be kept in mind that each value (called level ) of a qualitative variable (which will be of type
factor ) will be each set of different characters introduced. For example, if we write Woman as the value of the
variable Sex for one individual and woman for another, they will be considered different and we will have 2
codes to identify women.

If the factor variable is measured at the ordinal level ( ordered factor ), the order of the categories is
important when viewing the information. This order of the values of the variables, when the codes are
generated automatically as we enter them in the matrix, do not respect the desired order and requires that we
edit the factor levels to order them according to the direction of each variable.

On the other hand, we must bear in mind that if we edit the Factor levels of a qualitative variable and delete
one of the levels by mistake , we will delete the corresponding data from the matrix and they will become NA
(missing values).

In the case of responses corresponding to missing values, we have followed the criterion of considering the
categories “we don't know”, “does not answer” and “not relevant” together and not assigning them a specific
code, which is why they appear without distinction with the symbol NA. in the data matrix.

Figure III.2.9 Identification of survey data: data view

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |33

Thus, first we enter the data in the Data View as it appears in Graph III.2.9 where 10 cases have been
recorded. Entering the data only involves standing on the corresponding box and entering the data value and
clicking <Enter> or going to another box. Next we modify the name of the variables, we define their types and
in the case of the factor variables we adjust the order of the categories and determine if they are ordinal.

In the data editor you can:


- Copy rows (cases) or columns (variables) with < CTRL ++< C > or with the Edit menu or with the context
menu.

López-Roldán and Fachelli | Quantitative social research methodology


III.2 Preparation of data for analysis |34

- Cut rows (cases) or columns (variables with < CTRL ++< X > or with the Edit menu or with the context
menu. It doesn't delete the row/column.
- Paste rows (cases) or columns (variables) with < CTRL ++ < V > or with the Edit menu or with the
context menu. It is necessary to have previously created an empty space if you do not want to overwrite
other cases / variables. It doesn't paste the case/variable name.
- In the editor we cannot undo or redo any action (if any information is deleted, for example, it cannot be
recovered).
- We can't do searches either.

Once the data has been entered, or as we record it so as not to lose the work done, we must save it and convert
it into a file in the R system, for example with the name Survey . gdr17 . To save a data file : through the File /
Save Data menu, clicking on the kJ button or with the < CTRL > < S > keys. When saving the data, the
working folder defined by default ( My Documents ) will always appear or the one we have defined through the
File / Set Working Directory menu (< CTRL >+< D >). It is important to remember that the file directory
cannot have accents, nor can the data file.

The data identified from a matrix corresponds to cases (rows) and variables (columns). The latter are treated as
objects of the R system workspace . Objects can be viewed through the menu: Packages & Data / Object
Browser or by clicking <CTRL><B> from the console. This option also allows you to view and edit the
variables or even list the data with Print , request summary statistics through Summary or create graphs with
Plot . We can do it from all the variables in the matrix or one by one.

In the case of requesting a summary of the entire Survey data matrix, this result is obtained in the console18 :

> summary (Survey)


ID Pl P2 P3 1 P3 2
Min. : : 1.00 Min. :28.00 Male:5 Without :0 Without :1
1st Qu.: : 3.25 1st Qu. :30.75 Woman:5 EGB :2 EGB :3
Medium : 5.50 Medium :35.00 FEI :1 FP1 :1
mean : 5.50 mean :34.90 Baccalaureate : :3 Baccalaureate : :3
3rd Qu.: : 7.75 3rd Qu. :39.25 FP2 :1 FP2 :2
Max. : :10.00 Max. :43.00 University students : :3 University students :0

P3_3 P4 P5 P6_l P6_2 P6_3

Without :2 Works: 5 Min. :10.00 CDagree:0 CDagree :0 CDagree:1


EGB :3 No :5 1st Qu. :16.25 Disagreement :3 Disagreement :1 Disagreement :2
FPl :1 Average:22.50 NiAniD :2 NiAniD :2 NiAniD :3
Baccalaureate :1 Mean:23.33 Agreement :3 Agreement :4 Agreement :2
FP2 :3 3rd Qu.:28.75 Agreement : 2 CAgree :3 Agreement: 0
University students : 0 Max. :40.00 NA's :2
NA's :4

17 This data matrix can be found on the chapter website.


18 It corresponds to the summary command that we saw in the previous chapter.

López-Roldán and Fachelli | Quantitative social research methodology (cd)E2cm


III.2 Preparation of data for analysis |35

P6 4 P6 5 P6 6 P6 7 P7
CDagreement:0 CDagreement: 1 CDagreement:0 CDagreement: 0 Min. :3.00 Disagreement :3 Disagreement :0
Disagreement :1 Disagreement :1 1st Qu.:4.25
NiAniD :2 NiAniD :1 NiAniD :3 NiAniD :3 Average:5.50
Agreement :4 Agreement :3 Agreement :2 Agreement :1 Mean :5.50
CAgreement :1 CAgreement :5 CAgreement :4 CAgreement :5 3rd Qu.:6.00
Max. :9.00

Once the data has been identified, one way to check the correctness of the work carried out is to request the
frequency tables through the Analysis / Frequencies menu. We select the variables and pass them into the Run
Fequencies On box by clicking on the icon. Finally we execute the procedure of extracting the frequencies by
clicking on OK .

- Exercise 3. Proposed
From the data matrix created Survey.rda obtain the frequency tables of the different variables and check the
correct identification of the data.

- Exercise 4. Proposed
With the CIS3041.rda data matrix, obtain the data dictionary and the code book for the variables: CCAA,
TAMUNI, P3, P901, P1001, P1101, P1301, P15, P1601, P1701, P18, P2013, P23, P25 , P28, P29, P31, P32,
P46, VOTOSIM, REMEMBRANCE, STUDIES, OCUMAR11, CONDITION and STATUS , which allow us
to recognize the main types of variables and questions of the CIS Barometer.
You can also request the frequency tables of all of them.

1.2.2. Import and export data in R

If we have data already created by other software with a defined format (SPSS, SAS, Excel,...) or without
format, plain text (DAT, TXT), it can be easily imported from R. Through the File / Open Data menu of
Deducer or with the < CTRL >+< L > keys, or the h) button of the Data Viewer , we access a dialog box that
allows us to open a file choosing from a variety of formats:

López-Roldán and Fachelli | Quantitative social research methodology


36| III. Analysis

R(".rda ".rdata)
(All the files

R (*.rda *.rdata)
R dput0("rob)
Comma separated (.csv)
Text file (-.txt)
SPSS (".sav)
SAS export (*.xpt)
DBase (".dbf)
Stata (*.dta)
Systat (*.sys *.syd)
ARFF

Epiinfo (.rec)
Minitab (.mtp)
5 data dump (*.s3)
Excel (-.xfc ~.xlsx)

On the web page of this chapter there are the files Survey.xlsx , Survey.csv , Survey.sav and Survey.txt . If we
open them from Deducer we will see how it imports. In the case of the file in Excel format, it will ask us which
spreadsheet to import and then it will create a new data matrix with the name Survey119 . You can see how the
name of the variables has been assigned because the first line of the Excel sheet contains the name and
considers the data that is textually encoded as a character variable. When we convert them into factor-type
variables, the levels or categorical values will be automatically generated.

Secondly, we can import a csv file, that is, a format where the data is separated by a comma. When you open
the Survey.csv file, this import dialog box appears:

19 If we are in a workspace with the Survey matrix that we have identified.

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |37

When loaded into R, the Survey2 data matrix is generated 20 With the data, the names of the variables and those
that are qualitative are already incorporated as factor-type variables with their corresponding values.

If we import the SPSS file Survey.sav , which differs in the way the values of the qualitative variables have
been labeled, we see how the Survey3 matrix is generated. In this case, as in the previous case, the names of the
variables and the qualitative variables are imported as factor type with their values. 21 .

20 This will be the case if we are in a workspace with the Survey matrix that we identified at the beginning and we have also imported the
Survey.xlsx file from Excel, which was renamed Survey1 .
21 Importing date type variables from SPSS generates problems, so it is better to convert it to Excel format and import it from there.
López-Roldán and Fachelli | Quantitative social research methodology
III.2 Preparation of data for analysis |38

Finally we can import a plain text file like Survey.txt where the data is separated by tabs. The results are
similar to those of the imported matrix Survey2.

We can also save (export) our data in different formats. In this case, the available format options are fewer but
sufficient to take them to any other application:

2. Data transformation

The task of data transformation is intended to adapt the data to the needs of the analysis where it is required to
modify it, to make corrections and changes to the initially existing information, either in relation to the
variables of a data file or in relation to the treatment of several of them, or to generate new variables based on
existing ones: groupings, typologies, indices, etc. As in the previous section, we will present the transformation
procedures for SPSS and R in two different subsections.

2.1. Data transformation with SPSS

We will comment on the different procedures presented in the SPSS Data menus, intended for processing files,
either inside them or to combine them with others, and Transform , intended for the transformation of variables
and the creation of new ones.

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |39

2.1.1. File processing with SPSS

We will distinguish two types of file management and transformation procedures, those intended for the
processing of data within a file and the processing of data between files that are related. The SPSS commands
that we will discuss are those in Table III.2.2.

Table III.2.2 File treatment procedures


Data Menu SPSS Commands
Data processing procedures inside a file
Sort variables SORT VARIABLES
Sort cases SORT CASES
Select cases FILTER, SELECT IF, SAMPLE
Segment file SPLIT FILE
Weigh cases WEIGHT
Add AGGREGATE
Transpose FLIP
Restructure CASESTOVARS, VARSTOCASES
Data processing procedures between related files
Split into files SPSSINC SPLIT DATASET
Merge files MATCH FILES, ADD FILES

López-Roldán and Fachelli | Quantitative social research methodology


40| III. Analysis

2.1.1.1. Data processing inside a file

Sort variables

The SORT VARIABLES command ( Data menu / Sort Variables ) can sort the variables in the array based
on the values of any of the variable attributes in the data dictionary, in ascending or descending order:

It is convenient to save the previous order of the variables as it is usually a sorting criterion that does not
correspond to any pre-established one and could be difficult to restore.

Sort cases

The SORT CASES command ( Data menu / Sort cases ) allows the reordering of the cases in the active file
according to the values specified in one or more variables (up to 10), numerical or alphanumeric (string, for
these the order is alphabetical). Cases can be reordered in ascending, default, or descending order.

With the CIS3041.sav data matrix we see that the cases are initially ordered according to the questionnaire
number ( CUES variable). As an exercise we can organize the file according to the location of the interview. A
first criterion would be, for example, ordering the file according to the Autonomous Community (variable
CCAA ) in ascending order:

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |41

Note the changes in the data file. If we want to be more precise, we can put, in addition to the CCAA variable,
the variable of the province ( PROV ) and the municipality ( MUN ), all in ascending order. We will introduce
them in this order:

There is an option to save the reordered cases to a different file, with the possibility of creating an index.
Sorting a small file is instantaneous, but with files with millions of records it can take minutes. In this sense, it
is very useful to have the database sorted according to a criterion if it is used regularly. We will also see that
the organization of a file is a necessary prior step in various data processing procedures.

Select cases

Often, when working with a database we are interested in obtaining information about individuals who satisfy
certain conditions. We may be interested, for example, in studying various variables but only for individuals
with certain characteristics: female, those who plan to vote, those who have a low level of income, etc. SPSS
allows us to select individuals that satisfy a certain condition so that, from that moment on and as long as we do
not undo the selection, all the procedures we apply will refer only to the selected individuals. This is the default
option when we choose If condition is satisfied ( Discard unselected cases option) in the Data / Select cases
dialog:

López-Roldán and Fachelli | Quantitative social research methodology


III.2 Preparation of data for analysis |42

This operation corresponds to the FILTER command. In addition to this procedure, it is possible to extract a
random sample of cases ( SAMPLE command), select from a range of cases ( USE command), and use filter
variables. In any of these cases we can choose to:
- Discard unselected cases: Selection implies that the data is filtered, that is, the filtered cases remain in the
file but are excluded from the analysis and can be recovered. It is usually worked this way.
- Copy the selected cases to a new data file.
- Delete unselected cases: unselected cases are deleted from the active file (the one in the system's temporary
memory). The original file is kept on disk, but if after making the selection we save the file with the same
name then we will definitely lose the unselected cases.

As an exercise we can select the cases of the people interviewed who are women. We choose If the condition is
satisfied and click on the If the op icon.................................................................................................................
In the new dialog box we will build condition 20 . We select the sex variable ( P31 ) and move it to the right. To
select the women we will write with the keyboard or with the buttons in the dialog box: = 2 . The value 2
corresponds to women. In the event that we do not remember the code, an immediate way to consult it is to
right-click and click on Variable information :

Built condition:

20
We will also see this dialog box in the Calculate procedure to transform the data. To establish a condition it is necessary to handle
transformation expressions that we will discuss in the following section.

We will click on Continue and on Accept in the following dialog box to carry out the action, making sure that
the Discard option is activated. If we now look at the database, we will see that some “crossed out” cases
appear in the left margin of the case numbering: they are the cases that have not been selected, that is, the male
individuals.
López-Roldán and Fachelli | Quantitative social research methodology
III.2 Preparation of data for analysis |43

Note also that a new filter variable has been automatically created, the last one in the data matrix, called
filter_$ that takes the values 0 and 1 with labels Not selected and Selected , respectively, depending on whether
the individual has been selected or not. Also note that a label with the inscription Filter activated appears at the
bottom right of the SPSS window. It reminds us that the data file we are working with has been filtered, that is,
it reminds us that we are not working with all the data but only with those that satisfy a certain characteristic.
The annotation of the syntax commands has also appeared in the results file indicating that the cases have been
filtered.

If we now calculate, for example, the frequency table of any variable, the information obtained will refer only
to the women in our database. It is very important that, once we have carried out the study we wanted to do
with only a part of the individuals, we remember to undo the selection to work again with the complete file. If
we did not do so we would be obtaining erroneous information. To do this we would return to the selection
menu and mark the All cases option.

If the execution of the selection procedure that we have just carried out had been done with the Delete
unselected cases option, then we would be executing another SPSS command, the one that corresponds to
SELECT IF22 .

If we wanted to extract a random sample of cases we would specify in its dialog box an approximate % or a
given number of cases:

In the case of defining a range of cases, the dialog box would be as follows:

22 When creating syntax programs, you can write the SELECT IF command preceded by TEMPORARY , thus applying a temporary
selection that affects only the next procedure command, then all cases are considered again.
López-Roldán and Fachelli | Quantitative social research methodology | (EC)
44| III. Analysis

All of these procedures correspond to transformation commands, that is, commands that do not perform the
task (they do not access the reading of the data) if they do not find a command that forces the reading of the
data (any analysis procedure for example). When these commands are executed through the menu, their action
is carried out immediately because an additional command is attached to the execution: EXECUTE , as can be
seen in the results file, intended to force the reading of the data and perform all the transformation actions that
they had up to that moment23 .

Segment file

Another common need when processing data in a file is to segment it, that is, divide it into groups of
individuals according to the values of one or more grouping variables to perform the same type of analysis that
will be repeated within each group. In order to perform the segmentation correctly, it will be necessary to
previously sort the file. SPSS offers us two different ways to segment the file:
- Compare the groups : The groups are presented together so they can be compared in a single table or with
individual graphs presented together.
- Organize results by groups – The results of each procedure are displayed separately for each group.

The segmentation command is SPLIT FILE ( Data menu / Split File ). The initial dialog is:

23 See the section on the SPSS command language in the previous chapter where the concept of program states is explained.

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |45

In it we can see that the sex segmentation variable ( P31 ) has been introduced and the default option Compare
groups appears marked. If our data file is not ordered by the segmentation variable, we will mark that it is
ordered previously since it is a necessary condition to group the individuals first. We will execute this
transformation of the file and we will see that in the lower right part of the SPSS window a label appears with
the inscription Divide by .

From that moment on, every analysis exercise that we execute will be carried out for each group. For example,
we can request the descriptives of the variables through the menu Analyze / Descriptive statistics / Descriptives
of the variables P901 to P907:

The result is as follows:


Descriptive statistics
Standard
deviation
P31 Sex of the person interviewed N Minimum Maximum Half
1 Man P901 The family 1208 0 10 9,56 1,134
P902 Friends 1203 0 10 8,16 1,833
P903 Free time 1193 0 10 7,76 1,978
P904 Politics 1199 0 10 4,03 3,262
P905 Work 1193 0 10 8,78 2,096
P906 Religion 1193 0 10 3,31 3,198
P907 Associations,
clubs and other associative 1145 0 10 5,11 2,760
activities
Valid N (per list) 1110

2 Woman P901 The family 1266 0 10 9,76 ,832


P902 Friends 1258 0 10 8,18 1,878
P903 Free time 1244 0 10 7,86 1,902
P904 Politics 1241 0 10 3,79 3,313
P905 Work 1243 0 10 8,92 1,935
P906 Religion 1251 0 10 4,53 3,507
P907 Associations ,
clubs and other associative 1168 0 10 5,09 2,859
activities
Valid N (per list) 1126

López-Roldán and Fachelli | Quantitative social research methodology | (EC)


III.2 Preparation of data for analysis |46

A single table with the analysis carried out for men and women. If we run the procedure again with the
Organize results by groups option, we will obtain the same information but in separate tables.
Sex of the person interviewed = Male

Descriptive statistics 3
Standard
N Minimum Maximum Half deviation

P901 The family 1208 0 10 9,56 1,134


P902 Friends 1203 0 10 8,16 1,833
P903 Free time 1193 0 10 7.76 1,978
P904 Politics 1199 0 10 4,03 3,262
P 905 Work 1193 0 10 8,78 2,096
P906 Religion 1193 0 10 3,31 3,198
P9D7 Associations,
clubs and other activities 1145 0 10 5,11 2,760
associative
Valid N (per list) 1110

to. P31 Sex of the person interviewed = 1 Male

Sex of the person interviewed = Woman


Descriptive statistics 3
Standard
N Minimum Maximum Half deviation

P901 The family 1266 0 10 9,76 ,832


P902 Friends 1258 0 10 8,18 1,878
P903 Free time 1244 0 10 7,86 1,902
P904 Politics 1241 0 10 3,79 3,313
P 905 Work 1243 0 10 8,92 1,935
P906 Religion 1251 0 10 4,53 3,507
P907 Associations , clubs and
other associative activities 1168 0 10 5,09 2,859

Valid N (per list) 1126

to. P31 Sex of the person interviewed = 2 Female

This option has various applications, but one of them could be to prepare the statistical annex with numerous
tables and graphs that we want to repeat, for example, for each territory of the study separately.

Here again it is important to remember that once we have carried out the desired analysis, it is necessary to
undo the segmentation to work again with the entire file, as a single sample. To do this, we return to the menu
and mark Analyze all cases .

Weigh cases

Data weighting is another of the recurring needs of quantitative data analysis. If the cases are weighted, what
we do is change the weight that each case has. By default, each individual is worth one unit and the count of
any characteristic, for example being a man, is the sum of as many 1s as individuals have that value. But the
value of each individual's weight can be changed, and this means changing an internal SPSS system variable
named $weight . This internal variable always has a value of 1 for each individual until we change it with the
WEIGHT weighting command or through the Data / Weight cases menu.

The need to ponder can arise in different situations. We will comment on three of them. A first very common
situation has to do with the need to

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |47

weight the data of a sample, either by the construction design itself 24 or because there is a need to balance it
given that certain imbalances or biases have been detected in the information collected. Let's imagine, for
example, that the population proportion of men and women in a territory was 50 and 50 percent, we obtain a
sample of that population and we get 48 and 52. Our results will be biased in favor of the profiles of women
who mate 2% more than appropriate. To correct this deviation and restore 50% of its population in sample
terms, it is necessary to introduce a weighting in such a way that converts the weight of men from 48 to 50 and
that of women from 52 to 50.

520
W v aran e s = 480 = 1.083

If our sample is 1000 individuals, that implies that we have 480 men and 520 women, the weighting is
generated by applying the following formula:
theoretical weight
wi =-----------------—
Actual weight
In the case of males ( i=1 ), theoretically they should be 50%, that is, 500 individuals, but the real weight is
480, which means that we must increase the importance of males by multiplying each individual by a value.
greater than 1, specifically, 1.083.

The same reasoning in the case of women generates a weight less than 1 of 0.923:
480
Wmu J ere s = 520 = 0.923

If we multiply each man by 1.083 instead of 1 and each woman by 0.923 instead of 1, in the final count we will
have 500 men and 500 women. To make it effective in SPSS it is necessary to first create the weighting
variable and then weight. We will see in the next section how to generate variables. If we did it by syntax it
would be, for example, like this:

IF sex=1 weight=1.083.
IF sex=2 weight=0.923. WEIGHT BY weight.

The weighting command is very simple, and its dialog box is the following, where it is only a matter of
choosing the weighting variable:

We will do a second weighting exercise with data whose units are aggregated. This is the case of the matrix on
the human development index IDH2014.sav where each unit is a country. When we work with this file, if we
do not weight the cases, all countries have the same weight, regardless of their population, area, etc. Sometimes
we will be interested in working with the file in this way, but in other cases it may be wrong. If we want to
analyze, for example, what the world gross domestic product per capita is, we cannot give the same weight to
Andorra (0.08 million) as to China (1,385.57 million). In this case it would be convenient to give each country
a different weight according to its population, proportional to the number of people living in the country.

24 Weighting is sometimes also accompanied by the need to increase the sample , that is, to express the individuals in the sample in population
terms, by which each individual is multiplied by what it is worth in population terms. This is also expressed in the data, for example, from the
Active Population Survey. Weighting and raising are two weights and two weightings that can be applied simultaneously or separately.

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


48| III. Analysis

We will begin by calculating the average of the GDPpercapita variable ( Gross Domestic Product per
capita ) without weighting the cases. We obtain the following result:

Descriptive statistics
N Half
GDPpercapita GDP per capita Valid 16496,9136
180
N (per list)
180

$16,497 is an average where individuals are countries. Based on the wealth of each country, we have calculated
the average, giving equal weight to all countries. Therefore it is not an exact reflection of the world gross
domestic product per capita. To calculate it we must give each country a weight proportional to its population.
We weight through the menu Data / Weight cases / Weight cases through and choose the Population variable
that gives us the population of each country in millions. The new calculation of the average gives this result:
Descriptive statistics
N Half
GDPpercapita GDP per capita Valid N 13552,3587
6951
(per list)
6951

Note that the average has now dropped to $13,552, before we had 180 countries and now the value is 6,951
people (the world population in millions). This result approximates the world GDP per capita much better by
taking into account the most populated countries that are mostly less rich, which is why the world average
drops.

Once an analysis has been carried out by weighing the cases, we must remember to undo it if we do not need it.
Otherwise we would obtain incorrect information. To do this, we return to the menu: Data / Weight cases / Do
not weight cases .

The weighting command can also be used instrumentally to reproduce frequency tables of one or more
variables. For example, if we enter the website of the National Institute of Statistics and consult the data from
the Active Population Survey for the 4th quarter of 2014, we can see, among many other data, that the
distribution of the population according to the level of education achieved is as follows:

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |49

Active Population Survey


Population in family homes
Population aged 16 and over and level of training achieved Units: Thousands of people
Total
2014Q4
Total 38.523,4
Illiterate 727,2
Incomplete primary studies 2.627,
Primary education 3 5.812,7
First stage of secondary education and similar 10.896,9
Second stage of secondary education, with general orientation 5.083,5
Second stage of secondary education with vocational orientation 2.745,
Higher education 0 10.630,8
Source: National Institute of Statistics, EPA 2014

The survey data is taken from the entire population and refers to thousands of people. In total, the population
aged 16 and over is 38,523,400 people who are distributed according to the 7 categories of training level. If we
want to work with this data, for example, to extract a table of relative frequencies or create a graph, in a blank
data window we can enter two variables: one with the different levels of education ( training variable) and
another with the frequency , the variable that acts as a weight ( frequency variable), that is, with the number of
individuals in each category, the variable with which we will weight the cases.

The SPSS data window would look like this:

Once this is done, the cases are weighted according to the frequency variable. A label reading Weighting On
will appear at the bottom right of the SPSS window. From that moment on, the number of cases we have, 7,
where each case was worth 1, after weighting, becomes the number of cases indicated in the frequency column,
and in total the 38 and a half million in the original table. We can execute the Frequencies procedure for the
training variable and obtain the EPA table reproduced:

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |50

training
Valid percentage Accumulated
Frequency Percentage percentage

Valid 1. Illiterate 727 1,9 1,9 1,9

2. Incomplete primary studies 2627 6,8 6,8 0,7


3. Primary education 5813 15,1 15,1 23,8
4. First stage of secondary education and similar 10897 28,3 28,3 52,1
5. Second stage of secondary education, with general orientation 5084 13,2 13,2 65,3

6. Second stage of secondary education with vocational orientation 2745 7,1 7,1 72,4
7. Higher education 10631 27,6 27,6 100,0
Total 38523 100,0 100,0

And a pie chart for example:


training
□ 1. Illiterate
□ 2. Incomplete primary studies 03.
Primary education
m 4. First stage of "secondary and similar"
education
□ 5. Second stage of education -
secondary, with general orientation
n 6. Second stage of "secondary education
with professional orientation O?. Higher
education

Add

The aggregation of cases has multiple uses in the processing of data matrices, also when various databases are
related. It is especially useful when we have information in different matrices with different levels of
aggregation, as in the case of having information on individuals and households in the Active Population
Survey , or having multiple records of the working life for the same individual for whom We have
sociodemographic information in another base, such as in the Social Security Continuous Sample of
Working Lives .

We will see a simple application exercise to see how the procedure works. An attempt will be made to
aggregate the individuals interviewed in the CIS survey according to their Autonomous Community,
calculating a summary measure (the average) of the variables: P001 to P907 (Importance of various aspects of
social life), P30 (Personal Happiness Scale) and P32 (Age).

It is added with the AGGREGATE command ( Data menu / Add ). In the dialog box we must first determine
the variable or variables that act as segmentation, that is, the aggregation groups. In our case we choose the
Autonomous Community, therefore, we will have 19 groups.

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


51| III. Analysis

Within each group we can calculate different summary measures. To do this, we first choose the variables of
interest and pass them to the Added Variables box. The SPSS system automatically chooses the mean as the
measurement, but we can change it by choosing one or more variables and then clicking on Function . We will
access the dialog box that allows us to choose the function. In our case we will leave the average statistic. Each
new calculation generates a variable that can be defined with a specific name and a label, otherwise SPSS
offers us the Statistical_variable_name criterion. An additional calculation allows adding the variable with the
number of cases in each group, which by default has the name N_BREAK .

Once the calculations are defined, we can choose three alternatives:


- Add aggregated variables to the active data set . The new calculated group variables are an attribute of each
unit of the original database so each case with the same segmentation values receives the same values for
the new added variables.
- Create a new data set containing only the added variables . A new data set is created in the current session

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


52| III. Analysis

with the aggregation variables and aggregates the units.


- Write a new data file containing only the added variables . It is the previous case but it saves the added data
in an external data file that must be detailed.

In our exercise we choose the second option and obtain a data matrix that contains the 19 lines with each
Autonomous Community and 10 new variables that calculate the average of the variables P901 to P907 , P30 ,
P32 plus N_NREAK .

In this procedure it is also necessary to previously have the cases of the original matrix ordered according to the
segmentation variables.

Transpose

Transposing a matrix involves converting the cases (rows) into variables, and the variables (columns) into
cases. Doing so creates a new data file and automatically names the variables.

To illustrate this command, FLIP ( Data / Transpose menu), and those that follow, we will work with small
data matrices that will allow us to better see each of the tasks. The data matrix X.sav contains the employment
situation of 6 salaried individuals in relation to 2 variables of their employment conditions: Contract and Salary
.

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |53

In the menu we pass all the variables to the box on the right and execute:

The result obtained is the following:

CASE_LBL varOOI var00 var003 var004 varOOS var006


1 ID 1.00 2
2,00 3,00 4,00 5,00 E.00
2 Contract 1.00 2,00 1.00 2,00 1,00 1 00
3 Salary 1200.00 1000.00 3000.00 1000.00 1200.00 1500.00

Restructure

The simple structure of a data matrix of cases by variables is usually the usual one for data analysis, however,
the initial structure of a database can be complex. A simple structure is the example of the matrix X.sav, of 6
individuals and 2 variables with employment conditions. A structure where the information of a variable is in
more than one column or the information of a case in more than one row introduces a complexity of
organization of the information and the need to restructure the file to pass the cases to variables or the
variables to cases.

For example, if we have a matrix with 3 individuals and the employment conditions refer to two moments in
time: initial employment and current employment, the information can be arranged by rows where each
individual has double information on their employment conditions, the initial and current. The casestovars.sav
data array has this information:

ID Moment Contract Salary


1 1 Initial Temporary 1000
2 1 Curren Perman 1200
t ent
3 2 Initial Perman 1500
ent
4 2 Curren Temporary 1000
3
t
5 Initial Perman 2000
3
ent
6 Curren Perman 3000
t ent

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


54| III. Analysis

In this case we may be interested in passing the information from the rows to the columns, to have 3 cases and
4 variables (the contract and salary at both times). To do this, we execute the restructuring procedure through
the Data / Restructure menu ( CASESTOVARS command) and choose the option Restructure selected
cases into variables :

In the next window we choose the identification variable of the group of cases, in our case ID :

In the original data, a variable appears in a single column. In the new data file, this variable will appear in
several columns. Index variables are existing variables to create the new columns. The restructured data will
contain a new variable for each unique value contained in those columns. In this case we do not use them. In
step 3 of the wizard we will choose the default option of sorting the data according to the identification
variable (in fact it matches the current one):

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |55

Fourthly, we decide how to order the variables in the new matrix, we choose to group by index, and we
calculate a variable with the number of cases ( Ncases ):

Finally, the procedure is executed directly or converted into syntax:

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |56

The result is the following matrix:

1
I 1
Ncases
2
Moment.1 Contract. 1 Salary.1
Initial Temporary 1000
Moment.2 Contract.2
Current Permanent
Salary. 2
1200J
2 2 2 Initial Permanent 1500 Current Temporary 1000 1
3 3 2 Initial Permanent 2000 Current Permanent 3000

If we find ourselves in the reverse situation, with information in the columns that we want to pass to the rows,
the case of the casestovars.sav data matrix:

ID |
Contract 1 Contract?
Salary1 Salary?
1 Tempor 1000 Permanent 1200
ary Per
2 2 1500 Temporary 1000
3 3 manent Per 2000 Permanent 3000
manent
The process to follow will be similar. In this case we choose the option Restructure selected variables in cases
( VARSTOCASES command), in step 2 we choose to restructure according to a group of variables since
we have 2 contract variables and 2 salary variables. Thirdly we make the following adjustments: in the
identification of the groups we choose the option Use selected variable and pass the variable ID, in the
selection of the variables to transpose we first change the name that appears for the first group, trans1 , to
Contract , and we pass the variables Contract1 and Contract2 ; We operate the same with trans2 that we will
name as Salary and we will pass Salary1 and Salary2 :

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |57

In the fourth step we leave the option of creating a single index variable. In the fifth we leave the default
option of creating sequential numbers and change the name of the variable Index1 to Moment :

In the sixth step we leave the default options and click on finish in the last one. The result is a data array with
this layout:

ID Moment Contract Salary


1 1 1 Temporary 1000
2 1 2 Permanent 1200
3 2 1 Permanent 1500
4 2 2 Temporary 1000
5 3 1 Permanent 2000
6 3 2 Permanent 3000

2.1.1.2. Data processing between files that are related

We will now see other data matrix manipulation tasks that involve relating two or more files: division and
merging.

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


58| III. Analysis

Split into files

It is a procedure that acts in a similar way to segmentation but its function is to record the divisions in new
files, of special interest when we need to operate different procedures depending on the segmentation group.
The SPSSINC PROCESS FILES command ( Data menu / Split to Files ) performs this task. As an
exercise we will take the casestovars.sav array and divide it between the information from the initial moment
and the current moment. We therefore specify that the segmentation variable is Moment and we indicate the
folder where the data will be saved:

We complete the procedure by clicking on Options and choose to name the output files according to the
segmentation variable labels.

After clicking Continue and Accept, the two matrices are obtained: Initial.sav and Actual.sav with three cases
each.

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |59

Merge files

Merging or joining files gives rise to two alternatives:


- Add variables . The active data file is merged with another that contains the same cases but different
variables.

- Add cases . The active data file is merged with another that contains the same variables but different cases.

We will carry out a small exercise with the Y.sav matrix that contains 6 cases and 4 variables, Age and Sex are
individual sociodemographic characteristics and Sector and Size refer to the company's labor characteristics:

ID Age Sex Sector Size


1 1 23 Wome Services 20
2 2 35 n
Male Primary 1
3 3 48 Male Industry 100
4 4 55 Wome Industry 500
5 5 n
28 Male Construction 50
6 6 20 Male Services 5

For the exercise of joining variables we will consider two separate initial matrices with the sociodemographic
information ( YA.sav ) and the company information ( YB.sav ). For the exercise of joining cases we have two
separate matrices with the first three cases ( Y1-3.sav ) and the last three ( Y4-6.sav ).

ID Age Sex Sector Size ID Age Sex || Sector Size


1 1 23 Woman Services 20 1 1 23 Woman Services 20

2 2 35 Yaron Primary 1 2 2 35 Y1-3 rhyme 1


|3
4
3
4
NOW
55 Woman
Ind YB 100
Industry 500
3 3 48 Male Industry 100
4 4 55 Women Industry 500
|5 5 28 Male Construction 50 5 5 28 Y4-6 uction 50
6 20 Male Services 5
6 6 20 Male 5
6
Services

In the first case, the fusion is carried out with the command es MATCH FILES ( Data menu / Merge / Add
variables ). We first open the YA.sav matrix and then add the variables of the YB.sav matrix:

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |60

We can have it open and choose it in the first box or go look for it in the folder where it is saved. We click
continue and the fusion dialog box appears:

When merging, it is very convenient to have a key variable that identifies each unit in each of the files to be
merged, in this way the information will be matched based on the control of the coincidence of the same case.
In our example this role is played by the ID variable. With a key variable, it is then required to first sort both
files by it. The type of merge we will do will involve Both files providing cases , these are individual cases in
the two files. The other two options ( The non-active data set (or the active data set) is a key table ) imply that
there exists a key table or reference table, that is, a file in which the data of each case can be applied to several
cases in the other data file (a household characteristic as an attribute for all individuals in the household, for
example).

The ID variable is placed in the Key variable box after clicking on Assign cases to key variable . The variables
that are joined are identified by the file to which they belong in the New active data set box: those of the active
file ( YA.sav ) with (*) and those of the one being added ( YB.sab ) with (+) . The variables that are common
to the second file remain in the Excluded variables box, where the ID variable was. Once executed, we will
have the same information as the Y matrix as a result.

It is worth keeping in mind that all unpaired cases, that is, those that are in one matrix and not in the other,
whatever it may be, will have values lost in the fusion for the variables where they do not have information,

they will be empty ( 0 ) in the new array:

FILES ( Data menu / Add cases ) we will execute it from the matrix Y1-3.sav to the
that we will add Y4-6.sav , which we will choose in the same way as in the case of adding variables. On this
occasion we will see the list of common variables and the variables that

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |61

are unpaired because they are in one file and not the other, they will not be included in the merged file.

Once again executing the procedure we reproduce the Y.sav matrix.

2.1.2. Data transformation

After seeing different treatment operations of a matrix as a whole, we focus on those transformation tasks
where specific variables of the matrix are involved, individually or relating them to others. The SPSS system
has various commands intended for the transformation of existing variables, either for their modification or for
the generation or creation of new variables. The construction of typologies and indices from various variables
will be one of the frequent needs of the analysis, the recoding of the values of the variables to group values or
reduce the measurement scale is another immediate task that the analysis entails. All these tasks are solved
through the SPSS Transformation menu. The SPSS commands that we will discuss are those in Table III.2.3.

Table III.2.3 Variable transformation procedures


Data Menu SPSS Commands

Recode RECODE, AUTORECODE


Visual grouping RECODE
Calculate COMPUTE
Count values COUNT
Calculate Yes COMPUTE, IF
DO IF … END IF

In any variable creation exercise, the behavior of missing values must be kept in mind at two moments: before
and after creating the variables. First, keep in mind that if the variables contain missing values (system or user)
in the new variables these will appear as system missing values if they are not specifically treated. On the
other hand, when we create a new variable we must anticipate and control the unwanted generation of missing
values as a result of an operation in which the transformations do not in fact apply in all the cases that we
initially want to consider. If any transformation is not applied to a specific case, the value of the created
variable that will appear will be a missing system value.

As these are transformation commands, let us remember that their execution is not effective until a procedural
command is found that forces the reading of the file data (an analysis procedure), a function that is also

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


62| III. Analysis

fulfilled by the EXECUTE command.

Finally, it must be kept in mind that all variable generation requires completing its dictionary (labels, format,
missing values, measurement level, etc.) through the Variables tab or through the corresponding syntax
commands.

2.1.2.1. Variable recoding

Recoding variables allows you to change the current values of the variables with new ones. Recoding can
strictly mean changing one or more values to others, or combining or grouping ranges of values into new
categories. The value to be recoded can be numeric or alphanumeric (string format, string ) and can be passed
from an alphanumeric to a numeric encoding.

On the other hand, recoding can be done by choosing to keep the original variable and generating a new one
with a different name that will have the recoded values, or by choosing to replace the variable that is being
recoded with the new variable with the new coding criteria and with the same variable name. The first case in
SPSS terminology is called recoding in different variables and the second case is called recoding in the same
variables .

The SPSS command that performs the recoding is RECODE . The dialog box to carry out the recoding is
found in the Transform / Recode menu where you must choose to recode the same or different variables.

We will focus on the second case, the first is equivalent, although in general it is advisable not to use it if you
are not sure of its convenience since it always implies that the original variable disappears. From the
CIS3041.sav data matrix we will carry out two recoding exercises: from a qualitative variable and from a
quantitative one.

The first step to perform a recoding is to define the recoding criteria and observe the values of the variables by
extracting the frequency table. We first consider the variable OCUMAR11 , the occupational category of the
person interviewed according to the 2011 CNO (National Classification of Occupations).25 . Its frequency table
is this:

25 The CNO ( http://www.ine.es/jaxi/menu.do?type=pcaxis&path=%2Ft40%2Fcno11%2F&file=inebase&L=0 ) is the Spanish adaptation of


the international ISCO ( International Standard Classification of Occupations ) classification. of the ILO (
http://www.ilo.org/public/spanish/bureau/stat/isco/ ) , or ISCO, which has several levels of disaggregation, up to 5 and is coded to 4 digits.
Here it is presented with a single digit. The variable P40 of the matrix

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |63

0CUMAR11 Occupation of the person interviewed (CN011)


Valid Accumulated
percentage percentage
Frequency Percentage
Valid 1 Directors and managers 84 34 3,4 3,4
2 Scientific and intellectual technicians and professionals 309 12,5 12,6 16.1
3 professional support technicians 325 13,1 13,3 29,4
4 Accounting, administrative and other office employees
100 4,0 4,1 33,4

5 Workers in catering, personal, protection and sales services


559 22,5 22,9 56,3

6 Qualified workers in the agricultural, livestock, forestry and fishing


sectors 132 5,3 5,4 61.7

7 Arts and skilled workers in manufacturing and construction industries,


except operators 359 14,5 14,7 76,4

8 Operators of installations and machinery, and assemblers 274 11,0 11,2 87,6
9 Elementary occupations 294 11,9 12,0 99,6
10 Military occupations 10 ,4 ,4 100,0
Total 2446 98,6 100,0
Lost 94 Without occupation/lives on income 1 ,0

98 NS/Occupation incorrectly specified or insufficient 13 ,5


99NC 20 ,8
Total 34 1,4
Total 2480 100,0

The objective is to have an occupational variable with a smaller number of categories based on the grouping of
the 10 that the original variable has. We will consider a grouping into 4 occupational categories plus a
category of missing values according to the following criteria:
1. Upper class or high occupational category: codes 1 and 2.
2. Middle class or intermediate occupational category: codes 3 and 4.
3. Skilled workers or lower middle category: codes 5, 6 and 7.
4. Unskilled or low category workers: codes 8 and 9.
5. Missing values: codes 10, 94, 98 and 99.

We enter the procedure menu and choose the variable OCUMAR11 to transfer it to the box on the right. The
name and a ? will appear. , to tell us that we must give a name to the new variable. In the Result Variables
section we write the name of the new variable, for example Occupation , and a label, Occupational Class in
this case. To make the action effective, click on Change :

Next we must specify the correspondence between the old and new Values , click on said button:

CIS3041.sav is the three-digit CNO 2011. Therefore, in fact OCUMAR11 is already a variable that has been recoded (grouped) to a single
digit.

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |64

The recoding criteria that we have mentioned are transferred as follows: for the first 4 new values we will
choose the Range option, specifying in each case the lower and upper value. The first range would be 1 to 2 ,
as specification of the left side (old value), on the right side (new value) we will write 1 in the Value box, and
the Add button below. This is how we define that Directors and managers together with Technicians and
professionals , values 1 and 2 , are united in a single category, coded with value 1 . Thus we would continue
with the following three cases as can be seen in the image. We will consider the value 10 as a missing value
along with the missing values that the variable already has (no occupation, NS, NC). These correspond to
codes 94, 98 and 99, but since they are all considered user missing values in the original variable we can refer
to them together as System or User Missing Values , keyword MISSING in SPSS). We click on continue and
accept to execute the recoding. To see the result we need to request the frequency table, the result is the
following:
Occupation Occupational class
Valid percentage Accumulated
Frequency Percentage percentage

Valid 1.00 393 15,8 15,8 15,8


2,00 425 17,1 17,1 33,0
3,00 1050 42,3 42,3 75,3
4,00 568 22,9 22,9 98,2
5,00 44 1,8 1,8 100,0
Total 2480 100,0 100,0

You can check how the frequencies of the new variable correspond to the sum of the categories of the original
variable. In the table we see the new values but they do not have labels. As we suggested, after creating a
variable it is necessary to complete its dictionary. It is necessary to label the values, specify that it does not
have decimals, define 5 as the user's lost value and set its measurement level as ordinal. We ask for the table
again and the final result is these arrangements:
Occupation Occupational class
Valid Accumulated
Frequency Percentage percentage percentage

Valid 1 Upper class 393 15,8 16,1 16,1


2 Middle class 425 17,1 17,4 33,6

3 qualified workers 1050 42,3 43,1 76,7


4 Unskilled workers 568 22,9 23,3 100,0
Total 2436 98,2 100,0
Lost 5 Lost:NS,NC,FFAA 44 1,8
Total 2480 100,0

► Exercise 5. Proposed
The INE in its report Introduction to the CNO-11 comments on the concept of occupation (
http://www.ine.es/daco/daco42/clasificacións/Introduccion_CNO11.V02.pdf ) and distinguishes between
employment and skills as two fundamental dimensions that they structure it. Competencies in turn distinguish
two dimensions: specialization and level of competencies . The latter has 4 grades (theoretically associated

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


65| III. Analysis

with formal educational levels) that correspond to 1-digit occupational categories as follows:

Table 1: Correspondence between the Major Groups of ISCO-08 and the level of competencies
Competenc
Large Groups ISCO-08 y level
tences
1 - Directors and managers 3,4

2 - Scientific or intellectual professionals 4

3 - Mid-level technicians and professionals 3

4 - Administrative support staff


5 - Service workers and salespeople in shops and markets
6 - Farmers and skilled agricultural workers, forests such and fishing
7 - Officials, workers and artisans of mechanical arts and other trades
2
8 - Facility and machine operators and assemblers

9- Elementary occupations
1
0- Military occupations 1,2,4

Source: INE

According to this table, without considering 0 of military occupations and assigning directors and managers
only level 4, group the large occupational groups (variable OCUMAR11 of the CIS3041.sav matrix) into the 4
levels of competencies. Also complete the dictionary of the variables and extract the frequency table to check
the result.

A second example of recoding will take into account a quantitative variable, age (variable P32 ). It is common
to work with age grouped into intervals of 5 or 10 years, or in large age groups (young people, adults, older
people). Thus, the original quantitative variable reduces its scale and allows it to be worked with fewer
categories as an ordinal qualitative variable. It is proposed to create a new age variable ( Age10 ) with a
grouping into intervals according to these criteria:
BMU..........................................................................................................................1
ccreative.............................................................................................................4
Sommons....................................................................................................4
General index........................................................................................................5
PART II. PRODUCTION..............................................................................5
PART III. ANALYSIS..........................................................................................7
Chapter III.2 Data preparation for analysis.........................................................7
Preparing data for analysis..................................................................................11
1. Creation and identification of data.........................................................7
1.1. Creation and identification of data with SPSS.......................................9
1.2. Creation and identification of data with R................................................27
2. Data transformation...........................................................................38
2.1. Data transformation with SPSS............................................................38
520................................................................................................................47
W v aran e s = 480 = 1.083.............................................................................47
480................................................................................................................47

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


66| III. Analysis

Wmu J ere s = 520 = 0.923.............................................................................47


FREQUENCIES P45.............................................................................68
MISSING VALUES P45m(9999).........................................................68
VARIABLE LEVEL P45m ( SCALE )..................................................68
x—x..............................................................................................................75
zi= -—..........................................................................................................75
yes................................................................................................................75
SPA _ 100.p,+75.P,+50p,+25.p4+0.p;...................................................................77
Pi +P2+P3....................................................................................................77
EXECUTE...............................................................................................83
EXECUTE...............................................................................................84
[ ELSE ]...................................................................................................85
END IF.....................................................................................................85
ELSE........................................................................................................86
END IF.....................................................................................................86
2.2. Data transformation with R..................................................................88
x—x zi= -— yes.......................................................................................105
/cp_SPA + IEP.................................................................................................108
3. Bibliography.......................................................................................113
1.

Since the original variable has no missing values, it is not necessary to consider them in the new one. The
original frequency distribution table is as follows:

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |67

P32 Age of the person interviewed


Valid Accumulated
percentage percentage
Frequency Percentage
Valid 18 32 1.3 1.3 1,3 57 42 1,7 1,7 69,1
19 32 1.3 1.3 2,6 58 29 1.2 1.2 70,3
20 18 ,7 ,7 3,3 59 27 1.1 1.1 71,4

21 28 1.1 1.1 4,4 60 48 1.9 1.9 73,3


22 38 1.5 1.5 6,0 61 27 1.1 1.1 74,4
23 25 1,0 1.0 7,0 62 33 1.3 1.3 75,7
24 27 1.1 1.1 8,1 63 39 1.6 1.6 77,3
25 51 2,1 2,1 10,1 64 40 1.6 1.6 78,9

26 42 1.7 1.7 11,8 65 48 1.9 1,9 80,8


27 40 1.6 1.6 13,4 66 37 1,5 1,5 82,3
28 23 ,9 ,9 14,4 67 39 1.6 1.6 83,9
29 39 1.6 1,6 15,9 68 24 1.0 1.0 84,9
30 46 1.9 1.9 17,8 69 31 1.3 1,3 86,1

31 48 1.9 1.9 19,7 70 36 1,5 1,5 87,6


32 41 1.7 1.7 21.4 71 27 1,1 1,1 88,7
33 47 1.9 1,9 23,3 72 28 1,1 1,1 89,8

34 53 2.1 2,1 25,4 73 18 ,7 ,7 90,5


35 51 2,1 2,1 27,5 74 21 ,8 ,8 91,4
36 37 1,5 1,5 29,0 75 19 ,8 ,8 92,1
37 48 1.9 1,9 30,9 76 20 ,8 ,8 92,9
38 47 1.9 1.9 32,8 77 18 ,7 ,7 93,7

39 46 1.9 1.9 34,6 78 25 1,0 1,0 94,7


40 48 1.9 1.9 36,6 79 16 ,6 ,6 95,3
41 43 1.7 1.7 38,3 80 17 ,7 ,7 96,0

42 57 2,3 2,3 40,6 81 17 ,7 ,7 96,7


43 61 2,5 2,5 43,1 82 14 ,6 ,6 97,3
44 71 2,9 2,9 45,9 83 15 ,6 ,6 97,9
45 51 2,1 2,1 48,0 84 13 ,5 ,5 98,4
46 51 2,1 2,1 50,0 85 11 ,4 ,4 98,8
47 45 1.8 1.8 51,9 86 8 ,3 ,3 99,2
48 42 1.7 1,7 53,5 87 4 ,2 ,2 99,3
49 45 1.8 1,8 55,4 88 5 ,2 ,2 99,5
50 57 2,3 2,3 57,7 89 4 ,2 ,2 99,7
51 33 1.3 1.3 59,0 90 2 ,1 ,1 99,8
52 34 1.4 1.4 60,4 91 2 ,1 ,1 99,8
53 49 2,0 2,0 62,3 92 1 ,0 ,0 99,9
54 56 2,3 2,3 64,6 94 3 ,1 ,1 100,0
55 34 1,4 1,4 66,0 Tot
2480 100,0 100,0
56 36 1,5 1,5 67,4

Following the protocol we saw above we will specify in particular the


recoding criteria:

The resulting frequency table after completing the data dictionary is as follows:

López-Roldán and Fachelli | Quantitative social research methodology


68| III. Analysis

Age10 Age of the person interviewed in groups of 10


Valid percentage Accumulated
percentage
Frequency Percentage
Valid 118-24 200 8,1 8,1 8,1
2 25-34 430 17,3 17,3 25,4
3 35-44 509 20,5 20,5 45,9
4 45-54 463 18,7 18,7 64,6
5 55-64 355 14,3 14,3 78,9
6 65 and
523 21,1 21,1 100,0
more
Total 2480 100,0 100,0

- Exercise 6. Proposed

Recode the variable P15 of ideological self-positioning into three categories that group the values 1 to 3, 4 to 6
and 7 to 10.

On the other hand, if with the data from the CIS survey we ask ourselves, what are the average household
incomes of those interviewed? To answer this question we should have the income variable as quantitative and
the survey asks qualitatively by intervals. An alternative is to calculate the mean from the class mark of each
interval for which we must recode the variable. The distribution of the income variable ( P45 ) is as follows:
P45 Household income
Valid Accumulated
Frequency Percentage percentage percentage
Valid 1 They have no income of
any kind 10 ,4 ,6 ,6

2 Less than or equal to €300 28 1.1 1.6 2.2


3 From €301 to € 600 185 7,5 10,8 13,1
4 From €601 to € 900 297 12,0 17,4 30,5
5 From €901 to € 1,200 347 14,0 20,3 50,8
6 From €1,201 to €1,800 386 15,6 22,6 73,4
7 From €1,801 to €2,400 215 8,7 12,6 86,0
8 From €2,401 to €3,000 120 4,8 7,0 93,1
9 From €3,001 to € 4,500 76 3,1 4,5 97,5
10 From €4,501 to € 6,000 31 1.3 1.8 99,4
11 More than € 6,000 11 ,4 ,6 100,0
Total 1706 68,8 100,0
Lost 99N.C. 774 31,2
Total 2480 100,0

If we recode using the SPSS syntax by calling the new variable P45m, we can use the following instructions
that include, in addition to recoding, completing the dictionary of the variable and calculating the frequencies
together with the mean statistic:

FREQUENCIES P45.
RECODE P45 (1=0)(2=150)(3=450)(4=750)(5=1050)(6=1500)(7=2100)(8=2700) (9=3750)
(10= 5250)(11=7500)( MISSING =9999) INTO P45m.
VARIABLE LABELS P15m 'Household income (class mark)'.
VALUE LABELS P45m 9999 'NC'.
MISSING VALUES P45m(9999).
FORMATS P45m (F2.0).
VARIABLE LEVEL P45m ( SCALE ).
FREQUENCIES P45m / STATISTICS MEAN .

This is the result:

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |69

P45m Household income (class mark)

Valid percentage Accumulated


percentage
Frequency Percentage
Valid 0 10 ,4 ,6 ,6
150 28 1.1 1,6 2,2
450 185 7,5 10,8 13.1
750 297 12,0 17,4 30,5
1050 347 14,0 20,3 50,8
1500 386 15,6 22,6 73,4
2100 215 8,7 12,6 86,0
Statisticians
2700 120 4,8 7,0 93,1
3750 76 3,1 4,5 97,5
5250 31 1,3 1,8 99,4 P45m Household Income (mark
7500 11 ,4 ,6 100,0
N Valid 1706
Total 1706 68,8 100,0
Lost 9999 NC 774 31,2 Lost 774
Total 2480 100,0
Half 1 500,18

The average income of the households in the sample is €1,500.

- Exercise 7. Proposed

Recode the variable P46 related to personal income with the class mark of the intervals and calculate the
average income.

In SPSS there is an interesting assisted and automated procedure for recoding quantitative variables called
Visual Grouping in the Transform menu. When accessed we must first choose the variable, we can choose P32
for the age:

After clicking continue we access this dialog box where we have already specified the different options that we
now discuss:

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


70| III. Analysis

Initially, the histogram appears without partitions or groupings of the values and with a proposed label for the
variable that is created. It also informs us of the minimum (18) and maximum (94) values. We must give a
name to the new variable, for example Age4 . The recoding criteria and labels are detailed at the bottom. We
can choose the cut-off points manually, putting the corresponding values in the table, or we can do it through an
automated process with various alternatives in the Create cut-points tab. If we choose this last alternative, in the
new dialog window we can choose three options:
- Intervals of equal width according to number or width.
- Equal percentiles according to the number of cuts or the percentage of cases.
- Cut-off points based on the mean and standard deviations.

Any alternative could be valid, in this case we will choose to create a division of the values of the variable into
quartiles, into 4 groups with 25% of the cases, which implies specifying 3 cut-off points (remember that the
quartiles are 3, the 3 values

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |71

that mark the cuts). We click on accept and when we return to the previous dialog box we will click on Create
labels and it will automatically create them in correspondence with the values of the division into quartiles.
After executing the recoding procedure and requesting the frequency table, we obtain this result:
Age4 Age of the person interviewed in quartiles
Valid percentage Accumulated
Frequency Percentage percentage

Valid 1 <= 34 630 25,4 25,4 25,4


2 35 - 46 611 24,6 24,6 50,0
3 47- 62 637 25,7 25,7 75,7
4 63 + 602 24,3 24,3 100,0
Total 2480 100,0 100,0

► Exercise 8. Proposed
With the IDH2014.sav data matrix, perform a recoding of the variable GDPpercapita ( Gross Domestic
Product per capita ) following various criteria: grouping into intervals of equal width, into percentiles or
based on deviation units.

In addition to the recoding that is operated with the RECODE command, there is another automatic recoding
command called AUTORECODE that converts numeric and string values into consecutive integer values. This
recoding is interesting since some analysis procedures cannot use variables in string format and others
necessarily require the treatment of consecutive integer values. It is also of interest to export data to other
software that works with qualitative variables with consecutive integer values.

The new variable generated by automatic recoding retains the value labels of the original variable; In the event
that the values do not have a defined value label, the original value will be used as the label of the recoded
value. When dealing with string values, they are recoded in alphabetical order, with uppercase letters first
before lowercase letters. Missing values are assigned the last consecutive numbers. When the procedure is
executed, a table shows the correspondence between the old values, the new values and the labels.

For example, if we wanted to create consecutive codes for the voting intention variable P23 , through the
Transform / Automatic Recoding procedure we would simply choose the original variable P23 , give a name to
the new one, P23bis for example, and execute:

The effects of the change can be compared in the following frequency tables:
P23 Voting intention in supposed general elections P23bis Voting intention in supposed general elections
Valid Accumulated Valid Accumulated
Frequency percentage percentage Frequency percentage percentage

Valid 1 PP 290 11,7 11,7 Valid 1 PP 290 11,7 11.7


2 PSOE 354 14,3 26,0 2 PSOE 354 14,3 26,0
3 IU (ICV in Catalonia) 91 3.7 29,6 3 IU (ICV in Catalonia) 91 3,7 29,6
4 53 2,1 31,8 4 UPyD 53 2,1 31,8
5 CiU 49 2,0 33,8 5 CiU 49 2,0 33,8

López-Roldán
6 Amaiur and Fachelli | Quantitative
9 4 social 34,1
research methodology
6 Amaiur 9 ,4 34,1
72| III. Analysis

7PNV 11 4 34,6 7 PNV 11 .4 34,6


8 CKD 48 1.9 36,5 8 CKD 48 1,9 36,5
9 BNG 3 ,1 36,6 9 BNG 3 .1 36,6
10cc 3 ,1 36,7 10 CC 3 ,1 36,7
11 Compromís-Equo 8 3 37,1 11 Compromís-Equo 8 ,3 37,1
12FAC 2 ,1 37,1 12 PAC 2 .1 37,1
1 3 Geroa Bal 1 ,0 37,2 13 Geroa Bai 1 ,0 37,2
14 2 ,1 37,3 1 4 UPN 2 1 37,3

15 We can 437 17,6 54,9 15 We can 437 17,6 54,9


16 Citizens 37 1,5 56,4 16 Citizens 37 1,5 56,4
17 Other parties 44 1.8 58,1 1 7 Other parties 44 1,8 58,1
77 Null vote 2 ,1 58,2 18 Null vote 2 .1 58,2
96 Blank 105 4,2 62,5 19 Blank 105 4,2 62,5
97 I would not vote 389 15,7 78,1 20 I would not vote 389 15,7 78,1
98 Doesn't know yet 483 19,5 97,6 21 Doesn't know yet 483 19,5 97,6
99 NC 59 2.4 100,0 22NC 59 2,4 100,0
Total 2480 100,0 Total 2480 100,0

2.1.2.2. Transformation expressions

We will now see the transformation procedures that involve performing a calculation or a conditional
transformation to generate new variables. The use of its commands involves working with the so-called
transformation expressions that are specified in the syntax of the instructions of the transformation commands
using different types of operators and functions. There are three types of expressions: numerical, alphanumeric
(string) and logical.

Numerical expressions are used to create new numerical variables and where they are used:
- Arithmetic operators : + , – , * , / , ** . They are used for numerical variables, two cannot appear in a row
and they cannot be entered before or after a logical or relational operator. They are executed after the
functions, and at the same level they are executed from left to right.
- Numerical constants (numeric values).
- Numeric functions : These are functions that always return a number (or a system missing value). They are
specified through one or more arguments in parentheses. They can include arithmetic operators, constants
and variables. For example, MEAN(V1,V2) calculates the mean of two variables for each individual. Types
of numerical functions:
• Arithmetic functions: ABS , RND , TRUNC , SQRT , EXP , LG10 , LN .
• Statistical functions: MEAN , MEDIAN , SD , VARIANCE , MIN , MAX , CFVAR .
• Random variable functions and distribution functions: The functions CDF , PDF , RV , SIG ,
IDF , NCDF , NPDF are prefixes of the distributions (suffixes) NORMAL , LOGISTIC ,
CHISQ , POISSON , F , T , BINOM , etc.
• Date and time functions: DATE , TIME , CTIME , YRMODA , XDATE , DATEDIFF ,
DATESUM .

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |73

Alphanumeric expressions ( string ) are used with string variables, constants (text) between quotes and string
functions: CHAR.INDEX , CHAR.LENGTH , CONCAT , LTRIM , VALUELABEL , etc.

Logical expressions are transformation expressions that evaluate to true (value 1) or false (value 0) or as system
missing values, based on conditions established on the data using variables, constants, functions, relational
operators and logical operators. In general, it is advisable, if not necessary, to use parentheses to construct
expressions.
- Relational operators : EQ , LT , GT , NE , LE , GE or = < > <> <= >=
- Logical operators : AND , OR , NOT or & | ~
- Logical functions : RANGE , ANY .

In expressions, functions and arithmetic operators are evaluated first, then relational and logical operators (in
the order NOT , AND , OR ).

Other functions available in SPSS are:


- Values-missing functions: VALUE , MISSING , SYSMIS , NMISS , NVALID . - Previous case
functions: LAG .
- String/Numeric conversion functions: STRING , NUMERIC .

When we work through menus to build transformation expressions, we have a wizard to remind us of the
different operators and functions, as we will see below.
Funds and special variables

SCasenum $Date
W ate 11 $JDate $Sysmis $Time Abs
Any
Applymodel
Arsin Artan

2.1.2.3. Variable calculation

The creation of new variables by performing calculations is a constant need for any quantitative data analysis
process. Whether to modify or combine the existing original variables, we can operate an infinite number of
transformations, whether of a statistical nature to condition variables in an analysis, to create indicators and
new variables, quantitative variables, to use instrumental variables, etc.

The COMPUTE command ( Transform / Calculate variable menu) is intended for this task. The generic format
of this procedure is:

COMPUTE target variable = expression

Within the expression you can use numeric variables, constants, arithmetic operators, numeric functions,
missing value functions, random number functions and the date function. For alphanumeric variables it is only
allowed to create a variable with a constant alphanumeric value or copy a variable into another identical one.
Depending on the expression, the instruction can occupy only one line or several lines.

We will do some exercises to calculate variables. First of all, we can consider creating an index of
sociopolitical activism based on the answers to question Q14 :
P.14 There are various forms of participation in social and political
actions that people can carry out. Please tell me for each year of
them, if you: (SHOW CARD D) .

1 Has participated during the last twelve months


López-Roldán and Fachelli | Quantitative social research methodology | (EC)
74| III. Analysis

2. He participated in a more distant past


3 He has never participated
1 2 3 NC
Attend a demonstration........... 1 2 3 9 (86)
Participate in a strike......... ....... 1 2 3 9 (87)
Participate in a forum or blog
political debate on the Internet 1 2 3 9 (88)
Sign a petition/ collection
you sign either in person or by
Internet.................................... 1 ■ —। 2- 1 3 9 (89)

With the following criteria: score each form with 2 if you have participated recently, with 1 if you participated
in the past and with 0 if you have never participated. We built the index with these scores in the 4 questions,
adding them for each individual. Anyone who currently participates in everything will have a participation
level of 8 and anyone who has never participated in anything will have a participation level of 0. We will call
the new variable P14index .

Taking into account the current values of the variable, the proposed score implies that before adding we will
have to subtract each value from 3 (3-1 will give 2, 3-2 will give 1 and 3-3 will give 0). To obtain the new
variable we will go to the Transform / Calculate variable menu. In the dialog box we will enter the name of the
new variable ( P14index ) and as a numerical expression the following: (3-P1401)+(3-P1402)+(3-P1403)+(3-
P1404) . We can write this expression directly in the numerical expression box or we can help ourselves with
the available information: the variables on the left and the numbers, symbols and operators, clicking on them
from the “calculator” buttons:

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |75

If we click OK, the variable is created. Our array will contain one more variable, the last one. It must be taken
into account that in the new variable some individuals are missing values in one of the four initial variables, so
25
the calculation cannot be carried out for them and they will be missing values of the system in the new . Its
dictionary (type, variable label, measurement level) needs to be completed, which we can partially do through
the Type and label button in the Calculate dialog box. Once the task is completed, the frequency table of the
new variable is as follows:
P14index sociopolitical participation index
Valid Accumulated
Frequency Percentage percentage percentage

Valid 0 805 32,5 33,0 33,0


1 324 13,1 13,3 46,2
2 417 16,8 17,1 63,3
3 299 12,1 12,2 75,5
4 264 10,6 10,8 86,3
5 127 5,1 5,2 91,5
6 127 5,1 5,2 96,7
7 36 1.5 1.5 98,2
8 44 1.8 1.8 100,0
Total 2443 98,5 100,0
Lost System 37 1,6
Total 2480 100,0

If we calculate the average, we obtain a value of 2.09, much closer to 0 than 8, indicating a relatively low level
of sociopolitical activism in Spanish society as a whole.

► Exercise 9. Proposed
From question Q11 on the frequency with which newspapers, radio and television are consulted to follow
political news, giving between 4 and 0 points to the frequencies ranging from 1 ( Every day ) to 5 ( Never ). and
adding the scores for each individual.

Another important operation is the classification or standardization of a variable, a transformation that consists
of subtracting the mean from each score or value of a quantitative variable and dividing by the standard
deviation.

x—x
zi= -—
yes
We perform this operation with the age variable ( P32 ). We need to previously know the values of the mean
and the deviation, we execute the Analyze / Descriptive statistics / Descriptive procedure and we obtain:
Descriptive statistics
Standard
N Half deviation

P32 Age of the person


interviewed 2480 48,32 17,489

Valid N (per list)


2480

25
If we wish, we can recode them to a certain value, label it and declare it the user's lost value, it does not change anything, it is simply a way
to have them controlled and identified.

Once the values of the mean and standard deviation are known, we create the new variable using the Transform
/ Calculate variable menu. We choose a name for the new variable, for example, Agetip, and we apply the
formula that gives us the typed scores:

López-Roldán and Fachelli | Quantitative social research methodology


76| III. Analysis

If we ask for the descriptions of the new variable we can see how, except for decimals, the mean is 0 and the
standard deviation is 126 .

Descriptive statistics
Standard
N Minimum Maximum Half deviation
-1,73 2,61 -,0002 1,00000
Agetip 2480
Valid N (per list) 2480

We will now proceed to the construction of the indicators on the political situation prepared by the CIS in the
Barometer27 . The questions of the barometers of each month related to the political situation that are used in
the construction of the indicator are Q4 and Q6:
P.4 And referring now to the general political situation in Spain.
How would you rate it: very good, good, average, bad or very bad?

BMU...........................................................................................................................1
ccreative..............................................................................................................4
Sommons.....................................................................................................4
General index.........................................................................................................5
PART II. PRODUCTION...............................................................................5
PART III. ANALYSIS...........................................................................................7
Chapter III.2 Data preparation for analysis..........................................................7
Preparing data for analysis...................................................................................11
1. Creation and identification of data..........................................................7
1.1. Creation and identification of data with SPSS........................................9
1.2. Creation and identification of data with R.................................................27
2. Data transformation............................................................................38
2.1. Data transformation with SPSS.............................................................38
520.................................................................................................................47
W v aran e s = 480 = 1.083..............................................................................47

26 This same calculation can be obtained with SPSS through Analyze / Descriptive Statistics / Descriptives by checking the Save standardized
values as variables option. If we do it from age it will create the variable zP32 .
27 The methodology for constructing indicators of the CIS Barometer can be consulted on the page:
http://www.cis.es/cis/opencms/ES/11_barometros/metodologia.html .
Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381
III.2 Preparation of data for analysis |77

480.................................................................................................................47
Wmu J ere s = 520 = 0.923..............................................................................47
FREQUENCIES P45..............................................................................68
MISSING VALUES P45m(9999)..........................................................68
VARIABLE LEVEL P45m ( SCALE )...................................................68
x—x...............................................................................................................75
zi= -—...........................................................................................................75
yes.................................................................................................................75
SPA _ 100.p,+75.P,+50p,+25.p4+0.p;....................................................................77
Pi +P2+P3.....................................................................................................77
EXECUTE................................................................................................83
EXECUTE................................................................................................84
[ ELSE ]....................................................................................................85
END IF......................................................................................................85
ELSE.........................................................................................................86
END IF......................................................................................................86
2.2. Data transformation with R...................................................................88
x—x zi= -— yes........................................................................................105
/cp_SPA + IEP..................................................................................................108
3. Bibliography........................................................................................113

The Current Political Situation Indicator ( SPA ), based on question P4, is defined as:

SPA _ 100.p,+75.P,+50p,+25.p4+0.p;
P, +p2+p3+ P4+Ps
where p 1 , p 2 , p 3 , p 4 and p 5 are, respectively, the response percentages of the very good, good, average, bad
and very bad options.

The Political Expectations Indicator ( IEP ) from question P6 will be:

TR p 100.p,+ 50.p, + 0.p,


Pi +P2+P3
where p 1 , p 2 and p 3 are, respectively, the response percentages of the best, equal and worst options.

Finally, the Political Trust Indicator ( PCI ) is the arithmetic mean of the previous two:

In this case, they are synthetic indicators that are expressed in a single value for the entire sample, to then be
compared over time with previous Barometers.28 .

28 See http://www.cis.es/cis/export/sites/default/-Archivos/Indicadores/documentos_html/IndiPol.html .

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


78| III. Analysis

The frequencies of both variables for October 2014 are:

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |79

P4 Assessment of the general political situation in Spain prospective revaluation of the political situation in Spain (1 year)
Valid Percentage Valid Accumulated
percentage accumulated Frequency Percentage
Frequency Percentage percentage percentage
Valid 1 Very good 2 1 1 ,1
Valid 1 Best 287 11,6 13,3 13,3
2 Good 49 2,0 2,0 2,1
2 Same 1194 48,1 55,4 68,7
3 Regular 357 14 4 14,9 17,0
4 Bad 769 31,0 32,0 49,0 3 Worst 676 27,3 31,3 100,0
5 Very bad 1227 49,5 51,0 100,0 Total 2157 87,0 100,0
Total 2404 96,9 100,0 Lost 8 NS 299 12,1
Lost 8 NS 59 2.4
9NC 17 ,7 9NC 24 1,0
Total 76 31 Total 323 13,0
Total 2480 100.0 Total 2480 100,0

To obtain the 3 indicators we will use SPSS as a “calculator”, if we do it by syntax the commands are:

COMPUTE SPA=((100*0.1)+(75*2.0)+(50*14.9)+(25*32.0)+(0*51.0))/100.
COMPUTE IEP=((100*13.3)+(50*55.4)+(0*31.3))/100.
COMPUTE ICP=(SPA+IEP)/2.

3 variables are generated, which are in fact constants, with the values of the indices: 17.05 , 41.00 and 29.03 .

2.1.2.4. Value count

A specific calculation procedure consists of counting for each case the number of times that some value or
values appear in a list of variables, numerical or alphanumeric. It corresponds to the COUNT command (
Transform menu / Count values within cases ). Let's imagine that we have a list of 15 household equipment
consumer goods, we could create a variable that counts the number of times a household has each good (value
1), the resulting variable can be between 0 (does not have any goods) or 15 (has them all).

With the data from the CIS matrix we can consider question 13 on participation in associations
P.13 People sometimes belong to certain groups or associations For
each of those that I am going to read to you below, please tell me if
you (SHOW CARD C )

1. Belong and actively participate


2. Belongs, but does not actively participate
3. I used to belong, but now I don't.
4. has never belonged
1 2 3 4 NC

- A political party..................... 1 2 3 4 9 (77)


- A union or an association
of businessmen.................... 1 2 3 4 9 (78)
- A professional school............. 1 2 3 4 9 (79)
- A parish or other type of organization/association
.............................................. 1 2 3 4 9 (80)
- A sports group....................... 1 2 3 4 9 (81)
- A cultural or leisure group... 1 2 3 4 9 (82)
- A support organization
social or human rights.. 1 2 3 4 9 (83)
- A youth association or stu-
............................................. 1 2 3 4 9 (84)
- Another type of voluntary
tana....................................... 1 2 3 4 9 (85)

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |80

With the variables that the question gives rise to, we set ourselves the objective of creating a synthetic variable
that counts, for each individual, how many associations they belong to, that is, whether they have answered 1
(belongs and participates) or 2 (belongs and does not participate). to each one of them. As there are 9 questions,
the resulting variable will have values between 0 and 9. We enter the menu, select the variables P1301 to P1309
and name the new variable P13count with the label Number of associations to which it belongs :

Next we choose the count values in Define values and choose the range 1 to 2 :

We click on Continue and Accept , and request the frequency table:


P13count Number of associations to which you belong
Valid percentage Accumulated
Frequency Percentage percentage

Valid 0 1558 62,8 62,8 62,8


1 455 18,3 18,3 81,2

2 223 9.0 9.0 90,2


3 145 5,8 5.8 96,0
4 53 2,1 2.1 98,1
5 17 ,7 ,7 98,8
6 19 ,8 ,8 99,6
7 7 ,3 ,3 99,9
8 2 ,1 ,1 100,0
9 1 ,0 ,0 100,0
Total 2480 100,0 100,0

It is noted that the majority of people do not belong to any of the associations presented (62.8%) and very few
belong to 4 or more.

López-Roldán and Fachelli | Quantitative social research methodology


III.2 Preparation of data for analysis |81

► Exercise 10. Proposed


From question P10 on the frequency with which politics is discussed, obtain a frequency index by calculating a
variable with the count of the times 1 ( Often ) and 2 ( Sometimes ) are answered in relation to the three social
groups.

2.1.2.5. Conditional transformations

To finish this tour of the transformation of variables, we will work with a basic procedure in the analysis of
quantitative information: the creation of variables with conditional transformations. They are situations where
certain conditions are established in the characteristics of the units and depending on their fulfillment according
to a logical expression (true or false / lost) a value is assigned through an expression (giving the specific value
or executing a calculation formula ). The conditional transformation can be used in various commands, but we
will focus mainly on the IF command and the DO IF … END DIF structure.

The IF command which has the following general form:

IF [(] logical expression [)] target variable = expression


where the parentheses of the logical expression appear between brackets indicating that it is optional to use
them, although it will be mandatory if the condition is complex. The command actually looks similar to the
COMPUTE command we saw earlier. Proof of this is that the IF is obtained through the menu Transform /
Calculate variables / If option .

Through conditional transformations, typological variables are constructed that simultaneously combine
characteristics of various variables (attribute space) to define various types. This is the case of the construction
of the variable of social class, lifestyle, type of consumer, etc.

To illustrate the use of this procedure with the SPSS, we will create a (typological) variable of intergenerational
occupational mobility by relating the occupational level of the father with that achieved by the son/daughter.
The occupational variables are respectively OCCUPAAD and OCUMAR11 . As a previous step, we will ask
for the contingency table that crosses both variables ( Analyze / Descriptive statistics / Cross tables ) to
visualize the information being worked on, illustrate the procedure and then be able to verify the creation of the
new variable. By convention in social mobility analyses, the social origin of the father is placed in rows and
that of the son/daughter in columns. The table is the following:

López-Roldán and Fachelli | Quantitative social research methodology


82| III. Analysis

OCUMAR11 Occupation of the child


1 2 3 4 5 6 7 8 9 Total
BUSYPAD 1 13 19 13 3 12 0 2 1 6 69
Father's 2 4 75 19 4 18 0 10 5 2 137
occupation 3 10 34 58 13 46 1 8 15 10 195
4 1 7 9 9 14 1 3 1 4 49
5 18 34 36 15 98 6 26 11 28 272
6 7 26 35 9 80 84 73 60 50 424
7 12 44 64 15 121 9 121 48 70 504
8 7 33 48 11 79 7 50 91 29 355
9 2 12 8 5 25 7 24 20 53 156
Total 74 284 290 84 493 115 317 252 252 2161
1 Directors and managers; 2 Scientific and intellectual technicians and professionals; 3
Technicians; support professionals; 4 Accounting, administrative and other office employees; 5
Workers in catering, personal, protection and sales services; 6 Qualified workers in the
agricultural, livestock, forestry and fishing sectors; 7 Craftsmen and skilled workers in
manufacturing and construction industries, except installation operators; 8 Facility and
machinery operators, and assemblers; 9 Elementary occupations

The diagonal (in blue) defines immobility or occupational social reproduction where the occupational origin of
the father is the same as that of the son/daughter. The values in the lower triangle (in green) correspond to
upward mobility, children have a higher occupational level than parents. Finally, the upper triangle (in red)
corresponds to downward mobility, the children have a lower occupational level.

To create this typology of occupational mobility we use conditional transformations. In this case we establish 3
conditions29 :
- Yeah BUSYPAD < OCUMAR11 so HE gives downward mobility (value 1)
- Yeah BUSYPAD = OCUMAR11 so HE gives immobility (value 2)
- Yeah BUSYPAD > OCUMAR11 so HE gives upward mobility (value 3)

All cases that do not meet these conditions, that is, cases that correspond to missing values of both variables,
will become missing values of the system. To obtain the previous table of 9 by 9 categories, a lost value has
also been declared for value 10 (the Armed Forces).

To translate what we discussed into instructions for SPSS we can go to the Transform / Calculate variables
menu. In the dialog box we will call the new destination variable Mobility and we will put 1 as a numerical
expression .

29 As the values range from 1, highest occupational level, to 9, lowest level, the direction of the comparison is the opposite: a higher value
between origin and destination is downward mobility and a lower value is upward.
Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381
III.2 Preparation of data for analysis |83

Next we establish the condition that must be satisfied to assign the value 1 to an individual in the new variable
(downward mobility), OCCUPADA < OCUMAR11 :

To execute it, first press Continue and then OK . Alternatively we can perform this task by syntax as follows.
Instead of clicking on Accept we click on Paste . We hook the following instruction into a syntax window:

IF (OCCUPIED < OCCUMAR11) Mobility=1.


EXECUTE.

As you can see, and with time and experience with SPSS it will become clearer, it is more efficient to write this
instruction directly than to go through the entire menu previously. Even more so if it has to be repeated several
times to contemplate various situations that may be many more than the three that we are seeing here. Once the
first instruction is attached, we will copy it two more times and the

López-Roldán and Fachelli | Quantitative social research methodology


III.2 Preparation of data for analysis |84

We will modify it with the other two conditions: immobility, OCCUPAD = OCUMAR11 and upward mobility,
OCCUPAD > OCUMAR11 :

I.F. (OCCUPIED < OCUMAR11) Mobility=1.


I.F. (OCCUPIED = OCUMAR11) Mobility=2.
I.F. (OCCUPIED > OCUMAR11) Mobility=3.
EXECUTE.

We select the four lines and execute them by clicking on the execution icon ►
or with the <CTRL>+<R> keys. The new variable will be created that we have to finish conditioning with
its dictionary. Next we ask for the frequency table and this result is obtained:
Mobility Occupational mobility Intergenerational
Valid percentage Accumulated
Frequency Percentage percentage

Valid 1 Descending 631 25,4 29,2 29,2


2 Immobility 602 24,3 27,9 57,1
3 Ascendant 928 37,4 42,9 100,0
Total 2161 87,1 100,0
Lost System 319 12,9
Total 2480 100,0

As can be seen, absolute upward occupational mobility (43%) stands out as a result of the process of changes
that Spanish society has experienced from the period of industrialization to the current post-industrial phase.

► Exercise 11. Proposed


Carry out an analysis of the relationship between educational level (variable STUDIES ) and occupation
(variable OCUMAR11 ) of the people interviewed. Propose the creation of an empirical typology that relates
them based on the frequencies observed in the contingency table.

► Exercise 12. Proposed


Create a typological variable that relates money and happiness, considering the variables Personal Happiness
Scale ( P30 ) and Personal Income ( P46 ). To do this, previously recode each of the variables into three
categories: happy, neither happy nor happy, unhappy for happiness, and rich, neither rich nor poor and poor for
income. Answer the question: to what extent does money make you happy?

We can then ask ourselves if these results change when we also consider mothers, made invisible in the
previous exercise, and in general in the analyzes of social mobility (Fachelli and López-Roldán, 2013, 2015).
To do this, we must resolve the issue of how to determine the “occupational origin of fathers and mothers.”
One solution is to apply the dominance criterion: the highest occupational level is taken, that of the father or the
mother. We will consequently create a dominant family occupation variable with the name OCUPAFAM .

This consideration implies carrying out a prior analysis exercise of occupational homogamy that we can obtain
by crossing the occupation of the father and mother. As the missing values of both variables are defined, BUSY
and

OCCUPY , we would stop considering many cases since many mothers used to be classified as “inactive” in
the past. On the other hand, the rest of the values that do not specify the occupation in the case of the father or
the mother can be recovered if there is information on the occupation of one of the two members. To do this we
will delete the declaration of missing values and we will perform the crossing with all the values of both
variables:

EMPLOYMENT Occupation of the mother at age 16 of the person interviewed (CNO11)


1 2 3 4 5 6 7 8 9 10 95 96 97 98 99 Total
BUSYPAD 1 4 7 5 2 7 0 1 0 0 1 1 43 0 0 1 72
Father's occupation at 0 41 13 4 6 1 2 0 2 0 0 67 2 0 0 138

López-Roldán and Fachelli | Quantitative social research methodology


III.2 Preparation of data for analysis |85

age 16 3
1 9 14 2 13 0 4 6 8 0 0 138 3 0 1 199
person 4
0 3 3 1 8 0 0 0 1 0 0 33 0 1 0 50
interviewed 5 0 7 8 3 57 3 3 3 23 0 1 167 4 0 0 279
(CNO11) 0 5 1 1 13 78 5 7 8 1 0 301 8 1 0 429
7 1 6 9 1 45 3 19 14 50 0 3 352 4 2 3 512
8 0 2 6 1 23 3 7 21 25 0 1 261 5 0 3 358
9 1 2 1 1 13 1 2 2 30 0 0 99 4 0 0 156
10 0 1 0 1 0 0 1 1 1 0 0 14 0 0 0 19
94 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
95 0 1 0 0 0 0 0 0 2 0 1 2 0 0 0 6
96 0 0 3 0 6 2 0 1 6 0 0 27 0 0 0 45
97 0 2 1 3 20 7 7 3 32 0 0 59 14 0 0 148
98 1 1 1 1 4 1 0 1 4 0 0 27 1 1 1 44
99 0 0 0 1 0 0 1 0 2 0 0 7 0 1 12 24
Total 8 87 65 22 216 99 52 59 194 2 7 1597 45 6 21 2480
1 Directors and managers; 2 Scientific and intellectual technicians and professionals; 3 Technicians; support professionals; 4
Accounting, administrative and other office employees; 5 Workers in catering, personal, protection and sales services; 6 Qualified
workers in the agricultural, livestock, forestry and fishing sectors; 7 Craftsmen and skilled workers in manufacturing and
construction industries, except installation operators; 8 Facility and machinery operators, and assemblers; 9 Elementary
occupations; 94 Without employment, he lived on income; 95 Unemployed; 96 Inactive (neither employed, nor unemployed, or
unpaid domestic work, etc.); 97 Not applicable (was not present, had died, etc.); 98 NS/Does not remember/Occupation incorrectly
specified; 99 NC

Four regions can be identified in the table. Firstly, when there is information on the father's and mother's
occupation, in a similar way to the previous example of mobility, we will define the family occupation as
follows:

- If BUSINESS < BUSINESS then BUSINESS of the father.


- If OCCUMAR = OCUMAR11 then OCCUMAR that of the father or mother.
- If OCCUPARA > OCCUMAR11 then OCCUPIAM that of the mother.

The rest of the regions in the table define these situations:

- If BUSY known and BUSY unknown then BUSY . - If OCCUPADA is unknown and OCCUPADA is
known, then OCCUPADA is unknown . - If OCCUPADA and OCUPAMAD are unknown then no data.

How to translate it to SPSS? First of all, we have commented on the existence of 4 regions or situations to deal
with. Each of them can be considered separately and in each case operate the transformation condition
necessary for the creation of the family occupation variable. There is a command or structure ( DO IF…END
IF ) that conditionally executes one or more transformations on subsets of cases based on logical expressions.
Its scheme is the following:

DO IF [(] logical expression [)] transformation commands [ ELSE IF [(]


logical expression [)]] transformation commands [ ELSE IF [(] logical
expression [)]] ...
[ ELSE ]
transformation commands
END IF

DO IF establishes a first condition from which a transformation is carried out, optionally successive conditions
can be established with ELSE IF with their corresponding transformations. ELSE can be used within the
structure to execute transformations when the previous logical expressions are not satisfied and thus we control
the cases not contemplated.

Let's apply it to our case. It can only be done by syntax and would be the following:

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


86| III. Analysis

DO IF (BUSY BUSY <= 9 AND BUSY BUSY <= 9).


I.F. (OCCUPIED <OCCUPAMAD)BUSYFAM=BUSYFAM.
I.F. (OCCUPIED =OCCUPAMAD)BUSYFAM=BUSYFAM.
I.F. (OCCUPIED >OCCUPAMAD)OCCUFAAM=OCUPAMAD.
ELSE IF (BUSY <= 9 AND BUSY >= 10).
COMPUTE BUSYFAM=BUSYFAM.
ELSE IF (BUSY <= 9 AND BUSY >= 10).
COMPUTE BUSY=BUSY.
ELSE.
COMPUTE BUSYFAM=0.
END IF.

In the DO IF line the first condition (known occupation of father and mother) is established and in the next 3 IF
commands the decision is made as to which occupation is assigned to the new OCUPAFAM variable. If the
mother's occupation is not known, condition of the first ELSE IF , then it is calculated that the occupation of
origin will be that of the father. In the following ELSE IF , similarly, if the father's occupation is not known
then it is calculated that the occupation of origin will be that of the mother. Finally, the rest of the situations
with ELSE , that is, not having the occupation of the father and mother, will imply that the new variable has the
value 0. We will also declare this value below as the user's missing value and we will have to complete the
dictionary of the new variable with labels, type and measurement level. The frequency table will be:

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |87

It remains to analyze absolute intergenerational mobility and construct the mobility variable ( Mobility2 ) as
before, now between the occupational origin of the fathers and mothers and the destination of the sons and
daughters. The mobility table is:

OCUMAR11 Occupation of the child


1 2 3 4 5 6 7 8 9 Total
OCCUPAFA 1 14 21 13 3 13 0 2 1 6 73
M
Occupation 2 6 88 30 5 23 1 12 6 4 175
dominant 3 13 40 61 15 50 3 8 17 12 219
of the 4 1 6 10 10 16 2 3 3 5 56
parents and 5 18 41 40 19 129 8 48 24 48 375
mothers 6 7 21 37 7 80 84 75 61 49 421
7 11 38 60 14 106 11 111 45 62 458
8 6 30 43 11 73 7 40 83 28 321
9 2 10 12 5 29 6 30 21 64 179
Total 78 295 306 89 519 122 329 261 278 2277

and the instructions are:

I.F.(OCUPAFAM < OCUMAR11) Mobility2=1.


I.F.(OCUPAFAM = OCUMAR11) Mobility2=2.
I.F.(OCUPAFAM > OCUMAR11) Mobility2=3.

which we complete with the data dictionary and taking out the frequency table:
Mobility? Intergenerational occupational mobility
Valid Accumulated
Frequency Percentage percentage percentage

Valid 1 Descending 722 29,1 31,7 31,7


2 Immobility 644 26,0 28,3 60,0
3 Ascendant 911 36,7 40,0 100,0
Total 2277 91,8 100,0
Lost System 203 8,2
Total 2480 100,0

As a result, we observed that upward mobility decreased somewhat, going from 43 to 40%, due to the fact that
the dominance criterion tends to elevate the position of origin by choosing the highest between the father and
the mother; and since the positions of origin are higher, the possibilities of ascending socially will be lower.

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |88

As we have seen throughout this section, carrying out transformations with variables implies modifying or
creating new ones that expand our data file, as we highlighted at the beginning of this chapter when talking
about data processing. This involves managing how to save this data. A good practice is to keep a copy of the
original data source and create the expanded array by saving it under a different name. In our case, all the
variables that we have been generating are found in the CIS3041+.sav matrix.

It is also worth noting that the data generated has generally been obtained from the menu in an interactive work
dynamic, which may represent a limitation when it comes to replicating the work carried out. To repeat the
exercises seen, we have the manual itself, but in research practice, reviewing or redoing the generation of data
and its analysis requires recording it. One way to do this is to systematically save result files containing the
syntax and results of your execution. But re-executing them through the menu to translate those commands and
results can be complicated, long and laborious. The alternative is to save syntax files with all the tasks
performed that, when executed again, in a matter of seconds, generate all the hours of work that they
represented when they were originally designed. This is how we have worked and we have saved all the
transformations that have been seen in the chapter in the syntax program Transformar.sps that can be consulted
on the website of this chapter.

2.2. Data transformation with R

We will comment on the different procedures presented in the Deducer: Data menus, which include some
procedures intended for processing files, either internally or to combine them with others, and transformation
procedures for the creation of variables.

2.2.1. File processing with R

We will distinguish two types of file management and transformation procedures, those intended for the
processing of data within a file and the processing of data between files that are related.

2.2.1.1. Data processing inside a file

Sort cases

The sort cases command ( Data / Sort menu) allows the reordering of the cases in the active file according to
the values specified in one or more variables, numerical or alphanumeric (string, for these the order is
alphabetical). Cases can be reordered in ascending, default, or descending order.

With the data matrix CIS3041.rda we see that the cases are initially ordered according to the questionnaire
number ( CUES variable). As an exercise we can organize the file according to the location of the interview. A
first criterion would be, for example, ordering the file according to the Autonomous Community (variable
CCAA ) in ascending order:

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |89

Note the changes in the data file. If we want to be more precise, we can put, in addition to the CCAA variable,
the variable of the province ( PROV ) and the municipality ( MUN ), all in ascending order. We will introduce
them in this order:

Sorting a small file is instantaneous, but with files with millions of records it can take minutes. In this sense, it
is very useful to have the database sorted according to a criterion if it is used regularly.

We will also see that the organization of a file is a necessary prior step in various data processing procedures.

Select cases

Often, when working with a database we are interested in obtaining information about individuals who satisfy
certain conditions. We may be interested, for example, in studying various variables but only for individuals
with certain characteristics: female, those who plan to vote, those who have a low level of income, etc. With
Deducer we can select the subset of individuals that satisfy a certain condition so that a new object is created
with the selected data, a new dataframe . As an exercise we can select the cases of the people interviewed who
are women. In the Data / Subset menu dialog box we select the sex variable ( P31 ) and move it to the right by
double-clicking. To select the women we will write with the keyboard or with the buttons in the dialog box:
=="Woman"30 :

30 The equal in R is a double equal sign.

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


90| III. Analysis

Once the condition is built, we can change the name ( Subset Name ) that by default will be assigned to the
object with the selection data, for example CIS3041mujer . We will click on OK and it will be executed, that is,
we will have a new matrix in the workspace with the information on the cases that correspond to women and
that we can view from the data viewer. If we want to obtain, for example, a frequency table of a variable, in the
Frequencies dialog box we can choose at any time the matrix with which we want to work, whether with the
entire sample ( CIS3041 ) or with this subsample of women that we have just created. ( CIS3041female ).

Transpose

Transposing a matrix involves converting the cases (rows) into variables, and the variables (columns) into
cases. Doing so creates a new data file and automatically names the variables and the names of the rows.

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |91

To illustrate this procedure and those that follow, we will work with small data matrices that will allow us to
better see each of the tasks. We will consider the data matrix X.rda that contains the employment situation of 6
salaried individuals in relation to 2 variables of their employment conditions: Contract and Salary . It can be
opened directly from the Deducer data editor:
ID Salary Contract
1 1 Permanent 1200

2 2 Temporary 1000
3 3 Permanent 3000
4 4 Temporary 1000
5 5 Permanent 1200
6 6 Permanent 1500

To transpose it we will go to the Data / Transpose menu, it will ask us to choose the data matrix:

Once selected, we will be asked to give a name to the new data matrix that will be created, for example
Xtransposed :

To see the result we return to the data editor and look for the new matrix:

2.2.1.2. Data processing between files that are related

We will now see other data matrix manipulation tasks that involve relating two or more files: fusion.

Merge files

Merging or joining files gives rise to two alternatives:

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |92

- Add variables . The active data file is merged with another that contains the same cases but different
variables.

- Add cases . The active data file is merged with another that contains the same variables but different cases.

We will carry out a small exercise with the Y.rda matrix that contains 6 cases and 4 variables, Age and Sex are
individual sociodemographic characteristics and Sector and Size refer to the company's labor characteristics:

ID Age Sex Sector Size

1 1 23 Women Served 20
2 2 35 Male Primary 1
3 3 48 Male Industry 100

4 4 55 Women Industry 500


5 5 28 Male Construction 50
6 6 20 Male Served 5

For the exercise of joining variables we will consider two separate initial matrices with sociodemographic
information ( YA.rda ) and company information ( YB.rda ). For the exercise of joining cases we have two
separate matrices with the first three cases ( Y1.rda ) and the last three ( Y4-6.rda ). We open them from
Deducer.

ID Age Sex Sector Size

1 1 23 Women Served 20
2 2 V VK Primary V R 1
3 3 AND b

4 4
TO AND
55 Women Industry 500
5 5 28 Male Construction 50
6 6 20 Male Served 5

Merging is done through the Data/Merge menu. The dialog box opens where the workspace matrices that we
have previously loaded appear:

López-Roldán and Fachelli | Quantitative social research methodology


III.2 Preparation of data for analysis |93

First we will perform the fusion of YA with YB, a task that involves adding the variables of YB to those
existing in YA . We call the new matrix YAYB . We click on continue and the fusion dialog box appears:

We see three boxes with the variables specific to each file and those that are common. In the latter was the ID
variable that we used as a control variable for the pairing of the cases. When merging, it is always convenient
to have a key variable that identifies each unit in each of the files to be merged, in this way the information
will be matched based on the control of the coincidence of the same case. In our example this role is played by
the variable ID and is placed in the Match Cases By box: after choosing whether the variable is the one from
the first file: [1] , the second: [2] , or both [b] and in this case it will create two versions of the variable. Once
executed with Run we will have the same information as the matrix Y as a result.

It is worth keeping in mind that all unpaired cases, that is, those that are in one matrix and not in the other,
whatever it may be, will have values lost in the fusion for the variables where they do not have information,

they will be empty ( 0 ) in the new array:

( Drop Unmatched Cases ). If two variables represent the same element, but are named differently in the two
data arrays, they can be combined by selecting the two variables and clicking the down arrow and placing
them together in the Common Variables box.

López-Roldán and Fachelli | Quantitative social research methodology


94| III. Analysis

We will now carry out the second case of fusion, that of adding cases. We will choose the matrix Y1 that
contains the first 3 cases and we will add Y2 to it with the last 3. We call the new matrix Y1Y2 :

In this case all the variables are common. Variables that are unpaired, because they are in one file and not the
other, will not be included in the merged file. We must pass all the variables from the Common Variables box
to Match Cases By by clicking on the down arrow:

Again executing the procedure we reproduce the original matrix Y.

2.2.2. Variable transformation

After seeing different treatment operations of a matrix as a whole, we focus on those transformation tasks
where specific variables of the matrix are involved, individually or relating them to others. There are various
commands intended for the transformation of existing variables, either for their modification or for the
generation or creation of new variables. The construction of typologies and indices from various variables will
be one of the frequent needs of the analysis, the recoding of the values of the variables to group values or
reduce the measurement scale is another immediate task that the analysis entails.

In any variable creation exercise, the behavior of missing values must be kept in mind at two moments: before
and after creating the variables. First, keep in mind that if the variables contain missing values, in the new

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |95

variables these will appear as missing values if they are not specifically treated. On the other hand, when we
create a new variable we must anticipate and control the unwanted generation of missing values as a result of
an operation in which the transformations do not in fact apply in all the cases that we initially want to consider.
If any transformation is not applied to a specific case, the value in the created variable that will appear will be
a missing value.

Finally, we must keep in mind that any generation of variables often requires completing its dictionary (type of
variable and ordering of categories).

2.2.2.1. Variable recoding

Recoding variables allows you to change the current values of the variables with new ones. Recoding can
strictly mean changing one or more values to others, or combining or grouping ranges of values into new
categories.

On the other hand, recoding can be done by choosing to keep the original variable and generating a new one
with a different name that will have the recoded values, or by choosing to replace the variable that is being
recoded with the new variable with the new coding criteria and with the same variable name.

We will consider the CIS3041 data matrix and carry out two recoding exercises: from a qualitative variable
and from a quantitative one.

The first step to perform a recoding is to define the recoding criteria and observe the values of the variables by
extracting the frequency table. We first consider the variable OCUMAR11 , the occupational category of the
person interviewed according to the 2011 CNO (National Classification of Occupations).31 . Its frequency table
appears below. The short variable labels correspond to the following descriptions:

Director : Directors and managers; Technical : Scientific and intellectual technicians and professionals; Support :
Technicians; support professionals; Administrative : Accounting, administrative and other office employees; Services :
Workers in catering, personal, protection and sales services; Qualified agricultural workers : Qualified workers in the
agricultural, livestock, forestry and fishing sectors; Skilled industry : Craftsmen and skilled workers in the manufacturing and
construction industries, except installation operators; Operators : Facility and machinery operators, and assemblers;
Elementary : Elementary occupations; NA : Without a job, I lived off the income; Unemployed; Inactive (neither employed,
nor unemployed, or unpaid domestic work, etc.); Not applicable (was not present, had died, etc.); DK/Does not
remember/Occupation incorrectly specified; NC

31 The CNO ( http://www.ine.es/jaxi/menu.do?type=pcaxis&path=%2Ft40%2Fcno11%2F&file=inebase&L=0 ) is the Spanish adaptation of


the international ISCO ( International Standard Classification of Occupations ) classification. of the ILO (
http://www.ilo.org/public/spanish/bureau/stat/isco/ ) , or ISCO, which has several levels of disaggregation, up to 5 and is coded to 4 digits.
Here it is presented with a single digit. The variable P40 of the CIS3041.sav matrix is the three-digit CNO 2011. Therefore, in fact
OCUMAR11 is already a variable that has been recoded (grouped) to a single digit.

López-Roldán and Fachelli | Quantitative social research methodology


III.2 Preparation of data for analysis |96

Frequencies ( OCUMAR11 )
value # oreases % Cumulative %
1 Director 84 3.40 340
2 Technical 309 12.60 1610
3 Support 325 13.30 29 40
4 Administrative 100 4.10 3340
5 Services 559 22.90 56 30
6 Agricultural Qualified 132 5.40 6170

7 Skilled 359 14.70 76 40


industry
8 Operators 274 11.20 87 60
9 Elementary 294 12.00 99 60

10 Military 10 0.40 100 00

Case Summary (0CUMAR11 )

Valid Missing Total/ % Missing


1 2446.00 34.00 2480.00 140

The objective is to have an occupational variable with a smaller number than the 10 categories of the original
variable. We will consider a grouping into 4 occupational categories plus a category of missing values
according to the following criteria:
1. Upper class or high occupational category: Director and Technician.
2. Middle class or intermediate occupational category: Support and Administrative.
3. Qualified workers or lower middle category: Services, Qualified agricultural and Qualified
industry.
4. Unskilled or low category workers: Operators and Elementary.
5. Missing values: Military (which will join the 34 existing cases).

We enter the Data / Recode Variables procedure menu and choose the variable OCUMAR11 to pass it to the
box on the right of Variables to Recode . It automatically assigns the same name indicating that it will recode
into the same variable. In general, if we do not have the certainty to act in this way, we will prefer to create a
new variable. To do this, we select the line and click on Target to change the destination name of the variable,
we write the name of the new variable, for example OCUPA and click on Accept :

The initial dialog box appears like this:

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |97

Next we must specify the recoding criteria in Define Recode :

The recoding criteria that we have discussed are transferred as follows. First we will click on the pair of
variables that appears in the Variable Information box, we will see that for numerical variables a table of
percentiles is shown and for qualitative variables, as is the case, a table of frequencies. With factor type
variables we cannot use the range between values, we will have to write each value exactly (we will copy the
text that we have on the left) and we will specify the new value, the new text:
- In the first case it would be to write:
Value = Director into Registration and click Add
Value = Technician into Registration and click Add .
This is how we define that Directors and managers, together with Technicians and professionals , are
united in a single category of high occupational class, coded with High in the new one.
- We repeat the same for the other three occupational groups: Average , Skilled and Unskilled .
- In the last case: Value = Military into NA and click Add .
We will consider the last value as a missing value along with the missing values that the variable already
has identified with the symbol NA in the matrix.

We click OK in this window and again in the next one to execute the recoding.

To see the result we need to request the frequency table, but first it is
necessary to improve the data dictionary by ordering the labels, and
eliminating the Military that appears with zero frequency, and marking its
ordinal character.

López-Roldán and Fachelli | Quantitative social research methodology


98| III. Analysis

The final result is the following:


Frequencies ( OCCUPA )
value # of Cases % Cumulative %
1 high 393 16.10 16.10
2 Half 325 13.30 29.50
3 Services 559 22.90 52.40
4 Qualified 591 24.30 76.70

5 No 568 23.30 100.00


qualified

Case Summary (OCCUPA)


%
Valid Missing Total Missing
1 2436.00 44.00 2480.00 1.80

► Exercise 13. Proposed


The INE in its report Introduction to the CNO-11 comments on the concept of occupation (
http://www.ine.es/daco/daco42/clasificacións/Introduccion_CNO11.V02.pdf ) and distinguishes between
employment and skills as two fundamental dimensions that they structure it. Competencies in turn distinguish
two dimensions: specialization and level of competencies . The latter has 4 grades (theoretically associated
with formal educational levels) that correspond to 1-digit occupational categories as follows:

Table 1: Correspondence between the Major Groups of ISCO-08 and the level of competencies
Level of
Large Groups ISCO-08 compete
tences
1 - Directors and managers 3,4

2 - Scientific and intellectual professionals 4

3 - Mid-level technicians and professionals 3

4 - Administrative support staff


5 - Service workers and salespeople in shops and markets
6 - Farmers and skilled agricultural workers, forests such and fishing
7 - Officials, workers and artisans of mechanical arts and other trades
2
8 - Facility and machine operators and assemblers

9- Elementary occupations
1
0- Military occupations 1,2,4

Source: INE

According to this table, without considering 0 of military occupations and assigning directors and managers
only level 4, group the large occupational groups (variable OCUMAR11 of the CIS3041.sav matrix) into the 4
levels of competencies. Also complete the dictionary of the variables and extract the frequency table to check
the result.

A second example of recoding will take into account a quantitative variable, age (variable P32 ). It is common
to work with age grouped into intervals of 5 or 10 years, or in large age groups (young people, adults, older
people). Thus, the original quantitative variable reduces its scale and allows it to be worked with fewer
categories as an ordinal qualitative variable. It is proposed to create a new age variable ( Age10 ) with a
grouping into intervals according to these criteria:
BMU..........................................................................................................................1
ccreative..............................................................................................................4

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |99

Sommons.....................................................................................................4
General index.........................................................................................................5
PART II. PRODUCTION...............................................................................5
PART III. ANALYSIS...........................................................................................7
Chapter III.2 Data preparation for analysis..........................................................7
Preparing data for analysis...................................................................................11
1. Creation and identification of data..........................................................7
1.1. Creation and identification of data with SPSS........................................9
1.2. Creation and identification of data with R................................................27
2. Data transformation...........................................................................38
2.1. Data transformation with SPSS.............................................................38
520................................................................................................................47
W v aran e s = 480 = 1.083.............................................................................47
480................................................................................................................47
Wmu J ere s = 520 = 0.923..............................................................................47
FREQUENCIES P45..............................................................................68
MISSING VALUES P45m(9999)..........................................................68
VARIABLE LEVEL P45m ( SCALE )...................................................68
x—x..............................................................................................................75
zi= -—...........................................................................................................75
yes.................................................................................................................75
SPA _ 100.p,+75.P,+50p,+25.p4+0.p;....................................................................77
Pi +P2+P3.....................................................................................................77
EXECUTE................................................................................................83
EXECUTE................................................................................................84
[ ELSE ]....................................................................................................85
END IF......................................................................................................85
ELSE........................................................................................................86
END IF......................................................................................................86
2.2. Data transformation with R...................................................................88
x—x zi= -— yes........................................................................................105
/cp_SPA + IEP..................................................................................................108
3. Bibliography........................................................................................113
1.

López-Roldán and Fachelli | Quantitative social research methodology


100| III. Analysis

The original variable has no missing values. The original frequency distribution table is as follows:
Frequencies ( P32)
39 56 36 1.50 67.40
value Hof Cases % Cumulative % 40 57 42 1.70 69.10
1 18 32 1.30 1.30 41 58 29 1.20 70.30
2 19 32 1.30 2.60 42 59 27 1.10 71.40
3 20 18 0.70 3.30 43 60 48 1.90 73.30
4 21 28 1.10 4.40
44 61 27 1.10 7440
5 27 38 1.50 6.00
45 62 33 1.30 75.70
6 23 25 1.00 7.00
46 63 39 1.60 77.30
7 24 27 1.10 8.10
47 64 40 1.60 78.90
8 25 51 2.10 10.10
48 65 48 1.90 80.80
9 26 42 1.70 11.80
49 66 37 1.50 82.30
10 27 40 1.60 13.40
11 28 23 0.90 14.40 50 67 39 1.60 83.90
12 29 39 1.60 15.90 51 68 24 1.00 84.90
13 30 46 1.90 17.80 52 69 31 1.20 86.10
14 31 48 1.90 19.70 53 70 36 1.50 87.60
15 32 41 1.70 21.40 54 71 27 1.10 88.70
16 33 47 1.90 23.30 55 72 28 1.10 89.80
17 34 53 2.10 25.40 56 73 18 0.70 90.50
18 35 51 2.10 27.50 57 74 21 0.80 91.40
19 36 37 1.50 29.00 58 75 19 0.80 92.10
20 37 48 1.90 30.90 59 76 20 0.80 92.90
21 38 47 1.90 32.80 60 77 18 0.70 93.70
22 39 46 1.90 34.60 61 78 25 1.00 94.70
23 40 48 1.90 36.60 62 79 16 0.60 95.30

24 41 43 1.70 38.30 63 80 17 0.70 96.00


25 42 57 2.30 40.60 64 81 17 0.70 96.70
26 43 61 2.50 43.10 65 82 14 0.60 97.30
27 44 71 2.90 45.90 66 83 15 0.60 97.90
28 45 51 2.10 48.00
67 84 13 0.50 98.40
29 46 51 2.10 50.00
68 85 11 0.40 98.80
30 47 45 1.80 51.90
69 86 8 0.30 99.20
31 48 42 1.70 53.50
70 87 4 0.20 99.30
32 49 45 1.80 55.40
71 88 5 0.20 99 50
33 50 57 2.30 57.70
72 89 4 0.20 99.70
34 51 33 1.30 59.00
35 52 34 1.40 60.40 73 90 2 0.10 99.80
36 53 49 2.00 62.30 74 91 2 0.10 99.80
37 54 56 2.30 64.60 75 92 1 0.00 99.90
38 55 34 1.40 66.00 76 94 3 0.10 100.00

Case Summary ( P32 )


% Missing
Valid Missing Total
1 2480.00 0.00 2480.00 0.00

Following the protocol that we saw previously we will specify in particular the recoding criteria now being
able to use the range of values:

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |101

The resulting frequency table after completing the data dictionary: changing from character to factor and
ordering the values, is as follows:
Frequencies (Age10)
#of
value Cases % Cumulative %
1 18-24 200 8.10 8.10
2 25-34 430 17.30 25.40
3 35-44 509 20.50 45.90
4 45-54 463 18.70 64.60
5 55-64 355 14.30 78.90
6 >64 523 21.10 100.00

Case Summary (AgelO)


%
Valid Missing Total Missing
1 2480.00 0.00 2480.00 0.00

► Exercise 14. Proposed


Recode the variable P15 of ideological self-positioning into three categories that group the values 1 to 3, 4 to 6
and 7 to 10.

If, with the data from the CIS survey, we ask ourselves, what are the average household incomes of those
interviewed? To answer this question we should have the income variable as quantitative and the survey asks
qualitatively by intervals. An alternative is to calculate the mean from the class mark of each interval for which
we must recode the variable. The distribution of the income variable ( P45 ) is as follows:

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


102| III. Analysis

Frequencies ( P45 )
#of
value Cases % Cumulative %
1 Witho 10 0.60 0.60
ut
2 =300 28 1.60 2.20
3 301-600 185 10.80 13.10
4 601-900 297 17.40 30.50
5 901-1200 347 20.30 50.80
6 1201-1800 386 22.60 73.40
7 1801-2400 215 12.60 86.00
8 2401-3000 120 7.00 93.10
9 3001-4500 76 4.50 97.50
10 4501-6000 31 1.80 99.40

11 >6000 11 0.60 100.00

Case Summary ( P45 )

Valid Missing Total % Missing

1 1706.00 774.00 2480.00 31.20

If we recode it by calling the new variable P45m we will follow the following process:

But when creating the variable P45m, it becomes a factor type variable. To convert it to type double we can
create a new blank variable x with this format, copy the information from the column of the variable P45m ,
delete the column P45m and rename the variable x as P45m . Next we ask for the frequency table and the
descriptive mean. This is the result:

Frequencies ( P45m )
#of
value Cases % Cumulative %
1 0 10 0.60 0.60
2 150 28 1.60 2.20
3 450 185 10.80 13.10
4 750 297 17.40 30.50
5 1050 347 20.30 50.80
6 1500 386 22.60 73.40
7 2100 215 12.60 86.00
8 2700 120 7.00 93.10
9 3750 76 4.50 97.50
10 5250 31 1.80 99.40
11 7500 11 0.60 100.00

Case Summary ( P45m )

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |103

Valid Missing Total/ % mean


1 1706.00 774.00 2480.00 31.20 P45m 1500.18

The average income of the households in the sample is €1,500.

► Exercise 15. Proposed


Recode the variable P46 related to personal income with the class mark of the intervals and calculate the
average income.

2.2.2.2. Transformation expressions

We will now see the transformation procedures that involve performing a calculation or a conditional
transformation to generate new variables. Using its commands involves working with so-called transformation
expressions that specify the syntax of the instructions of the transformation commands using different types of
operators and functions. In these expressions we can use arithmetic operators: + – * / ^ , constants, functions of
all types, relational operators: > >= < <= == != and logical operators: & | ! .

2.2.2.3. Variable calculation

The creation of new variables by performing calculations is a constant need for any quantitative data analysis
process. Whether to modify or combine the existing original variables, we can operate an infinite number of
transformations, whether of a statistical nature to condition variables in an analysis, to create indicators and
new quantitative variables, to use instrumental variables, etc.

Calculations in R are performed from the command line (or through scripts ). We will do some exercises to
calculate variables. First of all, we can consider creating an index of sociopolitical activism based on the
answers to question Q14 :

López-Roldán and Fachelli | Quantitative social research methodology


104| III. Analysis

P.14 There are various forms of participation in social and political actions
that people can carry out. Please tell me for each of them, if you (SHOW
CARD D) .

1. You have participated during the last twelve months


2. He participated in a more distant past
3. He has never participated
1 2 3 NC

- Attend a demonstration.................. 1 2 3 9 (86)


- Participate in a strike..................... . 1 2 3 9 (87)
- Participate in a forum or blog
of political debate on the 1 2 3 9 (88)
- Sign a petition/ collection
you sign either in person or by
Internet......................................... 1 — 1 2— 3 9 (89)

With the following criteria: score each form with 2 if you have participated recently, with 1 if you participated
in the past and with 0 if you have never participated. We built the index with these scores in the 4 questions,
adding them for each individual. Anyone who currently participates in everything will have a participation
level of 8 and anyone who has never participated in anything will have a participation level of 0. We will call
the new variable P14index .

Taking into account the current values of the variables ( P1401 to P1404 ) we need to go from factor type to
double type, recoding the values of the variables as in the case of the last recoding discussed in the previous
section. We can do it for all 4 variables simultaneously and we will call them P1401x to P1404x :

Once changed to double format, we create the index from the Deducer console command line as follows:

> CIS3041$P14index = CIS3041$P1401x + CIS3041$P1402x +


CIS3041$P1403x + CIS3041$P1404x

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |105

The instruction contains on the left the name of the new variable ( P14indice ) that is associated with the
CIS3041 data array (it will be added as the last variable to the data array) and is the result of the numerical
calculation expression that involves adding the 4 variables for each individual. When we press the <Enter> key
the variable is created. Our array will contain one more variable, the last one. It must be taken into account that
in the new variable some individuals are missing values in one of the four initial variables, so the calculation
cannot be carried out for them and they will be missing values in the new one. The frequency table of the new
variable is as follows:
Frequencies (P14index )
Hof Cases
value % Cumulative %
1 0 805 33.00 33.00
2 1 324 13.30 46.20
3 2 417 17.10 63.30
4 3 299 12.20 75.50
5 4 264 10.80 86.30
6 5 127 5.20 91.50
7 6 127 5.20 96.70
8 7 36 1.50 98.20
9 8 44 1.80 100.00

Case Summary (P14index)


%
Valid Missing Total Missing mean
1 2443.00 37.00 2480.00 1.50 P14index 2.09

If we calculate the average, we obtain a value of 2.09, much closer to 0 than 8, indicating a relatively low level
of sociopolitical activism in Spanish society as a whole.

► Exercise 16. Proposed


From question Q11 on the frequency with which newspapers, radio and television are consulted to follow
political news, giving between 4 and 0 points to the frequencies ranging from 1 ( Every day ) to 5 ( Never ).
and adding the scores for each individual.

The operation of typification or standardization of a variable is a transformation that consists of subtracting the
mean from each score or value of a quantitative variable and dividing by the standard deviation.

x—x
zi= -—
yes
We perform this operation with the age variable ( P32 ). We need to previously know the values of the mean
and the deviation, we execute the Analysis / Descriptives procedure and we obtain:

st.
Mean Deviation Valid N
P32 48.32 17.49 2480

Once the values of the mean and standard deviation are known, we create the new variable, with the name
Agetip , using:

> CIS3041$Agetip = (CIS3041$P32–48.32)/17.49

If we ask for the descriptions of the new variable we can check how, except for decimals, the mean is 0 and the
standard deviation is 1.

st
Mean Deviation Valid N
Agetip -0.000221 1.00 2480
López-Roldán and Fachelli | Quantitative social research methodology
106| III. Analysis

This same result can be reached through the menu with Data / Transform , after choosing the variable P32 ,
moving it to the right and choosing the Transformation Standardize :

We will see in the data matrix added at the end the variable P32.tr , coinciding with the one we created
previously. Through these procedures you can operate other pre-established transformations or even propose
your own:
Center : Rescales the variables so that they have a mean of 0.
Standardize : Rescales the variables so that they have mean 0 and standard deviation 1.
Robust Standardize : Rescales the variables so that they have mean 0 and median absolute deviation 1.
Range : Transforms the variable so that it takes values between 0 and 1.
Box-cox : Transform the variable to try to obtain a normal distribution.
Rank : Replaces the values with their rank.
Log : Returns the natural logarithm (for values greater than 0).
Square root : Returns the square root.
Absolute value : Returns the absolute value.
Quantiles : Divide the variable into groups with the same number of observations. Equal width : Divides the
variable into groups with intervals of the same width.
Custom : Allows you to define custom transformations.

We will now proceed to the construction of the indicators on the political situation prepared by the CIS in the
Barometer32 . The questions of the barometers of each month related to the political situation that are used in
the construction of the indicator are Q4 and Q6:
P.4 And referring now to the general political situation in Spain,
How would you rate it: very good, good, average, bad or very bad?

BMU..........................................................................................................................1
ccreative..............................................................................................................4
Sommons.....................................................................................................4
General index.........................................................................................................5
PART II. PRODUCTION...............................................................................5
PART III. ANALYSIS...........................................................................................7
32 The methodology for constructing indicators of the CIS Barometer can be consulted on the page:
http://www.cis.es/cis/opencms/ES/11_barometros/metodologia.html .
Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381
III.2 Preparation of data for analysis |107

Chapter III.2 Data preparation for analysis..........................................................7


Preparing data for analysis...................................................................................11
1. Creation and identification of data..........................................................7
1.1. Creation and identification of data with SPSS........................................9
1.2. Creation and identification of data with R................................................27
2. Data transformation...........................................................................38
2.1. Data transformation with SPSS.............................................................38
520................................................................................................................47
W v aran e s = 480 = 1.083.............................................................................47
480................................................................................................................47
Wmu J ere s = 520 = 0.923..............................................................................47
FREQUENCIES P45..............................................................................68
MISSING VALUES P45m(9999)..........................................................68
VARIABLE LEVEL P45m ( SCALE )...................................................68
x—x..............................................................................................................75
zi= -—...........................................................................................................75
yes.................................................................................................................75
SPA _ 100.p,+75.P,+50p,+25.p4+0.p;....................................................................77
Pi +P2+P3.....................................................................................................77
EXECUTE................................................................................................83
EXECUTE................................................................................................84
[ ELSE ]....................................................................................................85
END IF......................................................................................................85
ELSE........................................................................................................86
END IF......................................................................................................86
2.2. Data transformation with R...................................................................88
x—x zi= -— yes........................................................................................105
/cp_SPA + IEP..................................................................................................108
3. Bibliography........................................................................................113
-

The Current Political Situation Indicator ( SPA ), based on question P4, is defined as:

spA _ 100.p,+75.p,+50.p,
+25.p4+0.p,
P, +p2+P+P4+Ps
where p 1 , p 2 , p 3 , p 4 and p 5 are, respectively, the response percentages of the very good, good, average, bad
and very bad options.

The Political Expectations Indicator ( IEP ) from question P6 will be:

, 100.p,+ 50.p,+ 0.p,

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


108| III. Analysis

P+P2+P3

where p 1 , p 2 and p 3 are, respectively, the response percentages of the best, equal and worst options.

Finally, the Political Trust Indicator ( PCI ) is the arithmetic mean of the previous two:

/cp_SPA + IEP
'2

In this case, they are synthetic indicators that are expressed in a single value for the entire sample, to then be
compared over time with previous Barometers.33 .

33 See http://www.cis.es/cis/export/sites/default/-Archivos/Indicadores/documentos_html/IndiPol.html .
Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381
III.2 Preparation of data for analysis |109

Source: CIS

The frequencies of both variables for October 2014 are:


Frequencies ( P4 )

value # of Cases % Cumulative %


requirements (r)
1 Very 2 0.10 0.10
good if Of
2 Good 49 2.00 2.10 value Cases % Cumulative %
3 Regular 357 14.90 17.00 1 Better 287 13.30 13.30
4 Bad 769 32.00 49.00 2 Equal 1194 55.40 68.70
5 Very bad 1227 51.00 100.00 3 Worse 676 31.30 100.00

case Summary ( P4 ) case Summary (P6)

% %
Valid Missing Total/ Missing Valid Missing Total Missing
1 240400 7600 2480.00 3.10 1 2157.00 323.00 2480.00 13.00

To obtain the 3 indicators we will use the console command line as a “calculator”:

> SPA=((100*0.1)+(75*2.0)+(50*14.9)+(25*32.0)+(0*51.0))/100
> ENG [1] 17.05

> IEP=((100*13.3)+(50*55.4)+(0*31.3))/100
> IEP [1] 41

> ICP=(SPA+IEP)/2
> ICP [1] 29.025

2.2.2.4. Conditional transformations

To finish this tour of the transformation of variables, we will work with a basic procedure in the analysis of
quantitative information: the creation of variables with conditional transformations. Are

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |110

situations where certain conditions are established in the characteristics of the units and depending on their
fulfillment according to a logical expression (true or false / lost) a value is assigned through an expression
(giving the specific value or executing a calculation formula) . The conditional transformation can be used in
various commands, but we will focus primarily on the ifelse command.

The ifelse command which has the following general form: ifelse(test, yes, no) . A condition is evaluated (
test ) and if it is true a transformation is executed ( yes ), otherwise another transformation or action is executed
( no ).

Through conditional transformations, typological variables are constructed that simultaneously combine
characteristics of various variables (attribute space) to define various types. This is the case of the construction
of the variable of social class, lifestyle, type of consumer, etc.

To illustrate the use of this procedure with R, we will create a (typological) variable of intergenerational
occupational mobility by relating the father's occupational level to that achieved by the son/daughter. The
occupational variables are respectively OCCUPAAD and OCUMAR11 . As a previous step, we will request
the contingency table that crosses both variables ( Analysis / Contingency Tables ) to visualize the information
being worked on, illustrate the procedure and then be able to verify the creation of the new variable. By
convention, the social origin of the father is placed in rows and that of the son/daughter in columns. The table
is the following:

OCUMAR11 Occupation of the child


1 2 3 4 5 6 7 8 9 Total
BUSYPAD 1 13 19 13 3 12 0 2 1 6 69
Occupation 2 4 75 19 4 18 0 10 5 2 137
of the father 3 10 34 58 13 46 1 8 15 10 195
4 1 7 9 9 14 1 3 1 4 49
5 18 34 36 15 98 6 26 11 28 272
6 7 26 35 9 80 84 73 60 50 424
7 12 44 64 15 121 9 121 48 70 504
8 7 33 48 11 79 7 50 91 29 355
9 2 12 8 5 25 7 24 20 53 156
Total 74 284 290 84 493 115 317 252 252 2161
1 Directors and managers; 2 Scientific and intellectual technicians and professionals; 3 Technicians; support professionals; 4
Accounting, administrative and other office employees; 5 Workers in catering, personal, protection and sales services; 6 Qualified
workers in the agricultural, livestock, forestry and fishing sectors; 7 Craftsmen and skilled workers in manufacturing and
construction industries, except installation operators; 8 Facility and machinery operators, and assemblers; 9 Elementary
occupations

The diagonal (in blue) defines immobility or occupational social reproduction where the occupational origin of
the father is the same as that of the son/daughter. The values in the lower triangle (in green) correspond to
upward mobility, children have a higher occupational level than parents. Finally, the upper triangle (in red)
corresponds to downward mobility, the children have a lower occupational level.

To create this typology of occupational mobility we will use conditional transformations. In this case we
establish 3 conditions34 :
- Yeah BUSYPAD < OCUMAR11 so HE gives downward mobility
- Yeah BUSYPAD = OCUMAR11 so HE gives immobility
- Yeah BUSYPAD > OCUMAR11 so HE gives upward mobility

All cases that do not meet these conditions, that is, cases that correspond to missing values of both variables,
will become missing values of the system. To obtain the previous table of 9 by 9 categories we must consider
the value 10 “Military” as a missing value.

34 As the values range from 1, highest occupational level, to 9, lowest level, the direction of the comparison is the opposite: a higher value
between origin and destination is downward mobility and a lower value is upward.

López-Roldán and Fachelli | Quantitative social research methodology (c©)12:0*d•


III.2 Preparation of data for analysis |111

To obtain the typology of occupational mobility with R we will execute instructions in the command language
through the development of a syntax program ( script ). To create the syntax file we open with File / New
Document and we will write the following instructions that we will comment on35 :

The frequency tables of the two variables are first requested with the frequencies command,36 which
only works with Deducer open or if the library is loaded, since it is not a command from the R base library.
The levels command allows you to see the attributes of a variable and also change them, as in this case
where the Military value becomes NA in the two variables. The which command is also used to find the
value that corresponds to the Military attribute in the variable. To execute the instructions of the created syntax,
select and type < CTRL >+< R >. The frequencies of the variables are:

35 Instructions are in the Transform.R file.


36 The variables appear associated with the data frame to which CIS3014 belongs to indicate in which file the variable is and where it should
be saved if a new one is created. In R there are two commands: attach and detach that allow you to manage this aspect, the first avoids
constantly writing the name ofand
López-Roldán the Fachelli
matrix establishing the default
| Quantitative socialdatabase
researchand the second cancels the action.
methodology
112| III. Analysis

Frequencies (OCUMAR11 ) Frequencies ( BUSY )

value # of Cases % Cumulative % value # of Cases % Cumulative %


1 Director 84 3.40 3.40 1 Director 72 3.30 330

2 Technical 309 12.70 16.10 2 Technical 138 6.30 9.60

3 Support 325 13.30 29.50 3 Support 199 9.10 18.70

4 Administrative 100 4.10 33.60 4 Administrative 50 2.30 20.90

5 Services 559 22.90 56.50 5 Services 279 12.70 33.70

6 Agricultural Qualified 132 5.40 61.90 6 Agricultural Qualified 429 19.60 53.20

7 Skilled 359 14.70 76.70 7 Skilled 512 23.30 76.60


industry industry
8 Operators 274 11.20 87.90 8 Operators 358 16.30 92.90

9 Elementary 294 12.10 100.00 9 Elementary 156 7.10 100.00

Case Summary ( OCUMAR11 ) Case Summary (OCCUPAD )

%
Valid Missing Total Missing Valid Missing Total/ % Missing
1 2436.00 44.00 2480.00 1.80 1 219300 287.00 2480.00 11.60

Next, the contingency table is requested, this is also a command from the Deducer library. Its execution
generates this result:
OCUMAR11 by OCUPAPAD across levels of
BUSYPAD
Technical Agricultural Row
OCUMAR11 Director Support Administrative Services Qualified Operators Elementary
Qualified industry Total
Director Count 13 10 1 18 7 12 7 2 74
Technical Count 19 75 34 7 34 26 44 33 12 284
Support Count 13 19 58 9 36 35 64 48 8 290

Administrative Count 3 4 '3 9 15 9 15 11 5 84


Services Count 12 18 46 14 98 80 121 79 25 493
Agricultural Qualified Count 0 0 1 1 6 84 9 7 7 115

Count 2 10 8 3 26 73 121 50 24 317


Qualified industry
Operators Count 1 5 15 1 11 60 48 91 20 252
Elementary Count 6 2 10' 4 28 50 70 29 53 252
Column Total 69 137 195 49 272 424 504 355 156 2161

Finally, we proceed to the construction of the new variable that we will call Mobility . We start by creating the
variable with all the missing values and then modify them according to the conditions we mentioned above that
define the three types of mobility. The first of them establishes with the ifelse command the condition that
must be satisfied to assign the Descending value to an individual in the new variable (downward mobility),
OCCUPADA < OCCUMAR11 . If the condition is met, the Descending value is assigned to all cases that meet
it, otherwise the value it has in the variable initially, that is, NA . The other two conditions equivalently
establish immobility, OCCUPAD == OCUMAR11 and upward mobility, OCCUPAD > OCCUMAR11 . To
finish, we change the type of variable created, it is converted from the character format with which it is
generated to factor , and we change the order of the labels to convert it into an ordered factor variable. The
frequency table obtained is the following:
Frequencies (Mobility)
#
value Cases % Cumulative %
1 Falling 631 29.20 29.20
2 Immobility 602 27.90 57.10
3 Upward 928 42.90 100.00

Case Summary (Mobility)


%
Valid Missing Total Missing
1 2161.00 319.00 2480.00 12.90

As can be seen, absolute upward occupational mobility stands out (43%) as a result of the process of changes
that Spanish society has experienced from the period of industrialization to the current post-industrial phase.

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381


III.2 Preparation of data for analysis |113

► Exercise 17. Proposed


Carry out an analysis of the relationship between educational level (variable STUDIES ) and occupation
(variable OCUMAR11 ) of the people interviewed. Propose the creation of an empirical typology that relates
them based on the frequencies observed in the contingency table.

► Exercise 18. Proposed


Create a typological variable that relates money and happiness, considering the variables Personal Happiness
Scale ( P30 ) and Personal Income ( P46 ). To do this, previously recode each of the variables into three
categories: happy, neither happy nor happy, unhappy for happiness, and rich, neither rich nor poor and poor for
income. Answer the question: to what extent does money make you happy?

As we have seen throughout this section, carrying out transformations with variables implies modifying or
creating new ones that expand our data file, as we highlighted at the beginning of this chapter when talking
about data processing. This involves managing how to save this data. A good practice is to keep a copy of the
original data source and create the expanded array by saving it under a different name. In our case, all the
variables that we have been generating are found in the matrix CIS3041+.rda .

It is also worth noting that the data generated has generally been obtained from the menu in an interactive work
dynamic, which may represent a limitation when it comes to replicating the work carried out. To repeat the
exercises seen, we have the manual itself, but in research practice, reviewing or redoing the generation of data
and its analysis requires recording it. One way to do this is to systematically save result files containing the
syntax and results of your execution. But re-executing them through the menu to translate those commands and
results can be complicated, long and laborious. The alternative is to save syntax files with all the tasks
performed that, when executed again, in a matter of seconds, generate all the hours of work that they
represented when they were originally designed. This is how we have worked and we have saved all the
transformations that have been seen in the chapter in the Transformar.R syntax program that can be consulted
on the website of this chapter.

3. Bibliography

Badiella, Ll. et al. (2015). Getting Started with Deducer: a graphical interface for R users . Bellaterra
(Cerdanyola del Vallès). Applied Statistics Service of the Autonomous University of Barcelona. 5th
edition.
http://sct.uab.cat/estadistica/sites/sct.uab.cat.estadistica/files/Manual%20c urs%20Deducer.pdf
Bouso, J. (2013). The R statistical package . Madrid: Center for Sociological Research.

Chapman, G. (2012). Deducer Quick Start Guide . Exploring Computer Science. National Science
Foundation.
http://www.exploringcs.org/wp-content/uploads/2010/08/Deducer-Quick-Start-Guide.pdf
Domínguez, M.; Simó, M. (2003). Tècniques d'Investigació Social Quantitatives . Barcelona: Editions
Universitat de Barcelona. Methodology, 13.
Dalgaard, P. (2008). Introductory Statistics with R. New York: Springer.
Díaz de Rada, V. (2002). Data analysis techniques for social researchers. Practical applications with
SPSS for Windows . Madrid: RA-MA.
Díaz de Rada, V. (2009). Analysis of survey data . Barcelona: UOC Publishing.
Fachelli, S.; López-Roldán, P. (2013). Are we more mobile? Including the invisible half. XI Spanish
Congress of Sociology , Madrid July 10-12, 2013.
http://www.fes-web.org/uploads/files/modules/congress/11/papers/1923.pdf .
Fachelli, S.; López-Roldán, P. (2015). Are we more mobile including the invisible half? Analysis of
intergenerational social mobility in Spain in 2011. Spanish Journal of Sociological Research ,
150.
IBM Corporation (2013). IBM SPSS Statistics 22 Command Syntax Reference .
López-Roldán and Fachelli | Quantitative social research methodology
114| III. Analysis

ftp://public.dhe.ibm.com/software/analytics/spss/documentation/statistics
/22.0/en/client/Manuals/IBM_SPSS_Statistics_Command_Syntax_Referen ce.pdf.
IBM Corporation (2015a). IBM SPSS Statistics 22 Core System. User's guide .
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/statistics
/22.0/es/client/Manuals/IBM_SPSS_Statistics_Core_System_User_Guide.p df.
IBM Corporation (2015b). IBM SPSS Statistics Base 22 .
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/statistics
/22.0/es/client/Manuals/IBM_SPSS_Statistics_Base.pdf .
IBM Corporation (2015c). A Quick Guide to IBM SPSS Statistics 22 .
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/statistics
/22.0/es/client/Manuals/IBM_SPSS_Statistics_Brief_Guide.pdf .
Lizasoaín, L.; Joaristi, L. (2003). Data management and analysis with SPSS: version 11. Madrid:
Paraninfo.
López-Roldán, P. (2014). Data analysis with SPSS . In P. López-Roldán, Resources for social research .
Bellaterra (Cerdanyola del Vallès): Dipòsit Digital de Documents, Universitat Autònoma de
Barcelona.
http://ddd.uab.cat/record/89349
Murillo Torrecilla, F. J.; Martínez-Garrido, C. (2012). Analysis of quantitative data with SPSS in socio-
educational research . Madrid: Publications Service of the Autonomous University of Madrid.
Muenchen, R. TO. (2011). R fos SAS and SPSS Users . New York: Springer. 2nd edition.
Pardo, A.; Ruiz, M. TO. (2005). Data analysis with SPSS 13 . Madrid: McGraw-Hill.
Pardo, A.; Ruiz, M. TO. (2009). Data Management with SPSS Statistics . Madrid: Synthesis.
R Development Core Team (2011). A: A Language and Environment for Statistical Computing . The R
Foundation for Statistical Computing Vienna, Austria. ISBN: 3-900051-07-0. http://www.r-
project.org/ .
Rial, A.; Varela, J.; Rojas, A. J. (2001). Preliminary data cleaning and analysis in SPSS . Madrid: RA-
MA.
Spector, Ph. (2008). Data Manipulation with R. New York: Springer.

Digital Document Dipòsit | UVIB http://ddd.uab.cat/record/129381

You might also like