0% found this document useful (0 votes)
10 views

Assignment 2 - Data Management

CT051-3-M- DATA MANAGEMENT INDIVIDUAL ASSIGNMENT PART 2

Uploaded by

Amirrul Rasyid
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Assignment 2 - Data Management

CT051-3-M- DATA MANAGEMENT INDIVIDUAL ASSIGNMENT PART 2

Uploaded by

Amirrul Rasyid
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 68

DATA MANAGEMENT

CT051-3-M

DR. MURUGANANTHAN VELAYUTHAM

INDIVIDUAL ASSIGNMENT

Student Name TP Number


Amirrul Rasyid Bin Norazman TP079469

1
Abstract
The author conducted a study on the connections between personal traits and social
results in speed-dating. The author found significant correlations between sincerity and
intelligence, attractiveness and likability, fun and likability, and shared interests and likability.
The study used statistical methods like Spearman correlation and Chi-Square tests.
Visualizations such as clustered bar charts, boxplots, and scatter plots were employed to
enhance data interpretation. The research highlights the impact of personal traits on social
dynamics and provides insights for future studies in psychology and social network analysis.

2
Table of Contents
Abstract.....................................................................................................................................2

Introduction..............................................................................................................................4

Related Works..........................................................................................................................5

Speed Dating......................................................................................................................................5

Predicting a match in a relationship................................................................................................6

Method......................................................................................................................................7

Participants........................................................................................................................................7

Materials.............................................................................................................................................7

Dataset Exploration and Preprocessing..........................................................................................7


SAS Code, SAS Results & Explanation.......................................................................................................13

Exploratory Data Analysis (EDA)..................................................................................................23


SAS Code, SAS Results & Explanation.......................................................................................................25

Feature Engineering........................................................................................................................34
SAS Code, SAS Results & Explanation.......................................................................................................34

Hypotheses Testing..........................................................................................................................39
Hypothesis 1 (sinc-intel)...............................................................................................................................40
Hypothesis 2 (Attr-like)................................................................................................................................41
Hypothesis 3 (Shar-fun)................................................................................................................................42
Hypothesis 4 (Fun-like)................................................................................................................................43
Hypothesis 5 (Shar-like)...............................................................................................................................44
SAS Code, SAS Results & Explanation.......................................................................................................45
Insights on Hypotheses.................................................................................................................................47

Discussion................................................................................................................................48

Conclusion...............................................................................................................................50

Future Research.....................................................................................................................51

Works Cited............................................................................................................................52

Appendix A.............................................................................................................................54

Appendix B.............................................................................................................................56

3
Table of Figures
Figure 1 Total observation and number of missing observation in the dataset........................10
Figure 2 The datapoints in variable ‘prob’...............................................................................11
Figure 3 The datapoints in variable 'met'.................................................................................11
Figure 4 Numbers with fractional parts in the variables..........................................................12
Figure 5 Fit Diagnostics for the variable 'age'..........................................................................12
Figure 6 Outliers in the variable 'age'.......................................................................................13
Figure 7 Outliers in the variable 'income'................................................................................14
Figure 5 Pearson Correlation Coefficient Analysis on all 21 variables...................................40

Table of Tables
Table 1 Variables, their types and Explanation..........................................................................9
Table 2 Breakdown of each plot in PROC REG......................................................................12
Table 3 List of Hypotheses.......................................................................................................41

4
Introduction
Data management is crucial for the development and performance of predictive
models in machine learning. The effectiveness of these models depends on the quality and
preprocessing of the data they use (Alexandropoulos, Kotsiantis, & Vrahatis, 2019). This
report aims to explore the detailed processes of data preprocessing, exploratory data analysis
(EDA), and feature engineering, specifically focusing on transforming a specific dataset into
valuable inputs that can improve predictive models. The analysis is conducted using SAS
Studio, a robust data manipulation and analysis tool.

Columbia University carried out an experiment from 2002 to 2004, which resulted in
the dataset being used for this assignment. The university conducted a speed-dating
experiment, tracking data from speed-dating sessions attended by young adults engaging with
individuals of the opposite gender. There are 3000 observations and 14 variables(gender, age,
income, the primary goal in participating in the speed dating, the decision if the date was a
match, attractiveness, sincerity, intelligence, fun, ambitiousness, shared interest, overall
rating, probability of interest being reciprocated, and if the participants have met the date
previously) in the dataset, which makes it useful for exploring algorithms that handle mixed
data types.

The data management process for this assignment is an iterative one, with each stage
building on the previous to ensure a comprehensive approach. The initial step, data
preprocessing, handles missing values and outliers and ensures data consistency. This is
followed by exploratory data analysis (EDA), which produces statistical and inferential
summaries, as well as visualisations, providing valuable insights into data distributions and
relationships. The subsequent step, feature engineering, encompasses variable transformation
and creation, which is crucial for enhancing the quality of the dataset and rendering it more
suitable for predictive modelling. Finally, hypotheses are formulated based on the cleaned
and transformed dataset, and these hypotheses are tested using statistical methods and
visualisations.

In this report, the author aims to showcase various feature engineering techniques,
like mean imputation, outlier detection and handling, binning, logarithm transformation, one-
hot encoding, and scaling. These techniques will be applied to the speed dating dataset to
derive valuable insights and improve the effectiveness of predictive models.

5
6
Related Works
In this section, the author will delve into the existing literature on the dataset's themes,
which are speed dating and predicting a match in a relationship.

Speed Dating
Finkel, Eastwick and Matthews argue that since its development in the late 1990s, the
speed-dating concept has enabled researchers to gain important and unique insights into the
dynamics of romantic attraction (Finkel, Eastwick, & Matthews, 2007). Their research serves
as a conceptual and methodological guide for researchers looking to conduct their own speed
dating study. It includes detailed procedures and references the Northwestern Speed-Dating
Study as an example.

In the following year, Finkel and Eastwick published another article on the speed
dating concept (Finkel & Eastwick, Speed-dating, 2008). The article outlines the advantages
and possibilities of speed-dating procedures, discusses their significant contributions to our
comprehension of the social mind, and demonstrates how researchers can utilise speed-dating
and its variations (such as speed-networking, speed-interviewing, and speed-friending) to
investigate subjects pertinent to various subfields of psychological science.

Asendorpf, Penke, and Back discovered that both men and women primarily focused
on the physical attractiveness of their dating partners (Asendorpf, Penke, & Back., 2011) .
Additionally, women also took men's sociosexuality, openness to experience, shyness,
education, and income into account. This study was conducted with 382 participants aged 18
to 54 over a one-year period in a community sample. This study also found that men tend to
become more selective as they age, while women tend to become less selective as they grow
older. Being selective is associated with being more popular with the opposite sex,
particularly for men. In this study, it was observed that the likelihood of engaging in sexual
interaction with a speed-dating partner stood at 6%. This probability was positively
associated with men's short-term mating interest. Furthermore, the likelihood of establishing
a relationship was found to be 4% and was positively associated with women's long-term
mating interest. However, Asendorpf, Penke, and Back recognized that the findings in the
research are only applicable to speed dating in Germany or Western cultures, and that the
sample appears to have a bias towards a higher level of education.

7
Predicting a match in a relationship
Ireland, Slatcher, Eastwick and Scissors note that the similarity in the dyad’s use of
function words, also known as language style matching (LSM), predicts positive outcomes
for romantic relationships (Ireland, et al., 2011). During this study, 187 participants attended
speed-dating events at Northwestern University, with each event lasting approximately 4
minutes. To summarise, the participants were over 3 times more likely to match with their
date for every standard deviation increase in LSM. Besides that, Großmann and Krohn-
Grimberghe conducted a study to predict the relationship quality based on the personality
traits of a partner and found out that prediction models based on general personality (e.g.
intelligence, ambitiousness, empathy) traits predicted relationship quality less effectively than
models based on relationship-related personality traits (e.g. attractiveness, commitment,
sexuality) (Großmann, Hottung, & Krohn-Grimberghe, 2019).

8
Method
Participants
In 2002-2004, Columbia University ran a speed-dating experiment. They tracked data
from 21 speed-dating sessions, during which mostly young adults met people of the opposite
sex.

Materials
In total, 3000 heterosexual participants registered for the experiment on the Columbia
University campus. During registration, they are to provide their demographic information,
which are gender, age, income, and primary goal in participating in the experiment. During
the speed-dating sessions, all participants are instructed to rate their respective date partners
based on the following characteristics: attractiveness, sincerity, intelligence, fun,
ambitiousness, and shared interest. The response scale for each rating ranged from 1
(strongly disagree) to 10 (strongly agree). Besides the aforementioned characteristics, all
participants are also instructed to rate the respective data using the same score method (1 –
strongly disagree to 10 – strongly agree), in which the data is represented as the variable
‘like’. The variable ‘like’ does not necessarily have any correlation with other characteristics,
i.e., the sum of all characteristics divided by 6 to get the variable ‘like’; it’s just an arbitrary
rating of what the participants rate their respective data in totality.

Afterwards, the participants need to provide a score (which is represented as the


variable ‘dec’) on whether their respective date partner is a match (1 being Yes and 0 being
No) and on whether their respective date would reciprocate their interest on a scale of 0
(strongly disagree) to 10 (strongly agree) (which is represented by the variable ‘prob’).
Lastly, the participants were to declare if they had met their respective dates before the
experiment (1 being Yes and 0 being No).

Dataset Exploration and Preprocessing


To begin with, the dataset was imported into the SAS Studio environment. It contains
a range of attributes, including gender, age, income, and other personal preferences and

9
perceptions. During the preprocessing stage, a variety of feature engineering techniques and
imputation variations are utilized to enhance the quality of the data.

Table 1 Variables, their types and Explanation

Variables Types Explanation


Gender Nominal Gender ( Female = 0 , Male = 1)
Age Ratio Age (years)
Income Ratio Median annual household income (in USD) based on
zipcode using the Census Bureau website (link here).
When there is no income, it means that they are either
from abroad or did not enter their zip code.
Goal Nominal Primary goal in participating in the speed dating event
Seemed like a fun night out = 1
To meet new people = 2
To get a date = 3
Looking for a serious relationship = 4
To say I did it = 5
Other = 6
Dec Nominal Rater’s decision on whether the date was a
match ( Yes = 1, No = 0)
Attr Ordinal What the participant rate the respective
date. Score from 1 to 10 (Strongly disagree
= 0, Strongly agree = 10) on the
characteristic : Attractive
Sinc Ordinal What the participant rate the respective
date. Score from 1 to 10 (Strongly disagree
= 1, Strongly agree = 10) on the
characteristic : Sincerity
Intel Ordinal What the participant rate the respective
date. Score from 1 to 10 (Strongly disagree
= 1, Strongly agree = 10) on the
characteristic : Intelligence
Shar Ordinal What the participant rate the respective
date. Score from 1 to 10 (Strongly disagree
= 1, Strongly agree = 10) on the
characteristic : Shared Interest
Like Ordinal What the participant rate the respective
date in overall. Score from 1 to 10 (Strongly
dislike = 1, Strongly like = 10)
Prob Ordinal A rating on whether the participant believed
the probability that the interest would be
reciprocated . Score from 0 to 10 ( Strongly
disagree = 0 , Strongly agree = 10)
Met Nominal A rating if the participant has met the date
prior the experiment ( Yes = 1, No =0)

10
Initially, logical imputation was utilized to address certain attributes with invalid data
points. For instance, values of the variable ‘met’ exceeding 1 were imputed to 1, and the
value ‘NA’ in ‘prob’ was set to 0. This was done to ensure that 'met' only had values of 1 or 0
(where 1 represents 'Yes' and 0 represents 'No'), while ‘prob’ should range from 0 to 10 and
after considering “NA” implies zero probability. The imputation strategy relied on logical
rules and did not make strong assumptions, as the inaccurate-data mechanism was well
understood (Ziegelmeyer, 2009). Subsequently, missing data in nominal variables were
imputed using mode imputation, where the most frequent value replaced the missing values,
considering the mode as the only applicable measure of central tendency for nominal
variables (Chakrabarty, 2021). For missing data in ordinal variables, median imputation was
employed, with missing values replaced by the median value to preserve the integrity of the
ordinal scale. Finally, mean imputation was used for missing data in ratio variables, replacing
missing values with the mean of the observed values (Alam, Ayub, Arora, & Khan, 2023).

Following mean imputation, the imputed data for the ratio variables and ordinal
variables underwent further refinement through numerosity reduction, specifically rounding
off. This is because whole numbers are expected in both ratio variables ‘age’ and ‘income’ as
well as the ordinal variables. Additionally, outlier detection techniques such as linear
regression and residual analysis were utilised to identify data points in the ratio variables that
significantly deviated from the regression line.

In the process of data preprocessing, it was found that the original dataset
contained missing data in nearly all variables, except for 'gender' and 'dec' as visualized in
Figure 1. Moreover, instances of data inaccuracies were noted. For instance, in Figure 3, for
the variable 'met', the expected data points were binary (1=Yes, 0=No), but there were data
points outside this range. Likewise, the variable
'prob' had data points with the value 'NA', which fell
outside the expected range of 0 to 10.

In these instances, logical imputation was


employed to address inaccurate data. Specifically,
for the 'met' variable, participants misunderstood the
task and reported the number of meetings with their
date partner. Regarding the 'prob' variable, 'NA'
represented a probability of 0. Therefore, it was

11
Figure 1 Total observation and number of
missing observation in the dataset
appropriate to impute data points with values greater than 1 as 1 for the 'met' variable and to
impute the 'NA' value as 0 for the 'prob' variable. In fact, Van Buuren stresses the significance
of

comprehending the context of data inaccuracies


prior to employing logical imputation (Van
Buuren, 2018).

When dealing with missing data, various imputation strategies are employed based on
the types of variables. For nominal variables, mode imputation is used, as these variables
represent categorical data without a specific order. Imputing the mode, or the most common
value, is a suitable measure of central tendency for this type of variable. On the other hand,
median imputation is utilized for ordinal variables, which have a meaningful order but
unequal intervals between values. The median, representing the middle value, preserves the
order of the data when used as an imputation strategy for ordinal variables. Finally, mean
imputation is adopted for ratio variables. Per Gravetter and Wallnau, the mean is especially
Figure 2 The datapoints in variable 'met'
valuable for ratio variables as it considers all data points, offering a comprehensive central
value(Gravetter & Wallnau, 2013). Figure 3 The datapoints in variable ‘prob’

After mean imputation, numerosity reduction i.e. rounding off is applied to both ratio
variables 'age' and 'income' as well as ordinal variables. This process aims to enhance the
efficiency of machine learning algorithms and improve result interpretability. Mean
imputation may produce data with fractional parts, which are unexpected in 'age' and 'income'
variables. Additionally, data points with fractional parts in ordinal variables act as outliers,
potentially distorting data visualisation and leading to misleading correlations, as illustrated
in Figure 4.

12
13
Figure 4 Numbers with fractional parts in the variables

Lastly, in the data preprocessing stage,


another feature engineering technique,
outlier detection, is applied to the
ratio variable to identify data points
that deviated from the regression line.
As shown in Figure 5, multiple plots
Figure 5 Fit Diagnostics for the variable 'age' are used to identify the outliers in the
variable. The breakdown of each plot is shown below in Table 2.

Table 2 Breakdown of each plot in PROC REG

Location in Plot Description


Figure 5
Top-left Residual vs This plot shows residuals against predicted values. Points
Predicted that are far from the horizontal line at zero are potential
Value outliers.
Top-middle Studentized This plot shows studentized residuals against predicted
Residuals vs. values. Studentized residuals greater than ±2 or ±3 are
Predicted considered outliers
Value
Top- right Studentized This plot helps in identifying influential data points. Points
Residuals vs. with high leverage and high studentized residuals (falling
Leverage outside the horizontal lines) are potential influential

14
outliers.
Middle-Left Residual This quantile plot helps to identify outliers by comparing
Quantile Plot the distribution of residuals to a theoretical normal
distribution. Points that deviate significantly from the line
are potential outliers.
Middle Residuals vs. Points far from the zero line are potential outliers.
Predicted
Value
Middle – Cook's This plot shows Cook's Distance for each observation.
Right Distance Points with Cook's D values significantly higher than the
rest indicate influential data points.
Bottom- Residual The histogram of residuals shows the distribution of
Left Histogram residuals. Outliers can often be seen as isolated bars away
from the main distribution.
Bottom – Cumulative These plots provide additional visualizations to assess the
Middle and Residual Plots distribution and identify outliers.
Bottom-
Right

Figures 6 and 7 show the outliers identified in the variables ‘age’ and ‘income’,
respectively. There are 14 outliers in the variable ‘age’ and 28 outliers in the variable
‘income’.

Figure 6 Outliers in the variable 'age'

15
Figure 7 Outliers in the variable 'income'

SAS Code, SAS Results & Explanation


Procedure Importing dataset
SAS Code

SAS Output
Data

Explanatio Proc Import procedure is used to read data from an external file and import
n it into a SAS dataset.
‘Out’ option specifies the name of the SAS dataset to be created from the
imported data. The dataset will be stored in the ‘Work’ library and named
‘dataset’. The ‘work’ library is a temporary library that is deleted at the end
of the SAS session.
‘dbms’ option specifies the type of file being imported. csv indicates that

16
the file is a comma-separated values (CSV) file.
‘replace’ option allows the Proc Import procedure to overwrite the dataset
with the new data being imported should the dataset already exists.
‘guessingrows’ option tells SAS to read all the rows in the CSV file to make
the best guess about the data types. This is useful when the data types might
vary throughout the file, ensuring more accurate data type determination.

Procedure Identifying missing observations


SAS Code

SAS Result

Explanation PROC MEANS procedure is used to compute descriptive statistics for


numeric variables in a dataset.
n: Computes the number of non-missing values for each variable.
nmiss: Computes the number of missing values for each variable.

17
PROC FREQ procedure, is used to produce frequency tables for categorical
variables and can also provide frequency distributions for numeric
variables.
The variables listed are gender, age, income, goal, dec, attr, sinc, intel, fun,
amb, shar, like, prob, met.
The / missing option includes missing values in the frequency tables.

Procedure Logical Imputation & ensuring data validaty (change met & prob variables
from character to numerical types)
SAS Code

SAS Results

Explanation The code transforms the met variable, combining certain values (2, 5, 7)
into 1.
It handles missing values in the prob variable by changing 'NA' to '0'.
It converts the met and prob variables from character to numeric types.
The original character variables met and prob are dropped, and the new

18
numeric variables are renamed to the original variable names.

Procedure Mean Imputation for ratio data


SAS Code

SAS Results

Explanation The PROC STDIZE procedure will calculate the mean for the age and
income variables in work.dataset2.
It will then replace any missing values in these variables with their
respective means.
Because of the reponly option, it will not perform full standardization
(e.g., scaling to a mean of 0 and standard deviation of 1); it will only
replace missing values.

19
Procedure Logical imputation (sinc, intel, fun, amb, shar, like) , mode imputation (all
nominal variables), median imputation (all ordinal variables)
SAS Code

SAS
Results

Explanatio The code reads data from dataset3 and creates a new dataset dataset4.
n For the specified variables (goal, met, attr, sinc, intel, fun, amb, shar, like,
prob), the code performs imputation (either logical, mode or median) :
If a variable has the value 'NA', it replaces it with a specified default value.
For some variables, if the value is '0', it replaces it with '1'.

20
Procedure Numerosity reduction on all variables to replace fractional number with
whole number
SAS Code

SAS Results

Explanation The code reads data from dataset4 and creates a new dataset dataset5.
It applies the round function to several numeric variables (age, income,
goal, attr, sinc, intel, fun, amb, shar, like, prob), rounding their values to the
nearest integer.

Procedure Outlier detection using linear regression and residual for variable ‘age’
SAS Code

21
SAS
Results

22
Explanatio The code performs a linear regression analysis where age is the dependent
n variable, and income, goal, and dec are the independent variables.
It outputs diagnostic measures and predictions to a new dataset named
reg_age.

Procedure Isolating all outliers for variable ‘age’ in a separate dataset


SAS Code

SAS Results

Explanation The code creates a dataset named outlier_age that contains only the
observations from reg_age where the residual value is greater than 12. This

23
is because these observations are considered outliers as they indicate a
significant difference between the observed and predicted values.

Procedure Outlier detection using linear regression and residual for variable ‘income’
SAS Code

SAS
Results

24
Explanatio The code performs a linear regression analysis where income is the
n dependent variable, and age, goal, and dec are the independent variables.
It outputs diagnostic measures and predictions to a new dataset named
reg_income.

Procedure Isolating all outliers for variable ‘income’ in a separate dataset


SAS Code

SAS Results

25
Explanation The code creates a dataset named outlier_income that contains only the
observations from reg_income where the residual value is greater than
40000. This is because these observations are considered outliers as they
indicate a significant difference between the observed and predicted
values.

Exploratory Data Analysis (EDA)


Following data preprocessing, an exploratory data analysis (EDA) was conducted to
understand the dataset's underlying patterns and relationships.

The mode summarises the distribution of nominal variables such as gender, goal, dec,
and met and visualises it using vertical bar charts (Agresti, 2019). While the mode can also be
used to represent central tendency in ordinal variables (attr, sinc, intel, fun, amb,
shar, like, prob), median, ranges and interquartile ranges are much more
appropriate as descriptive statistics since these data can be ranked in
positions (Manikandan, 2011). A box-and-whisker plot (or boxplot) graph,
along with a table displaying quartiles and percentiles, is employed to
visually represent the distribution of ordinal variables (Bensken, Pieracci,
& Ho, 2021).

The statistics used to describe the data in ratio variables such as


'age' and 'income' include the mean, variance, standard deviation, and
coefficient of variation. Additionally, Z-score histograms for both 'age' and
'income' are generated to visually represent these variables' distribution.

The next step involves creating a clustered bar chart to analyse the
relationships between selected pairs of categorical variables, including
both nominal and ordinal variables. Additionally, a schematic boxplot will
be used to examine the relationships between a selected categorical
variable and ratio data. Finally, a scatter plot will be employed to explore
the relationships between the ratio variables.

26
Each of the visualisation techniques is chosen on the type of data as
follows:

Clustered bar charts = categorical variable & categorical variable

Boxplots = categorical variable & ratio variable

Scatter plot = ratio variable & ratio variable

Clustered bar charts are used to display the frequency of different


categories within a variable, with bars grouped by another categorical
variable. This way, they allow for easy comparison of the distribution of
one categorical variable across the levels of another categorical variable
and effectively highlight relationships or differences between categories.
Also, clustered bar charts provide a clear and straightforward way to
visualize the interaction between two categorical variables without any
overlap, which can sometimes occur in stacked bar charts.

Schematic boxplots illustrate the distribution of a ratio variable


across various categories of a categorical variable. They provide a concise
summary of the distribution of the ratio variable, showcasing the median,
quartiles, and potential outliers. This visual tool also facilitates
straightforward comparisons of distributions across different categories.
Besides that, boxplots are particularly useful for identifying outliers and
gaining insight into the spread and skewness of the data within each
category. They offer a compact and insightful visual summary, enabling
users to easily discern differences and similarities between groups.

A scatter plot visually represents the relationship between two


continuous variables by plotting data points on a two-dimensional graph.
It assists in identifying trends, clusters, and potential correlations. Each
data point is depicted individually, allowing for a thorough examination of
data distribution. Additionally, scatter plots can be improved with
regression lines to offer a clearer understanding of the strength and
direction of the relationship.

27
SAS Code, SAS Results & Explanation
Procedure Verifying data completeness after data preprocessing (no missing data)
and Descriptive Statistics for nominal variables (gender, goal, dec, met)
- Central tendency: mode
- Total observation counts, Frequency, Percentage
SAS Code

SAS Results

Explanation PROC MEANS: Provides the mode and the count of total observations
for the variables gender, goal, dec, and met. This is also done to ensure
data completeness.
PROC FREQ: Provides detailed frequency distributions for the variables
gender, goal, dec, and met, including the count and percentage of each
unique value.
These procedures give a comprehensive overview of the distribution and
central tendency of the specified categorical variables in the dataset.

Procedure Verifying data completeness after data preprocessing and Descriptive


Statistics for ordinal variables (attr, sinc, intel, fun, amb, shar , like ,
prob)
- Central tendency: median, mode
- Variability: Ranges, Interquartile Range
- Position: Quartile, Percentile
SAS Code

28
SAS Results

Explanation PROC FREQ provides a detailed distribution of the ordinal variables


( attr, sinc, intel, fun, amb, shar , like , prob) , showing how often each
value occurs and its relative proportion in the dataset. This is useful for
understanding the frequency and distribution of categorical data. This
procedure is also done to ensure data completeness.

PROC UNIVARIATE offers a comprehensive statistical summary and


visual representation of the ordinal variables. The descriptive statistics
help understand the central tendency (median), and variability (quartiles
and percentile)

Procedure Verifying data completeness after data preprocessing (no missing data)
and Descriptive Statistics for ratio variables (age, income)
- Central tendency: mean, median, mode
- Variability: Range, Interquartile Range, Variance, Standard
Deviation, Coefficient of Variation
- Position: Distribution Plot and Z Score Graph
SAS Code

29
SAS Results

Explanation PROC UNIVARIATE offers a comprehensive statistical summary and


visual representation of the ratio variables. The descriptive statistics help
understand the central tendency (mean), variability (coefficient of
variation, standard deviation, variance, quartiles, percentiles), and
distribution shape, while the plots (Distribution plot & Z-Score Plot)
provide a visual way to assess the distribution, detect outliers, and check
for normality.

Procedure Vertical bar chart for categorical variables visualisation (gender, goal,
dec, met, attr, sinc, intel, fun, amb, shar, like, prob)

30
SAS Code

SAS Results

31
Explanation The %vbar macro is defined to create a vertical bar chart for a specified
variable in a given dataset using PROC SGPLOT.
The macro is called for both nominal and ordinal variables in dataset5,
generating vertical bar charts for each of these variables.
The vertical bar charts visualize the frequency distribution of all
categorical variable, providing a graphical representation of the
categorical data in dataset5.

Procedure Two way cluster bar chart to examine relationships between two
categorical variables.
SAS Code

More in Appendix A

32
SAS Results

33
More in Appendix B
Explanation Each PROC FREQ step generates a cross-tabulation table and a clustered
frequency plot for a pair of categorical variables. The pairs are gender-goal,

34
dec-attr, gender-dec, dec-goal, gender-attr, dec-like, dec-sinc, dec-intel,
dec-fun, dec-amb, dec-shar, dec-met, dec-prob, prob-like, and met-like.
The code applies specific formats and labels to improve the readability and
interpretability of the output.
The clustered frequency plots provide a visual representation of the
relationships between the pairs of categorical variables.

Procedure Regression line scatter plot to examine relationships between 2 ratio


variables (age and income)
SAS Code

SAS Results

Explanation The code generates a scatter plot with a regression line for age and
income using PROC SGSCATTER. This visual representation helps in
assessing the linear relationship between these two variables, providing
insights into how income tends to change with age in the dataset dataset5.

Procedure Schematic Boxplot to examine relationship between a categorical data


and a ratio data
SAS Code

35
SAS Results

Explanation PROC SORT sorts the dataset dataset5 by the dec variable while
PROC BOXPLOT creates a schematic box plot to show the distribution
of income across different levels of decision, with applied formats and
labels for better understanding. This code effectively prepares the data
and generates a detailed visualization of how income varies based on the
decision (dec), aiding in the interpretation of the relationship between
these two variables in dataset5

36
Feature Engineering
Following exploratory data analysis (EDA), the dataset undergoes feature engineering
techniques to improve its predictive capability, eliminate noisy features, and further refine the
data. In addition to the feature engineering techniques previously applied during data
preprocessing, including imputation, numerosity reduction, and outlier detection, four more
techniques are integrated into the dataset: numerical binning, log transformation, one-hot
encoding, scaling and feature creation.

In the process of numerical binning, the variable 'income' is segmented into discrete
intervals, categorizing continuous values into specific bins such as 'very low', 'low', 'medium',
'high', and 'very high'. This technique serves to improve interpretability.

The next step involves applying a log transformation to the 'age' variable to stabilize
its variance and normalize its distribution, making it more suitable for modelling.
Additionally, the categorical variable 'goal' is converted using one-hot encoding to create
multiple binary columns for each category. Then, the 'age' and 'income' variables are
standardized to have a mean of 0 and a standard deviation of 1 to ensure consistency across
the dataset. Finally, feature creation is applied to create a composite variable ‘Total’ by using
the following formula:

attr + sinc+intel + fun+amb+ shar


total=
6

It is interesting to note that while the composite variable ‘total’ does not necessarily
share the same value as the variable ‘like’ ( the score where participants rate their dating
partner in total), both variables share high correlation strength.

SAS Code, SAS Results & Explanation


Procedure Numerical Binning on ‘Income’ variable (creating ‘Income’ variable from
ratio variable to categorical variable)

37
SAS Code

SAS Results

Explanation The PROC RANK procedure divides the income variable into quintiles,
facilitating categorization based on relative income levels. Next, the data
step assigns meaningful labels to each quintile, effectively binning the
income variables into categorical variables (income labels).

Procedure Logarithm Transformation to ‘age’ variable.

38
SAS Code

SAS Results

Explanation The data step creates a new variable log_age which transforms age using
the natural logarithm. This is often done to normalize the distribution,
reduce skewness, or meet the assumptions of certain statistical analyses.
PROC UNIVARIATE provides a histogram to visualize the distribution
of the log_age variable, aiding in understanding the transformed data's
characteristics and assessing whether the transformation achieved the
desired effect.

Procedure Scaling (Standardization/ Z-Score Normalisation) on variables ‘income’


and ‘age’
SAS Code

39
SAS Results

Explanation PROC STANDARD transforms the income and age variables to have a
mean of 0 and a standard deviation of 1. This is useful for making the
variables comparable on the same scale, particularly for statistical
analysis or machine learning algorithms that are sensitive to the scale of
input data.
PROC UNIVARIATE provides detailed descriptive statistics and
visualizations for the standardized variables. Histograms help assess the
distribution of the standardized income and age variables, checking for
normality or other patterns.

40
Procedure One-hot encoding for variable ‘goal’
SAS Code

SAS Results

Explanation dataset9 contains all variables from dataset8 and further adds six new
binary variables (goal_1, goal_2, goal_3, goal_4, goal_5, goal_6) that
represent the one-hot encoded values of the goal variable.

Procedure Feature Creation


SAS Code

41
SAS Results

Explanation The Data procedure creates a new feature called ‘total’. The ‘total’
variable provides a composite score that represents the average of six
specific attributes (attr, sinc, intel, fun, amb, shar). This composite score
can be useful for summarizing these attributes into a single metric for
further analysis.

Hypotheses Testing
Through the utilization of the cleaned and transformed dataset, hypotheses are
constructed to investigate the connections between attributes. For instance, during EDA, a
positive correlation between the variables 'dec' and 'attr' becomes evident when visualized in
a two-way cluster bar chart. The observation aligns with the unspoken rule #1 in modern
dating, which is ‘be attractive’ (Mitchell & Wells, 2018). However, further testing is
necessary to validate the observation.

42
Performing a Pearson correlation analysis across all variables is an effective method
for identifying potential linear relationships within a dataset. This analysis generates
correlation coefficients, which fall within the range of -1 to 1, as well as P-values. A
correlation coefficient close to 1 signifies a strong positive linear relationship, while a value
near -1 indicates a strong negative linear relationship (Schober, Boer, & Schwarte, 2018). A
coefficient around 0 suggests no linear relationship. Additionally, smaller P-values (usually
<0.05) indicate the statistical significance of the correlation coefficients.

Figure 8 Pearson Correlation Coefficient Analysis on all 21 variables

Figure 5 shows the result of the Pearson Correlation Coefficient Analysis conducted on all 21
variables. 5 major findings are as follow:

Table 3 List of Hypotheses

Hypothesis #1 Sinc - The higher the sincerity scores, the higher the intelligence
intel scores.
Hypothesis #2 Attr - like The higher the attractiveness scores, the higher the like
scores
Hypothesis #3 Shar - fun Dating partners that share more interests with the
participants are perceived as more fun
Hypothesis #4 Fun - like Individuals perceived as more fun are likely to receive
higher like scores
Hypothesis #5 Shar-like Dating partners who share more interests with the

43
participant tend to receive higher like scores

It is worth noting that the correlation between the variable ‘total’ and attr, sinc, intel,
fun, amb, shar are significant. Since total is an aggregate measure of those variables, it will
naturally exhibit strong correlations with its component variables and any other variables
correlated with those components. This does not invalidate the correlation but rather explains
its origin and why it is strong. For this assignment, other variables than the ‘total’ variable
will be prioritised.

Chi-Square test is used to test the hypotheses. The Chi-Square test is designed to test
for independence between categorical variables (which all of the involved variables are) thus
making it a robust choice for assessing associations in contingency tables without making
assumptions about the underlying data distribution.

To further validate the hypotheses of correlation between the pairs of variables,


Spearman’s Rank Correlation is also conducted on all pairs. This non-parametric test can
measure the strength and direction of the monotonic relationship between ordinal variables.

Hypothesis 1 (sinc-intel)
Chi-Square Value 3945.591 Significant relationship
Chi-Square Test of Degree of Freedom thus rejecting the null
81
Independence (DF) hypothesis of
P-Value <0.0001 independence
Value Significant association
Likelihood Ratio Chi-
2140.587 between the two
Square
variables
Value 1270.293 Strong relationship
Mantel- Haenszel Chi-
Degrees of between the two
Square 1
Freedom(DF) variables
A value greater than 1 is
Phi Coefficient 1.1469 uncommon but suggests
a strong association
Contigency Coefficient 0.7537 Value close to 1
indicates a strong

44
relationship between the
two variables.
Moderately strong
Cramer’s V 0.3823 relationship between the
two variables.

Spearman’s There is a strong positive monotonic relationship between sinc


Correlation 0.63373 (sincerity) and intel (intelligence). As sinc increases, intel also
Coefficient tends to increase, and vice versa.

Hypothesis 2 (Attr-like)
Chi-Square Value 3488.121 Significant relationship
Chi-Square Test of Degree of Freedom thus rejecting the null
81
Independence (DF) hypothesis of
P-Value <0.0001 independence
Value Significant association
Likelihood Ratio Chi-
2160.0884 between the two
Square
variables
Value 1338.6646 Strong relationship
Mantel- Haenszel Chi-
Degrees of between the two
Square 1
Freedom(DF) variables
A value greater than 1 is
Phi Coefficient 1.0783 uncommon but suggests
a strong association
Value close to 1
indicates a strong
Contigency Coefficient 0.7332
relationship between the
two variables.
Cramer’s V 0.3594 Moderately strong
relationship between the

45
two variables.

Spearman’s There is a strong positive monotonic relationship between attr


Correlation 0.65854 (attractiveness) and like. Higher attractiveness scores are
Coefficient associated with higher like scores

Hypothesis 3 (Shar-fun)
Chi-Square Value 2440.7562 Significant relationship
Chi-Square Test of Degree of Freedom thus rejecting the null
81
Independence (DF) hypothesis of
P-Value <0.0001 independence
Value Significant association
Likelihood Ratio Chi-
1547.146 between the two
Square
variables
Value 1032.3330 Strong relationship
Mantel- Haenszel Chi-
Degrees of between the two
Square 1
Freedom(DF) variables
A value close to 1
Phi Coefficient 0.9020 suggests a strong
association
Value close to 1
indicates a strong
Contigency Coefficient 0.6698
relationship between the
two variables.
Moderately strong
Cramer’s V 0.3007 relationship between the
two variables.

Spearman’s There is a strong positive monotonic relationship between shar


Correlation 0.55196 (shared interests) and fun. Higher shared interests scores are
Coefficient associated with higher fun scores.

46
Hypothesis 4 (Fun-like)
Chi-Square Value 3335.3030 Significant relationship
Chi-Square Test of Degree of Freedom thus rejecting the null
81
Independence (DF) hypothesis of
P-Value <.0001 independence
Value Significant association
Likelihood Ratio Chi-
2213.8685 between the two
Square
variables
Value 1434.8524 Strong relationship
Mantel- Haenszel Chi-
Degrees of between the two
Square 1
Freedom(DF) variables
A value greater than 1 is
Phi Coefficient 1.0544 uncommon but suggests
a strong association
Value close to 1
indicates a strong
Contigency Coefficient 0.7256
relationship between the
two variables.
Moderately strong
Cramer’s V 0.3515 relationship between the
two variables.

Spearman’s There is a strong positive monotonic relationship between fun


0.6718
Correlation and like. Higher fun scores are associated with higher like
5
Coefficient scores.

Hypothesis 5 (Shar-like)
Chi-Square Value 2961.6365 Significant relationship
Chi-Square Test of Degree of Freedom thus rejecting the null
81
Independence (DF) hypothesis of
P-Value <.0001 independence
Likelihood Ratio Chi- Value 1793.5797 Significant association

47
between the two
Square
variables
Value 1181.3347 Strong relationship
Mantel- Haenszel Chi-
Degrees of between the two
Square 1
Freedom(DF) variables
A value close to 1
Phi Coefficient 0.9936 suggests a strong
association
Value close to 1
indicates a strong
Contigency Coefficient 0.7048
relationship between the
two variables.
Moderately strong
Cramer’s V 0.3312 relationship between the
two variables.

Spearman’s There is a strong positive monotonic relationship between shar


0.6047
Correlation (shared interests) and like. Higher shared interests scores are
0
Coefficient associated with higher like scores.

SAS Code, SAS Results & Explanation


Procedure Pearson Correlation
SAS Code

48
SAS Results

Explanation PROC CORE computes Pearson correlation coefficients, which measure


the linear relationship between pairs of variables, which in this case, all 21
variables against all 21 variables.

Procedure Chi Square Test for Independence

49
SAS Code

SAS Results

Explanation The macro %chisq streamlines the process of performing Chi-Square


Tests for independence on multiple pairs of variables. It allows for easy
and efficient testing of relationships between categorical variables within
a dataset.

Procedure Spearman Correlation Test

50
SAS Code

SAS Results

Explanation The macro %spearman streamlines the process of performing Spearman


Correlation Tests on multiple pairs of variables. It allows for easy and
efficient testing of monotonic relationships between variables within a
dataset.

Insights on Hypotheses
The results of both the Chi-Square tests and Spearman Correlation tests yield strong
statistical evidence supporting the significance of the hypothesised relationships. The
Spearman coefficients indicate a range of correlation strengths from strong to very strong,
suggesting a close and monotonic relationship between the variables in each hypothesis.
Consequently, the application of these statistical tests has confirmed the hypotheses,
validating that higher values in one variable (e.g., sincerity, attractiveness, shared interests,
fun) are indeed associated with higher values in another related variable (e.g., intelligence,
like scores). The use of these tests has furnished thorough and dependable evidence
bolstering the hypothesized relationships in the dataset.

Discussion
The exploratory data analysis (EDA) carried out in this study unveiled valuable
insights into the connections between different personal attributes and their influence on
51
likeability and other social outcomes. By employing a range of statistical measures and
visualisation techniques, the author is able to identify patterns and validate our hypotheses
with strong statistical evidence.

1. Correlation and Association:

In the analysis, the author found robust positive correlations among the key variables.
Specifically, sincerity (sinc) and intelligence (intel) were strongly correlated with a Spearman
correlation coefficient of 0.63373, indicating a significant monotonic relationship. Similarly,
attractiveness (attr) exhibited a strong positive correlation with like scores (like), yielding a
Spearman coefficient of 0.65854. These associations were further substantiated through Chi-
Square tests, affirming the substantial relationships between these variables.

2. Importance of Fun and Shared Interests:

It is crucial to note that fun (fun) and shared interests (shar) are important variables in
understanding social dynamics. The data shows a strong correlation between fun and like
scores, as evidenced by the significant Spearman coefficient of 0.67185. This suggests that
individuals perceived as more fun are more likely to be liked. Additionally, shared interests
also play a significant role, exhibiting a strong positive correlation with like scores
(Spearman coefficient = 0.60470). These findings are consistent with existing literature
highlighting the importance of mutual interests and enjoyment in the formation and
maintenance of social relationships.

3. Visualizations and Interpretations:

The utilization of clustered bar charts for categorical variables, schematic boxplots for
both categorical and ratio variables, and scatter plots for ratio variables offered clear and
intuitive visual representations of the data. Clustered bar charts effectively showcased the
distribution and interactions between categorical variables, while boxplots succinctly
summarized the distribution of ratio variables across different categories. Scatter plots were
crucial in visualizing relationships between continuous variables, unveiling trends and
potential correlations.

4. Feature Engineering:

The process of feature engineering involved binning income into numerical


categories, transforming age using logarithms, and encoding categorical variables using one-

52
hot encoding. These techniques greatly improved the suitability of the data for analysis by
normalising distributions, reducing skewness, and ensuring that categorical data could be
effectively used in statistical models.

53
Conclusion
The research conducted an in-depth exploratory data analysis to examine the connections
between personal attributes and social outcomes in a speed-dating context. Robust statistical
methods, such as Spearman correlation and Chi-Square tests, supported the hypothesized
associations. Key findings showed strong positive correlations between sincerity and
intelligence, attractiveness and like scores, and fun and like scores. Shared interests were also
found to enhance likeability, emphasizing the importance of mutual enjoyment and common
interests in forming social connections. The use of visualization techniques like clustered bar
charts, schematic boxplots, and scatter plots helped communicate data insights effectively.
Feature engineering processes, including numerical binning, log transformation, and one-hot
encoding, improved the data's suitability for analysis. Overall, the study highlights the
complex interplay of personal attributes in social interactions and their significant impact on
social outcomes, providing implications for understanding social dynamics and laying the
groundwork for future research.

54
Future Research
The next step in research could involve delving into advanced modelling techniques,
such as machine learning algorithms, to forecast social outcomes based on the identified key
attributes. Moreover, conducting longitudinal studies could offer deeper insights into the
evolution of these relationships over time, contributing to a more nuanced understanding of
social interactions and their long-term effects. In summary, this study establishes a strong
basis for comprehending the intricate interplay of personal attributes in social dynamics, with
significant implications for fields ranging from psychology to social network analysis.

55
Works Cited
Alexandropoulos, S.-A. N., Kotsiantis, S. B., & Vrahatis, M. N. (2019). Data preprocessing in
predictive data mining. The Knowledge Engineering Review,.

Asendorpf, J. B., Penke, L., & Back., M. D. (2011). From dating to mating and relating:
Predictors of initial and long–term outcomes of speed–dating in a community sample.
European Journal of Personality.

Finkel, E. J., Eastwick, P. W., & Matthews, J. (2007). peed‐dating as an invaluable tool for
studying romantic attraction: A methodological primer. Personal Relationships, 149-
166.

Finkel, E. J., & Eastwick, P. W. (2008). Speed-dating. Current Directions in Psychological


Science, 193-197.

Ireland, M. E., Slatcher, R. B., Eastwick, P. W., Scissors, L. E., Finkel, E. J., & Pennebaker, J.
W. (2011). Language style matching predicts relationship initiation and stability.
Psychological science, 39-44.

Großmann, I., Hottung, A., & Krohn-Grimberghe, A. (2019). Machine learning meets partner
matching: Predicting the future relationship quality based on personality traits. PLoS
One.

Ziegelmeyer, M. (2009). Documentation of the logical imputation using the panel structure of
the 2003-2008 German SAVE Survey. SONDERFORSCHUNGSBEREICH.

Chakrabarty, D. (2021). Model describing central tendency of data. International Journal of


Advanced Research in Science, Engineering and Technology.

Alam, S., Ayub, M. S., Arora, S., & Khan, M. A. (2023). An investigation of the imputation
techniques for missing values in ordinal data enhancing clustering and classification
analysis validity. Decision Analytics Journal.

Agresti, A. (2019). An introduction to categorical data analysis. John Wiley & Sons.

Manikandan, S. (2011). Measures of central tendency: The mean. Journal of pharmacology


& pharmacotherapeutics.

56
Bensken, W. P., Pieracci, F. M., & Ho, V. P. (2021). Basic introduction to statistics in
medicine, part 1: Describing data. Surgical Infections, 590-596.

Van Buuren, S. (2018). Flexible imputation of missing data. CRC press.

Gravetter, F., & Wallnau, L. (2013). Statistics for the behavioral sciences. Belmont: Cengage
Learning.

Mitchell, M., & Wells, M. (2018). Race, romantic attraction, and dating. Ethical Theory and
Moral Practice .

Schober, P., Boer, C., & Schwarte, L. A. (2018). Correlation coefficients: appropriate use and
interpretation. Anesthesia & analgesia 126.

57
Appendix A
SAS Code for Two Way Cluster Bar Chart for selected categorical variable pairs

58
59
Appendix B

60
61
62
63
64
65
66
67
68

You might also like