Assignment 2 - Data Management
Assignment 2 - Data Management
CT051-3-M
INDIVIDUAL ASSIGNMENT
1
Abstract
The author conducted a study on the connections between personal traits and social
results in speed-dating. The author found significant correlations between sincerity and
intelligence, attractiveness and likability, fun and likability, and shared interests and likability.
The study used statistical methods like Spearman correlation and Chi-Square tests.
Visualizations such as clustered bar charts, boxplots, and scatter plots were employed to
enhance data interpretation. The research highlights the impact of personal traits on social
dynamics and provides insights for future studies in psychology and social network analysis.
2
Table of Contents
Abstract.....................................................................................................................................2
Introduction..............................................................................................................................4
Related Works..........................................................................................................................5
Speed Dating......................................................................................................................................5
Method......................................................................................................................................7
Participants........................................................................................................................................7
Materials.............................................................................................................................................7
Feature Engineering........................................................................................................................34
SAS Code, SAS Results & Explanation.......................................................................................................34
Hypotheses Testing..........................................................................................................................39
Hypothesis 1 (sinc-intel)...............................................................................................................................40
Hypothesis 2 (Attr-like)................................................................................................................................41
Hypothesis 3 (Shar-fun)................................................................................................................................42
Hypothesis 4 (Fun-like)................................................................................................................................43
Hypothesis 5 (Shar-like)...............................................................................................................................44
SAS Code, SAS Results & Explanation.......................................................................................................45
Insights on Hypotheses.................................................................................................................................47
Discussion................................................................................................................................48
Conclusion...............................................................................................................................50
Future Research.....................................................................................................................51
Works Cited............................................................................................................................52
Appendix A.............................................................................................................................54
Appendix B.............................................................................................................................56
3
Table of Figures
Figure 1 Total observation and number of missing observation in the dataset........................10
Figure 2 The datapoints in variable ‘prob’...............................................................................11
Figure 3 The datapoints in variable 'met'.................................................................................11
Figure 4 Numbers with fractional parts in the variables..........................................................12
Figure 5 Fit Diagnostics for the variable 'age'..........................................................................12
Figure 6 Outliers in the variable 'age'.......................................................................................13
Figure 7 Outliers in the variable 'income'................................................................................14
Figure 5 Pearson Correlation Coefficient Analysis on all 21 variables...................................40
Table of Tables
Table 1 Variables, their types and Explanation..........................................................................9
Table 2 Breakdown of each plot in PROC REG......................................................................12
Table 3 List of Hypotheses.......................................................................................................41
4
Introduction
Data management is crucial for the development and performance of predictive
models in machine learning. The effectiveness of these models depends on the quality and
preprocessing of the data they use (Alexandropoulos, Kotsiantis, & Vrahatis, 2019). This
report aims to explore the detailed processes of data preprocessing, exploratory data analysis
(EDA), and feature engineering, specifically focusing on transforming a specific dataset into
valuable inputs that can improve predictive models. The analysis is conducted using SAS
Studio, a robust data manipulation and analysis tool.
Columbia University carried out an experiment from 2002 to 2004, which resulted in
the dataset being used for this assignment. The university conducted a speed-dating
experiment, tracking data from speed-dating sessions attended by young adults engaging with
individuals of the opposite gender. There are 3000 observations and 14 variables(gender, age,
income, the primary goal in participating in the speed dating, the decision if the date was a
match, attractiveness, sincerity, intelligence, fun, ambitiousness, shared interest, overall
rating, probability of interest being reciprocated, and if the participants have met the date
previously) in the dataset, which makes it useful for exploring algorithms that handle mixed
data types.
The data management process for this assignment is an iterative one, with each stage
building on the previous to ensure a comprehensive approach. The initial step, data
preprocessing, handles missing values and outliers and ensures data consistency. This is
followed by exploratory data analysis (EDA), which produces statistical and inferential
summaries, as well as visualisations, providing valuable insights into data distributions and
relationships. The subsequent step, feature engineering, encompasses variable transformation
and creation, which is crucial for enhancing the quality of the dataset and rendering it more
suitable for predictive modelling. Finally, hypotheses are formulated based on the cleaned
and transformed dataset, and these hypotheses are tested using statistical methods and
visualisations.
In this report, the author aims to showcase various feature engineering techniques,
like mean imputation, outlier detection and handling, binning, logarithm transformation, one-
hot encoding, and scaling. These techniques will be applied to the speed dating dataset to
derive valuable insights and improve the effectiveness of predictive models.
5
6
Related Works
In this section, the author will delve into the existing literature on the dataset's themes,
which are speed dating and predicting a match in a relationship.
Speed Dating
Finkel, Eastwick and Matthews argue that since its development in the late 1990s, the
speed-dating concept has enabled researchers to gain important and unique insights into the
dynamics of romantic attraction (Finkel, Eastwick, & Matthews, 2007). Their research serves
as a conceptual and methodological guide for researchers looking to conduct their own speed
dating study. It includes detailed procedures and references the Northwestern Speed-Dating
Study as an example.
In the following year, Finkel and Eastwick published another article on the speed
dating concept (Finkel & Eastwick, Speed-dating, 2008). The article outlines the advantages
and possibilities of speed-dating procedures, discusses their significant contributions to our
comprehension of the social mind, and demonstrates how researchers can utilise speed-dating
and its variations (such as speed-networking, speed-interviewing, and speed-friending) to
investigate subjects pertinent to various subfields of psychological science.
Asendorpf, Penke, and Back discovered that both men and women primarily focused
on the physical attractiveness of their dating partners (Asendorpf, Penke, & Back., 2011) .
Additionally, women also took men's sociosexuality, openness to experience, shyness,
education, and income into account. This study was conducted with 382 participants aged 18
to 54 over a one-year period in a community sample. This study also found that men tend to
become more selective as they age, while women tend to become less selective as they grow
older. Being selective is associated with being more popular with the opposite sex,
particularly for men. In this study, it was observed that the likelihood of engaging in sexual
interaction with a speed-dating partner stood at 6%. This probability was positively
associated with men's short-term mating interest. Furthermore, the likelihood of establishing
a relationship was found to be 4% and was positively associated with women's long-term
mating interest. However, Asendorpf, Penke, and Back recognized that the findings in the
research are only applicable to speed dating in Germany or Western cultures, and that the
sample appears to have a bias towards a higher level of education.
7
Predicting a match in a relationship
Ireland, Slatcher, Eastwick and Scissors note that the similarity in the dyad’s use of
function words, also known as language style matching (LSM), predicts positive outcomes
for romantic relationships (Ireland, et al., 2011). During this study, 187 participants attended
speed-dating events at Northwestern University, with each event lasting approximately 4
minutes. To summarise, the participants were over 3 times more likely to match with their
date for every standard deviation increase in LSM. Besides that, Großmann and Krohn-
Grimberghe conducted a study to predict the relationship quality based on the personality
traits of a partner and found out that prediction models based on general personality (e.g.
intelligence, ambitiousness, empathy) traits predicted relationship quality less effectively than
models based on relationship-related personality traits (e.g. attractiveness, commitment,
sexuality) (Großmann, Hottung, & Krohn-Grimberghe, 2019).
8
Method
Participants
In 2002-2004, Columbia University ran a speed-dating experiment. They tracked data
from 21 speed-dating sessions, during which mostly young adults met people of the opposite
sex.
Materials
In total, 3000 heterosexual participants registered for the experiment on the Columbia
University campus. During registration, they are to provide their demographic information,
which are gender, age, income, and primary goal in participating in the experiment. During
the speed-dating sessions, all participants are instructed to rate their respective date partners
based on the following characteristics: attractiveness, sincerity, intelligence, fun,
ambitiousness, and shared interest. The response scale for each rating ranged from 1
(strongly disagree) to 10 (strongly agree). Besides the aforementioned characteristics, all
participants are also instructed to rate the respective data using the same score method (1 –
strongly disagree to 10 – strongly agree), in which the data is represented as the variable
‘like’. The variable ‘like’ does not necessarily have any correlation with other characteristics,
i.e., the sum of all characteristics divided by 6 to get the variable ‘like’; it’s just an arbitrary
rating of what the participants rate their respective data in totality.
9
perceptions. During the preprocessing stage, a variety of feature engineering techniques and
imputation variations are utilized to enhance the quality of the data.
10
Initially, logical imputation was utilized to address certain attributes with invalid data
points. For instance, values of the variable ‘met’ exceeding 1 were imputed to 1, and the
value ‘NA’ in ‘prob’ was set to 0. This was done to ensure that 'met' only had values of 1 or 0
(where 1 represents 'Yes' and 0 represents 'No'), while ‘prob’ should range from 0 to 10 and
after considering “NA” implies zero probability. The imputation strategy relied on logical
rules and did not make strong assumptions, as the inaccurate-data mechanism was well
understood (Ziegelmeyer, 2009). Subsequently, missing data in nominal variables were
imputed using mode imputation, where the most frequent value replaced the missing values,
considering the mode as the only applicable measure of central tendency for nominal
variables (Chakrabarty, 2021). For missing data in ordinal variables, median imputation was
employed, with missing values replaced by the median value to preserve the integrity of the
ordinal scale. Finally, mean imputation was used for missing data in ratio variables, replacing
missing values with the mean of the observed values (Alam, Ayub, Arora, & Khan, 2023).
Following mean imputation, the imputed data for the ratio variables and ordinal
variables underwent further refinement through numerosity reduction, specifically rounding
off. This is because whole numbers are expected in both ratio variables ‘age’ and ‘income’ as
well as the ordinal variables. Additionally, outlier detection techniques such as linear
regression and residual analysis were utilised to identify data points in the ratio variables that
significantly deviated from the regression line.
In the process of data preprocessing, it was found that the original dataset
contained missing data in nearly all variables, except for 'gender' and 'dec' as visualized in
Figure 1. Moreover, instances of data inaccuracies were noted. For instance, in Figure 3, for
the variable 'met', the expected data points were binary (1=Yes, 0=No), but there were data
points outside this range. Likewise, the variable
'prob' had data points with the value 'NA', which fell
outside the expected range of 0 to 10.
11
Figure 1 Total observation and number of
missing observation in the dataset
appropriate to impute data points with values greater than 1 as 1 for the 'met' variable and to
impute the 'NA' value as 0 for the 'prob' variable. In fact, Van Buuren stresses the significance
of
When dealing with missing data, various imputation strategies are employed based on
the types of variables. For nominal variables, mode imputation is used, as these variables
represent categorical data without a specific order. Imputing the mode, or the most common
value, is a suitable measure of central tendency for this type of variable. On the other hand,
median imputation is utilized for ordinal variables, which have a meaningful order but
unequal intervals between values. The median, representing the middle value, preserves the
order of the data when used as an imputation strategy for ordinal variables. Finally, mean
imputation is adopted for ratio variables. Per Gravetter and Wallnau, the mean is especially
Figure 2 The datapoints in variable 'met'
valuable for ratio variables as it considers all data points, offering a comprehensive central
value(Gravetter & Wallnau, 2013). Figure 3 The datapoints in variable ‘prob’
After mean imputation, numerosity reduction i.e. rounding off is applied to both ratio
variables 'age' and 'income' as well as ordinal variables. This process aims to enhance the
efficiency of machine learning algorithms and improve result interpretability. Mean
imputation may produce data with fractional parts, which are unexpected in 'age' and 'income'
variables. Additionally, data points with fractional parts in ordinal variables act as outliers,
potentially distorting data visualisation and leading to misleading correlations, as illustrated
in Figure 4.
12
13
Figure 4 Numbers with fractional parts in the variables
14
outliers.
Middle-Left Residual This quantile plot helps to identify outliers by comparing
Quantile Plot the distribution of residuals to a theoretical normal
distribution. Points that deviate significantly from the line
are potential outliers.
Middle Residuals vs. Points far from the zero line are potential outliers.
Predicted
Value
Middle – Cook's This plot shows Cook's Distance for each observation.
Right Distance Points with Cook's D values significantly higher than the
rest indicate influential data points.
Bottom- Residual The histogram of residuals shows the distribution of
Left Histogram residuals. Outliers can often be seen as isolated bars away
from the main distribution.
Bottom – Cumulative These plots provide additional visualizations to assess the
Middle and Residual Plots distribution and identify outliers.
Bottom-
Right
Figures 6 and 7 show the outliers identified in the variables ‘age’ and ‘income’,
respectively. There are 14 outliers in the variable ‘age’ and 28 outliers in the variable
‘income’.
15
Figure 7 Outliers in the variable 'income'
SAS Output
Data
Explanatio Proc Import procedure is used to read data from an external file and import
n it into a SAS dataset.
‘Out’ option specifies the name of the SAS dataset to be created from the
imported data. The dataset will be stored in the ‘Work’ library and named
‘dataset’. The ‘work’ library is a temporary library that is deleted at the end
of the SAS session.
‘dbms’ option specifies the type of file being imported. csv indicates that
16
the file is a comma-separated values (CSV) file.
‘replace’ option allows the Proc Import procedure to overwrite the dataset
with the new data being imported should the dataset already exists.
‘guessingrows’ option tells SAS to read all the rows in the CSV file to make
the best guess about the data types. This is useful when the data types might
vary throughout the file, ensuring more accurate data type determination.
SAS Result
17
PROC FREQ procedure, is used to produce frequency tables for categorical
variables and can also provide frequency distributions for numeric
variables.
The variables listed are gender, age, income, goal, dec, attr, sinc, intel, fun,
amb, shar, like, prob, met.
The / missing option includes missing values in the frequency tables.
Procedure Logical Imputation & ensuring data validaty (change met & prob variables
from character to numerical types)
SAS Code
SAS Results
Explanation The code transforms the met variable, combining certain values (2, 5, 7)
into 1.
It handles missing values in the prob variable by changing 'NA' to '0'.
It converts the met and prob variables from character to numeric types.
The original character variables met and prob are dropped, and the new
18
numeric variables are renamed to the original variable names.
SAS Results
Explanation The PROC STDIZE procedure will calculate the mean for the age and
income variables in work.dataset2.
It will then replace any missing values in these variables with their
respective means.
Because of the reponly option, it will not perform full standardization
(e.g., scaling to a mean of 0 and standard deviation of 1); it will only
replace missing values.
19
Procedure Logical imputation (sinc, intel, fun, amb, shar, like) , mode imputation (all
nominal variables), median imputation (all ordinal variables)
SAS Code
SAS
Results
Explanatio The code reads data from dataset3 and creates a new dataset dataset4.
n For the specified variables (goal, met, attr, sinc, intel, fun, amb, shar, like,
prob), the code performs imputation (either logical, mode or median) :
If a variable has the value 'NA', it replaces it with a specified default value.
For some variables, if the value is '0', it replaces it with '1'.
20
Procedure Numerosity reduction on all variables to replace fractional number with
whole number
SAS Code
SAS Results
Explanation The code reads data from dataset4 and creates a new dataset dataset5.
It applies the round function to several numeric variables (age, income,
goal, attr, sinc, intel, fun, amb, shar, like, prob), rounding their values to the
nearest integer.
Procedure Outlier detection using linear regression and residual for variable ‘age’
SAS Code
21
SAS
Results
22
Explanatio The code performs a linear regression analysis where age is the dependent
n variable, and income, goal, and dec are the independent variables.
It outputs diagnostic measures and predictions to a new dataset named
reg_age.
SAS Results
Explanation The code creates a dataset named outlier_age that contains only the
observations from reg_age where the residual value is greater than 12. This
23
is because these observations are considered outliers as they indicate a
significant difference between the observed and predicted values.
Procedure Outlier detection using linear regression and residual for variable ‘income’
SAS Code
SAS
Results
24
Explanatio The code performs a linear regression analysis where income is the
n dependent variable, and age, goal, and dec are the independent variables.
It outputs diagnostic measures and predictions to a new dataset named
reg_income.
SAS Results
25
Explanation The code creates a dataset named outlier_income that contains only the
observations from reg_income where the residual value is greater than
40000. This is because these observations are considered outliers as they
indicate a significant difference between the observed and predicted
values.
The mode summarises the distribution of nominal variables such as gender, goal, dec,
and met and visualises it using vertical bar charts (Agresti, 2019). While the mode can also be
used to represent central tendency in ordinal variables (attr, sinc, intel, fun, amb,
shar, like, prob), median, ranges and interquartile ranges are much more
appropriate as descriptive statistics since these data can be ranked in
positions (Manikandan, 2011). A box-and-whisker plot (or boxplot) graph,
along with a table displaying quartiles and percentiles, is employed to
visually represent the distribution of ordinal variables (Bensken, Pieracci,
& Ho, 2021).
The next step involves creating a clustered bar chart to analyse the
relationships between selected pairs of categorical variables, including
both nominal and ordinal variables. Additionally, a schematic boxplot will
be used to examine the relationships between a selected categorical
variable and ratio data. Finally, a scatter plot will be employed to explore
the relationships between the ratio variables.
26
Each of the visualisation techniques is chosen on the type of data as
follows:
27
SAS Code, SAS Results & Explanation
Procedure Verifying data completeness after data preprocessing (no missing data)
and Descriptive Statistics for nominal variables (gender, goal, dec, met)
- Central tendency: mode
- Total observation counts, Frequency, Percentage
SAS Code
SAS Results
Explanation PROC MEANS: Provides the mode and the count of total observations
for the variables gender, goal, dec, and met. This is also done to ensure
data completeness.
PROC FREQ: Provides detailed frequency distributions for the variables
gender, goal, dec, and met, including the count and percentage of each
unique value.
These procedures give a comprehensive overview of the distribution and
central tendency of the specified categorical variables in the dataset.
28
SAS Results
Procedure Verifying data completeness after data preprocessing (no missing data)
and Descriptive Statistics for ratio variables (age, income)
- Central tendency: mean, median, mode
- Variability: Range, Interquartile Range, Variance, Standard
Deviation, Coefficient of Variation
- Position: Distribution Plot and Z Score Graph
SAS Code
29
SAS Results
Procedure Vertical bar chart for categorical variables visualisation (gender, goal,
dec, met, attr, sinc, intel, fun, amb, shar, like, prob)
30
SAS Code
SAS Results
31
Explanation The %vbar macro is defined to create a vertical bar chart for a specified
variable in a given dataset using PROC SGPLOT.
The macro is called for both nominal and ordinal variables in dataset5,
generating vertical bar charts for each of these variables.
The vertical bar charts visualize the frequency distribution of all
categorical variable, providing a graphical representation of the
categorical data in dataset5.
Procedure Two way cluster bar chart to examine relationships between two
categorical variables.
SAS Code
More in Appendix A
32
SAS Results
33
More in Appendix B
Explanation Each PROC FREQ step generates a cross-tabulation table and a clustered
frequency plot for a pair of categorical variables. The pairs are gender-goal,
34
dec-attr, gender-dec, dec-goal, gender-attr, dec-like, dec-sinc, dec-intel,
dec-fun, dec-amb, dec-shar, dec-met, dec-prob, prob-like, and met-like.
The code applies specific formats and labels to improve the readability and
interpretability of the output.
The clustered frequency plots provide a visual representation of the
relationships between the pairs of categorical variables.
SAS Results
Explanation The code generates a scatter plot with a regression line for age and
income using PROC SGSCATTER. This visual representation helps in
assessing the linear relationship between these two variables, providing
insights into how income tends to change with age in the dataset dataset5.
35
SAS Results
Explanation PROC SORT sorts the dataset dataset5 by the dec variable while
PROC BOXPLOT creates a schematic box plot to show the distribution
of income across different levels of decision, with applied formats and
labels for better understanding. This code effectively prepares the data
and generates a detailed visualization of how income varies based on the
decision (dec), aiding in the interpretation of the relationship between
these two variables in dataset5
36
Feature Engineering
Following exploratory data analysis (EDA), the dataset undergoes feature engineering
techniques to improve its predictive capability, eliminate noisy features, and further refine the
data. In addition to the feature engineering techniques previously applied during data
preprocessing, including imputation, numerosity reduction, and outlier detection, four more
techniques are integrated into the dataset: numerical binning, log transformation, one-hot
encoding, scaling and feature creation.
In the process of numerical binning, the variable 'income' is segmented into discrete
intervals, categorizing continuous values into specific bins such as 'very low', 'low', 'medium',
'high', and 'very high'. This technique serves to improve interpretability.
The next step involves applying a log transformation to the 'age' variable to stabilize
its variance and normalize its distribution, making it more suitable for modelling.
Additionally, the categorical variable 'goal' is converted using one-hot encoding to create
multiple binary columns for each category. Then, the 'age' and 'income' variables are
standardized to have a mean of 0 and a standard deviation of 1 to ensure consistency across
the dataset. Finally, feature creation is applied to create a composite variable ‘Total’ by using
the following formula:
It is interesting to note that while the composite variable ‘total’ does not necessarily
share the same value as the variable ‘like’ ( the score where participants rate their dating
partner in total), both variables share high correlation strength.
37
SAS Code
SAS Results
Explanation The PROC RANK procedure divides the income variable into quintiles,
facilitating categorization based on relative income levels. Next, the data
step assigns meaningful labels to each quintile, effectively binning the
income variables into categorical variables (income labels).
38
SAS Code
SAS Results
Explanation The data step creates a new variable log_age which transforms age using
the natural logarithm. This is often done to normalize the distribution,
reduce skewness, or meet the assumptions of certain statistical analyses.
PROC UNIVARIATE provides a histogram to visualize the distribution
of the log_age variable, aiding in understanding the transformed data's
characteristics and assessing whether the transformation achieved the
desired effect.
39
SAS Results
Explanation PROC STANDARD transforms the income and age variables to have a
mean of 0 and a standard deviation of 1. This is useful for making the
variables comparable on the same scale, particularly for statistical
analysis or machine learning algorithms that are sensitive to the scale of
input data.
PROC UNIVARIATE provides detailed descriptive statistics and
visualizations for the standardized variables. Histograms help assess the
distribution of the standardized income and age variables, checking for
normality or other patterns.
40
Procedure One-hot encoding for variable ‘goal’
SAS Code
SAS Results
Explanation dataset9 contains all variables from dataset8 and further adds six new
binary variables (goal_1, goal_2, goal_3, goal_4, goal_5, goal_6) that
represent the one-hot encoded values of the goal variable.
41
SAS Results
Explanation The Data procedure creates a new feature called ‘total’. The ‘total’
variable provides a composite score that represents the average of six
specific attributes (attr, sinc, intel, fun, amb, shar). This composite score
can be useful for summarizing these attributes into a single metric for
further analysis.
Hypotheses Testing
Through the utilization of the cleaned and transformed dataset, hypotheses are
constructed to investigate the connections between attributes. For instance, during EDA, a
positive correlation between the variables 'dec' and 'attr' becomes evident when visualized in
a two-way cluster bar chart. The observation aligns with the unspoken rule #1 in modern
dating, which is ‘be attractive’ (Mitchell & Wells, 2018). However, further testing is
necessary to validate the observation.
42
Performing a Pearson correlation analysis across all variables is an effective method
for identifying potential linear relationships within a dataset. This analysis generates
correlation coefficients, which fall within the range of -1 to 1, as well as P-values. A
correlation coefficient close to 1 signifies a strong positive linear relationship, while a value
near -1 indicates a strong negative linear relationship (Schober, Boer, & Schwarte, 2018). A
coefficient around 0 suggests no linear relationship. Additionally, smaller P-values (usually
<0.05) indicate the statistical significance of the correlation coefficients.
Figure 5 shows the result of the Pearson Correlation Coefficient Analysis conducted on all 21
variables. 5 major findings are as follow:
Hypothesis #1 Sinc - The higher the sincerity scores, the higher the intelligence
intel scores.
Hypothesis #2 Attr - like The higher the attractiveness scores, the higher the like
scores
Hypothesis #3 Shar - fun Dating partners that share more interests with the
participants are perceived as more fun
Hypothesis #4 Fun - like Individuals perceived as more fun are likely to receive
higher like scores
Hypothesis #5 Shar-like Dating partners who share more interests with the
43
participant tend to receive higher like scores
It is worth noting that the correlation between the variable ‘total’ and attr, sinc, intel,
fun, amb, shar are significant. Since total is an aggregate measure of those variables, it will
naturally exhibit strong correlations with its component variables and any other variables
correlated with those components. This does not invalidate the correlation but rather explains
its origin and why it is strong. For this assignment, other variables than the ‘total’ variable
will be prioritised.
Chi-Square test is used to test the hypotheses. The Chi-Square test is designed to test
for independence between categorical variables (which all of the involved variables are) thus
making it a robust choice for assessing associations in contingency tables without making
assumptions about the underlying data distribution.
Hypothesis 1 (sinc-intel)
Chi-Square Value 3945.591 Significant relationship
Chi-Square Test of Degree of Freedom thus rejecting the null
81
Independence (DF) hypothesis of
P-Value <0.0001 independence
Value Significant association
Likelihood Ratio Chi-
2140.587 between the two
Square
variables
Value 1270.293 Strong relationship
Mantel- Haenszel Chi-
Degrees of between the two
Square 1
Freedom(DF) variables
A value greater than 1 is
Phi Coefficient 1.1469 uncommon but suggests
a strong association
Contigency Coefficient 0.7537 Value close to 1
indicates a strong
44
relationship between the
two variables.
Moderately strong
Cramer’s V 0.3823 relationship between the
two variables.
Hypothesis 2 (Attr-like)
Chi-Square Value 3488.121 Significant relationship
Chi-Square Test of Degree of Freedom thus rejecting the null
81
Independence (DF) hypothesis of
P-Value <0.0001 independence
Value Significant association
Likelihood Ratio Chi-
2160.0884 between the two
Square
variables
Value 1338.6646 Strong relationship
Mantel- Haenszel Chi-
Degrees of between the two
Square 1
Freedom(DF) variables
A value greater than 1 is
Phi Coefficient 1.0783 uncommon but suggests
a strong association
Value close to 1
indicates a strong
Contigency Coefficient 0.7332
relationship between the
two variables.
Cramer’s V 0.3594 Moderately strong
relationship between the
45
two variables.
Hypothesis 3 (Shar-fun)
Chi-Square Value 2440.7562 Significant relationship
Chi-Square Test of Degree of Freedom thus rejecting the null
81
Independence (DF) hypothesis of
P-Value <0.0001 independence
Value Significant association
Likelihood Ratio Chi-
1547.146 between the two
Square
variables
Value 1032.3330 Strong relationship
Mantel- Haenszel Chi-
Degrees of between the two
Square 1
Freedom(DF) variables
A value close to 1
Phi Coefficient 0.9020 suggests a strong
association
Value close to 1
indicates a strong
Contigency Coefficient 0.6698
relationship between the
two variables.
Moderately strong
Cramer’s V 0.3007 relationship between the
two variables.
46
Hypothesis 4 (Fun-like)
Chi-Square Value 3335.3030 Significant relationship
Chi-Square Test of Degree of Freedom thus rejecting the null
81
Independence (DF) hypothesis of
P-Value <.0001 independence
Value Significant association
Likelihood Ratio Chi-
2213.8685 between the two
Square
variables
Value 1434.8524 Strong relationship
Mantel- Haenszel Chi-
Degrees of between the two
Square 1
Freedom(DF) variables
A value greater than 1 is
Phi Coefficient 1.0544 uncommon but suggests
a strong association
Value close to 1
indicates a strong
Contigency Coefficient 0.7256
relationship between the
two variables.
Moderately strong
Cramer’s V 0.3515 relationship between the
two variables.
Hypothesis 5 (Shar-like)
Chi-Square Value 2961.6365 Significant relationship
Chi-Square Test of Degree of Freedom thus rejecting the null
81
Independence (DF) hypothesis of
P-Value <.0001 independence
Likelihood Ratio Chi- Value 1793.5797 Significant association
47
between the two
Square
variables
Value 1181.3347 Strong relationship
Mantel- Haenszel Chi-
Degrees of between the two
Square 1
Freedom(DF) variables
A value close to 1
Phi Coefficient 0.9936 suggests a strong
association
Value close to 1
indicates a strong
Contigency Coefficient 0.7048
relationship between the
two variables.
Moderately strong
Cramer’s V 0.3312 relationship between the
two variables.
48
SAS Results
49
SAS Code
SAS Results
50
SAS Code
SAS Results
Insights on Hypotheses
The results of both the Chi-Square tests and Spearman Correlation tests yield strong
statistical evidence supporting the significance of the hypothesised relationships. The
Spearman coefficients indicate a range of correlation strengths from strong to very strong,
suggesting a close and monotonic relationship between the variables in each hypothesis.
Consequently, the application of these statistical tests has confirmed the hypotheses,
validating that higher values in one variable (e.g., sincerity, attractiveness, shared interests,
fun) are indeed associated with higher values in another related variable (e.g., intelligence,
like scores). The use of these tests has furnished thorough and dependable evidence
bolstering the hypothesized relationships in the dataset.
Discussion
The exploratory data analysis (EDA) carried out in this study unveiled valuable
insights into the connections between different personal attributes and their influence on
51
likeability and other social outcomes. By employing a range of statistical measures and
visualisation techniques, the author is able to identify patterns and validate our hypotheses
with strong statistical evidence.
In the analysis, the author found robust positive correlations among the key variables.
Specifically, sincerity (sinc) and intelligence (intel) were strongly correlated with a Spearman
correlation coefficient of 0.63373, indicating a significant monotonic relationship. Similarly,
attractiveness (attr) exhibited a strong positive correlation with like scores (like), yielding a
Spearman coefficient of 0.65854. These associations were further substantiated through Chi-
Square tests, affirming the substantial relationships between these variables.
It is crucial to note that fun (fun) and shared interests (shar) are important variables in
understanding social dynamics. The data shows a strong correlation between fun and like
scores, as evidenced by the significant Spearman coefficient of 0.67185. This suggests that
individuals perceived as more fun are more likely to be liked. Additionally, shared interests
also play a significant role, exhibiting a strong positive correlation with like scores
(Spearman coefficient = 0.60470). These findings are consistent with existing literature
highlighting the importance of mutual interests and enjoyment in the formation and
maintenance of social relationships.
The utilization of clustered bar charts for categorical variables, schematic boxplots for
both categorical and ratio variables, and scatter plots for ratio variables offered clear and
intuitive visual representations of the data. Clustered bar charts effectively showcased the
distribution and interactions between categorical variables, while boxplots succinctly
summarized the distribution of ratio variables across different categories. Scatter plots were
crucial in visualizing relationships between continuous variables, unveiling trends and
potential correlations.
4. Feature Engineering:
52
hot encoding. These techniques greatly improved the suitability of the data for analysis by
normalising distributions, reducing skewness, and ensuring that categorical data could be
effectively used in statistical models.
53
Conclusion
The research conducted an in-depth exploratory data analysis to examine the connections
between personal attributes and social outcomes in a speed-dating context. Robust statistical
methods, such as Spearman correlation and Chi-Square tests, supported the hypothesized
associations. Key findings showed strong positive correlations between sincerity and
intelligence, attractiveness and like scores, and fun and like scores. Shared interests were also
found to enhance likeability, emphasizing the importance of mutual enjoyment and common
interests in forming social connections. The use of visualization techniques like clustered bar
charts, schematic boxplots, and scatter plots helped communicate data insights effectively.
Feature engineering processes, including numerical binning, log transformation, and one-hot
encoding, improved the data's suitability for analysis. Overall, the study highlights the
complex interplay of personal attributes in social interactions and their significant impact on
social outcomes, providing implications for understanding social dynamics and laying the
groundwork for future research.
54
Future Research
The next step in research could involve delving into advanced modelling techniques,
such as machine learning algorithms, to forecast social outcomes based on the identified key
attributes. Moreover, conducting longitudinal studies could offer deeper insights into the
evolution of these relationships over time, contributing to a more nuanced understanding of
social interactions and their long-term effects. In summary, this study establishes a strong
basis for comprehending the intricate interplay of personal attributes in social dynamics, with
significant implications for fields ranging from psychology to social network analysis.
55
Works Cited
Alexandropoulos, S.-A. N., Kotsiantis, S. B., & Vrahatis, M. N. (2019). Data preprocessing in
predictive data mining. The Knowledge Engineering Review,.
Asendorpf, J. B., Penke, L., & Back., M. D. (2011). From dating to mating and relating:
Predictors of initial and long–term outcomes of speed–dating in a community sample.
European Journal of Personality.
Finkel, E. J., Eastwick, P. W., & Matthews, J. (2007). peed‐dating as an invaluable tool for
studying romantic attraction: A methodological primer. Personal Relationships, 149-
166.
Ireland, M. E., Slatcher, R. B., Eastwick, P. W., Scissors, L. E., Finkel, E. J., & Pennebaker, J.
W. (2011). Language style matching predicts relationship initiation and stability.
Psychological science, 39-44.
Großmann, I., Hottung, A., & Krohn-Grimberghe, A. (2019). Machine learning meets partner
matching: Predicting the future relationship quality based on personality traits. PLoS
One.
Ziegelmeyer, M. (2009). Documentation of the logical imputation using the panel structure of
the 2003-2008 German SAVE Survey. SONDERFORSCHUNGSBEREICH.
Alam, S., Ayub, M. S., Arora, S., & Khan, M. A. (2023). An investigation of the imputation
techniques for missing values in ordinal data enhancing clustering and classification
analysis validity. Decision Analytics Journal.
Agresti, A. (2019). An introduction to categorical data analysis. John Wiley & Sons.
56
Bensken, W. P., Pieracci, F. M., & Ho, V. P. (2021). Basic introduction to statistics in
medicine, part 1: Describing data. Surgical Infections, 590-596.
Gravetter, F., & Wallnau, L. (2013). Statistics for the behavioral sciences. Belmont: Cengage
Learning.
Mitchell, M., & Wells, M. (2018). Race, romantic attraction, and dating. Ethical Theory and
Moral Practice .
Schober, P., Boer, C., & Schwarte, L. A. (2018). Correlation coefficients: appropriate use and
interpretation. Anesthesia & analgesia 126.
57
Appendix A
SAS Code for Two Way Cluster Bar Chart for selected categorical variable pairs
58
59
Appendix B
60
61
62
63
64
65
66
67
68