SATISFACTI
ON
PREDICTION
USING
MACHINE
LEARNING
ALGORITHMS
TABLE OF CONTENTS
01 02 03
INTRODUCTION DATA OVERVIEW DATA EXPLORATION
04
DATA QUALITY
05
DATA
06
MODEL
ASSESSMENT PREPROCESSING PREPARATION
07 08 09
MODEL TRAINING MODEL EVALUATION CONCLUSON
AND COMPARISON
01
INTRODUCTIO
N
1.1. Purpose of Analysis
The purpose of this analysis is to
explore and predict customer
satisfaction in the airline industry.
Customer satisfaction is a critical
determinant of an airline's success and
competitiveness.
● The primary drivers of customer
satisfaction.
● Patterns or trends in passenger
feedback.
● Opportunities for airlines to improve
their services and address customer
pain points.
1.2. Importance of passenger’s
satisfaction prediction
In a highly competitive market, this can
lead to: Increased Revenue;
Improved Operational Efficiency;
Brand Loyalty and Reputation
02
Data
Exploratio
n
2.1.Loading the
Data
The dataset is loaded using Python, and its dimensions (97,410
rows and 24 columns) confirm its robustness for analysis.
Running Code:
● The pandas library is used to process and analyze tabular data
(DataFrame).
● pd is an alias for calling functions in the pandas library.
● pd.read_csv(): The function reads a CSV file and converts the content into
a DataFrame (table format).
● The head() function displays the first 5 rows of the train_data DataFrame
(default is 5 rows, but can be changed by passing the desired number of
rows
2.1.Loading the
Data
Output:
● 97,410 rows: This is the number of records or observations
in the dataset (e.g., number of customers or transactions).
● 24 columns: This is the number of features or variables in
the dataset (e.g., customer information such as gender, age,
ticket type, satisfaction level).
2.2. Initial
Inspection
Running Code:
Display the first 5 rows of the DataFrame to get an overview of
the data structure.
2.2. Initial
Inspection
Output:
● Gender: The dataset includes 2 genders: male and female.
● Customer Type: The dataset includes 2 types of customer, that
is disloyal customer and loyal customer
● Age: The dataset includes passenger ages from 22 (youngest)
to 49 (oldest).
● Flight distance: Range from 127 ( smallest ) to 3945 ( highest )
● Other types of services such as Inflight wifi service, On-board
service, Leg room service are rated on a scale of 1 to 5.
2.2. Initial
Inspection
Running Code:
The purpose of this code is to display detailed information about
the data train_data.info() and test_data.info() provide an
overview of two DataFrames (train_data and test_data), including:
● Number of rows and columns.
● Names of columns.
● Null values in each column.
● Data type of each column (e.g., int64, float64, object).
2.2. Initial
Inspection
Output:
● Numerical form: ID, Age,
Satisfaction.
● Categorical form: Gender,
Customer type, Type of travel,
and Class.
● Unstructured form:
Cleanliness, Baggage handling,
Departure delay in minutes, and
Arrival delay in minutes.
2.2. Initial
Inspection
Output:
• The final dataset now
comprises 32470 entries and
22 columns.
• There are 3 types of data
types including int64 which is
a 64-bit integer data type,
used to store large integer
values, float64 which is a 64-
bit floating-point data type,
used to store values with
decimal parts, and object
which is string data or non-
numeric data types.
03
Data Quality
Assessment
3.1. Outliers
In this code, outlier detection is
performed on the data columns
in the train_data dataset. This
screen shows the number of
outliers in each column of the
train_data dataset:
• Age: 0 outliers
• Flight distance: 2150 outliers
• Departure delay in minutes:
13711 outliers
• Arrival delay in minutes:
13,215 outliers
3.1. Outlier
The graph generated by this code is a
box plot, which displays important
information about the distribution of
numerical data in each column of
train_data.
Outliers can negatively impact a
model by skewing linear models,
causing incorrect predictions because
the regression line is pulled toward
them.
Therefore, dealing with outliers is an
important step to improve the
efficiency and accuracy of the model.
3.1. Outlier
The `Flight Distance` column has a
wide distribution, with the main value
in the range 500 - 1500, but there are
many outliers above the upper limit.
The `Departure Delay in Minutes` and
`Arrival Delay in Minutes` columns
have main distributions close to 0.
.2. Correlation Analysis
• The output of this code is a heatmap
correlation matrix: Each cell in the
matrix represents the relationship
between two data columns.
• -1 (completely negative) to 1
(completely positive), 0 indicates no
relationship.
• Colors make it easy to identify strong
(red) or weak (blue) relationships.
• Inflight wifi service and Inflight
entertainment are highly correlated
(0.71)
• Ease of Online booking and Online
boarding are moderately correlated
(0.44)
• Arrival Delay in Minutes and Departure
Delay in Minutes are highly correlated
(0.94)
04
Data
Preprocessing
4.1. Handling Missing
Value
• The above code is used to visualize the
number of missing values in each column of
the training dataset.
• The result is a Series, where index is the
column name and values is the number of
missing values.
• This bar chart displays the number of missing
values in each column of the training dataset.
• Only the `Arrival Delay in Minutes` column
contains about 300 missing values
4.1. Handling Missing
Value
Here is the code to handle missing data and reset the index for the
data. Specifically: Missing values (`NaN`) in the `Arrival Delay in
Minutes` column of both `train_data` and `test_data` are replaced
with the mean value of the corresponding column, ensuring data
integrity.
4.2. Feature
Transformation
• This code performs one-hot encoding for the columns
'Customer Type', 'Type of Travel', 'Class', and 'Gender' in both
the training set.
• Create a target variable y from the column 'satisfaction' in
train_data.
• Create a feature set X by copying the train_data and removing
the columns 'satisfaction' and 'id’.
• Check the size of the feature sets (X and X_test) after
processing.
• After running the code, the categorical columns in both
train_data and test_data will be converted to binary columns
(0 or 1).
4.3. Log Transformation
• The code performs a logarithmic
transformation on the `Flight Distance`,
`Departure Delay in Minutes` and `Arrival
Delay in Minutes`.
• The log-transform helps to address the
following issues: Skewness of the data
distribution
• Reducing the influence of outliers
• Improving display performance
4.3. Log Transformation
• The "Flight Distance" value ranges from 4 to 8
on a logarithmic scale.
• Actual initial: approximately 54 km to 2,980
km.
• The KDE curve (plain blue line) superimposed
on the histogram shows that the data is
approximately normally distributed.
05
Model
Preparation
5.1. Defining Features and Target
Variable
• Features (X_train): Features for training
• Target Variable (y_train): Labels for training.
• Test Data: Features for testing
5.2. Train-Test Split
• X_train and y_train are taken directly from train_data.
• Inaccurate evaluation of the model's performance and the risk of
overfitting
• Training set (X_train, y_train): Used to train the model.
• Test set (X_val, y_val): Used to evaluate the model performance
on unseen data.
• X_train: Contains 80% of the data (features only) used for
training.
• X_test: Contains 20% of the data (features only) reserved for
testing.
5.3. Preprocessing Pipelines for
Numeric and Categorical Features
Numeric Features:
• Numerical data columns (int64, float64).
• Easy to process with mathematical operations
Categorical_features:
• Categorical data columns (object type).
• Need to be encoded before being fed into the
machine learning model.
.3.1. Numeric Transformer
SimpleImputer(strategy='mean'):
• Fill in missing values with the mean of each
column.
StandardScaler():
• Standardize data to a normal distribution
(mean=0, standard deviation=1).
3.2. Categorical Transformer
SimpleImputer(strategy='constant'
, fill_value='missing'):
• Fill missing values with a fixed value (missing).
OneHotEncoder(handle_unknown='ignore'):
• Encode categorical values into One-Hot form
(binary variables).
3.3. Numeric Transformer
Aggregate in ColumnTransformer
• numeric_transformer applies to
numeric_features.
• categorical_transformer applies to
categorical_features.
5.3.4. Combining Preprocessing
and Modeling
Preprocessor:
• Perform preprocessing of all data before feeding
into the model.
RandomForestClassifier:
• Classification model using random forest
algorithm.
• Number of trees (estimators) = 100.
• Random_state=42 to ensure reproducible
5.3.5. Model Training
Procedure:
• Data preprocessing (imputation, scaling,
encoding).
• Training the RandomForestClassifier model.
Purpose:
• Ensure consistent transformations, reducing the
risk of data leakage and simplifying predictions
for X_test.
5.3.5. Model Training
The Output
Numerical data processing (num):
• SimpleImputer replaces missing values with the
mean.
• StandardScaler normalizes the data, bringing
the values to the same scale to improve the
efficiency of the machine learning algorithm.
Categorical data processing (cat):
• SimpleImputer replaces missing values with a
default value ('missing').
• OneHotEncoder converts categorical values
into one-hot encoding, creating representative
binary features.
Combining and training:
• ColumnTransformer combines both types of
data (numeric and categorical) after being
processed and feeds them into the
RandomForestClassifier model
06
MODEL
TRAINING
6.1. Random
Forest Classifier
Import:
• Random Forest Classifier:
Uses multiple decision trees
to improve predictive
performance and control
overfitting.
Creating the Random Forest Model:
o Pipeline
o Preprocessor
o Classifier
• n_estimators=100: the number of
decision trees in the forest
• random_state=42: Ensures the
reproducibility of the random selection
of features and samples for each tree.
6.1. Random
Forest Classifier
Evaluating Model Performance:
• The output "Random Forest
Accuracy: 1.00" indicates that
the Random Forest model
achieved a perfect score of
100% accuracy on the data it
was evaluated on.
• The model appears to
prioritize the aspects of the
travel experience : flight
distance, age of the customer,
and service quality compare
to categorical variables like
gender, customer loyalty, or
travel class.
.2. Logistic Regression
Import:
• Logistic Linear Regression: This is a
popular algorithm used for binary
classification .
Model Evaluation:
73% of the model's predictions on
the data are correct.
6.3. Neural Network
(MLP)
Classifier
Importing Multi-layer Perceptron
• Thisclassifier
line imports the MLPClassifier,
which is a type of neural network
model
MLP Classifier
• hidden_layer_sizes=(100,): one
hidden layer with 100 neurons.
• max_iter=300: The maximum
number of iterations for training.
The model will stop if it converges
before reaching this number.
Making Prediction
• The output "Neural Network
Accuracy: 0.78" indicates that the
Multi-layer Perceptron (MLP)
classifier achieved an accuracy of
78% on the dataset.
07
MODEL
EVALUATION &
COMPARISON
7. Model Evaluation &
Comparison
• Logistic Regression offers simplicity
with a 73% training accuracy,
suggesting good generalization.
• Its interpretability allows airlines to
clearly see how features like flight
duration or service quality directly
affect satisfaction.
• A negative coefficient for flight
duration indicates longer flights
reduce satisfaction.
7. Model Evaluation &
Comparison
Random Forest achieved a perfect 100%.
It can handle complex high dimensional
data, provides high interpretability through
feature importance scores. Model capture
the interaction through the structure of
multiple trees.
For instance: In-Flight Entertainment is a
crucial factor for long flights (> 4 hours),
but less so for short flights.
Neural Network (MLP) with a 78%
accuracy. It's particularly effective when
satisfaction is influenced by a mix of factors
in complex ways, like the interaction
between flight delays and customer service
quality
08
CONCLUSION
This study aimed to uncover the key factors influencing airline
passenger satisfaction and identify areas where airlines can enhance
their services to improve customer loyalty. We analyzed a dataset that
included various factors such as inflight entertainment, Wi-Fi service,
seat comfort, departure and arrival delays, and customer
segmentation. Airlines should focus on improving inflight services,
minimizing delays, and customizing their offerings to meet diverse
customer needs.
THANKS
for
listening!