Predicting Employee Salaries
Using Demographic and
Professional Features: A
Comparative Analysis of
Machine Learning Models
Introduction
Salary prediction has become an increasingly important area of research, particularly with the growing use of
machine learning techniques to analyze workforce data. Predicting salaries based on demographic and
professional factors, such as age, gender, education level, job title, and years of experience, has significant
implications for various stakeholders. Accurate salary prediction models can assist employers in making data-
driven compensation decisions, help employees assess potential earnings, and provide valuable insights for
researchers and policymakers seeking to understand labor market trends and disparities.
Machine learning offers advanced tools to extract patterns from large datasets, making it an effective
approach for estimating salaries. However, these techniques also present challenges, particularly regarding
fairness and bias. When demographic data like gender and age are incorporated, predictive models risk
amplifying or even perpetuating existing disparities. Therefore, it is crucial to ensure that such models are not
only accurate but also fair and transparent, ensuring equitable outcomes for all demographic groups.
This research aims to develop a salary prediction model that incorporates demographic and professional
factors while focusing on fairness and interpretability. The objective is to create a machine learning model that
can accurately predict salaries while minimizing bias and providing transparency in its decision-making
process. This study seeks to bridge the gap in existing salary prediction models by integrating fairness
mechanisms with predictive accuracy, addressing the pressing need for responsible and equitable models in
the field.
Literature Review
Salary prediction using machine learning techniques has been a topic of significant interest, particularly in the
context of improving compensation transparency and equity. A number of studies have utilized demographic
and professional features, such as age, gender, education, and years of experience, to develop predictive
models. This section will summarize key research on salary prediction models, the challenges associated with
fairness and bias, and the methodologies used to improve model performance and interpretability.
Machine Learning Approaches for Salary Prediction
Machine learning techniques, particularly regression models, decision trees, and ensemble methods, have
been widely used for salary prediction. These models leverage various features such as education level, job
title, and experience to predict salary outcomes. One common approach is the use of linear regression, which
models the relationship between input features (such as years of experience or education level) and the salary
(Das, Barik et al. 2020). However, linear models may fail to capture complex, non-linear relationships present
in the data, leading to suboptimal predictions.
To address this, researchers have increasingly turned to more complex algorithms such as random forests
(Gao, Wen et al. 2019)and support vector machines (SVMs) (Quan and Raheem 2022). These methods have
shown promising results in capturing non-linear relationships between features and salary, improving
predictive accuracy. Additionally, ensemble methods, such as gradient boosting machines (GBM), have been
found to perform particularly well in salary prediction tasks by combining multiple weak learners to create a
strong predictive model (Chung, Yun et al. 2023, Chen, Peng et al. 2024). These advanced machine learning
techniques offer superior performance, particularly when handling large, diverse datasets with multiple
features.
Bias and Fairness in Salary Prediction Models
While machine learning models offer powerful tools for salary prediction, they also bring attention to issues of
bias and fairness. Several studies have highlighted that salary prediction models may inadvertently perpetuate
biases present in the training data. For example, gender and age biases in salary datasets can result in
discriminatory predictions, disadvantaging certain demographic groups. Gender bias, in particular, has been
widely studied, with research showing that models trained on historical salary data often reflect existing wage
gaps between men and women (Blau and Kahn 2017, Blau and Kahn 2020). Such biases can be harmful and
lead to unjust compensation practices, which is why addressing bias is crucial for ensuring fairness in salary
prediction models.
One approach to mitigating bias in machine learning models is fairness-aware learning. This involves
incorporating fairness constraints into the model’s training process to ensure that predictions do not
disproportionately favor certain demographic groups. Several fairness metrics have been proposed, such as
demographic parity and equalized odds, which assess whether the model treats different groups equally in
terms of prediction outcomes (Hardt, Price et al. 2016). However, applying fairness constraints often involves
trade-offs with model accuracy, which can complicate the model development process.
Interpretability and Transparency in Salary Prediction
Another critical challenge in salary prediction using machine learning is ensuring interpretability. While
complex models like random forests and gradient boosting often produce more accurate results, they are also
more difficult to interpret. This lack of interpretability can undermine trust in the model’s predictions,
especially in sensitive applications like salary forecasting, where transparency is vital.
Several techniques have been proposed to improve the interpretability of machine learning models, such as
LIME (Local Interpretable Model-Agnostic Explanations) and SHAP (Shapley Additive Explanations)
(Bramhall, Horn et al. 2020). These methods help to explain how individual features contribute to a model's
predictions, enabling users to understand the reasons behind salary estimates. Interpretability is particularly
important in applications like salary prediction, as it ensures that the model’s decisions can be understood and
justified, reducing the risk of unintended consequences, such as reinforcing stereotypes or biases.
Challenges in Salary Prediction
Despite the progress made in the development of machine learning models for salary prediction, several
challenges remain. One challenge is the availability and quality of data. Salary data often comes with issues
such as missing values, inconsistencies, or underrepresentation of certain demographic groups. Incomplete or
biased datasets can undermine the effectiveness and fairness of machine learning models (Bramhall, Horn et
al. 2020).
Another challenge is the generalizability of salary prediction models. Models trained on data from one
industry or region may not perform well when applied to other contexts. This is especially problematic when
attempting to create a universal salary prediction model that works across various sectors and geographic
locations. Researchers have suggested the use of transfer learning and domain adaptation techniques
(Patricia and Caputo 2014) to address this issue, allowing models to leverage knowledge learned in one
domain and apply it to another.
Opportunities for Future Research
There are several opportunities for advancing research in salary prediction. First, more attention needs to be
paid to developing models that balance accuracy and fairness. While achieving high predictive accuracy is
important, it should not come at the expense of fairness. Future research could explore new algorithms that
incorporate fairness-aware learning while maintaining strong performance.
Second, improving the interpretability of complex machine learning models in salary prediction is crucial for
gaining trust and ensuring fairness. By making models more transparent, employers and policymakers will be
better equipped to understand the reasons behind salary predictions, which can help mitigate biases and
increase accountability.
Lastly, the use of alternative data sources—such as social media profiles, company reviews, and other online
data—could be explored to improve salary prediction models. These data sources may provide additional
insights into candidates’ skills, job performance, and market trends, enriching the features used to predict
salaries.
Methodology
The methodology chapter outlines the systematic approach that will be used to carry out the research project
on salary prediction based on demographic and professional attributes such as age, gender, education level,
years of experience, and job title. The research will follow a quantitative methodology, utilizing various
machine learning techniques to predict salaries based on these features. This chapter explains the research
design, data collection process, data preprocessing steps, model selection, and evaluation criteria used to
assess the effectiveness of the models.
Research Design
This research employs a quantitative research design with an emphasis on predictive modeling. The objective
is to develop a machine learning model that predicts salaries of individuals based on demographic and
professional characteristics. The research methodology can be broken down into the following stages:
1. Data Collection and Preprocessing
2. Model Development
3. Model Evaluation
4. Interpretation of Results
The primary aim is to build a robust predictive model capable of estimating salaries for a range of job roles
across various educational backgrounds and experience levels.
Data Collection
The dataset used in this research is sourced from Kaggle and contains 6704 entries, which include the
following variables:
Age: The age of the employee.
Gender: The gender of the employee.
Education Level: The educational qualification of the employee.
Job Title: The role or position held by the employee.
Years of Experience: The number of years the employee has worked in the field.
Salary: The monthly salary of the employee.
The data is publicly available and obtained from multiple sources such as surveys, job posting sites, and other
publicly available datasets. These data points are considered relevant for understanding how different factors
influence salary levels in various professional contexts. For the purposes of this dissertation, the dataset will
be used in its entirety, ensuring that it is representative of the broader population of employees in the
relevant job roles.
Data Preprocessing
Data preprocessing is a critical step in ensuring that the dataset is clean, consistent, and ready for analysis.
The preprocessing steps include:
Handling Missing Data: Missing data, if present, will be identified and handled. In cases where the
missing data is minimal, imputation techniques such as mean or median substitution will be used. If a
large portion of the data for a specific feature is missing, that feature may be excluded from the
dataset.
Encoding Categorical Data: Several features, such as Gender, Education Level, and Job Title, are
categorical. These will be converted into numerical representations using techniques like One-Hot
Encoding for multi-class categorical features like Job Title and Education Level. Label Encoding will be
applied to Gender as it is binary (Male/Female).
Feature Scaling: Numerical features such as Age, Years of Experience, and Salary will be standardized
using Min-Max Scaling or Standardization. Standardizing these features ensures that the model does
not become biased due to differences in feature ranges.
Feature Engineering: New features will be created where applicable. For instance, interaction terms
between Years of Experience and Job Title will be generated to capture non-linear relationships that
may exist between experience level and salary. Additionally, polynomial features may be considered
to capture any complex trends in the data.
Model Selection
To predict salaries, multiple machine learning algorithms will be evaluated, each offering unique advantages
depending on the complexity of the data and the relationships within it. The following models will be tested:
Linear Regression: A simple yet effective model that establishes a relationship between the
dependent variable (Salary) and the independent variables (Age, Gender, Education Level, Job Title,
Years of Experience). Linear regression will be used as a baseline model to evaluate more complex
models.
Decision Trees: A decision tree is a non-linear model that works by splitting the data into subsets
based on the most informative features. It is interpretable and visualizes how decisions are made.
Decision trees are expected to capture non-linear relationships better than linear regression.
Random Forest: An ensemble learning method that creates multiple decision trees and averages their
predictions. Random forests are less prone to overfitting than a single decision tree, and they tend to
provide better generalization, especially on large datasets.
Gradient Boosting Machines (GBM): Advanced ensemble methods such as XGBoost and LightGBM
will be used to improve model performance. These models build trees sequentially, where each tree
corrects the errors made by the previous one. They are highly effective in handling complex datasets.
Support Vector Machines (SVM): If the relationship between the features and salary is highly non-
linear, SVMs will be explored. SVM can efficiently handle complex decision boundaries by
transforming the feature space into higher dimensions using kernel tricks.
Each model will be trained and evaluated on the dataset, and their performance will be compared based on a
variety of evaluation metrics.
Model Training and Hyperparameter Tuning
Once the models are selected, they will be trained using the training set (80% of the total dataset), and their
performance will be tested on the validation set (20%). Hyperparameter tuning will be performed to optimize
model performance. The tuning process will involve adjusting the model’s hyperparameters, such as the
number of trees in a Random Forest or the learning rate in Gradient Boosting.
Hyperparameter tuning will be done using Grid Search and Random Search, techniques that exhaustively or
randomly explore a range of hyperparameters and identify the best configuration.
Model Evaluation
To assess the performance of the predictive models, several evaluation metrics will be used:
Mean Absolute Error (MAE): This metric measures the average magnitude of errors between
predicted and actual salary values. MAE gives a clear understanding of the magnitude of error in
predictions.
Root Mean Squared Error (RMSE): RMSE penalizes larger errors more heavily than MAE and is useful
when trying to minimize large discrepancies in salary prediction.
R-Squared (R²): This metric represents the proportion of variance in the dependent variable (Salary)
that can be explained by the independent variables. A higher R² value indicates better model
performance.
Cross-Validation: To ensure that the models are not overfitting to the training data, k-fold cross-
validation will be used. This technique splits the dataset into k subsets, training the model on k-1
subsets and testing it on the remaining one. This process is repeated k times to validate the model’s
generalization performance.
Bias and Fairness Considerations
As part of the evaluation, attention will be given to any potential biases in salary predictions related to gender,
age, or education level. Disparities in salary predictions among different demographic groups will be closely
examined to ensure fairness. Techniques like fairness constraints or re-sampling methods may be used if
significant bias is detected.
Limitations of the Study
There are several limitations in the proposed research:
Size of the Dataset: Although 6704 records are substantial, a larger and more diverse dataset could
provide more accurate insights and generalizable results.
Absence of Additional Factors: Key factors such as geographic location, industry type, and company
size are missing from the dataset. These factors can significantly impact salary levels and will be a
limitation in this study.
Data Quality: Some of the data may be noisy or incomplete, which could affect model performance.
Interpretation of the Code and Results:
Data Preprocessing:
1. The dataset was first cleaned by removing rows with missing values.
2. Categorical variables such as Gender were encoded using Label Encoding, and Education Level was
one-hot encoded into separate binary columns for each education category (e.g., Master's, PhD).
3. These preprocessing steps ensured that the dataset is ready for model training by converting non-
numeric data into a form that machine learning algorithms can process.
Model Training Evaluation:
1. Linear Regression: A basic machine learning model was applied to predict employee salary based on
features like Age, Years of Experience, Gender, and Education Level.
a. The MSE and R-squared values were calculated, where R-squared indicated how well the
model fit the data. A higher R-squared implies better model performance.
2. Random Forest Regressor: This more complex model, using an ensemble of decision trees, was used
to predict salaries.
a. It provided better feature importance, showing which factors most contributed to salary
prediction.
3. Support Vector Regressor (SVR): A third model was used to evaluate salary predictions. SVR uses
kernel tricks and is useful for higher-dimensional spaces. It’s particularly effective when the
relationship between input features and output is nonlinear.
Model Evaluation Metrics:
1. MSE (Mean Squared Error): This value measures how well each model predicts the salary. A lower
MSE means better model performance.
2. R-squared: This metric tells us how much of the variance in salary can be explained by the features
used in the model. A higher R-squared indicates that the model has a better fit.
Model Comparison:
1. The results of R-squared and MSE from the three models were compared in a table and visualized.
This comparison helps in selecting the best-performing model for salary prediction. Random Forest
generally performed better in terms of R-squared, meaning it explained more variance in salary
prediction.
Feature Importance (Random Forest):
1. Random Forest's feature importance results showed which variables most contributed to the salary
predictions. Education Level, Years of Experience, and Age were the most important factors. These
insights are useful for understanding which features influence salary more.
Cross-Validation:
1. Cross-validation was performed to validate the performance of the models. By splitting the data into
multiple folds and training the model on each fold, we obtain a more reliable measure of how well the
model generalizes to unseen data. The results from cross-validation gave us an average estimate of
model performance, particularly focusing on minimizing MSE.
Residuals Analysis:
1. Residuals vs Fitted Values and Histogram of Residuals were used to check for homoscedasticity
(constant variance) and normality of the residuals. This step helps ensure that the assumptions of
regression models are met. Linear Regression results indicated residuals with a roughly normal
distribution, indicating that the model assumptions were valid.
Learning Curve:
1. The learning curve, which plots training error and validation error as the training set size increases,
was plotted to assess how the model improves as more data is provided. A gap between training and
validation error indicates overfitting or underfitting.
Feature Scaling:
1. The StandardScaler was used to standardize the features (mean = 0, variance = 1), which often
improves model performance, especially for models like SVR that are sensitive to the scale of the
features.
Hyperparameter Tuning (GridSearchCV):
1. Hyperparameter tuning was done using GridSearchCV for the Random Forest model. It optimized
parameters like the number of estimators, max depth of trees, and minimum samples required to split
a node. The optimized model showed improved performance in terms of MSE and R-squared.
Clustering (KMeans):
1. KMeans clustering was applied to group the data into clusters based on features like Age, Years of
Experience, and Education Level. By visualizing salary distribution across clusters, the project revealed
that certain clusters tend to have higher or lower salaries, which might reflect different job categories
or career stages.
Conclusion:
This project demonstrates the application of machine learning techniques for predicting employee
salary based on demographic and professional data, including features such as Age, Years of
Experience, Gender, and Education Level. By comparing three different models—Linear Regression,
Random Forest, and Support Vector Regressor— it was found that the Random Forest Regressor
performed the best, yielding the highest R-squared value, which means it explained the most variance
in salary prediction.
Key insights derived from the analysis include:
Education Level and Years of Experience were found to be the most influential features in
determining salary.
Random Forest provided not only the best predictive accuracy but also useful feature importance
metrics, allowing us to understand what drives salary differences.
Cross-validation and hyperparameter tuning improved model performance by ensuring that the
models generalize well and aren't overfitting.
The KMeans clustering step revealed that different clusters have distinct salary distributions, which
could be indicative of different career stages or job types within the dataset.
This study provides valuable insights into salary prediction and can be further expanded by
considering other features such as job title, company size, or geographic location to improve
prediction accuracy. The results can be used by HR departments and recruitment agencies to better
understand salary trends and make data-driven decisions in employee compensation.
Interpretation and Insights
The code performs a comprehensive salary prediction analysis using various machine learning
models, providing insights into their comparative performance and underlying patterns in the data.
The dataset includes features such as age, years of experience, gender, and education level. Gender
was label-encoded, while education levels were one-hot encoded, and the features were standardized
to ensure compatibility across models. The models implemented include Linear Regression, Decision
Tree, Random Forest, XGBoost, LightGBM, and Support Vector Regressor (SVM). Each of these
models has distinct strengths, with Linear Regression assuming a linear relationship between features
and the target, Decision Tree capturing non-linear patterns, and ensemble methods like Random
Forest and XGBoost excelling at reducing overfitting and improving accuracy. LightGBM provides
an efficient gradient-boosting approach, while SVM is particularly adept at handling complex patterns
through kernel-based methods.
The models were evaluated using Mean Squared Error (MSE) and R-squared (R²). MSE measures the
prediction error, with lower values indicating better performance, while R² indicates the proportion of
variance in the target variable explained by the model. A comparison of these metrics across models
showed that ensemble methods like XGBoost and Random Forest generally outperform simpler
models due to their ability to handle complex feature interactions. Visualizing R² scores with bar plots
further highlighted the superior performance of these ensemble models.
Random Forest's feature importance analysis revealed the most influential predictors of salary, such as
years of experience and education level, offering valuable interpretability for the model's decision-
making process. Cross-validation was used to ensure the robustness of the models by evaluating their
performance on multiple data splits. Negative mean squared error (MSE) from cross-validation
highlighted the consistency of ensemble models compared to simpler alternatives. Furthermore, a
learning curve for XGBoost illustrated the relationship between training size and model error,
revealing insights into potential overfitting or underfitting. Smaller gaps between training and
validation errors indicated a well-generalized model.
Conclusion
In conclusion, ensemble models like XGBoost and Random Forest emerged as the best-performing
models, demonstrating their ability to capture complex patterns and provide consistent results.
Features such as years of experience and education level were identified as critical determinants of
salary, emphasizing the need for interpretable models. Although models like XGBoost and LightGBM
offered high accuracy, they required significant computational resources and careful tuning, whereas
simpler models like Linear Regression and Decision Tree were more straightforward but less effective
for this task. Future work should address bias and fairness considerations to ensure demographic
features, such as gender, do not introduce discrimination in salary predictions. Additionally,
incorporating more features, such as job industry or location, could enhance prediction accuracy. This
analysis highlights the importance of balancing performance, interpretability, and fairness to develop
reliable salary prediction models.
References
Blau, F. D. and L. M. Kahn (2017). "The gender wage gap: Extent, trends, and explanations." Journal of
economic literature 55(3): 789-865.
Blau, F. D. and L. M. Kahn (2020). The gender pay gap: Have women gone as far as they can? Inequality in the
United States, Routledge: 345-362.
Bramhall, S., et al. (2020). "Qlime-a quadratic local interpretable model-agnostic explanation approach." SMU
Data Science Review 3(1): 4.
Chen, Y., et al. (2024). A Model for Predicting Salaries in Big Data Roles: An Integration of Random Forest and
Adaboost-KNN Models. Proceedings of the 2024 4th International Conference on Artificial Intelligence, Big
Data and Algorithms.
Chung, D., et al. (2023). "Predictive model of employee attrition based on stacking ensemble learning." Expert
Systems with Applications 215: 119364.
Das, S., et al. (2020). "Salary prediction using regression techniques." Proceedings of Industry Interactive
Innovations in Science, Engineering & Technology (I3SET2K19).
Gao, X., et al. (2019). "An improved random forest algorithm for predicting employee turnover." Mathematical
Problems in Engineering 2019(1): 4140707.
Hardt, M., et al. (2016). "Equality of opportunity in supervised learning." Advances in neural information
processing systems 29.
Patricia, N. and B. Caputo (2014). Learning to learn, from transfer learning to domain adaptation: A unifying
perspective. Proceedings of the IEEE conference on computer vision and pattern recognition.
Quan, T. Z. and M. Raheem (2022). "Salary prediction in data science field using specialized skills and job
benefits–a literature." Journal of Applied Technology and Innovation 6(3): 70-74.