0% found this document useful (0 votes)

16 views16 pages

Copy of Final Project

The document outlines a data analysis process using a dataset of employees, including data loading, cleaning, and preprocessing steps. It describes the structure of the dataset, visualizations for understanding distributions, and various machine learning models evaluated for predicting employee turnover. The Random Forest Classifier achieved the highest test accuracy of 82.81% among the models tested.

Uploaded by

praveencse1779

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views16 pages

Copy of Final Project

Uploaded by

praveencse1779

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Read the Data

import pandas as pd

df = pd.read_csv('/content/Employee.csv')
df.head()

{"summary":"{\n \"name\": \"df\",\n \"rows\": 4653,\n \"fields\":

[\n {\n \"column\": \"Education\",\n \"properties\": {\n
\"dtype\": \"category\",\n \"num_unique_values\": 3,\n
\"samples\": [\n \"Bachelors\",\n \"Masters\",\n
\"PHD\"\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"JoiningYear\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 1,\n \"min\": 2012,\n
\"max\": 2018,\n \"num_unique_values\": 7,\n
\"samples\": [\n 2017,\n 2013,\n 2012\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n },\n {\n \"column\": \"City\",\n \"properties\":
{\n \"dtype\": \"category\",\n \"num_unique_values\":
3,\n \"samples\": [\n \"Bangalore\",\n
\"Pune\",\n \"New Delhi\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"PaymentTier\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
0,\n \"min\": 1,\n \"max\": 3,\n
\"num_unique_values\": 3,\n \"samples\": [\n 3,\n
1,\n 2\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Age\",\n \"properties\": {\n \"dtype\": \"number\",\n
\"std\": 4,\n \"min\": 22,\n \"max\": 41,\n
\"num_unique_values\": 20,\n \"samples\": [\n 34,\n
35,\n 26\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Gender\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"num_unique_values\": 2,\n \"samples\":
[\n \"Female\",\n \"Male\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"EverBenched\",\n
\"properties\": {\n \"dtype\": \"category\",\n
\"num_unique_values\": 2,\n \"samples\": [\n \"Yes\",\
n \"No\"\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"ExperienceInCurrentDomain\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n
\"max\": 7,\n \"num_unique_values\": 8,\n \"samples\":
[\n 3,\n 4\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"LeaveOrNot\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n
\"max\": 1,\n \"num_unique_values\": 2,\n \"samples\":
[\n 1,\n 0\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n }\n ]\
n}","type":"dataframe","variable_name":"df"}

df.describe()

{"summary":"{\n \"name\": \"df\",\n \"rows\": 8,\n \"fields\": [\n

{\n \"column\": \"JoiningYear\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 1251.4547220922507,\n
\"min\": 1.8633768286863546,\n \"max\": 4653.0,\n
\"num_unique_values\": 8,\n \"samples\": [\n
2015.0629701267999,\n 2015.0,\n 4653.0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"PaymentTier\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
1644.2629844314426,\n \"min\": 0.5614354643364909,\n
\"max\": 4653.0,\n \"num_unique_values\": 5,\n
\"samples\": [\n 2.6982591876208897,\n 3.0,\n
0.5614354643364909\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Age\",\n \"properties\": {\n \"dtype\": \"number\",\n
\"std\": 1635.8622898051515,\n \"min\": 4.826087009126065,\n
\"max\": 4653.0,\n \"num_unique_values\": 8,\n
\"samples\": [\n 29.393294648613796,\n 28.0,\n
4653.0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"ExperienceInCurrentDomain\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 1644.051605746371,\n
\"min\": 0.0,\n \"max\": 4653.0,\n
\"num_unique_values\": 8,\n \"samples\": [\n
2.905652267354395,\n 3.0,\n 4653.0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"LeaveOrNot\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
1644.9416023787396,\n \"min\": 0.0,\n \"max\": 4653.0,\n
\"num_unique_values\": 5,\n \"samples\": [\n
0.3438641736514077,\n 1.0,\n 0.47504747514881046\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n }\n ]\n}","type":"dataframe"}

df.shape

(4653, 9)

df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2764 entries, 0 to 4651
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Education 2764 non-null object
1 JoiningYear 2764 non-null int64
2 City 2764 non-null object
3 PaymentTier 2764 non-null int64
4 Age 2764 non-null int64
5 Gender 2764 non-null object
6 EverBenched 2764 non-null object
7 ExperienceInCurrentDomain 2764 non-null int64
8 LeaveOrNot 2764 non-null int64
dtypes: int64(5), object(4)
memory usage: 215.9+ KB

df.isnull().sum()

Education 0
JoiningYear 0
City 0
PaymentTier 0
Age 0
Gender 0
EverBenched 0
ExperienceInCurrentDomain 0
LeaveOrNot 0
dtype: int64

df.duplicated().sum()

import matplotlib.pyplot as plt

for col in df.select_dtypes(include=['object']).columns:

plt.bar(df[col].value_counts().index, df[col].value_counts().values)
plt.xlabel(col)
plt.ylabel('Value Counts')
plt.xticks(rotation = 30)
plt.show()
import seaborn as sns

for col in df.select_dtypes(include=['int64', 'float64']).columns:

sns.boxplot(x=df[col])
plt.show()
#Handling Outliers

# Use box plots to identify outliers in numerical features

numerical_features = ['JoiningYear', 'PaymentTier', 'Age',
'ExperienceInCurrentDomain']
for feature in numerical_features:
plt.figure(figsize=(10, 4))
sns.boxplot(x=df[feature])
plt.title(f'Boxplot of {feature}')
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt

# Visualizing the distribution of the target variable 'LeaveOrNot'

sns.countplot(x='LeaveOrNot', data=df)
plt.title('Distribution of LeaveOrNot')
plt.show()
# Data cleaning and preprocessing

df.isnull().sum()

Education 0
JoiningYear 0
City 0
PaymentTier 0
Age 0
Gender 0
EverBenched 0
ExperienceInCurrentDomain 0
LeaveOrNot 0
dtype: int64

Encode Categorical Variables

from sklearn.preprocessing import LabelEncoder

obj_cols = df.select_dtypes(object).columns
for col in obj_cols:
encoder = LabelEncoder()
df[col] = encoder.fit_transform(df[col])
df.dtypes

Education int64
JoiningYear int64
City int64
PaymentTier int64
Age int64
Gender int64
EverBenched int64
ExperienceInCurrentDomain int64
LeaveOrNot int64
dtype: object

Split the Data

x = df.drop(columns=['LeaveOrNot']) # Features
y = df['LeaveOrNot'] # Target variable

from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(x, y, train_size=0.8)

Model Selection

def eval_model(model, xtrain, ytrain, xtest, ytest):

model.fit(xtrain, ytrain)
train_score = model.score(xtrain, ytrain)
test_score = model.score(xtest, ytest)
print(f'Train Score: {train_score}')
print(f'Test Score: {test_score}')

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
eval_model(model, xtrain, ytrain, xtest, ytest)

Train Score: 0.7063406770553465

Test Score: 0.7175080558539205

/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/
_logistic.py:460: ConvergenceWarning: lbfgs failed to converge
(status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as

shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:

https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
eval_model(model, xtrain, ytrain, xtest, ytest)

Train Score: 0.9306824288017195

Test Score: 0.8055853920515574

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
eval_model(model, xtrain, ytrain, xtest, ytest)

Train Score: 0.8382590005373455

Test Score: 0.7937701396348013

from sklearn.svm import SVC

model = SVC()
eval_model(model, xtrain, ytrain, xtest, ytest)

Train Score: 0.6534121440085975

Test Score: 0.6670247046186896

from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
eval_model(model, xtrain, ytrain, xtest, ytest)

Train Score: 0.6775926921010209

Test Score: 0.6595059076262084

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
eval_model(model, xtrain, ytrain, xtest, ytest)

Train Score: 0.9306824288017195

Test Score: 0.8281417830290011

from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier()
eval_model(model, xtrain, ytrain, xtest, ytest)

Train Score: 0.8559914024717894

Test Score: 0.8453276047261009

Model Training

model = RandomForestClassifier()
model.fit(xtrain, ytrain)

RandomForestClassifier()
Model Evaluation

trainpred = model.predict(xtrain)
testpred = model.predict(xtest)

from sklearn.metrics import classification_report

print(classification_report(ytrain, trainpred))

precision recall f1-score support

0 0.92 0.98 0.95 2432

1 0.95 0.84 0.89 1290

accuracy 0.93 3722

macro avg 0.94 0.91 0.92 3722
weighted avg 0.93 0.93 0.93 3722

print(classification_report(ytest, testpred))

precision recall f1-score support

0 0.86 0.90 0.88 621

1 0.79 0.70 0.74 310

accuracy 0.84 931

macro avg 0.82 0.80 0.81 931
weighted avg 0.83 0.84 0.83 931

DACLUSTER
No ratings yet
DACLUSTER
9 pages
Loan Default Prediction System
No ratings yet
Loan Default Prediction System
13 pages
ML Lab-1
No ratings yet
ML Lab-1
5 pages
IS - Extended - Project - Guided - Template - Notebook
No ratings yet
IS - Extended - Project - Guided - Template - Notebook
26 pages
MLT Ann Lab 2
No ratings yet
MLT Ann Lab 2
7 pages
VoThaiThaoNhi ECON209 F2024 Lab 2
No ratings yet
VoThaiThaoNhi ECON209 F2024 Lab 2
10 pages
BD WPS2
No ratings yet
BD WPS2
23 pages
Kakauikkla
No ratings yet
Kakauikkla
51 pages
Python 3
No ratings yet
Python 3
9 pages
Lab2
No ratings yet
Lab2
15 pages
Import As Import As Import As Import: Pandas PD Numpy NP Matplotlib - Pyplot PLT Sklearn DF PD - Read - CSV DF
No ratings yet
Import As Import As Import As Import: Pandas PD Numpy NP Matplotlib - Pyplot PLT Sklearn DF PD - Read - CSV DF
9 pages
Projet 2 Classification Des Crédits
No ratings yet
Projet 2 Classification Des Crédits
24 pages
# Importing Necessary Libraries: Import As Import As Import As Import As
No ratings yet
# Importing Necessary Libraries: Import As Import As Import As Import As
21 pages
B58 - Handling Missing Values, Feature - Selection
No ratings yet
B58 - Handling Missing Values, Feature - Selection
4 pages
Assignment 1 ML
No ratings yet
Assignment 1 ML
30 pages
Exp 343
No ratings yet
Exp 343
18 pages
Analyzing Customer Data with NumPy
No ratings yet
Analyzing Customer Data with NumPy
9 pages
Copy of ML - Assignment
No ratings yet
Copy of ML - Assignment
7 pages
Covid 19 Analysis and Visualization Using Plotly Express
No ratings yet
Covid 19 Analysis and Visualization Using Plotly Express
11 pages
Data Visualization & Preprocessing Guide
No ratings yet
Data Visualization & Preprocessing Guide
18 pages
Another Copy of Ensemble Models Original Paid
No ratings yet
Another Copy of Ensemble Models Original Paid
51 pages
Kunal Assignment 3
No ratings yet
Kunal Assignment 3
19 pages
Week 4
No ratings yet
Week 4
13 pages
Aiml
No ratings yet
Aiml
27 pages
1st Project
No ratings yet
1st Project
24 pages
CVD Web
No ratings yet
CVD Web
22 pages
Heart Disease Classification Full-1
No ratings yet
Heart Disease Classification Full-1
3 pages
Task 1
No ratings yet
Task 1
5 pages
LDA CreditCardDefault Code N
No ratings yet
LDA CreditCardDefault Code N
11 pages
Kunal DA-12 Assignment-4
No ratings yet
Kunal DA-12 Assignment-4
26 pages
Exploratory Data Analysis and Preprocessing Pipeline
No ratings yet
Exploratory Data Analysis and Preprocessing Pipeline
18 pages
Alishba (S005)
No ratings yet
Alishba (S005)
5 pages
Observation: Import As Import As Import As Import As
No ratings yet
Observation: Import As Import As Import As Import As
31 pages
Supply Chain Analytics
No ratings yet
Supply Chain Analytics
20 pages
Credit Risk Prediction Model Overview
No ratings yet
Credit Risk Prediction Model Overview
19 pages
Ex 8
No ratings yet
Ex 8
3 pages
Lecture 2
No ratings yet
Lecture 2
30 pages
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
12 pages
Germany Credit Analysis
No ratings yet
Germany Credit Analysis
41 pages
Predicting Salary with Experience
100% (1)
Predicting Salary with Experience
7 pages
Building Logistic Regression Model in Python
No ratings yet
Building Logistic Regression Model in Python
24 pages
Scaler Case Study
No ratings yet
Scaler Case Study
18 pages
Unit7 Working With Pandas - Solved
No ratings yet
Unit7 Working With Pandas - Solved
12 pages
DWM Journal
No ratings yet
DWM Journal
104 pages
2022ucd2164 1 2
No ratings yet
2022ucd2164 1 2
35 pages
Random Forest Classifier
No ratings yet
Random Forest Classifier
18 pages
Ola Case Study
No ratings yet
Ola Case Study
51 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
4 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
4 pages
188 Code Tugas 1
No ratings yet
188 Code Tugas 1
18 pages
1 Linear Regression - Ipynb
No ratings yet
1 Linear Regression - Ipynb
66 pages
DSBDA3 - Jupyter Notebook
No ratings yet
DSBDA3 - Jupyter Notebook
12 pages
Plot 3D: Import As
No ratings yet
Plot 3D: Import As
26 pages
15 - 11 - 24 - SVM - Jupyter Notebook
No ratings yet
15 - 11 - 24 - SVM - Jupyter Notebook
5 pages
UQ21CA632B Unit2 Class14a Data Representation
No ratings yet
UQ21CA632B Unit2 Class14a Data Representation
5 pages
Week 4 LAB
No ratings yet
Week 4 LAB
26 pages
Experiment No 11
No ratings yet
Experiment No 11
19 pages
Bose A S
No ratings yet
Bose A S
37 pages
TCS - NQT - Technical Coding 40 Qns
No ratings yet
TCS - NQT - Technical Coding 40 Qns
37 pages
Project
No ratings yet
Project
12 pages
DT20246233361 Application
No ratings yet
DT20246233361 Application
4 pages
Introduction To Java
No ratings yet
Introduction To Java
2 pages
NM - TNSLPP - 2024-25 - Pool Campus 2025 Hiring
No ratings yet
NM - TNSLPP - 2024-25 - Pool Campus 2025 Hiring
1 page
Vsbcetc 1st Year Amcat
No ratings yet
Vsbcetc 1st Year Amcat
135 pages
Wi Fi
No ratings yet
Wi Fi
1 page
8 Apr Vsbcetc Svar Logins
No ratings yet
8 Apr Vsbcetc Svar Logins
90 pages
Smart Pet Feeder Base Paper
No ratings yet
Smart Pet Feeder Base Paper
5 pages
Transparent Product. (No Logo) - 1
No ratings yet
Transparent Product. (No Logo) - 1
45 pages
Sonar Image Interpretation Guide
No ratings yet
Sonar Image Interpretation Guide
52 pages
Ceiling Mounted Basketball Backstop Specs
No ratings yet
Ceiling Mounted Basketball Backstop Specs
2 pages
PDF P Classtruncatedtext Module Lineclamped 85ulhh Style Max Lines5volume Isi Gerobak Sorong Dan Ember Cor P Compress
No ratings yet
PDF P Classtruncatedtext Module Lineclamped 85ulhh Style Max Lines5volume Isi Gerobak Sorong Dan Ember Cor P Compress
5 pages
Oracle DB Keywords
No ratings yet
Oracle DB Keywords
21 pages
Cs615 Solution-1 by Junaid Malik
No ratings yet
Cs615 Solution-1 by Junaid Malik
3 pages
The Mathematics of Bitcoin
No ratings yet
The Mathematics of Bitcoin
9 pages
Lesson 3 OPERATIONS ON FUNCTIONS
No ratings yet
Lesson 3 OPERATIONS ON FUNCTIONS
6 pages
Image 002
No ratings yet
Image 002
20 pages
ET3000-2A 简介 (翻译结果)
No ratings yet
ET3000-2A 简介 (翻译结果)
4 pages
Artificial Intelligence Modern Life
No ratings yet
Artificial Intelligence Modern Life
2 pages
Mca Ii Sem DBMS
No ratings yet
Mca Ii Sem DBMS
98 pages
Secure Communication Technique Using Concepts of Image Steganography by LSB Testing
No ratings yet
Secure Communication Technique Using Concepts of Image Steganography by LSB Testing
7 pages
Mathematical Olympiads
No ratings yet
Mathematical Olympiads
3 pages
PBG 2021 Innovation Survey Results 4 2021
No ratings yet
PBG 2021 Innovation Survey Results 4 2021
12 pages
How To Write A Winning Grant Proposal
100% (1)
How To Write A Winning Grant Proposal
12 pages
A Star Algorithm
No ratings yet
A Star Algorithm
7 pages
Pertemuan 2 Perpindahan Panas Konduksi
No ratings yet
Pertemuan 2 Perpindahan Panas Konduksi
68 pages
NCERT Solutions For Class 9 Maths Chapter 2 Polynomials Ex 2.1
No ratings yet
NCERT Solutions For Class 9 Maths Chapter 2 Polynomials Ex 2.1
20 pages
5th Semester Syllabus
No ratings yet
5th Semester Syllabus
6 pages
UnderstandQUIC More
No ratings yet
UnderstandQUIC More
12 pages
Applied - 11 - Research in Daily Life 1 - semII - CLAS8 - Analyzing and Drawing Out Patterns and Themes With Intellectual Honesty - v2 PNS PDF
No ratings yet
Applied - 11 - Research in Daily Life 1 - semII - CLAS8 - Analyzing and Drawing Out Patterns and Themes With Intellectual Honesty - v2 PNS PDF
15 pages
DIY Coax Trap Construction Guide
No ratings yet
DIY Coax Trap Construction Guide
5 pages
Isaac Is Professional Guide For Editing
No ratings yet
Isaac Is Professional Guide For Editing
4 pages
ÔN THI LÂM SÀNG NỘI - Google Drive
No ratings yet
ÔN THI LÂM SÀNG NỘI - Google Drive
1 page
Design and Analysis of Algorithm
No ratings yet
Design and Analysis of Algorithm
2 pages
DWR For Marketing
No ratings yet
DWR For Marketing
9 pages
MBBD2143 - Marketing Analytics - Lab Manual - Final
No ratings yet
MBBD2143 - Marketing Analytics - Lab Manual - Final
37 pages
Final PPT123
No ratings yet
Final PPT123
32 pages
Bus Ticket TSC527130258 231109 132135
No ratings yet
Bus Ticket TSC527130258 231109 132135
6 pages