0% found this document useful (0 votes)
17 views51 pages

Kakauikkla

nice

Uploaded by

nishithrbd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views51 pages

Kakauikkla

nice

Uploaded by

nishithrbd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Assignment – 1

Perform the following tasks on the shared dataset


a) Basic Exploratory Data Analysis

b) Visualization

c) Preprocessing

d) Apply Model

e) Evaluate

f) Tune HyperParameter

g) Save the model parameter

• Name: Nishith Dubey


• Enrollment Number: 0801CS231087
import numpy as np
import pandas as pd

df=pd.read_csv('/content/01_Student Final Grade Prediction-


Multi_lin_reg - 01_Student Final Grade Prediction-Multi_lin_reg.csv')

##EDA

df.head()

{"type":"dataframe","variable_name":"df"}

df.tail()

{"type":"dataframe"}

df.shape

(395, 33)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 school 395 non-null object
1 gender 394 non-null object
2 age 392 non-null float64
3 address 392 non-null object
4 famsize 394 non-null object
5 Parrent_status 395 non-null object
6 Mother_edu 394 non-null float64
7 Father_edu 394 non-null float64
8 Mother_job 394 non-null object
9 Father_job 392 non-null object
10 reason_to_chose_school 392 non-null object
11 guardian 393 non-null object
12 traveltime 393 non-null float64
13 weekly_studytime 394 non-null float64
14 failures 393 non-null float64
15 extra_edu_supp 394 non-null object
16 family_edu_supp 395 non-null object
17 extra_paid_class 394 non-null object
18 extra_curr_activities 393 non-null object
19 nursery 394 non-null object
20 Interested_in_higher_edu 394 non-null object
21 internet_access 394 non-null object
22 romantic_relationship 394 non-null object
23 Family_quality_reln 394 non-null float64
24 freetime_after_school 395 non-null int64
25 goout_with_friends 395 non-null int64
26 workday_alcohol_consum 395 non-null int64
27 weekend_alcohol_consum 395 non-null int64
28 health_status 395 non-null int64
29 absences 395 non-null int64
30 G1 395 non-null int64
31 G2 395 non-null int64
32 G3 395 non-null int64
dtypes: float64(7), int64(9), object(17)
memory usage: 102.0+ KB

df.describe()

{"summary":"{\n \"name\": \"df\",\n \"rows\": 8,\n \"fields\": [\n


{\n \"column\": \"age\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 133.37624950020606,\n
\"min\": 1.2746422747774064,\n \"max\": 392.0,\n
\"num_unique_values\": 8,\n \"samples\": [\n
16.706632653061224,\n 17.0,\n 392.0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"Mother_edu\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
138.45628484354472,\n \"min\": 0.0,\n \"max\": 394.0,\n
\"num_unique_values\": 7,\n \"samples\": [\n 394.0,\n
2.746192893401015,\n 3.0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"Father_edu\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
138.56731772027555,\n \"min\": 0.0,\n \"max\": 394.0,\n
\"num_unique_values\": 7,\n \"samples\": [\n 394.0,\n
2.520304568527919,\n 3.0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"traveltime\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
138.3875590098277,\n \"min\": 0.6983593411207679,\n
\"max\": 393.0,\n \"num_unique_values\": 6,\n
\"samples\": [\n 393.0,\n 1.4478371501272265,\n
4.0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"weekly_studytime\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 138.65339932689832,\n \"min\":
0.8403054970813315,\n \"max\": 394.0,\n
\"num_unique_values\": 6,\n \"samples\": [\n 394.0,\n
2.035532994923858,\n 4.0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"failures\",\n \"properties\":
{\n \"dtype\": \"number\",\n \"std\":
138.7441113832976,\n \"min\": 0.0,\n \"max\": 393.0,\n
\"num_unique_values\": 5,\n \"samples\": [\n
0.33587786259541985,\n 3.0,\n 0.7451614708890008\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n },\n {\n \"column\": \"Family_quality_reln\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
138.10543705762666,\n \"min\": 0.8962139199690501,\n
\"max\": 394.0,\n \"num_unique_values\": 6,\n
\"samples\": [\n 394.0,\n 3.9416243654822334,\n
5.0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"freetime_after_school\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 138.63828826062982,\n
\"min\": 0.9988620396657205,\n \"max\": 395.0,\n
\"num_unique_values\": 7,\n \"samples\": [\n 395.0,\n
3.2354430379746835,\n 4.0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"goout_with_friends\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
138.68948196584594,\n \"min\": 1.0,\n \"max\": 395.0,\n
\"num_unique_values\": 8,\n \"samples\": [\n
3.108860759493671,\n 3.0,\n 395.0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"workday_alcohol_consum\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
139.0354623650101,\n \"min\": 0.8907414280909659,\n
\"max\": 395.0,\n \"num_unique_values\": 6,\n
\"samples\": [\n 395.0,\n 1.481012658227848,\n
5.0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"weekend_alcohol_consum\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 138.87302263653973,\n
\"min\": 1.0,\n \"max\": 395.0,\n \"num_unique_values\":
7,\n \"samples\": [\n 395.0,\n
2.2911392405063293,\n 3.0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"health_status\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
138.50262599778412,\n \"min\": 1.0,\n \"max\": 395.0,\n
\"num_unique_values\": 7,\n \"samples\": [\n 395.0,\n
3.5544303797468353,\n 4.0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"absences\",\n \"properties\":
{\n \"dtype\": \"number\",\n \"std\":
136.85777166785417,\n \"min\": 0.0,\n \"max\": 395.0,\n
\"num_unique_values\": 7,\n \"samples\": [\n 395.0,\n
5.708860759493671,\n 8.0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"G1\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 136.30663508587594,\n
\"min\": 3.0,\n \"max\": 395.0,\n \"num_unique_values\":
8,\n \"samples\": [\n 10.90886075949367,\n
11.0,\n 395.0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"G2\",\n \"properties\": {\n \"dtype\": \"number\",\n
\"std\": 136.4163745266465,\n \"min\": 0.0,\n \"max\":
395.0,\n \"num_unique_values\": 8,\n \"samples\": [\n
10.713924050632912,\n 11.0,\n 395.0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"G3\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 136.35024783099098,\n
\"min\": 0.0,\n \"max\": 395.0,\n \"num_unique_values\":
8,\n \"samples\": [\n 10.415189873417722,\n
11.0,\n 395.0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n }\n ]\n}","type":"dataframe"}

df.columns

Index(['school', 'gender', 'age', 'address', 'famsize',


'Parrent_status',
'Mother_edu', 'Father_edu', 'Mother_job', 'Father_job',
'reason_to_chose_school', 'guardian', 'traveltime',
'weekly_studytime',
'failures', 'extra_edu_supp', 'family_edu_supp',
'extra_paid_class',
'extra_curr_activities', 'nursery', 'Interested_in_higher_edu',
'internet_access', 'romantic_relationship',
'Family_quality_reln',
'freetime_after_school', 'goout_with_friends',
'workday_alcohol_consum',
'weekend_alcohol_consum', 'health_status', 'absences', 'G1',
'G2',
'G3'],
dtype='object')

df.describe(include='all')

{"type":"dataframe"}

Visualization
import matplotlib.pyplot as plt
import seaborn as sns

plt.hist(df['G1'])

(array([ 2., 31., 37., 72., 51., 74., 63., 24., 30., 11.]),
array([ 3. , 4.6, 6.2, 7.8, 9.4, 11. , 12.6, 14.2, 15.8, 17.4,
19. ]),
<BarContainer object of 10 artists>)

plt.hist(df['G2'])
(array([13., 0., 16., 35., 82., 81., 78., 57., 18., 15.]),
array([ 0. , 1.9, 3.8, 5.7, 7.6, 9.5, 11.4, 13.3, 15.2, 17.1,
19. ]),
<BarContainer object of 10 artists>)

plt.hist(df['G3'])

(array([ 38., 0., 8., 24., 60., 103., 62., 60., 22., 18.]),
array([ 0., 2., 4., 6., 8., 10., 12., 14., 16., 18., 20.]),
<BarContainer object of 10 artists>)
plt.hist(df['absences'])

(array([287., 72., 25., 5., 1., 2., 0., 2., 0., 1.]),
array([ 0. , 7.5, 15. , 22.5, 30. , 37.5, 45. , 52.5, 60. , 67.5,
75. ]),
<BarContainer object of 10 artists>)
plt.hist(df['failures'])

(array([310., 0., 0., 50., 0., 0., 17., 0., 0., 16.]),
array([0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8, 2.1, 2.4, 2.7, 3. ]),
<BarContainer object of 10 artists>)
sns.scatterplot(x=df['G1'], y=df['G2'])

<Axes: xlabel='G1', ylabel='G2'>


sns.scatterplot(x=df['G1'], y=df['G3'])

<Axes: xlabel='G1', ylabel='G3'>


sns.scatterplot(x=df['G2'], y=df['G3'])

<Axes: xlabel='G2', ylabel='G3'>


sns.histplot(df, x=df['age'], hue=df['gender'])

<Axes: xlabel='age', ylabel='Count'>


sns.histplot(df, x=df['age'], hue=df['address'])

<Axes: xlabel='age', ylabel='Count'>


plt.pie(df['school'].value_counts(), labels=df['school'].unique(),
autopct='%1.1f%%')
plt.title('School Distribution')
plt.show()
plt.pie(df.gender.value_counts().values,
labels = df.gender.value_counts().index, shadow =True,
autopct = "%1.2f%%")
plt.legend()

<matplotlib.legend.Legend at 0x79fe811263f0>
Preprocessing
sns.boxplot(x=df['G1'])

<Axes: xlabel='G1'>
sns.boxplot(x=df['G2'])

<Axes: xlabel='G2'>
numerical_cols = df.select_dtypes(include='number').columns
features_with_outliers=[]

for col in numerical_cols:


Q1 =df[col].quantile(0.25)
Q3 =df[col].quantile(0.75)
IQR = Q3-Q1

lower_bound = Q1-1.5*IQR
upper_bound = Q3+1.5*IQR

outliers = df[(df[col] < lower_bound )| (df[col] > upper_bound)]

if not outliers.empty:
features_with_outliers.append(col)

print("Features with outliers:")


print(features_with_outliers)

for cols in features_with_outliers:


plt.figure(figsize=(6,4))
sns.boxplot(x=df[cols], data=df, color='red')
plt.legend()
plt.title(f'Box plot of {cols}')
plt.show()

Features with outliers:


['school', 'address', 'Parrent_status', 'Mother_job', 'Father_job',
'extra_edu_supp', 'nursery', 'Interested_in_higher_edu',
'internet_access']

/tmp/ipython-input-3960268259.py:25: UserWarning: No artists with


labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()

/tmp/ipython-input-3960268259.py:25: UserWarning: No artists with


labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
/tmp/ipython-input-3960268259.py:25: UserWarning: No artists with
labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
/tmp/ipython-input-3960268259.py:25: UserWarning: No artists with
labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
/tmp/ipython-input-3960268259.py:25: UserWarning: No artists with
labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
/tmp/ipython-input-3960268259.py:25: UserWarning: No artists with
labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
/tmp/ipython-input-3960268259.py:25: UserWarning: No artists with
labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
/tmp/ipython-input-3960268259.py:25: UserWarning: No artists with
labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
/tmp/ipython-input-3960268259.py:25: UserWarning: No artists with
labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
for col in features_with_outliers:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR


upper_bound = Q3 + 1.5 * IQR

df[col] = df[col].clip(lower=lower_bound, upper=upper_bound)

for cols in features_with_outliers:


plt.figure(figsize=(6,4))
sns.boxplot(x=df[cols], data=df)
plt.legend()
plt.title(f'Box plot of {cols}')
plt.show()

/tmp/ipython-input-3418483522.py:4: UserWarning: No artists with


labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
/tmp/ipython-input-3418483522.py:4: UserWarning: No artists with
labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
/tmp/ipython-input-3418483522.py:4: UserWarning: No artists with
labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
/tmp/ipython-input-3418483522.py:4: UserWarning: No artists with
labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
/tmp/ipython-input-3418483522.py:4: UserWarning: No artists with
labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
/tmp/ipython-input-3418483522.py:4: UserWarning: No artists with
labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
/tmp/ipython-input-3418483522.py:4: UserWarning: No artists with
labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
/tmp/ipython-input-3418483522.py:4: UserWarning: No artists with
labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
/tmp/ipython-input-3418483522.py:4: UserWarning: No artists with
labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
/tmp/ipython-input-3418483522.py:4: UserWarning: No artists with
labels found to put in legend. Note that artists whose label start
with an underscore are ignored when legend() is called with no
argument.
plt.legend()
corr_matrix= df.corr(numeric_only=True)
print(corr_matrix)

age Mother_edu Father_edu


traveltime \
age 1.000000 -0.159973 -0.164266 0.077411

Mother_edu -0.159973 1.000000 0.625897 -0.167021

Father_edu -0.164266 0.625897 1.000000 -0.157558

traveltime 0.077411 -0.167021 -0.157558 1.000000

weekly_studytime 0.014124 0.068783 0.009156 -0.115008

failures NaN NaN NaN NaN

Family_quality_reln 0.055209 0.022801 0.013962 -0.017217

freetime_after_school 0.014538 0.025119 -0.018703 -0.026569

goout_with_friends 0.119841 0.061921 0.041183 0.021811

workday_alcohol_consum 0.124073 0.016929 0.001879 0.116286

weekend_alcohol_consum 0.110046 -0.051180 -0.016568 0.120241


health_status -0.069379 -0.043770 0.013469 0.002497

absences 0.186068 0.116780 0.018008 -0.023451

G1 -0.060287 0.206500 0.192346 -0.086274

G2 -0.153019 0.228634 0.179830 -0.138590

G3 -0.156116 0.217775 0.154668 -0.114356

weekly_studytime failures
Family_quality_reln \
age 0.014124 NaN
0.055209
Mother_edu 0.068783 NaN
0.022801
Father_edu 0.009156 NaN
0.013962
traveltime -0.115008 NaN -
0.017217
weekly_studytime 1.000000 NaN
0.063992
failures NaN NaN
NaN
Family_quality_reln 0.063992 NaN
1.000000
freetime_after_school -0.141654 NaN
0.136637
goout_with_friends -0.066011 NaN
0.058834
workday_alcohol_consum -0.219130 NaN -
0.079564
weekend_alcohol_consum -0.260562 NaN -
0.122639
health_status -0.071126 NaN
0.077752
absences -0.080753 NaN -
0.080903
G1 0.163235 NaN
0.027758
G2 0.134537 NaN
0.007214
G3 0.099217 NaN
0.058057

freetime_after_school goout_with_friends \
age 0.014538 0.119841
Mother_edu 0.025119 0.061921
Father_edu -0.018703 0.041183
traveltime -0.026569 0.021811
weekly_studytime -0.141654 -0.066011
failures NaN NaN
Family_quality_reln 0.136637 0.058834
freetime_after_school 1.000000 0.281769
goout_with_friends 0.281769 1.000000
workday_alcohol_consum 0.205032 0.266818
weekend_alcohol_consum 0.146665 0.420386
health_status 0.075318 -0.009577
absences 0.007181 0.105672
G1 0.007524 -0.149104
G2 -0.011653 -0.157180
G3 0.008719 -0.132791

workday_alcohol_consum weekend_alcohol_consum
\
age 0.124073 0.110046

Mother_edu 0.016929 -0.051180

Father_edu 0.001879 -0.016568

traveltime 0.116286 0.120241

weekly_studytime -0.219130 -0.260562

failures NaN NaN

Family_quality_reln -0.079564 -0.122639

freetime_after_school 0.205032 0.146665

goout_with_friends 0.266818 0.420386

workday_alcohol_consum 1.000000 0.658956

weekend_alcohol_consum 0.658956 1.000000

health_status 0.080359 0.092476

absences 0.146541 0.193614

G1 -0.101402 -0.126179

G2 -0.087085 -0.102462

G3 -0.066432 -0.051939

health_status absences G1 G2
G3
age -0.069379 0.186068 -0.060287 -0.153019 -
0.156116
Mother_edu -0.043770 0.116780 0.206500 0.228634
0.217775
Father_edu 0.013469 0.018008 0.192346 0.179830
0.154668
traveltime 0.002497 -0.023451 -0.086274 -0.138590 -
0.114356
weekly_studytime -0.071126 -0.080753 0.163235 0.134537
0.099217
failures NaN NaN NaN NaN
NaN
Family_quality_reln 0.077752 -0.080903 0.027758 0.007214
0.058057
freetime_after_school 0.075318 0.007181 0.007524 -0.011653
0.008719
goout_with_friends -0.009577 0.105672 -0.149104 -0.157180 -
0.132791
workday_alcohol_consum 0.080359 0.146541 -0.101402 -0.087085 -
0.066432
weekend_alcohol_consum 0.092476 0.193614 -0.126179 -0.102462 -
0.051939
health_status 1.000000 -0.052585 -0.073172 -0.089461 -
0.061335
absences -0.052585 1.000000 -0.020177 -0.050567
0.068030
G1 -0.073172 -0.020177 1.000000 0.884067
0.801468
G2 -0.089461 -0.050567 0.884067 1.000000
0.905780
G3 -0.061335 0.068030 0.801468 0.905780
1.000000

sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')

<Axes: >
Missing Values
df.isna().sum()

school 0
gender 1
age 3
address 3
famsize 1
Parrent_status 0
Mother_edu 1
Father_edu 1
Mother_job 1
Father_job 3
reason_to_chose_school 3
guardian 2
traveltime 2
weekly_studytime 1
failures 2
extra_edu_supp 1
family_edu_supp 0
extra_paid_class 1
extra_curr_activities 2
nursery 1
Interested_in_higher_edu 1
internet_access 1
romantic_relationship 1
Family_quality_reln 1
freetime_after_school 0
goout_with_friends 0
workday_alcohol_consum 0
weekend_alcohol_consum 0
health_status 0
absences 0
G1 0
G2 0
G3 0
dtype: int64

categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
df[col] = df[col].fillna(df[col].mode()[0])

numeric_cols = df.select_dtypes(include=['int64','float64']).columns
for col in numeric_cols:
df[col] = df[col].fillna(df[col].median())

df.isna().sum()

school 0
gender 0
age 0
address 0
famsize 0
Parrent_status 0
Mother_edu 0
Father_edu 0
Mother_job 0
Father_job 0
reason_to_chose_school 0
guardian 0
traveltime 0
weekly_studytime 0
failures 0
extra_edu_supp 0
family_edu_supp 0
extra_paid_class 0
extra_curr_activities 0
nursery 0
Interested_in_higher_edu 0
internet_access 0
romantic_relationship 0
Family_quality_reln 0
freetime_after_school 0
goout_with_friends 0
workday_alcohol_consum 0
weekend_alcohol_consum 0
health_status 0
absences 0
G1 0
G2 0
G3 0
dtype: int64

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 school 395 non-null object
1 gender 395 non-null object
2 age 395 non-null float64
3 address 395 non-null object
4 famsize 395 non-null object
5 Parrent_status 395 non-null object
6 Mother_edu 395 non-null float64
7 Father_edu 395 non-null float64
8 Mother_job 395 non-null object
9 Father_job 395 non-null object
10 reason_to_chose_school 395 non-null object
11 guardian 395 non-null object
12 traveltime 395 non-null float64
13 weekly_studytime 395 non-null float64
14 failures 395 non-null float64
15 extra_edu_supp 395 non-null object
16 family_edu_supp 395 non-null object
17 extra_paid_class 395 non-null object
18 extra_curr_activities 395 non-null object
19 nursery 395 non-null object
20 Interested_in_higher_edu 395 non-null object
21 internet_access 395 non-null object
22 romantic_relationship 395 non-null object
23 Family_quality_reln 395 non-null float64
24 freetime_after_school 395 non-null float64
25 goout_with_friends 395 non-null int64
26 workday_alcohol_consum 395 non-null float64
27 weekend_alcohol_consum 395 non-null int64
28 health_status 395 non-null int64
29 absences 395 non-null int64
30 G1 395 non-null int64
31 G2 395 non-null int64
32 G3 395 non-null int64
dtypes: float64(9), int64(7), object(17)
memory usage: 102.0+ KB

df.school.value_counts()

school
GP 349
MS 46
Name: count, dtype: int64

for col in df.select_dtypes(include=['object']).columns:


print(f"\nColumn: {col}")
print(df[col].value_counts())

Column: school
school
GP 349
MS 46
Name: count, dtype: int64

Column: gender
gender
F 209
M 186
Name: count, dtype: int64

Column: address
address
U 307
R 88
Name: count, dtype: int64

Column: famsize
famsize
GT3 282
LE3 113
Name: count, dtype: int64

Column: Parrent_status
Parrent_status
T 354
A 41
Name: count, dtype: int64
Column: Mother_job
Mother_job
other 142
services 102
at_home 59
teacher 58
health 34
Name: count, dtype: int64

Column: Father_job
Father_job
other 218
services 110
teacher 29
at_home 20
health 18
Name: count, dtype: int64

Column: reason_to_chose_school
reason_to_chose_school
course 148
home 108
reputation 104
other 35
Name: count, dtype: int64

Column: guardian
guardian
mother 274
father 89
other 32
Name: count, dtype: int64

Column: extra_edu_supp
extra_edu_supp
no 345
yes 50
Name: count, dtype: int64

Column: family_edu_supp
family_edu_supp
yes 242
no 153
Name: count, dtype: int64

Column: extra_paid_class
extra_paid_class
no 214
yes 181
Name: count, dtype: int64

Column: extra_curr_activities
extra_curr_activities
yes 203
no 192
Name: count, dtype: int64

Column: nursery
nursery
yes 314
no 81
Name: count, dtype: int64

Column: Interested_in_higher_edu
Interested_in_higher_edu
yes 375
no 20
Name: count, dtype: int64

Column: internet_access
internet_access
yes 329
no 66
Name: count, dtype: int64

Column: romantic_relationship
romantic_relationship
no 264
yes 131
Name: count, dtype: int64

df['gender'] = df.apply(lambda x: 1 if x['gender'] == 'M' else 0,


axis=1)
df['address'] = df.apply(lambda x: 1 if x['address'] == 'R' else 0,
axis=1)
df['famsize'] = df.apply(lambda x: 1 if x['famsize'] == 'GT3' else 0,
axis=1)
df['Parrent_status'] = df.apply(lambda x: 1 if x['Parrent_status'] ==
'T' else 0, axis=1)
df['extra_edu_supp'] = df.apply(lambda x: 1 if x['extra_edu_supp'] ==
'yes' else 0, axis=1)
df['family_edu_supp'] = df.apply(lambda x: 1 if x['family_edu_supp']
== 'yes' else 0, axis=1)
df['extra_paid_class'] = df.apply(lambda x: 1 if x['extra_paid_class']
== 'yes' else 0, axis=1)
df['extra_curr_activities'] = df.apply(lambda x: 1 if
x['extra_curr_activities'] == 'yes' else 0, axis=1)
df['nursery'] = df.apply(lambda x: 1 if x['nursery'] == 'yes' else 0,
axis=1)
df['Interested_in_higher_edu'] = df.apply(lambda x: 1 if
x['Interested_in_higher_edu'] == 'yes' else 0, axis=1)
df['internet_access'] = df.apply(lambda x: 1 if x['internet_access']
== 'yes' else 0, axis=1)
df['romantic_relationship'] = df.apply(lambda x: 1 if
x['romantic_relationship'] == 'yes' else 0, axis=1)

df['school'] = df['school'].map({'GP': 0, 'MS': 1})


df['Mother_job'] = df['Mother_job'].map({'at_home': 0, 'health': 1,
'other': 2, 'services': 3, 'teacher': 4})
df['Father_job'] = df['Father_job'].map({'at_home': 0, 'health': 1,
'other': 2, 'services': 3, 'teacher': 4})
df['reason_to_chose_school'] =
df['reason_to_chose_school'].map({'home': 0, 'reputation': 1,
'course': 2, 'other': 3})
df['guardian'] = df['guardian'].map({'mother': 0, 'father': 1,
'other': 2})

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 school 395 non-null int64
1 gender 395 non-null int64
2 age 395 non-null float64
3 address 395 non-null int64
4 famsize 395 non-null int64
5 Parrent_status 395 non-null int64
6 Mother_edu 395 non-null float64
7 Father_edu 395 non-null float64
8 Mother_job 395 non-null int64
9 Father_job 395 non-null int64
10 reason_to_chose_school 395 non-null int64
11 guardian 395 non-null int64
12 traveltime 395 non-null float64
13 weekly_studytime 395 non-null float64
14 failures 395 non-null float64
15 extra_edu_supp 395 non-null int64
16 family_edu_supp 395 non-null int64
17 extra_paid_class 395 non-null int64
18 extra_curr_activities 395 non-null int64
19 nursery 395 non-null int64
20 Interested_in_higher_edu 395 non-null int64
21 internet_access 395 non-null int64
22 romantic_relationship 395 non-null int64
23 Family_quality_reln 395 non-null float64
24 freetime_after_school 395 non-null float64
25 goout_with_friends 395 non-null int64
26 workday_alcohol_consum 395 non-null float64
27 weekend_alcohol_consum 395 non-null int64
28 health_status 395 non-null int64
29 absences 395 non-null int64
30 G1 395 non-null int64
31 G2 395 non-null int64
32 G3 395 non-null int64
dtypes: float64(9), int64(24)
memory usage: 102.0 KB

## sEPERATE TARGET VARIABLE


y = df['G3']
X = df.drop('G3', axis =1)

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

from sklearn.linear_model import LinearRegression


lin = LinearRegression()
lin = lin.fit(X_train,y_train)
y_pred = lin.predict(X_test)

y_pred

array([ 6.53578216, 11.11320513, 3.61318931, 8.65230501,


9.95418988,
11.78545531, 18.83808101, 7.75778205, 6.64733596,
12.49644447,
14.98638193, 4.97546822, 13.67571849, 11.68115081,
14.53775178,
7.88646458, 4.75536475, 10.74090073, 13.85881117,
7.47913783,
13.76329359, 16.67784243, 13.54226895, 5.51112857,
8.42623937,
21.29809965, 9.98419751, 9.0424006 , 16.89730598,
11.11564326,
9.23491816, 6.48639573, 14.86393416, 13.30759839,
4.67925488,
4.45676117, -0.52096394, 15.49608868, 11.66803542,
8.91600367,
4.64897411, 10.3827293 , 14.22767684, 7.90541528,
16.14847446,
8.55657074, 12.44456524, 14.59436283, 11.47753162,
15.38628234,
14.12254439, 14.59157218, 10.28654417, 7.71348917,
2.86832812,
13.26623373, 9.6241547 , 5.73288258, 15.49127104,
16.05455577,
13.67138629, 7.85598092, 8.16820299, 3.18568083,
3.66881939,
16.87922268, 8.24014765, 8.47858617, 9.30042735,
16.84094208,
8.31127531, 8.31178302, 13.92164121, 20.95349954,
10.30296541,
5.8690325 , 8.06471596, 12.5941896 , 5.3494861 ])

lin.intercept_

np.float64(-3.079114806474344)

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 316 entries, 181 to 102
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 school 316 non-null int64
1 gender 316 non-null int64
2 age 316 non-null float64
3 address 316 non-null int64
4 famsize 316 non-null int64
5 Parrent_status 316 non-null int64
6 Mother_edu 316 non-null float64
7 Father_edu 316 non-null float64
8 Mother_job 316 non-null int64
9 Father_job 316 non-null int64
10 reason_to_chose_school 316 non-null int64
11 guardian 316 non-null int64
12 traveltime 316 non-null float64
13 weekly_studytime 316 non-null float64
14 failures 316 non-null float64
15 extra_edu_supp 316 non-null int64
16 family_edu_supp 316 non-null int64
17 extra_paid_class 316 non-null int64
18 extra_curr_activities 316 non-null int64
19 nursery 316 non-null int64
20 Interested_in_higher_edu 316 non-null int64
21 internet_access 316 non-null int64
22 romantic_relationship 316 non-null int64
23 Family_quality_reln 316 non-null float64
24 freetime_after_school 316 non-null float64
25 goout_with_friends 316 non-null int64
26 workday_alcohol_consum 316 non-null float64
27 weekend_alcohol_consum 316 non-null int64
28 health_status 316 non-null int64
29 absences 316 non-null int64
30 G1 316 non-null int64
31 G2 316 non-null int64
dtypes: float64(9), int64(23)
memory usage: 81.5 KB

from sklearn.metrics import mean_absolute_error, mean_squared_error,


r2_score
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Mean Absolute Error (MAE):", mae)


print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R² Score:", r2)

Mean Absolute Error (MAE): 1.6264715397250666


Mean Squared Error (MSE): 5.249308259664779
Root Mean Squared Error (RMSE): 2.29113689238875
R² Score: 0.7439992119481771

accuracy_score = lin.score(X_test, y_test)


print("Accuracy Score:", accuracy_score)

Accuracy Score: 0.7439992119481771

from sklearn.linear_model import Ridge


from sklearn.model_selection import GridSearchCV

ridge = Ridge()

params = {
'alpha': [0.01, 0.1, 1, 10, 100]
}

grid = GridSearchCV(ridge, param_grid=params, cv=5, scoring='r2')


grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)


print("Best Score:", grid.best_score_)

Best Parameters: {'alpha': 100}


Best Score: 0.8494824902919669

ridge_pred = grid.predict(X_test)

score = r2_score(ridge_pred , y_test)

score

0.7489279687734536
import pickle

model_pkl_file = "ML-Assignment-1.pkl"
with open(model_pkl_file, 'wb') as file:
pickle.dump(grid, file)

You might also like