0% found this document useful (0 votes)
11 views

ml

The document provides an overview of machine learning techniques, including supervised, unsupervised, and reinforcement learning, with a focus on data preprocessing methods such as handling missing values, encoding categorical variables, and outlier detection. It also discusses feature scaling, handling duplicate data, and feature selection techniques, along with practical examples using Python libraries like pandas, seaborn, and scikit-learn. Additionally, it touches on regression analysis and the implementation of linear regression algorithms.

Uploaded by

manoj.ray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

ml

The document provides an overview of machine learning techniques, including supervised, unsupervised, and reinforcement learning, with a focus on data preprocessing methods such as handling missing values, encoding categorical variables, and outlier detection. It also discusses feature scaling, handling duplicate data, and feature selection techniques, along with practical examples using Python libraries like pandas, seaborn, and scikit-learn. Additionally, it touches on regression analysis and the implementation of linear regression algorithms.

Uploaded by

manoj.ray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 23

## MACHINE LEARNING

i) Supervised Learning
ii)UnSupervised Learning
iii)Reinforcement Learning

i)Supervised Learning
a) Regression Learning
b) Classification Learning

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

dataset = pd.read_csv(r"D:\data set\video\loan.csv")


dataset.head(3)
dataset.shape
dataset.shape[0]
dataset.isnull()
dataset.isnull().sum()
dataset.isnull().sum().sum()
dataset.notnull().sum()
(dataset.isnull().sum().sum()/(dataset.shape[0]*dataset.shape[1])*100 # over all
data in percentage
(dataset.isnull().sum()/dataset.shape[0])*100
sns.heatmap(dataset.isnull())
ptl.show()
dataset.drop(columns=["Credit_History"]) # drop columns name
dataset.shape
dataset.dropna() # Perticular row deleted
dataset.dropna(inplace=True) #if we put inplace True its deleted in orignal
dataset
dataset.isnull().sum()
sns.heatmap(dataset.isnull())
plt.show()
dataset.shape
# Handling Missing Values
dataset.fillna(10) # wrong way to fill the data
dataset.fillna(10).head(10)
dataset.info()
dataset.fillna(method="bfill") # backward filling
dataset.fillna(method="ffill",axis=1) # axis=0 means column wise or axis = 1 means
row wise filling
dataset["Gender"].mode()[0]
dataset["Gender"].fillna(dataset["Gender"].mode()[0])
dataset["Gender"].fillna(dataset["Gender"].mode()[0],inplace=True) # To fixed data
in original dataset (perticular column
dataset.select_dtypes(include="object")
dataset.select_dtypes(include="object").isnull().sum()
dataset.select_dtypes(include="object").columns

for i in dataset.select_dtypes(include="object").columns:
print(i)

for i in dataset.select_dtypes(include="object").columns:
dataset[i].fillna(dataset[i].mode()[0],inplace=True) # to fill all objective
by mode without giving error if they are not objective type then that place are
blank
# To fill the numerical values

dataset.isnull().sum()
dataset.info()
dataset.select_dtypes(include="float64").columns
from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy="mean")
si.fit_transform(dataset[[dataset.select_dtypes(include="float64").columns - just
put this output]])
arr = si.fit_transform(dataset[[dataset.select_dtypes(include="float64").columns -
just put this output]])
pd.DataFrame(arr,columns=dataset.select_dtypes(include="float64").columns)

new_dataset =
pd.DataFrame(arr,columns=dataset.select_dtypes(include="float64").columns)

new_dataset.isnull().sum()
new_dataset
dataset["LoanAmount"].mean()

# we genrally try to put numerical value that why we change the catogries value to
numberical value
# ONE HOT Encoding

import pandas as pd
dataset.head()
dataset.isnull().sum()
dataset["Gender"].fillna(dataset["Gender"].mode()[0],inplace=True)
dataset["Married"].fillna(dataset["Married"].mode()[0],inplace=True)
#1st method(ONE Hot Encoding) - get_dummies
en_data = dataset[["Gender","Married"]]
pd.get_dummies(en_data)
pd.get_dummies(en_data).info()

#2nd Method
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit_transform(en_data).toarray()
ar = ohe.fit_transform(en_data).toarray()
pd.DataFrame(ar,columns=["Gender_Female","Gender_Male","Married_No","Married_Yes"])
ohe1 = OneHotEncoder(drop="first")
ar1 = ohe1.fit_transform(en_data).toarray()
ar1
pd.DataFrame(ar,columns=["Gender_Male","Married_Yes"])

#Label Encoding
import pandas as pd
df = pd.DataFrame({"name":["wscube","cow","cat","dog","black"]})
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fir.transform(df["name"])
df["en_name"] = le.fir.transform(df["name"])
df
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset["Property_Area].unique()
la = LabelEncoder()
la.fit(dataset["Property_Area"])
la.transform(dataset["Property_Area"])
dataset["Property_Area"] = la.transform(dataset["Property_Area"])
dataset["Property_Area"].unique()

#Ordinal Encoding
import pandas as pd
df = pd.DataFrame({"Size":["s","m","l","xl","s","m","l","s","s","l","xl","m"]})
df.head(3)
ord_data = [["s","m","l","xl"]] # we use double quotes because of 2 dimension
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=ord_data)
oe.fit(df[["Size"]])
oe.transform(df[["Size"]])
df["Size_encoding"] = oe.transform(df[["Size"]])
df

# 2nd method (Map function)


ord_data1 = {"s":0,"m":1,"l":2,"xl":3
df["Size"].map(ord_data1)
df["Size_encoding_map"] = df["Size"].map(ord_data1)
df

--------
# for example(learning purpose)
ord_data1 = {"s":5,"m":6,"l":7,"xl":8}
df["Size_encoding_map"] = df["Size"].map(ord_data1)
df
----------

#practical on big data


dataset = pd.read_csv("loan.csv")
dataset.head()
dataset["Property_Area"].unique() # for find name
dataset["Property_Area"].fillna(dataset["Property_Area"].model()[0],inplace=True)
en_data_ord = [['Rural','Semiurban','Urban']]
from sklearn.preprocessing import OrdinalEncoder
oen = OrdinalEncoder(categories=en_data_ord)
dataset["Property_Area"] = oen.fit_transform(dataset[["Property_Area"]] # use
fit_transform to permoform direct dataset
dataset.head(10)

#OUTLIER
# how to detect outlier
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.info()
dataset.describe()
# box plot
plt.figure(figsize=(15,5))
sns.boxplot(x = "CoapplicantIncome", data=dataset)
plt.show()
sns.boxplot(x = "ApplicantIncome", data=dataset)
plt.show()

#another method to find outlier


sns.distplot(dataset["ApplicantIncome"])
plt.show()

# outlier remover method


# using IQR(inter quantile range)
# IQR = Q3-Q1
# min = Q1-(1.5*IQR)
# max = Q3 +(1.5 * IQR)

dataset.shape
q1 = dataset["CoapplicantIncome"].quantile(0.25)
q3 = dataset["CoapplicantIncome"].quantile(0.75)
q1
q3
IQR = q3-q1
min_range = q1 - (1.5*IQR)
max_range = q3 + (1.5*IQR)
min_range,max_range

dataset
dataset[dataset["CoapplicantIncome"]<=max_range ]
new_dataset = dataset[dataset["CoapplicantIncome"]<=max_range]
new_dataset.shape
sns.boxplot(x = "CoapplicantIncome", data=new_dataset)
plt.show()

# outlier removal using Z score


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.isnull().sum()
dataset.describe()
sns.boxplot(x= "CoapplicantIncome",data= dataset)
sns.distplot(dataset["CoapplicantIncome"]) # distribution plot
min_range = dataset["CoapplicantIncome"].mean() -
3*dataset["CoapplicantIncome"].std() # std - standard division
max_range = dataset["CoapplicantIncome"].mean() +
3*dataset["CoapplicantIncome"].std()
min_range,max_range
new_dataset = dataset[dataset["CoapplicantIncome"]<= max_range]
sns.boxplot(x= "CoapplicantIncome",data= new_dataset)

# Z score
z_score = (dataset["CoapplicantIncome"] -
dataset["CoapplicantIncome"].mean())/(dataset["CoapplicantIncome"].std())
z_score
z_score>3
data["z_score"] = z_score # puting data in orignal dataset
dataset
# removing outlier
dataset[dataset["z_score"]<3]
new_dataset.shape
dataset[dataset["z_score"]<3].shape

# Feature Scaling(standardization)
standardization - It is a very effective technique which re-Scales a feature value
so that it has distribution with 0 mean value and variance equals to 1.
x(new) = x(i)-x(mean)/standard Deviation

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.isnull().sum()
dataset["ApplicantIncome"].fillna(dataset["ApplicantIncome"].mean(),inplace=True)
sns.distplot(dataset["ApplicantIncome"])
plt.show()
dataset.describe()
# scalling throw scikit-learn

from sklearn.preprocessing import StandardScaler


ss = StandardScaler()
ss.fit(dataset[["ApplicantIncome"]])
ss.transform(data[["ApplicantIncome]])
dataset["ApplicantIncome_ss"] =
pd.DataFrame(ss.transform(dataset[["ApplicantIncome]]),columns=["x"])
dataset.head(3)
dataset.describe()

plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.title("Before")
sns.distplot(dataset["ApplicantIncome_ss"])

plt.subplot(1,2,2)
plt.title("After")
sns.distplot(dataset["ApplicantIncome"])

plt.show()

# Feature Scaling(normalization)
#min-max scaler(normalization techniques)
normalization - It is a scaling technique in which values are shifted and rescaled
so that they end up ranging between 0 and 1. it is also known as Min-Max scaling.

x(new) = x(i) -min(x)/max(x) - min(x)

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.isnull.sum()
dataset.describe()
sns.distplot(dataset["CoapplicantIncome"])
plt.show()
from sklearn.preprocessing import MinMaxScaler
ms = MinMaxScaler()
ms.fit(dataset[["CoapplicantIncome"]])
ms.transform(dataset[["CoapplicantIncome"]])
dataset["CoapplicantIncome_min"] = ms.transform(dataset[["CoapplicantIncome"]])
dataset.head(3)

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.title("Before")
sns.distplot(dataset["CoapplicantIncome"])
plt.subplot(1,2,2)
plt.title("After")
sns.distplot(dataset["CoapplicantIncome_min"])

plt.show()

# Handling Duplicate Data


import pandas as pd
data = {"name":["a","b","c","d","a","c"],"eng":[8,7,5,8,8,5],"hindi":[2,3,4,5,2,6]}
df = pd.DataFrame(data)
df
df.duplicate()
# df["duplicate"] = df.duplicate()
df
df.drop_duplicates()
df.drop_duplicates(inplace=True)

# duplicate data on big data


dataset = pd.read_csv("loan.csv")
dataset.duplicate()
dataset.shape
dataset.drop_duplicates(inplace=True)
dataset.shape

# Replace And Data Type change


import pandas as pd
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.info()
dataset.isnull().sum()
dataset["Dependents"].value_counts()
dataset["Dependents"].fillna(dataset["Dependents"].mode()[0],inplace=True)
dataset.isnull().sum()
dataset["Dependents"].replace("3+","3",inplace=True)
dataset["Dependents"].astype("int64") #converting type of object
dataset["Dependents"] = dataset["Dependents"].astype("int64")
dataset["Dependents"].value_counts()
dataset.info()

# Function Transformer
# without outlier
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.isnull.sum()
sns.distplot(dataset["CoapplicantIncome"])
plt.show()
#IQR
q1 = dataset["CoapplicantIncome"].quantile(0.25)
q3 = dataset["CoapplicantIncome"].quantile(0.75)
iqr = q3 -q1
min_r = q1 - (1.5*iqr)
max_r = q3 + (1.5*iqr)
min_r,max_r

dataset[dataset["CoapplicantIncome"]<=max_r]
dataset = dataset[dataset["CoapplicantIncome"]<=max_r]
sns.distplot(dataset["CoapplicantIncome"])
plt.show()

from sklearn.preprocessing import FunctionTransformer # function transforming here


ft = FunctionTransformer(func = np.log1p)
ft.fit(dataset[["CoapplicantIncome"]])
ft.transform(dataset[["CoapplicantIncome"]])
dataset["CoapplicantIncome_tf"] = ft.transform(dataset[["CoapplicantIncome"]])

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.distplot(dataset["CoapplicantIncome"])
plt.title("Before")
plt.subplot(1,2,2)
sns.distplot(dataset["CoapplicantIncome_tf"])
plt.title("After")
plt.show()

# Function Transformer
# with outlier
dataset.head(3)
dataset.isnull.sum()
sns.distplot(dataset["CoapplicantIncome"])
plt.show()
## IQR
q1 = dataset["CoapplicantIncome"].quantile(0.25)
q3 = dataset["CoapplicantIncome"].quantile(0.75)
iqr = q3 -q1
min_r = q1 - (1.5*iqr)
max_r = q3 + (1.5*iqr)
min_r,max_r

#dataset[dataset["CoapplicantIncome"]<=max_r]
#dataset = dataset[dataset["CoapplicantIncome"]<=max_r]

sns.distplot(dataset["CoapplicantIncome"])
plt.show()

from sklearn.preprocessing import FunctionTransformer # function transforming here


ft = FunctionTransformer(func = np.log1p)
ft.fit(dataset[["CoapplicantIncome"]])
ft.transform(dataset[["CoapplicantIncome"]])
dataset["CoapplicantIncome_tf"] = ft.transform(dataset[["CoapplicantIncome"]])
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.distplot(dataset["CoapplicantIncome"])
plt.title("Before")
plt.subplot(1,2,2)
sns.distplot(dataset["CoapplicantIncome_tf"])
plt.title("After")
plt.show()

## another method
ft1 = FunctionTransformer(func= lambda x : x**2)
ft1.fit(dataset[["CoapplicantIncome"]])
dataset["CoapplicantIncome_tf1"] = ft1.transform(dataset[["CoapplicantIncome"]])
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.distplot(dataset["CoapplicantIncome"])
plt.title("Before")
plt.subplot(1,2,2)
sns.distplot(dataset["CoapplicantIncome_tf1"])
plt.title("After")
plt.show()

## Feature Selection Techniques


## Backward Elemination (using mlxtend) Forward Elimination (using mlxtend)

Feature_Selection :- A feature is an attribute that has an impact on a problem or


is useful for the problem, and choosing the important features for the model is
known as feature selection.

#Forward Elimination (using mlxtend) :-


import pandas as pd
from mlxtend.feature_selection import SequentialFeatureSelector
dataset = pd.read_csv("diabetes.csv")
dataset.head(3)
x = dataset.iloc[:,:-1]
y = dataset["Outcome"]
x.shape

from sklearn.linear_model import LogisticRegression


lr = LogisticRegression()
fs = SequentialFeatureSelector(lr,k_features=5,forward=True)
fs.fit(x,y)
fs.feature_names
fs.k_feature_names_
fs.k_score_

# for backword
fs = SequentialFeatureSelector(lr,k_features=5,forward=False)
fs.fit(x,y)
fs.feature_names
fs.k_feature_names_
fs.k_score_
## Train Test Split in Data set ##
import pandas as pd
dataset = pd.read_csv("Boston.csv")
dataset.head(3)
dataset.shape
input_data = dataset.iloc[:,:-1]
output_data = dataset["House_Price"]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test =
train_test_split(input_data,output_data,test_size=0.25)
x_test
x_train
y_test
y_train
x_train.shape , y_train.shape
x_test.shape , y_test.shape

### Regression Analysis


## LINEAR REGRESSION ALGORITHM (simple linear)
# simple linear Regression - simple linear Regression is a type of Regression
algorithms that models the relationship between a dependent variable and a single
independent variable.
y = mx+c
where y = dependent variable
x = independent variable
m = slope/gradient/coefficient
c = intercept

m = +ve thita < 90


m = -ve thita > 90
m = 0 thita = 0

#practical
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("placement.csv")
#dataset = pd.read_csv(r"D:\data set\video\placement.csv")
dataset.head(3)
plt.figure(figsize=(5,3))
sns.scatterplot(x="cgpa",y="package",data=dataset)
plt.show()
dataset.isnull.sum()
dataset.ndim # if they are 1 dimensionthen please change in 2 dimension
x = dataset[["cgpa"]]
y = dataset["package"]

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
# y = m*x+c
lr.coef_ #array([0.57425647])
lr.intercept_ # -1.0270069374542108
# y = 0.57425647*x-1.0270069374542108
lr.score(x_test,y_test)*100
lr.predict([[6.89]])
# 0.57425647*6.89-1.0270069374542108 # 2.92962016
y_prd = lr.predict(x)
plt.figure(figsize=(5,4))
sns.scatterplot(x="cgpa",y="package",data=dataset)
plt.plot(dataset["cgpa"],y_prd,c="red" )
plt.legend(["org data","predict line"])
plt.savefig("predict.jpg")
plt.show()

## Multiple linear Regression

Multiple linear Regression is an extension of simple linear regression as it takes


more than one predictor variable to predict the response variable.

y = m1x1+m2x2+m3x3.....+c

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("regression_dataset.csv")
dataset.head(3)
dataset.shape
dataset.isnull.sum()
sns.pairplot(data=dataset)
plt.show()
sns.heatmap(data=dataset.corr(),annot=True)
plt.show()

x = dataset.iloc[:,:-1]
x
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("multiple_regression_dataset.csv")
dataset.head()
dataset.isnull.sum()
sns.pairplot(data=dataset)
plt.show()
sns.heatmap(data=dataset.corr(),annot=True)
plt.show()
x = dataset.iloc[:,:-1]
y = dataset["Salary"]
x.ndim
dataset.shape

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test train_test_split(x,y,
test_size=0.2,random_state=42)

from sklearn.linear_model import LinearRegression


lr = LinearRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 # 63.65989707495038
# y = m1x1+m2x2+c
lr.coef_ #array([1676.38278101, -136.23367567])
lr.intercept_ #34875.404040507696
# y_prd = 1676.38278101*Age -136.23367567*Experience + 34875.404040507696
x.column
lr.predict(x_test)

## POLYNOMIAL REGRESSION
Ploynomial Regression is a regression algorithm that models the relationship
between a dependent(y) and independent variable(x) as nth degree ploynomial.

Y = b0+b1x1+b2x1*square+.......bnx1square(n)

import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv("ploynomial.csv")
dataset.head(3)
dataset.corr()
plt.scatter(dataset["Level"],dataset["Salary"])
plt.xlabel("Level")
plt.ylabel("Salary")
plt.show()

x = dataset[["Level"]]
y = dataset[["Salary"]]
from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(degree=2)
pf.fit(x)
pf.transform(x)
x = pf.transform(x)

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

from sklearn.linear_model import LinearRegression


lr = LinearRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 # 99.999999
# y = m1x1+m2x2^2+c
# y = 1000.10647295*x1 + 500.00158031*x2^2 - 13.512174614
lr.coef_ #array([0. , 1000.10647295, 500.00158031])
lr.intercept_ # -13.512174614

prd =lr.predict(x)
plt.scatter(dataset["Level"],dataset["Salary"])
plt.plot(dataset["Level"],prd,c='red')
plt.xlabel("Level")
plt.ylabel("Salary")
plt.legend(["org","prd"])
plt.show()

test = pf.transform([[45]])
test # array([[1.000e+00, 4.500e+01, 2.025e+03]])
lr.predict(test) #([1057494.47922994])

# Cost Function
1 - A cost function is an important parameter that determines how well a machine
learning model performs for a given dataset.
2 - Cost function is a measure of how wrong the model is in estimating the
relationship between X(input) and Y(output) Parameter.

=> types of cost function


a) Regression Cost Function
b) Classification cost Function

a)Regression Cost Function


Regression models are used to make a prediction for the continuous variable.
1- MSE(Mean Square Error)
2- RMSE(Root Mean Square Error)
3- MAE- (Mean Absolute Error)
4- R^2 Accuracy

b) Classification cost Function


1- Binary Classification Cost Function:
Classification models are used to make predictions of categorical variables,
such as predictions for 0 or 1, Cat or dog, etc.

2- Multi-class Classification Cost Function:


A multi-class Classification cost function is used in the Classification
problems for which instances are allocated to one of more than two classes.

-> Binary Cross Entropy Cost Function or Log Loss Function

=> Regression Cost Function

1. Mean Squared Error:


. Mean Squared Error(MSE) is the mean Squared difference between the actual
and predicted values. MSE penalizes high errors caused by outliers by squaring the
errors.
. Mean Squared error is also known as L2 Loss.

Mean Squared Error = 1/n E (y(i) - ^y(i))^2

2. Mean Absolute Error:


. Mean Absolute Error(MAE) is the mean absolute difference between the actual
Values and the predicted values.
. MAE is more robust to outliers. The insensitivity to outlier is because it
does not penalize high errors caused by outliers.

Mean Absolute Error = 1/n E |Y(i) - ^y(i)|

3. Root Mean Squared Error:


. Root Mean Squared Error (RMSE) is the root Squared mean of the difference
between actual and predicted values.
. RMSE can be used in situations where we want to penalize high errors but
not as much as MSE does.

Root Mean Squared Error = root1/n E (y(i) -^y(i))^2

=> How To Find a Best Fit Line:-

=> L1(Lasso Regularization), L2(Ridge Regularization)


=> Regularization Techniques
. This is a form of regression, that constrains/regularizes or shrinks the
coefficient estimates towards zero.
. This technique discourages learning a more complex or flexible model, so as
to avoid the risk of overfitting.

-> Regularization can achieve this motive with 2 techniques:


- Ridge Regularization /L2
- Lasso Regularization /L1

-> Lasso Regularization /L1:(feature selection work)


. This is a regularization technique used in feature selection using a
Shrinkage method also referred to as the penalized regression method.
. Lasso Regression magnitude of coefficients can be exactly zero.

cost function = Loss + lambda E||w||

. Loss = sum of squared residual


. lambda = penalty
. w = slope of the curve

-> Ridge Regularization /L2:(overfitting reducing technique)


. Ridge Regression,also known as L2 regularization, is an extension to linear
Regression that introduces a regularization term to reduce model complexity and
help prevent overfitting.
. Ridge Regression is working value/magnitude of coefficients is almost equal
to zero.

cost function = loss + lambda E ||w||^2

. Loss = sum of squared residual


. lambda = penalty
. w = slope of the curve

=> L1(Lasso Regularization), L2(Ridge Regularization) Practical


-> Regularization Techniques (Practical)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

dataset = pd.read_csv(r"houseprice.csv")
dataset.head(3)
plt.figure(figsize=(10,10))
sns.heatmap(data=dataset.corr(),annot=True)
plt.show()
x = dataset.iloc[:,:-1]
y = dataset["price"]

sc = StandardScaler()
sc.fit(x)
sc.transform(x)
x = pd.DataFrame(sc.transform(x),column=x.column)

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)
from sklearn.linear_model import LinearRegression , Lasso , Ridge
from sklearn.metrics import mean_absolute_error,mean_squared_error
import numpy

# LinearRegression
lr = LinearRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 # 3.2286184...
lr.coef_
print(mean_squared_error(y_test,lr.predict(x_test))) #986919392751.0544
print(mean_absolute_error(y_test,lr.predict(x_test))) #210903.52141518658
print(np.sqrt(mean_squared_error(y_test,lr.predict(x_test)))) #993438.167552996
plt.figure(figsize=(15,5))
plt.bar(x.columns,lr.coef_)
plt.title("LinearRegression")
plt.xlabel("columns")
plt.ylabel("coef")
plt.show()

# Lasso(L1)

la =Lasso(alpha=0.5)
la.fit(x_train,y_train)
la.score(x_test,y_test)*100 # 3.228361
la.coef_
print(mean_squared_error(y_test,la.predict(x_test))) #986921772009.158
print(mean_absolute_error(y_test,la.predict(x_test))) #210908.17447564355
print(np.sqrt(mean_squared_error(y_test,la.predict(x_test)))) #993439.3650390335
plt.figure(figsize=(15,5))
plt.bar(x.columns,la.coef_)
plt.title("Lasso")
plt.xlabel("columns")
plt.ylabel("coef")
plt.show()

# Ridge(L2)

ri = Ridge(alpha = 10)
ri.fit(x_train,y_train)
ri.score(x_test,y_test)*100 #3.2401994171
ri.coef_
print(mean_squared_error(y_test,ri.predict(x_test))) #986801284919.7765
print(mean_absolute_error(y_test,ri.predict(x_test))) #210815.94787357954
print(np.sqrt(mean_squared_error(y_test,ri.predict(x_test)))) #993378.7217973699
plt.figure(figsize=(15,5))
plt.bar(x.columns,ri.coef_)
plt.title("Ridge")
plt.xlabel("columns")
plt.ylabel("coef")
plt.show()

df =
pd.DataFrame({"col_name":x.columns,"LinearRegression":lr.coef_,"Lasso":la.coef_,"Ri
dge":ri.coef_})

## Classification
# Classification Algorithm
. The Classification algorithm is used to identify the category of new observation
on the basis of training data.
. In Classification, a program learns from the given dataset or observations and
then classifies new observation into a number of classes or groups.
. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. classes can be
called as targets/labels or categories.

-> There are two types of Classifications:


a) Binary Classifier: If the Classification problem has only two possible Outcomes,
then it is called as Binary Classifier.
ex- SPAM or NOT SPAM, CAT or DOG, etc.

b) Multi-class Classifier: If a Classification problem has more than two outcomes,


then it is called as Multi-class Classifier.
ex - Classifications of types of crops, Classification of types of music.

=> Types Of ML CLASSIFICATION ALGORITHM

Non-linear models linear Models


. K-Nearest Neighbours . Logistic Regression
. SVM . Support Vector Machines
. Naive Bayes
. Decision Tree Classification
. Random Forest Classification

=> Logistic Regression AND (Binary Classification)(Practical):


. Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique.
. It is used for predicting the categorical dependent variable using a given set of
independent variables.
. Therefore, the outcome must be a categorical or discrete value. it can be either
Yes or No,0 or 1,True or False, etc. but instead of giving the exact value as 0 and
1, it gives the probabilistic values which lie between 0 and 1.

-> Types of Logistic Regression


on the basis of the categories, Logistic Regression can be classified into three
types:
i) Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1,Pass or Fail,etc.
ii) Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as "cat","dogs", or
"sheep".
iii) Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as "low","medium",or "High".

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

dataset = pd.read_csv(r"Social_Network_Ads.csv")
dataset.drop(columns=["EstimatedSalary"],inplace=True)
dataset.head(5)

plt.figure(figsize=(4,3))
sns.scatterplot(x="Age",y="Purchased",data=dataset)
plt.show()
x = dataset[["Age"]]
y = dataset["Purchased"]

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

from sklearn.linear_model import LogisticRegression


lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 # 91.25
lr.predict([[40]])

plt.figure(figsize=(4,3))
sns.scatterplot(x="Age",y="Purchased",data=dataset)
sns.lineplot(x = "Age",y = lr.predict(x),data=dataset,color = 'red')
plt.show()

=> Logistic Regression (Binary Classification)(Multiple input)(Practical):

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

dataset = pd.read_csv(r"placement.csv")
dataset.head(3)

plt.figure(figsize=(5,4))
sns.scatterplot(x="cgpa",y="score",data=dataset,hue="placed")
plt.legend(loc=1)
plt.show()

x = dataset.iloc[:,:-1]
x.ndim
print(x)
y = dataset["placed"]

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100
lr.predict([[8.14,6.52]]) # array([1], dtype=int64)
lr.coef_
lr.intercept_

from mlxtend.plotting import plot_decision_regions


plot_decision_regions(x.to_numpy(),y.to_numpy,clf=lr)
plt.show()

=> Logistic Regression(Binary Classification)(Ploynomial input)(practical):


-> Logistic Regression(Ploynomial Feature):

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

dataset = pd.read_csv(r"Polynomial_classification.csv")
dataset.head(5)
plt.figure(figsize=(5,4))
sns.scatterplot(x="data1",y="data2",data=dataset,hue="output")
plt.show()

x = dataset.iloc[:,:-1]
y = dataset["output"]

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0,random_state=42)

from sklearn.linear_model import LogisticRegression


lr = LogisticRegression()
lr.fix(x_train,y_train)
lr.score(x_test,y_test)*100 #90.0
from mlxtend.plotting import plot_decision_regions
plot_decision_regions(x.to_numpy(),y.to_numpy(),clf=lr)
plt.show()

->with Ploynomial Feature

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

dataset = pd.read_csv(r"Polynomial_classification.csv")
dataset.head(5)
lt.figure(figsize=(5,4))
sns.scatterplot(x="data1",y="data2",data=dataset,hue="output")
plt.show()

x = dataset.iloc[:,:-1]
y = dataset["output"]

from sklearn.preprocessing import PolynomialFeatures


pf = PolynomialFeatures(degree=3)
pf.fit(x)
pf.transform(x)
x = pd.DataFrame(pf.transform(x))
x.shape

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0,random_state=42)

from sklearn.linear_model import LogisticRegression


lr = LogisticRegression()
lr.fix(x_train,y_train)
lr.score(x_test,y_test)*100 #95.0

=> Logistic Regression(Multiclass Classification)(practical):


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv(r"iris.csv")
dataset.head(3)
dataset["species"].unique() #array(['setosa','versicolor','virginica'],
dtype=object)
sns.pairplot(data=dataset,hue="species")
plt.show()

x = dataset.iloc[:,:-1]
y = dataset["species"]

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0,random_state=42)

## OVR
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(multi_class="ovr")
lr.fix(x_train,y_train)
lr.score(x_test,y_test)*100 #96.66666

# multinomial
lr1 = LogisticRegression(multi_class="multinomial")
lr1.fit(x_train,y_train)
lr1.score(x_test,y_test)*100

lr2 = LogisticRegression()
lr2.fit(x_train,y_train)
lr2.score(x_test,y_test)*100 #100.0

=> CONFUSION MATRIX:


. A Confusion matrix is a simple and useful tool for understanding the performance
of a Classification model, like one used in machine learning or statistics.
. It helps you evaluate how well your model is doing in categorizing things
correctly.
. It is also known as the error matrix.
. The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions.

. Accuracy = (TP+TN)/N
. Error = (FN+FP)/n
. False Negative: The model has predicted no, but the actual value was Yes, it is
also called as Type-II error.
. False Positive:The model has predicted Yes, but the actual value was No. it is
also called a Type-I error.

=> CONFUSION MATRIX(Sensitivity, Precision, Recall,F1-score)


-> Precision: TP/(TP+FP)
It helps us to measure the ability to classify positive samples in the model.
-> F1-Score:
It is the harmonic mean of precision and recall. it takes both false positive
and false negattives into account. Therefore, it performs well on an imbalanced
dataset.

F1 Score = 2*Precision*Recall/(Precision+Recall)

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv(r"placement.csv")
dataset.head(5)
x = dataset.iloc[:,:-1]
x
y = dataset["placed"]

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 #100.0

from sklearn.metrics import confusion_matrix,precision_score,recall_score,f1_score


cf = confusion_matrix(y_test,lr.predict(x_test))
cf #array([[10, 0],
[0, 10]], dtype=int64)
sns.heatmap(cf,annot=True)
plt.show()

precision_score(y_test,lr.predict(x_test))*100 #100.0
recall_score(y_test,lr.predict(x_test))*100 #100.0
f1_score(y_test,lr.predict(x_test))*100 #100

=> IMBALANCED DTATASET


-> Techniques to Handle IMBALANCED Data
i) Random Under Sampling:
we will reduce the majority of the class so that it will have same no of as
the minority.
ii) Random Over Sampling:
we will increase the size of manority is inactive class to the size of
majority class ie active.

import pandas as pd
dataset = pd.read_csv("Social_Network_Ads.csv")
dataset.head(5)
dataset["Purchased"].value_counts()

x = dataset.iloc[:,:-1]
y = dataset["Purchased"]

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42,test_size=0.2)

from sklearn.linear_model import LogisticRegression


lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100
lr.predict([[45,26000]]) #array([0], dtype=int64)

=> imblearn
import pandas as pd
dataset = pd.read_csv("Social_Network_Ads.csv")
dataset.head(5)
x = dataset.iloc[:,:-1]
y = dataset["Purchased"]
dataset["Purchased"].value_counts()

-> from imblearn.under_sampling import RandomUnderSampler


ru = RandomUnderSampler()
ru.fit_resample(x,y)
ru_x, ru_y = ru.fit_resample(x,y)
ru_y.value_counts()
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42,test_size=0.2)

from sklearn.linear_model import LogisticRegression


lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 # 58.620689655172406
lr.predict([[45,26000]]) #array([1], dtype=int64)

-> from imblearn.over_sampling import RandomOverSampler


ro = RandomOverSampler()
ro_x, ro_y = ro.fit_resample(x,y)
ro_y.value_counts()
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42,test_size=0.2)

from sklearn.linear_model import LogisticRegression


lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 #45.63106796116505
lr1.predict([[45,26000]]) #array([1], dtype=int64)

## NAIVE BAYES

- Naive Bayes is a Classification algorithm based on Bayes theorem.


- which is a probability theory that describe the probability of an event, based on
prior knowledge of conditions that might be related to the event.

-> Naive: It is called Naive because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features.
-> Bayes: It is called bayes because it depends on the principle of Bayes' Theorem.

-> Bayes' Theorem: Bayes' theorem is also known as Bayes' Rile or Bayes' law, which
is used to determines the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.

p(a/b)= (p(b/a)p(a))/p(b)

where:
- p(a/b) is Posterior probability: probability of hypothesis A on the observed
event B.
- P(b/a) is Likelihood probability: Probability of the evidence given that the
probability a hypothesis is true.
- p(a) is Prior Probability: Probability of hypothesis before observing the
evidence.
- p(b) is Marginal Probability: Probability of evidence.

=> Types of Naive Bayes Model:


There are three types of Naive Bayes Model,
which are given below
. Gaussian
. Multinomial
. Bernoulli

i) Gaussian Naive Bayes:


- Assumes that continuous features follow a Gaussian (normal) distribution.
- Suitable for features that are continuous and have a normal distribution.

ii) Bernoulli Naive Bayes:


- Assumes that features are binary (Boolean) variable.
- Suitable for data that can be represented as binary features, such as document
Classification problems where each term is either present or absent.

iii) Multinomial Naive Bayes:


- Assumes that features follow a multinomial distribution.
- Typically used for discrete data, such as text data, where each feature
represents the frequency of a team.

-> Practical:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.plotting import plot_decision_regions

dataset = pd.read_csv(r"placement.csv")
dataset.head(5)

sns.kdeplot(data=dataset["cgpa"])
plt.show()
sns.kdeplot(data=dataset["score"])
plt.show()

plt.figure(figsize=(4,3))
sns.scatterplot(x="cgpa",y="score",data=dataset,hue="placed")
plt.show()
x = dataset.iloc[:,:-1]
y = dataset["placed"]

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42,test_size=0.2)

from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB


gnb = GaussianNB()
gnb.fit(x_train,y_train)
gnb.score(x_test,y_testtest)*100 #100.0
gnb.score(x_train,y_train)*100 #97.5

gnb.predict([[6.17,5.17]]) # array([0],dtype=int64) # 6.17 5.17 0


plt.figure(figsize=(4,3))
plot_decision_regions(x.to_numpy(),y.to_numpy(),clf=gnb)
plt.show()

mnb = MultinomialNB()
mnb.fit(x_train,y_train)
mnb.score(x_test,y_test)*100 #75.0
mnb.score(x_train,y_train)*100 #73.75

plt.figure(figsize=(4,3))
plot_decision_regions(x.to_numpy(),y.to_numpy(),clf=mnb)
plt.show()

bnb = BernoulliNB()
bnb.fit(x_train,y_train)
bnb.score(x_test,y_test)*100 #50.0
bnb.score(x_train,y_train)*100 #50.0

plt.figure(figsize=(4,3))
plot_decision_regions(x.to_numpy(),y.to_numpy(),clf=bnb)
plt.show()

#==> DECISION TREE (REGRESSION):


. Decision Tree is a Supervised learning technique that can be used for both
Classification and Regression problems, but mostly it is preferred for solving
Classification problems.

. In order to build a tree, we use the CART algorithm, which stands for
Classification and Regression Tree algorithm.

-> important Terminology related to Decision Trees:


- Root Node: It represents the entire population or sample and this further
gets divided into two or more homogeneous sets.
- Splitting: It is a process of dividing a node into two or more sub-nodes.
- Decision Node: When a sub-node splits into further sub-nodes, then it is
called the decision node.
- Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
- Pruning: When we remove sub-nodes of a decision node , this process is
called pruning. You can say the opposite process of Splitting.
- Branch / sub-Tree: A subsection of the entire tree is called branch or sub-
tree.
- Parent and Child Node: A node , which is divided into sub-nodes is called a
parent node of sub-nodes where as sub-nodes are the child of a parentnode.

-> attribute selection measures


- This measurement, we can easily select the best attribute for the nodes of
the tree. There are two popular techniques for ASM, which are:

. Information Gain
. Entropy / Gini Index

=> Entropy : Entropy is a metric to measure the impurity in a given attribute. It


specifies randomness in data.

Entropy(s) = -P(Yes) log2 P(Yes) - P(no)log2 P(no)


where:
s = Total number of samples
P(Yes) = probability of Yes
P(no) = probability of no

=> Information Gain : Information gain is the measurement of changes in entropy


after the segmentation of a dataset based on an attribute. It calculates how much
information a feature provided us about a class.

Information Gain = Entropy(S) - [(Weighted Avg) *Entropy(each feature)]

## DECISION TREE (Classification)(Practical):

You might also like