0% found this document useful (0 votes)
147 views67 pages

ML Lab Final R22

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views67 pages

ML Lab Final R22

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

ML LAB R22

List of Experiments

1. Write a python program to compute Central Tendency Measures: Mean, Median,


Mode Measure of Dispersion: Variance, Standard Deviation
2. Study of Python Basic Libraries such as Statistics, Math, Numpy and Scipy
3. Study of Python Libraries for ML application such as Pandas and Matplotlib
4. Write a Python program to implement Simple Linear Regression
5. Implementation of Multiple Linear Regression for House Price Prediction using sklearn
6. Implementation of Decision tree using sklearn and its parameter tuning
7. Implementation of KNN using sklearn
8. Implementation of Logistic Regression using sklearn
9. Implementation of K-Means Clustering
10. Performance analysis of Classification Algorithms on a specific dataset (Mini Project)
WEEK-1
Write a python program to compute Central Tendency
Measures: Mean, Median, Mode Measure of Dispersion:
Variance, Standard Deviation
Week 1

Write a python program to compute Central Tendency measures : Mean,


Median, Mode Measure of Dispersion : Variance, Standard Deviation.

Measures of Central Tendency

The following definitions of mean, median, and mode state in words how each is calculated.

Mean: The mean of a data set is the average value. Is that circular? Let me state this in the form of an
algorithm. To find the mean of a data set first sum up all of the values in the data set then divide by the
number of data points or examples in the data set. When we find or know the mean of an entire population
we call that mean a parameter of the population and in is assigned the symbol μ. When the mean is of a
sample it is a statistic of the sample ans is assigned the symbol x̄.

Median: The median of a data set is simply the data point that occurs exactly in the middle of the data set.
To find the median first the data needs to be sorted and listed in ascending or descending order. Then if
there is an odd number of data points in the set the median is in the exact middle of this list. If there is an
even number of data points in the list then the median is the average of the two middle values in the list.

Mode: The mode is the value in a data set that is repeated the most often. Below the mean, median, and
mode are calculated for a dummy data set.

Measures of Variability

The following definitions of variance and standard deviation state in words how each is calculated.

Variance: The variance of a data set is found by first finding the deviation of each element in the data set
from the mean. These deviations are squared and then added together. (Why are they squared?) Finally, the
sum of squared deviations is normalized by the number of elements in the population, N, for a population
variance or the number of element in the sample minus one, N-1, for the sample variance. (Why is a sample
variance normalized by N-1?)

Standard Deviation: Once the variance is in hand, standard deviation is eacy to find. It is simply the square
root of the variance. The symbol for population standard deviation is σ while the simple for sample standard
deviation is s or sd.

Program :

import numpy as np
x= np.array([56, 31 ,56, 8 , 32])
print(" mean value of array x", np.mean(x))
print(" mean value of array x", x.mean())
print(" x median",np.median(x))
import numpy as np
scores= np.array([89,73,84,91,87,77,94,67])
print(" score median:",np.median(scores))

print("argument position for minimum value in unsorted array ",np.argmin(x))


print("argument position for maximum value in unsorted array ",np.argmax(x))
print("Average value ",np.average(x))
print("sum ",np.sum(x))
print("Maximum value ",np.max(x))
print("minimum ",np.min(x))

import numpy as np
y= np.array([56,31,0,56,0,8,88,0,32])
print(" Places where Non Zero vlaues are there ", np.nonzero(y))

import numpy as np
import matplotlib.pyplot as plt
y= np.array([650,450,275,350,387,575,555,649,361])
print("mean is :",np.mean(y))
print("median is :",np.median(y))
range=np.max(y)- np.min(y)
print("range is :",range)
plt.bar(y,y,color = 'b')
plt.bar(np.mean(y),y,color = 'g',width=5)
plt.bar(np.median(y),y,color='r',width=5)
plt.bar(range,y,color = 'm',width =4)
plt.show()

a = np.array([[10, 7, 4], [3, 2, 1],[22 , 55 ,44 ],[15, 16, 11],[30, 20,40]])


print(a)
print("variance of array across columns:",np.var(a,axis=0))
print("standard deviaion across columns:",np.std(a,axis=0))
print("variance of array across rows :",np.var(a,axis=1))
print("standard deviaion across rows :",np.std(a,axis=1))
b =np.array([2,3,4,5,6,7,8,9])
print("variance:",np.var(b))
print("std :",np.std(b))
WEEK-2
Study of Python Basic Libraries such as Statistics, Math,
Numpy and Scipy
2. Study of Python Basic Libraries such as Statistics, Math, Numpy and
Scipy

Statistics with Python

Statistics, in general, is the method of collection of data, tabulation, and interpretation of numerical data.
It is an area of applied mathematics concerned with data collection analysis, interpretation, and
presentation. With statistics, we can see how data can be used to solve complex problems.

Measure of Central Tendency

The measure of central tendency is a single value that attempts to describe the whole set of data. There are
three main features of central tendency:
 Mean
 Median
 Median Low
 Median High
o Mode

Mean

It is the sum of observations divided by the total number of observations. It is also defined as average which
is the sum divided by count.
Mean(x‾)=∑xnMean(x)=n∑x

The mean() function returns the mean or average of the data passed in its arguments. If the passed argument
is empty, StatisticsError is raised.

import statistics
li = [1, 2, 3, 3, 2, 2, 2, 1]
print ("The average of list values is :")
print (statistics.mean(li))

Output:
The average of list values is : 2

Median

It is the middle value of the data set. It splits the data into two halves. If the number of elements in the data
set is odd then the center element is the median and if it is even then the median would be the average of
two central elements.

from statistics import median


data1 = (2, 3, 4, 5, 7, 9, 11)
print("Median of data-set 1 is % s" % (median(data1)))

output:
Median of data-set 1 is 5
Mode

It is the value that has the highest frequency in the given data set.

from statistics import mode


data1 = (2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, 7)
print("Mode of data set 1 is % s" % (mode(data1)))

Measure of Variability

The measure of variability is known as the spread of data or how well our data is distributed. The most
common variability measures are:
 Range
 Variance
 Standard deviation

Range

The difference between the largest and smallest data point in our data set is known as the range. The range
is directly proportional to the spread of data which means the bigger the range, the more the spread of data
and vice versa.
Range = Largest data value – smallest data value

We can calculate the maximum and minimum values using the max() and min() methods respectively.
arr = [1, 2, 3, 4, 5]

#Finding Max
Maximum = max(arr)
# Finding Min
Minimum = min(arr)

# Difference Of Max and Min


Range = Maximum-Minimum
print("Maximum = {}, Minimum = {} and Range = {}".format(
Maximum, Minimum, Range))
Output:
Maximum = 5, Minimum = 1 and Range = 4

Variance

It is defined as an average squared deviation from the mean. It is calculated by finding the difference
between every data point and the average which is also known as the mean, squaring them, adding all of
them, and then dividing by the number of data points present in our data set.

σ2=∑(x−μ2)Nσ2=N∑(x−μ2)
where N = number of terms
u = Mean
from statistics import variance
sample1 = (1, 2, 5, 4, 8, 9, 12)
print("Variance of Sample1 is % s " % (variance(sample1)))

Output:
Variance of Sample1 is 15.80952380952381

Standard Deviation

It is defined as the square root of the variance. It is calculated by finding the Mean, then subtracting each
number from the Mean which is also known as the average, and squaring the result. Adding all the values
and then dividing by the no of terms followed by the square root.

σ=∑(x−μ)2Nσ=N∑(x−μ)2
where N = number of terms
u = Mean

from statistics import stdev


sample1 = (1, 2, 5, 4, 8, 9, 12)
print("The Standard Deviation of Sample1 is % s"
% (stdev(sample1)))

Output:
The Standard Deviation of Sample1 is 3.9761191895520196
math — Mathematical functions
This module provides access to the mathematical functions
 math.ceil(x)
Return the ceiling of x, the smallest integer greater than or equal to x.

import math
math.ceil(22.3)

output:
23

 math.fabs(x)
Return the absolute value of x

import math
math.fabs(-22.3)

output:
22.3
 math.factorial(n)
Return n factorial as an integer. Raises ValueError if n is not integral or is negative.

n=5
math.factorial(n)

output:
120
 math.floor(x)
Return the floor of x, the largest integer less than or equal to x

 math.floor(22.3)
22
 math.fmod(x, y)
Return fmod(x, y), as defined by the platform C library

 math.fmod(3, 2)
1.0

 math.isnan(x)
Return True if x is a NaN (not a number), and False otherwise.
 math.trunc(3.5)

 math.cbrt(x)
Return the cube root of x.

 math.exp(x)
Return e raised to the power x, where e = 2.718281… is the base of natural logarithms. This is
usually more accurate than math.e ** x or pow(math.e, x).
 math.exp2(x)
Return 2 raised to the power x.
 math.pow(x, y)
Return x raised to the power y.

 math.sqrt(x)
Return the square root of x.
 math.degrees(x)
Convert angle x from radians to degrees.
 math.radians(x)
Convert angle x from degrees to radians.
NumPy
Num Py is the fundamental package for scientific computing in Python. It is a Python library
that provides a multidimensional array object, various derived objects for fast operations
on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O,
discrete Fourier transforms, basic linear algebra, basic statistical operations, random
simulation and much more. At the core of the NumPy package, is the ndarray object. This
encapsulates n-dimensional arrays of homogeneous data types.

Advantages:
● NumPy arrays have a fixed size at creation, unlike Python lists (which can grow
dynamically). Changing the size of an ndarray will create a new array and delete the
original.
● The elements in a NumPy array are all required to be of the same data type, and thus
will be the same size in memory.
● NumPy arrays facilitate advanced mathematical and other types of operations on large
numbers of data. Typically, such operations are executed more efficiently and with
less code than is possible using Python’s built-in sequences.

Numpy basics:
NumPy’s main object is the homogeneous multidimensional array. It is a table of
elements (usually numbers), all of the same type. In NumPy dimensions are called axes.
array transforms sequences of sequences into two-dimensional arrays, sequences of

sequences of sequeces into thr ee-


dimensional arrays, and so on.

The type of the array can also be explicitly specified at creation time:
>>> c = np.array([[1, 2], [3, 4]], dtype=float)
>>> print(c)

[[1. 2.]
[3. 4.]]

>>> c.dtype
dtype('float64')

The function zeros creates an array full of zeros, the function ones creates an array full of
ones, and the function empty creates an array whose initial content is random and depends
on the state of the memory. By default, the dtype of the created array is float64, but it can
be specified via the key word argument dtype.
import

To create sequences of numbers, NumPy provides the arange function which is analogous
to the Python built-in range, but returns an array.
Printing arrays

When you print an array, Num]Py displays it in a similar way to nested lists, but with the
following layout:

● the last axis is printed from left to right,


● the second-to-last is printed from top to bottom,
● the rest are also printed from top to bottom, with each slice separated from the next
by an empty line.

One-dimensional arrays are then printed as rows, bidimensionals as matrices and


tridimensionals as lists of matrices.

If an array is too large to be printed, NumPy automatically skips the central part of the array
and only prints the corners:

Basic operations

Arithmetic operators on arrays apply elementwise. A new array is created and filled with the
result.
Unlike in many matrix languages, the product operator * operates elementwise in NumPy
arrays. The matrix product can be performed using the @ operator (in python >=3.5) or
the dot function or method:

Many unary operations, such as computing the sum of all the elements in the array, are
implemented as methods of the ndarray class.

By default, these operations apply to the array as though it were a list of numbers, regardless
of its shape. However, by specifying the axis parameter you can apply an operation along
the specified axis of an array:
Indexing, slicing and iteratingb.su

One-dimensional arrays can be indexed, sliced and iterated over, much like lists and other
Python sequences.
WEEK-3

. Study of Python Libraries for ML application such as


Pandas and Matplotlib
PANDAS

pandas is a data manipulation package in Python for tabular data. That is, data in the form of rows and
columns, also known as DataFrames.
pandas’ functionality includes data transformations, like sorting rows and taking subsets, to calculating
summary statistics such as the mean, reshaping DataFrames, and joining DataFrames together. pandas
works well with other popular Python data science packages.
pandas is used throughout the data analysis workflow. With pandas, you can:

 Import datasets from databases, spreadsheets, comma-separated values (CSV) files, and more.
 Clean datasets, for example, by dealing with missing values.
 Tidy datasets by reshaping their structure into a suitable format for analysis.
 Aggregate data by calculating summary statistics such as the mean of columns, correlation between
them, and more.
 Visualize datasets and uncover insights.
Install pandas
Installing pandas is straightforward; just use the pip install command in your terminal.

pip install pandas

Importing data in pandas

To begin working with pandas, import the pandas Python package as shown below. When importing
pandas, the most common alias for pandas is pd.

import pandas as pd

Importing CSV files


Use read_csv() with the path to the CSV file to read a comma-separated values file

df = pd.read_csv("D:\ml 2024-25\diabetes.csv")

Importing text files

Reading text files is similar to CSV files. The only nuance is that you need to specify a separator with
the sep argument, as shown below. The separator argument refers to the symbol used to separate rows in a
DataFrame. Comma (sep = ","), whitespace(sep = "\s"), tab (sep = "\t"), and colon(sep = ":") are the
commonly used separators. Here \s represents a single white space character.

df = pd.read_csv("diabetes.txt", sep="\s")

importing Excel files (single sheet)

Reading excel files (both XLS and XLSX) is as easy as the read_excel() function, using the file path as an
input.

df = pd.read_excel('diabetes.xlsx')
Importing JSON file

Similar to the read_csv() function, you can use read_json() for JSON file types with the JSON file name as
the argument (for more detail read this tutorial on importing JSON and HTML data into pandas). The
below code reads a JSON file from disk and creates a DataFrame object df.

df = pd.read_json("diabetes.json")

Outputting data in pandas

Just as pandas can import data from various file types, it also allows you to export data into various formats.
This happens especially when data is transformed using pandas and needs to be saved locally on your
machine. Below is how to output pandas DataFrames into various formats.

Outputting a DataFrame into a CSV file

A pandas DataFrame (here we are using df) is saved as a CSV file using the .to_csv() method. The
arguments include the filename with path and index – where index = True implies writing the DataFrame’s
index.

df.to_csv("diabetes_out.csv", index=False)

Outputting a DataFrame into a JSON file

Export DataFrame object into a JSON file by calling the .to_json() method.

df.to_json("diabetes_out.json")

Note: A JSON file stores a tabular object like a DataFrame as a key-value pair. Thus you would observe
repeating column headers in a JSON file.

Outputting a DataFrame into a text file

As with writing DataFrames to CSV files, you can call .to_csv(). The only differences are that the output
file format is in .txt, and you need to specify a separator using the sep argument.

df.to_csv('diabetes_out.txt', header=df.columns, index=None, sep=' ')

Outputting a DataFrame into an Excel file

Call .to_excel() from the DataFrame object to save it as a “.xls” or “.xlsx” file.

df.to_excel("diabetes_out.xlsx", index=False)

Viewing and understanding DataFrames using pandas


After reading tabular data as a DataFrame, you would need to have a glimpse of the data. You can either
view a small sample of the dataset or a summary of the data in the form of summary statistics.
How to view data using .head() and .tail()

You can view the first few or last few rows of a DataFrame using the .head() or .tail() methods, respectively.
You can specify the number of rows through the n argument (the default value is 5).

df.head()

df.tail(n = 10)

Understanding data using .describe()

The .describe() method prints the summary statistics of all numeric columns, such as count, mean, standard
deviation, range, and quartiles of numeric columns.

df.describe()

By transposing them with the .T attribute.

df.describe().T

Understanding data using .info()

The .info() method is a quick way to look at the data types, missing values, and data size of a DataFrame.
Here, we’re setting the show_counts argument to True, which gives a few over the total non-missing values
in each column. We’re also setting memory_usage to True, which shows the total memory usage of the
DataFrame elements. When verbose is set to True, it prints the full summary from .info().

df.info(show_counts=True, memory_usage=True, verbose=True)

Understanding your data using .shape

The number of rows and columns of a DataFrame can be identified using the .shape attribute of the
DataFrame. It returns a tuple (row, column) and can be indexed to get only rows, and only columns count
as output.

df.shape # Get the number of rows and columns


df.shape[0] # Get the number of rows only
df.shape[1] # Get the number of columns only

(768,9)
768
9
Get all columns and column names

Calling the .columns attribute of a DataFrame object returns the column names in the form of
an Index object. As a reminder, a pandas index is the address/label of the row or column.

df.columns

It can be converted to a list using a list() function.

list(df.columns)

Checking for missing values in pandas with .isnull()

df2 = df.copy()
df2.loc[2:5,'Pregnancies'] = None
df2.head(7)

We can check whether each element in a DataFrame is missing using the .isnull() method.

df2.isnull().head(7)

isnull() with .sum() to count the number of nulls in each column.

df2.isnull().sum()

Pregnancies 4
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

Slicing and Extracting Data in pandas

The pandas package offers several ways to subset, filter, and isolate data in your DataFrames. Here, we'll
see the most common ways.
Isolating one column using [ ]

You can isolate a single column using a square bracket [ ] with a column name in it. The output is a
pandas Series object. A pandas Series is a one-dimensional array containing data of any type, including
integer, float, string, boolean, python objects, etc. A DataFrame is comprised of many series that act as
columns.

df['Outcome']

Isolating two or more columns using [[ ]]

You can also provide a list of column names inside the square brackets to fetch more than one column.
Here, square brackets are used in two different ways. We use the outer square brackets to indicate a subset
of a DataFrame, and the inner square brackets to create a list.

df[['Pregnancies', 'Outcome']]

Isolating one row using [ ]

A single row can be fetched by passing in a boolean series with one True value. In the example below, the
second row with index = 1 is returned. Here, .index returns the row labels of the DataFrame, and the
comparison turns that into a Boolean one-dimensional array.

df[df.index==1]

Isolating two or more rows using [ ]

Similarly, two or more rows can be returned using the .isin() method instead of a == operator.

df[df.index.isin(range(2,10))]

Using .loc[] and .iloc[] to fetch rows

You can fetch specific rows by labels or conditions using .loc[] and .iloc[] ("location" and "integer
location"). .loc[] uses a label to point to a row, column or cell, whereas .iloc[] uses the numeric position.
To understand the difference between the two, let’s modify the index of df2 created earlier.

df2.index = range(1,769)

The below example returns a pandas Series instead of a DataFrame. The 1 represents the row index (label),
whereas the 1 in .iloc[] is the row position (first row).

df2.loc[1]
Pregnancies 6.000
Glucose 148.000
BloodPressure 72.000
SkinThickness 35.000
Insulin 0.000
BMI 33.600
DiabetesPedigreeFunction 0.627
Age 50.000
Outcome 1.000
Name: 1, dtype: float64
df2.iloc[1]

Pregnancies 1.000
Glucose 85.000
BloodPressure 66.000
SkinThickness 29.000
Insulin 0.000
BMI 26.600
DiabetesPedigreeFunction 0.351
Age 31.000
Outcome 0.000
Name: 2, dtype: float64
You can also fetch multiple rows by providing a range in square brackets.

df2.loc[100:110]

df2.iloc[100:110]
You can also select specific columns along with rows. This is where .iloc[] is different from .loc[] – it
requires column location and not column labels.

df2.loc[100:110, ['Pregnancies', 'Glucose', 'BloodPressure']]

Conditional slicing (that fits certain conditions)

pandas lets you filter data by conditions over row/column values. For example, the below code selects the
row where Blood Pressure is exactly 122. Here, we are isolating rows using the brackets [ ] as seen in
previous sections. However, instead of inputting row indices or column names, we are inputting a condition
where the column BloodPressure is equal to 122. We denote this condition using df.BloodPressure == 122.
df[df.BloodPressure == 122]

The below example fetched all rows where Outcome is 1. Here df.Outcome selects that column, df.Outcome
== 1 returns a Series of Boolean values determining which Outcomes are equal to 1, then [] takes a subset
of df where that Boolean Series is True.

df[df.Outcome == 1]

You can use a > operator to draw comparisons. The below code fetches Pregnancies, Glucose,
and BloodPressure for all records with BloodPressure greater than 100.

df.loc[df['BloodPressure'] > 100, ['Pregnancies', 'Glucose', 'BloodPressure'

Cleaning data using pandas

Data cleaning is one of the most common tasks in data science. pandas lets you preprocess data for any use,
including but not limited to training machine learning and deep learning models. Let’s use the
DataFrame df2 from earlier, having four missing values, to illustrate a few data cleaning use cases. As a
reminder, here's how you can see how many missing values are in a DataFrame.

df2.isnull().sum()

Pregnancies 4
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

Dealing with missing data technique #1: Dropping missing values


One way to deal with missing data is to drop it. This is particularly useful in cases where you have plenty
of data and losing a small portion won’t impact the downstream analysis. You can use a .dropna() method
as shown below. Here, we are saving the results from .dropna() into a DataFrame df3.
df3 = df2.copy()
df3 = df3.dropna()
df3.shape

(764, 9) # this is 4 rows less than df2

The axis argument lets you specify whether you are dropping rows, or columns, with missing values. The
default axis removes the rows containing NaNs. Use axis = 1 to remove the columns with one or more NaN
values. Also, notice how we are using the argument inplace=True which lets you skip saving the output
of .dropna() into a new DataFrame.

df3 = df2.copy()
df3.dropna(inplace=True, axis=1)
df3.head()

You can also drop both rows and columns with missing values by setting the how argument to 'all'

df3 = df2.copy()
df3.dropna(inplace=True, how='all')

Dealing with missing data technique #2: Replacing missing values

Instead of dropping, replacing missing values with a summary statistic or a specific value (depending on
the use case) maybe the best way to go. For example, if there is one missing row from a temperature column
denoting temperatures throughout the days of the week, replacing that missing value with the average
temperature of that week may be more effective than dropping values completely. You can replace the
missing data with the row, or column mean using the code below.

df3 = df2.copy()
# Get the mean of Pregnancies
mean_value = df3['Pregnancies'].mean()
# Fill missing values using .fillna()
df3 = df3.fillna(mean_value)

Dealing with Duplicate Data


Let's add some duplicates to the original data to learn how to eliminate duplicates in a DataFrame. Here,
we are using the .concat() method to concatenate the rows of the df2 DataFrame to the df2 DataFrame,
adding perfect duplicates of every row in df2.

df3 = pd.concat([df2, df2])


df3.shape
(1536, 9)

You can remove all duplicate rows (default) from the DataFrame using .drop_duplicates() method.

df3 = df3.drop_duplicates()
df3.shape
(768, 9)
Renaming columns

A common data cleaning task is renaming columns. With the .rename() method, you can use columns as an
argument to rename specific columns. The below code shows the dictionary for mapping old and new
column names.

df3.rename(columns = {'DiabetesPedigreeFunction':'DPF'}, inplace = True)


df3.head()

df3.columns = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DPF', 'Age', 'Outcome', 'STF']
df3.head()

Data analysis in pandas

The main value proposition of pandas lies in its quick data analysis functionality. In this section, we'll focus
on a set of analysis techniques you can use in pandas.
Summary operators (mean, mode, median)
As you saw earlier, you can get the mean of each column value using the .mean() method.

df.mean()

A mode can be computed similarly using the .mode() method.


df.mode()

Similarly, the median of each column is computed with the .median() method
df.median()

Create new columns based on existing columns

pandas provides fast and efficient computation by combining two or more columns like scalar variables.
The below code divides each value in the column Glucose with the corresponding value in
the Insulin column to compute a new column named Glucose_Insulin_Ratio.
df2['Glucose_Insulin_Ratio'] = df2['Glucose']/df2['Insulin']
df2.head()

Counting using .value_counts()

Often times you'll work with categorical values, and you'll want to count the number of observations each
category has in a column. Category values can be counted using the .value_counts() methods. Here, for
example, we are counting the number of observations where Outcome is diabetic (1) and the number of
observations where the Outcome is non-diabetic (0).
df['Outcome'].value_counts()

Adding the normalize argument returns proportions instead of absolute counts.


df['Outcome'].value_counts(normalize=True)
Turn off automatic sorting of results using sort argument (True by default). The default sorting is based on
the counts in descending order.
df['Outcome'].value_counts(sort=False)

You can also apply .value_counts() to a DataFrame object and specific columns within it instead of just a
column. Here, for example, we are applying value_counts() on df with the subset argument, which takes in
a list of columns.
df.value_counts(subset=['Pregnancies', 'Outcome'])

Aggregating data with .groupby() in pandas


pandas lets you aggregate values by grouping them by specific column values. You can do that by
combining the .groupby() method with a summary method of your choice. The below code displays the
mean of each of the numeric columns grouped by Outcome.
df.groupby('Outcome').mean()

.groupby() enables grouping by more than one column by passing a list of column names, as shown below.
df.groupby(['Pregnancies', 'Outcome']).mean()

Any summary method can be used alongside .groupby(),


including .min(), .max(), .mean(), .median(), .sum(), .mode(), and more.
Matplotlib
Matplotlib is an open-source drawing library that supports various drawing types.You can generate plots,
histograms, bar charts, and other types of charts with just a few lines of code.
Getting Started With Pyplot
Pyplot is a Matplotlib module that provides simple functions for adding plot elements, such as lines, images,
text, etc. to the axes in the current figure.
import matplotlib.pyplot as plt
plt.plot([1,2,3,4])
plt.show()

import matplotlib.pyplot as plt


%matplotlib inline
plt.plot([1,2,3,4],[5,6,7,8])
plt.plot([1,2,3,4],[5,6,7,8],linewidth='2',color='r',ls='dashed')

Bar Graphs:
import matplotlib.pyplot as plt
%matplotlib inline
x=[1,2,3,4,5]
y=[5,10,15,20,25]
plt.bar(x,y)

x=[1,2,3,4,5]
y=[5,10,15,20,25]
plt.bar(x,y)
plt.xlabel('sales')
plt.ylabel('year')
plt.title("SALES")

plt.bar(x,y,color='r',label="first")
plt.bar(x1,y1,color='b',label="second")
plt.xlabel('sales')
plt.ylabel('year')
plt.title("SALES")
plt.legend()

plt.figure(figsize=(10,5))
Histogram:
import numpy as np
male=np.arange(10,20,2)
female=np.arange(12,22,2)
plt.hist([male,female],rwidth=0.4,bins=4)
plt.hist([male,female],rwidth=0.4,bins=4,orientation="horizontal")

scatter plot:

x=[1,2,3,4,5,6,7]
y=[12,14,16,18,19,21,23]
plt.scatter(x,y)

x=[1,2,3,4,5,6,7]
y=[12,14,16,18,19,21,23]
plt.scatter(x,y,marker='+',color='r')

import matplotlib.pyplot as plt

x1 = [1, 2, 3, 4, 5, 6, 7]
y1 = [12, 14, 16, 18, 19, 21, 23]
x2 = [1, 2, 3, 4, 5, 6, 7]
y2 = [13, 15, 17, 18, 22, 24, 25]

plt.scatter(x1, y1, color='r', marker='+', label='Dataset 1')

# Scatter plot for the second dataset


plt.scatter(x2, y2, color='b', marker='+', label='Dataset 2')
# Adding labels and legend
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend()

# Show the plot


plt.show()

EXERCISE
Use the following CSV file for this exercise. Read this file using Pandas or NumPy or using in-built
matplotlib function.
Exercise 1: Read Total profit of all months and show it using a line plot

Total profit data provided for each month. Generated line plot must include the following properties:

X label name = Month Number
Y label name = Total profit

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
profitList = df ['total_profit'].tolist()
monthList = df ['month_number'].tolist()
plt.plot(monthList, profitList, label = 'Month-wise Profit data of last year')
plt.xlabel('Month number')
plt.ylabel('Profit in dollar')
plt.xticks(monthList)
plt.title('Company profit per month')
plt.yticks([100000, 200000, 300000, 400000, 500000])
plt.show()
Exercise 2: Get total profit of all months and show line plot with the following Style properties
Generated line plot must include following Style properties: –

Line Style dotted and Line-color should be red


Show legend at the lower right location.
X label name = Month Number
Y label name = Sold units number
Add a circle marker.
Line marker color as read
Line width should be 3
The line plot graph should look like this.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
profitList = df ['total_profit'].tolist()
monthList = df ['month_number'].tolist()

plt.plot(monthList, profitList, label = 'Profit data of last year',


color='r', marker='o', markerfacecolor='k',
linestyle='--', linewidth=3)

plt.xlabel('Month Number')
plt.ylabel('Profit in dollar')
plt.legend(loc='lower right')
plt.title('Company Sales data of last year')
plt.xticks(monthList)
plt.yticks([100000, 200000, 300000, 400000, 500000])
plt.show()
Exercise 3: Read all product sales data and show it using a multiline plot

Display the number of units sold per month for each product using multiline plots. (i.e., Separate Plotline
for each product ).
The graph should look like this.
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
monthList = df ['month_number'].tolist()
faceCremSalesData = df ['facecream'].tolist()
faceWashSalesData = df ['facewash'].tolist()
toothPasteSalesData = df ['toothpaste'].tolist()
bathingsoapSalesData = df ['bathingsoap'].tolist()
shampooSalesData = df ['shampoo'].tolist()
moisturizerSalesData = df ['moisturizer'].tolist()

plt.plot(monthList, faceCremSalesData, label = 'Face cream Sales Data', marker='o', linewidth=3)


plt.plot(monthList, faceWashSalesData, label = 'Face Wash Sales Data', marker='o', linewidth=3)
plt.plot(monthList, toothPasteSalesData, label = 'ToothPaste Sales Data', marker='o', linewidth=3)
plt.plot(monthList, bathingsoapSalesData, label = 'ToothPaste Sales Data', marker='o', linewidth=3)
plt.plot(monthList, shampooSalesData, label = 'ToothPaste Sales Data', marker='o', linewidth=3)
plt.plot(monthList, moisturizerSalesData, label = 'ToothPaste Sales Data', marker='o', linewidth=3)

plt.xlabel('Month Number')
plt.ylabel('Sales units in number')
plt.legend(loc='upper left')
plt.xticks(monthList)
plt.yticks([1000, 2000, 4000, 6000, 8000, 10000, 12000, 15000, 18000])
plt.title('Sales data')
plt.show()
Exercise 4: Read toothpaste sales data of each month and show it using a scatter plot
Also, add a grid in the plot. gridline style should “–“.
The scatter plot should look like this.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
monthList = df ['month_number'].tolist()
toothPasteSalesData = df ['toothpaste'].tolist()
plt.scatter(monthList, toothPasteSalesData, label = 'Tooth paste Sales data')
plt.xlabel('Month Number')
plt.ylabel('Number of units Sold')
plt.legend(loc='upper left')
plt.title(' Tooth paste Sales data')
plt.xticks(monthList)
plt.grid(True, linewidth= 1, linestyle="--")
plt.show()
Exercise 5: Read face cream and facewash product sales data and show it using the bar chart
The bar chart should display the number of units sold per month for each product. Add a separate
bar for each product in the same chart.
The bar chart should look like this.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
monthList = df ['month_number'].tolist()
faceCremSalesData = df ['facecream'].tolist()
faceWashSalesData = df ['facewash'].tolist()

plt.bar([a-0.25 for a in monthList], faceCremSalesData, width= 0.25, label = 'Face Cream sales data',
align='edge')
plt.bar([a+0.25 for a in monthList], faceWashSalesData, width= -0.25, label = 'Face Wash sales data',
align='edge')
plt.xlabel('Month Number')
plt.ylabel('Sales units in number')
plt.legend(loc='upper left')
plt.title(' Sales data')

plt.xticks(monthList)
plt.grid(True, linewidth= 1, linestyle="--")
plt.title('Facewash and facecream sales data')
plt.show()
Exercise 6: Read sales data of bathing soap of all months and show it using a bar chart. Save this plot
to your hard disk
The bar chart should look like this.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
monthList = df ['month_number'].tolist()
bathingsoapSalesData = df ['bathingsoap'].tolist()
plt.bar(monthList, bathingsoapSalesData)
plt.xlabel('Month Number')
plt.ylabel('Sales units in number')
plt.title(' Sales data')
plt.xticks(monthList)
plt.grid(True, linewidth= 1, linestyle="--")
plt.title('bathingsoap sales data')
plt.savefig('D:\Python\Articles\matplotlib\sales_data_of_bathingsoap.png', dpi=150)
plt.show()
Exercise 7: Read the total profit of each month and show it using the histogram to see the most
common profit ranges
The histogram should look like this.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
profitList = df ['total_profit'].tolist()
labels = ['low', 'average', 'Good', 'Best']
profit_range = [150000, 175000, 200000, 225000, 250000, 300000, 350000]
plt.hist(profitList, profit_range, label = 'Profit data')
plt.xlabel('profit range in dollar')
plt.ylabel('Actual Profit in dollar')
plt.legend(loc='upper left')
plt.xticks(profit_range)
plt.title('Profit data')
plt.show()
Exercise 8: Calculate total sale data for last year for each product and show it using a Pie chart
Note: In Pie chart display Number of units sold per year for each product in percentage.
The Pie chart should look like this.

import
Import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
monthList = df ['month_number'].tolist()

labels = ['FaceCream', 'FaseWash', 'ToothPaste', 'Bathing soap', 'Shampoo', 'Moisturizer']


salesData = [df ['facecream'].sum(), df ['facewash'].sum(), df ['toothpaste'].sum(),
df ['bathingsoap'].sum(), df ['shampoo'].sum(), df ['moisturizer'].sum()]
plt.axis("equal")
plt.pie(salesData, labels=labels, autopct='%1.1f%%')
plt.legend(loc='lower right')
plt.title('Sales data')
plt.show()
Exercise 9: Read Bathing soap facewash of all months and display it using the Subplot
The Subplot should look like this.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
monthList = df ['month_number'].tolist()
bathingsoap = df ['bathingsoap'].tolist()
faceWashSalesData = df ['facewash'].tolist()

f, axarr = plt.subplots(2, sharex=True)


axarr[0].plot(monthList, bathingsoap, label = 'Bathingsoap Sales Data', color='k', marker='o', linewidth=3)
axarr[0].set_title('Sales data of a Bathingsoap')
axarr[1].plot(monthList, faceWashSalesData, label = 'Face Wash Sales Data', color='r', marker='o',
linewidth=3)
axarr[1].set_title('Sales data of a facewash')

plt.xticks(monthList)
plt.xlabel('Month Number')
plt.ylabel('Sales units in number')
plt.show()
Exercise Question 10: Read all product sales data and show it using the stack plot
The Stack plot should look like this.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
monthList = df ['month_number'].tolist()

faceCremSalesData = df ['facecream'].tolist()
faceWashSalesData = df ['facewash'].tolist()
toothPasteSalesData = df ['toothpaste'].tolist()
bathingsoapSalesData = df ['bathingsoap'].tolist()
shampooSalesData = df ['shampoo'].tolist()
moisturizerSalesData = df ['moisturizer'].tolist()

plt.plot([],[],color='m', label='face Cream', linewidth=5)


plt.plot([],[],color='c', label='Face wash', linewidth=5)
plt.plot([],[],color='r', label='Tooth paste', linewidth=5)
plt.plot([],[],color='k', label='Bathing soap', linewidth=5)
plt.plot([],[],color='g', label='Shampoo', linewidth=5)
plt.plot([],[],color='y', label='Moisturizer', linewidth=5)

plt.stackplot(monthList, faceCremSalesData, faceWashSalesData, toothPasteSalesData,


bathingsoapSalesData, shampooSalesData, moisturizerSalesData,
colors=['m','c','r','k','g','y'])

plt.xlabel('Month Number')
plt.ylabel('Sales unints in Number')
plt.title('Alll product sales data using stack plot')
plt.legend(loc='upper left')
plt.show()
WEEK-4

Write a Python program to implement Simple Linear


Regression
4. Write a Python program to implement Simple Linear Regression

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

df=pd.read_csv( "/house_prices.csv")
df

price
area

0 2600 550000

1 3000 565000

2 3200 610000

3 3600 680000

4 4000 725000
%matplotlib inline
plt.xlabel('area(sqr ft)')
plt.ylabel('prices')
plt.scatter(df.area,df.price,color='red',marker='+')
new_df=df.drop('price',axis='columns')
new_df

area

0 2600

1 3000

2 3200

3 3600

4 4000
df

area price

0 2600 550000

1 3000 565000

2 3200 610000

3 3600 680000

4 4000 725000
price=df.price
price

0 550000
1 565000
2 610000
3 680000
4 725000
Name: price, dtype: int64

reg=linear_model.LinearRegression()
reg.fit(new_df,price)

reg.predict([[3300]])

array([628715.75342466])

reg.coef_
array([135.78767123])

reg.intercept_
180616.43835616432
WEEK-6

Implementation of Decision tree using sklearn and its


parameter tuning
6. Implementation of Decision tree using sklearn and its parameter tuning

Importing Required Libraries

# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

Loading Data

The required Pima Indian Diabetes dataset using pandas' read CSV function. You can download the Kaggle
data set

# load dataset
pima = pd.read_csv("diabetes.csv", header=None, names=col_names)

pima.head()

#split dataset in features and target variable


feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable

# Check for columns with object (string) dtype in X


object_cols = X.select_dtypes(include=['object']).columns

# If there are object columns, convert them to numerical using one-hot encoding or other methods
if len(object_cols) > 0:
# Example using one-hot encoding with pandas get_dummies
X = pd.get_dummies(X, columns=object_cols)

# Split dataset into training set and test set


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and
30% test
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer


clf = clf.fit(X_train,y_train)

#Predict the response for test dataset


y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?


print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.6363636363636364
hyperparameter tuning using GridSearchCV to improve performance.
from sklearn.model_selection import GridSearchCV

# Create Decision Tree classifer object


clf = DecisionTreeClassifier()

param_grid = {
'max_depth': [None, 5, 10, 15, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'criterion': ['gini', 'entropy'],
'max_features': [None, 'sqrt', 'log2']
}

# Create GridSearchCV object


grid_search = GridSearchCV(estimator=clf, param_grid=param_grid) # Create a GridSearchCV object
with the classifier and parameter grid

# Fit GridSearchCV on the training data


grid_search.fit(X_train, y_train) # Now you can fit the grid_search object
from sklearn.model_selection import GridSearchCV
# Fit GridSearchCV on the training data
grid_search.fit(X_train, y_train)

# Print the best parameters found


print("Best parameters found:", grid_search.best_params_)

Best parameters found: {'criterion': 'entropy', 'max_depth': 15, 'max_features': 'sqrt', 'min_samples_leaf': 4,
'min_samples_split': 10}

# Initialize GridSearchCV with 5-fold cross-validation


dt = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')
# Fit the model to your training data. This step is crucial for finding the best estimator.
grid_search.fit(X_train, y_train) # Make sure X_train and y_train are defined and available in your
environment

# Retrieve the best model from GridSearchCV


best_dt = grid_search.best_estimator_

### Step 3: Train and Evaluate the Tuned Model ###


# Make predictions with the tuned model
from sklearn.metrics import accuracy_score
y_pred_tuned = best_dt.predict(X_test)
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
print("Accuracy with tuned parameters:", accuracy_tuned)

Accuracy with tuned parameters: 0.670995670995671

Explanation of the Code

1. Loading Data:
o The data is loaded from a CSV file, and the features (X) and target (y) are separated.
2. Train with Default Parameters:
o A basic DecisionTreeClassifier model is created with default settings.
o The model is trained on X_train and y_train, and accuracy is evaluated on the X_test set to
get a baseline score (accuracy_default).
3. Hyperparameter Tuning:
o A param_grid dictionary is defined to explore different values for key parameters:
 max_depth: Controls the depth of the tree.
 min_samples_split: Minimum number of samples required to split a node.
 min_samples_leaf: Minimum number of samples required at a leaf node.
 criterion: Splitting criterion, either "gini" or "entropy."
 max_features: Maximum number of features considered for each split.
o GridSearchCV performs cross-validation to find the best parameters from this grid.
o The best parameters and model are printed and stored as best_dt.
4. Train and Evaluate the Tuned Model:
o The best_dt model (with optimal hyperparameters) is evaluated on X_test.
o The resulting accuracy (accuracy_tuned) should ideally be an improvement over the initial
default accuracy.
WEEK-7

Implementation of KNN using sklearn


7. Implementation of KNN using sklearn

%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

iris = load_iris()

type(iris)

iris.data

array([[5.1, 3.5, 1.4, 0.2],


[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[4.5, 2.3, 1.3, 0.3],
[4.4, 3.2, 1.3, 0.2],
[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],
[5.3, 3.7, 1.5, 0.2],
[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
[5. , 2. , 3.5, 1. ],
[5.9, 3. , 4.2, 1.5],
[6. , 2.2, 4. , 1. ],
[6.1, 2.9, 4.7, 1.4],
[5.6, 2.9, 3.6, 1.3],
[6.7, 3.1, 4.4, 1.4],
[5.6, 3. , 4.5, 1.5],
[5.8, 2.7, 4.1, 1. ],
[6.2, 2.2, 4.5, 1.5],
[5.6, 2.5, 3.9, 1.1],
[5.9, 3.2, 4.8, 1.8],
[6.1, 2.8, 4. , 1.3],
[6.3, 2.5, 4.9, 1.5],
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.9, 4.3, 1.3],
[6.6, 3. , 4.4, 1.4],
[6.8, 2.8, 4.8, 1.4],
[6.7, 3. , 5. , 1.7],
[6. , 2.9, 4.5, 1.5],
[5.7, 2.6, 3.5, 1. ],
[5.5, 2.4, 3.8, 1.1],
[5.5, 2.4, 3.7, 1. ],
[5.8, 2.7, 3.9, 1.2],
[6. , 2.7, 5.1, 1.6],
[5.4, 3. , 4.5, 1.5],
[6. , 3.4, 4.5, 1.6],
[6.7, 3.1, 4.7, 1.5],
[6.3, 2.3, 4.4, 1.3],
[5.6, 3. , 4.1, 1.3],
[5.5, 2.5, 4. , 1.3],
[5.5, 2.6, 4.4, 1.2],
[6.1, 3. , 4.6, 1.4],
[5.8, 2.6, 4. , 1.2],
[5. , 2.3, 3.3, 1. ],
[5.6, 2.7, 4.2, 1.3],
[5.7, 3. , 4.2, 1.2],
[5.7, 2.9, 4.2, 1.3],
[6.2, 2.9, 4.3, 1.3],
[5.1, 2.5, 3. , 1.1],
[5.7, 2.8, 4.1, 1.3],
[6.3, 3.3, 6. , 2.5],
[5.8, 2.7, 5.1, 1.9],
[7.1, 3. , 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8],
[6.5, 3. , 5.8, 2.2],
[7.6, 3. , 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7],
[7.3, 2.9, 6.3, 1.8],
[6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
[6.5, 3.2, 5.1, 2. ],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3. , 5.5, 2.1],
[5.7, 2.5, 5. , 2. ],
[5.8, 2.8, 5.1, 2.4],
[6.4, 3.2, 5.3, 2.3],
[6.5, 3. , 5.5, 1.8],
[7.7, 3.8, 6.7, 2.2],
[7.7, 2.6, 6.9, 2.3],
[6. , 2.2, 5. , 1.5],
[6.9, 3.2, 5.7, 2.3],
[5.6, 2.8, 4.9, 2. ],
[7.7, 2.8, 6.7, 2. ],
[6.3, 2.7, 4.9, 1.8],
[6.7, 3.3, 5.7, 2.1],
[7.2, 3.2, 6. , 1.8],
[6.2, 2.8, 4.8, 1.8],
[6.1, 3. , 4.9, 1.8],
[6.4, 2.8, 5.6, 2.1],
[7.2, 3. , 5.8, 1.6],
[7.4, 2.8, 6.1, 1.9],
[7.9, 3.8, 6.4, 2. ],
[6.4, 2.8, 5.6, 2.2],
[6.3, 2.8, 5.1, 1.5],
[6.1, 2.6, 5.6, 1.4],
[7.7, 3. , 6.1, 2.3],
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]])

print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0000000000000111111111111111111111111
1111111111111111111111111122222222222
2222222222222222222222222222222222222
2 2]

print(iris.target_names)

['setosa' 'versicolor' 'virginica']


plt.scatter(iris.data[:,1],iris.data[:,2],c=iris.target, cmap=plt.cm.Paired)
plt.xlabel(iris.feature_names[1])
plt.ylabel(iris.feature_names[2])
plt.show()

plt.scatter(iris.data[:,0],iris.data[:,3],c=iris.target, cmap=plt.cm.Paired)
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[3])
plt.show()
p = iris.data

q = iris.target

print(p.shape)
print(q.shape)

(150, 4)
(150,)

# Instead of importing from sklearn.cross_validation, import from sklearn.model_selection


from sklearn.model_selection import train_test_split
p_train, p_test, q_train, q_test = train_test_split(p, q, test_size=0.2, random_state=4)

print(p_train.shape)
print(p_test.shape)

(120, 4)
(30, 4)

print(q_train.shape)
print(q_test.shape)

(120,)
(30,)

from sklearn.neighbors import KNeighborsClassifier

from sklearn import metrics

k_range = range(1,25)
scores = {}
scores_list = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(p_train,q_train)
q_pred=knn.predict(p_test)
scores[k] = metrics.accuracy_score(q_test,q_pred)
scores_list.append(metrics.accuracy_score(q_test,q_pred))

scores

{1: 0.9333333333333333,
2: 0.9333333333333333,
3: 0.9666666666666667,
4: 0.9666666666666667,
5: 0.9666666666666667,
6: 0.9666666666666667,
7: 0.9666666666666667,
8: 0.9666666666666667,
9: 0.9666666666666667,
10: 0.9666666666666667,
11: 0.9666666666666667,
12: 0.9666666666666667,
13: 0.9666666666666667,
14: 0.9666666666666667,
15: 0.9666666666666667,
16: 0.9666666666666667,
17: 0.9666666666666667,
18: 0.9666666666666667,
19: 0.9666666666666667,
20: 0.9333333333333333,
21: 0.9666666666666667,
22: 0.9333333333333333,
23: 0.9666666666666667,
24: 0.9666666666666667}

#plot the relationship between K and the testing accuracy


plt.plot(k_range,scores_list)
plt.xlabel('Value of K for KNN')
plt.ylabel('Testing Accuracy')

Text(0, 0.5, 'Testing Accuracy')


knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(p,q)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',


metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')

#0 = setosa, 1=versicolor, 2=virginica


classes = {0:'setosa',1:'versicolor',2:'virginica'}

x_new = [[5.1, 3.5, 1.4, 0.2],


[5.9, 3. , 5.1, 1.8]]
y_predict = knn.predict(x_new)

print(classes[y_predict[0]])
print(classes[y_predict[1]])

setosa
virginica
WEEK-8

Implementation of Logistic Regression using sklearn


8. Implementation of Logistic Regression using sklearn

Diabetes Prediction

Importing the Dependencies:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
Loading the dataset:
dataframe = pd.read_csv("/content/diabetes.csv")
Data Analysis:
dataframe.head()
Output:

dataframe.columns
Output:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
Datatype of each feature:
dataframe.dtypes

Finding null or missing values:


dataframe.isnull().sum()
Finding correlation:
dataframe.corr()

Replacing the zeros with mean and median:


dataframe['Insulin']=dataframe['Insulin'].replace(0,dataframe['Insulin'].median())
dataframe['Pregnancies']=dataframe['Pregnancies'].replace(0,dataframe['Pregnancies'].median())
dataframe['Glucose']=dataframe['Glucose'].replace(0,dataframe['Glucose'].mean())
dataframe['BloodPressure']=dataframe['BloodPressure'].replace(0,dataframe['BloodPressure'].mean())
dataframe['SkinThickness']=dataframe['SkinThickness'].replace(0,dataframe['SkinThickness'].median())
dataframe['BMI']=dataframe['BMI'].replace(0,dataframe['BMI'].mean())
dataframe['DiabetesPedigreeFunction']=dataframe['DiabetesPedigreeFunction'].replace(0,dataframe['Diab
etesPedigreeFunction'].median())
dataframe['Age']=dataframe['Age'].replace(0,dataframe['Age'].median())
dataframe.head(20)
Output:
x=dataframe.drop(columns='Outcome',axis=1)
y=dataframe['Outcome']
To remove outliers:
cols=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI',
'DiabetesPedigreeFunction', 'Age']
for col in cols:
Q1=x[col].quantile(0.25)
Q3=x[col].quantile(0.75)
IQR=Q3-Q1
lower_bound=Q1-1.5*IQR
upper_bound=Q3+1.5*IQR
mask=(x[col] >= lower_bound) & (x[col] <= upper_bound)
x_outlier_detection=x[mask]
y_outlier_detection=y[mask]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(x_outlier_detection)
fig,ax = plt.subplots(figsize=(15,15))
sns.boxplot(data = X_scaled,ax=ax)
Output:
To resample the data:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print("\nResampled class distribution:")


print(pd.Series(y_train_resampled).value_counts())
Output:
Resampled class distribution:
Outcome
1 328
0 328
Name: count, dtype: int64
Training the LogisticRegression Algorithm:
rom sklearn.linear_model import LogisticRegression

classification = LogisticRegression()

classification.fit(X_train_resampled, y_train_resampled)
y_predictions = classification.predict(X_test)
print(y_predictions)
Output:
[0 0 0 1 0 1 0 1 0 0 1 0 0 1 1 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 1 1 1 1 0
0100110011111011101100100001000100000
1100011100101001010110001000011011000
0010000101000010010010000000000001000
1010001000001100000011010100000100010
0100100110011101001110001001100001101
0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0]
WEEK-9

Implementation of K-Means Clustering


9. Implementation of K-Means Clustering

Customer Behaviour Analysis based on K-Means Clustering Algorithm

Importing the Dependencies:


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

Loading the dataset:

customer_data = pd.read_csv('Mall_Customers.csv')

Data Analysis:

customer_data.head()

Output:

customer_data.shape

Output:
(250, 5)

customer_data.info()
Output:
customer_data.isnull().sum()
Output:
0
CustomerID 0
Gender 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0

dtype: int64
Choosing the Annual income column & Spending score Column
X = customer_data.iloc[:,[3,4]].values
print(X)

Choosing the number of clusters:


WCSS -> Within Clusters Sum of Squares
wcss = []
for i in range(1,11):
kmeans = KMeans(n_clusters = i,init='k-means++',random_state=42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)

Plotting an elbow graph:


sns.set()
plt.plot(range(1,11),wcss)
plt.title('The Elbow Point Graph')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()
Output:
Optimum number of clusters = 5

Training the K-Means clustering model:


kmeans = KMeans(n_clusters = 5,init='k-means++',random_state=0)
Y = kmeans.fit_predict(X)
print(Y)
Output:
[3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
333333333333333333333333333333333333333333333333333333333
333333333302020202020202020202020202020202020202020202020
202020202020202020202020202040444444444444444444411111111
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
Visualizing all the clusters:

plt.figure(figsize=(8,8))
plt.scatter(X[Y==0,0],X[Y==0,1],s=50,c='green',label='Cluster 1')
plt.scatter(X[Y==1,0],X[Y==1,1],s=50,c='red',label='Cluster 2')
plt.scatter(X[Y==2,0],X[Y==2,1],s=50,c='blue',label='Cluster 3')
plt.scatter(X[Y==3,0],X[Y==3,1],s=50,c='violet',label='Cluster 4')
plt.scatter(X[Y==4,0],X[Y==4,1],s=50,c='yellow',label='Cluster 5')
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1],s=100, c='cyan',label='Centroids')
plt.title('Customer Groups')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.show()
Output:

You might also like