ML Lab Final R22
ML Lab Final R22
List of Experiments
The following definitions of mean, median, and mode state in words how each is calculated.
Mean: The mean of a data set is the average value. Is that circular? Let me state this in the form of an
algorithm. To find the mean of a data set first sum up all of the values in the data set then divide by the
number of data points or examples in the data set. When we find or know the mean of an entire population
we call that mean a parameter of the population and in is assigned the symbol μ. When the mean is of a
sample it is a statistic of the sample ans is assigned the symbol x̄.
Median: The median of a data set is simply the data point that occurs exactly in the middle of the data set.
To find the median first the data needs to be sorted and listed in ascending or descending order. Then if
there is an odd number of data points in the set the median is in the exact middle of this list. If there is an
even number of data points in the list then the median is the average of the two middle values in the list.
Mode: The mode is the value in a data set that is repeated the most often. Below the mean, median, and
mode are calculated for a dummy data set.
Measures of Variability
The following definitions of variance and standard deviation state in words how each is calculated.
Variance: The variance of a data set is found by first finding the deviation of each element in the data set
from the mean. These deviations are squared and then added together. (Why are they squared?) Finally, the
sum of squared deviations is normalized by the number of elements in the population, N, for a population
variance or the number of element in the sample minus one, N-1, for the sample variance. (Why is a sample
variance normalized by N-1?)
Standard Deviation: Once the variance is in hand, standard deviation is eacy to find. It is simply the square
root of the variance. The symbol for population standard deviation is σ while the simple for sample standard
deviation is s or sd.
Program :
import numpy as np
x= np.array([56, 31 ,56, 8 , 32])
print(" mean value of array x", np.mean(x))
print(" mean value of array x", x.mean())
print(" x median",np.median(x))
import numpy as np
scores= np.array([89,73,84,91,87,77,94,67])
print(" score median:",np.median(scores))
import numpy as np
y= np.array([56,31,0,56,0,8,88,0,32])
print(" Places where Non Zero vlaues are there ", np.nonzero(y))
import numpy as np
import matplotlib.pyplot as plt
y= np.array([650,450,275,350,387,575,555,649,361])
print("mean is :",np.mean(y))
print("median is :",np.median(y))
range=np.max(y)- np.min(y)
print("range is :",range)
plt.bar(y,y,color = 'b')
plt.bar(np.mean(y),y,color = 'g',width=5)
plt.bar(np.median(y),y,color='r',width=5)
plt.bar(range,y,color = 'm',width =4)
plt.show()
Statistics, in general, is the method of collection of data, tabulation, and interpretation of numerical data.
It is an area of applied mathematics concerned with data collection analysis, interpretation, and
presentation. With statistics, we can see how data can be used to solve complex problems.
The measure of central tendency is a single value that attempts to describe the whole set of data. There are
three main features of central tendency:
Mean
Median
Median Low
Median High
o Mode
Mean
It is the sum of observations divided by the total number of observations. It is also defined as average which
is the sum divided by count.
Mean(x‾)=∑xnMean(x)=n∑x
The mean() function returns the mean or average of the data passed in its arguments. If the passed argument
is empty, StatisticsError is raised.
import statistics
li = [1, 2, 3, 3, 2, 2, 2, 1]
print ("The average of list values is :")
print (statistics.mean(li))
Output:
The average of list values is : 2
Median
It is the middle value of the data set. It splits the data into two halves. If the number of elements in the data
set is odd then the center element is the median and if it is even then the median would be the average of
two central elements.
output:
Median of data-set 1 is 5
Mode
It is the value that has the highest frequency in the given data set.
Measure of Variability
The measure of variability is known as the spread of data or how well our data is distributed. The most
common variability measures are:
Range
Variance
Standard deviation
Range
The difference between the largest and smallest data point in our data set is known as the range. The range
is directly proportional to the spread of data which means the bigger the range, the more the spread of data
and vice versa.
Range = Largest data value – smallest data value
We can calculate the maximum and minimum values using the max() and min() methods respectively.
arr = [1, 2, 3, 4, 5]
#Finding Max
Maximum = max(arr)
# Finding Min
Minimum = min(arr)
Variance
It is defined as an average squared deviation from the mean. It is calculated by finding the difference
between every data point and the average which is also known as the mean, squaring them, adding all of
them, and then dividing by the number of data points present in our data set.
σ2=∑(x−μ2)Nσ2=N∑(x−μ2)
where N = number of terms
u = Mean
from statistics import variance
sample1 = (1, 2, 5, 4, 8, 9, 12)
print("Variance of Sample1 is % s " % (variance(sample1)))
Output:
Variance of Sample1 is 15.80952380952381
Standard Deviation
It is defined as the square root of the variance. It is calculated by finding the Mean, then subtracting each
number from the Mean which is also known as the average, and squaring the result. Adding all the values
and then dividing by the no of terms followed by the square root.
σ=∑(x−μ)2Nσ=N∑(x−μ)2
where N = number of terms
u = Mean
Output:
The Standard Deviation of Sample1 is 3.9761191895520196
math — Mathematical functions
This module provides access to the mathematical functions
math.ceil(x)
Return the ceiling of x, the smallest integer greater than or equal to x.
import math
math.ceil(22.3)
output:
23
math.fabs(x)
Return the absolute value of x
import math
math.fabs(-22.3)
output:
22.3
math.factorial(n)
Return n factorial as an integer. Raises ValueError if n is not integral or is negative.
n=5
math.factorial(n)
output:
120
math.floor(x)
Return the floor of x, the largest integer less than or equal to x
math.floor(22.3)
22
math.fmod(x, y)
Return fmod(x, y), as defined by the platform C library
math.fmod(3, 2)
1.0
math.isnan(x)
Return True if x is a NaN (not a number), and False otherwise.
math.trunc(3.5)
math.cbrt(x)
Return the cube root of x.
math.exp(x)
Return e raised to the power x, where e = 2.718281… is the base of natural logarithms. This is
usually more accurate than math.e ** x or pow(math.e, x).
math.exp2(x)
Return 2 raised to the power x.
math.pow(x, y)
Return x raised to the power y.
math.sqrt(x)
Return the square root of x.
math.degrees(x)
Convert angle x from radians to degrees.
math.radians(x)
Convert angle x from degrees to radians.
NumPy
Num Py is the fundamental package for scientific computing in Python. It is a Python library
that provides a multidimensional array object, various derived objects for fast operations
on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O,
discrete Fourier transforms, basic linear algebra, basic statistical operations, random
simulation and much more. At the core of the NumPy package, is the ndarray object. This
encapsulates n-dimensional arrays of homogeneous data types.
Advantages:
● NumPy arrays have a fixed size at creation, unlike Python lists (which can grow
dynamically). Changing the size of an ndarray will create a new array and delete the
original.
● The elements in a NumPy array are all required to be of the same data type, and thus
will be the same size in memory.
● NumPy arrays facilitate advanced mathematical and other types of operations on large
numbers of data. Typically, such operations are executed more efficiently and with
less code than is possible using Python’s built-in sequences.
Numpy basics:
NumPy’s main object is the homogeneous multidimensional array. It is a table of
elements (usually numbers), all of the same type. In NumPy dimensions are called axes.
array transforms sequences of sequences into two-dimensional arrays, sequences of
The type of the array can also be explicitly specified at creation time:
>>> c = np.array([[1, 2], [3, 4]], dtype=float)
>>> print(c)
[[1. 2.]
[3. 4.]]
>>> c.dtype
dtype('float64')
The function zeros creates an array full of zeros, the function ones creates an array full of
ones, and the function empty creates an array whose initial content is random and depends
on the state of the memory. By default, the dtype of the created array is float64, but it can
be specified via the key word argument dtype.
import
To create sequences of numbers, NumPy provides the arange function which is analogous
to the Python built-in range, but returns an array.
Printing arrays
When you print an array, Num]Py displays it in a similar way to nested lists, but with the
following layout:
If an array is too large to be printed, NumPy automatically skips the central part of the array
and only prints the corners:
Basic operations
Arithmetic operators on arrays apply elementwise. A new array is created and filled with the
result.
Unlike in many matrix languages, the product operator * operates elementwise in NumPy
arrays. The matrix product can be performed using the @ operator (in python >=3.5) or
the dot function or method:
Many unary operations, such as computing the sum of all the elements in the array, are
implemented as methods of the ndarray class.
By default, these operations apply to the array as though it were a list of numbers, regardless
of its shape. However, by specifying the axis parameter you can apply an operation along
the specified axis of an array:
Indexing, slicing and iteratingb.su
One-dimensional arrays can be indexed, sliced and iterated over, much like lists and other
Python sequences.
WEEK-3
pandas is a data manipulation package in Python for tabular data. That is, data in the form of rows and
columns, also known as DataFrames.
pandas’ functionality includes data transformations, like sorting rows and taking subsets, to calculating
summary statistics such as the mean, reshaping DataFrames, and joining DataFrames together. pandas
works well with other popular Python data science packages.
pandas is used throughout the data analysis workflow. With pandas, you can:
Import datasets from databases, spreadsheets, comma-separated values (CSV) files, and more.
Clean datasets, for example, by dealing with missing values.
Tidy datasets by reshaping their structure into a suitable format for analysis.
Aggregate data by calculating summary statistics such as the mean of columns, correlation between
them, and more.
Visualize datasets and uncover insights.
Install pandas
Installing pandas is straightforward; just use the pip install command in your terminal.
To begin working with pandas, import the pandas Python package as shown below. When importing
pandas, the most common alias for pandas is pd.
import pandas as pd
df = pd.read_csv("D:\ml 2024-25\diabetes.csv")
Reading text files is similar to CSV files. The only nuance is that you need to specify a separator with
the sep argument, as shown below. The separator argument refers to the symbol used to separate rows in a
DataFrame. Comma (sep = ","), whitespace(sep = "\s"), tab (sep = "\t"), and colon(sep = ":") are the
commonly used separators. Here \s represents a single white space character.
df = pd.read_csv("diabetes.txt", sep="\s")
Reading excel files (both XLS and XLSX) is as easy as the read_excel() function, using the file path as an
input.
df = pd.read_excel('diabetes.xlsx')
Importing JSON file
Similar to the read_csv() function, you can use read_json() for JSON file types with the JSON file name as
the argument (for more detail read this tutorial on importing JSON and HTML data into pandas). The
below code reads a JSON file from disk and creates a DataFrame object df.
df = pd.read_json("diabetes.json")
Just as pandas can import data from various file types, it also allows you to export data into various formats.
This happens especially when data is transformed using pandas and needs to be saved locally on your
machine. Below is how to output pandas DataFrames into various formats.
A pandas DataFrame (here we are using df) is saved as a CSV file using the .to_csv() method. The
arguments include the filename with path and index – where index = True implies writing the DataFrame’s
index.
df.to_csv("diabetes_out.csv", index=False)
Export DataFrame object into a JSON file by calling the .to_json() method.
df.to_json("diabetes_out.json")
Note: A JSON file stores a tabular object like a DataFrame as a key-value pair. Thus you would observe
repeating column headers in a JSON file.
As with writing DataFrames to CSV files, you can call .to_csv(). The only differences are that the output
file format is in .txt, and you need to specify a separator using the sep argument.
Call .to_excel() from the DataFrame object to save it as a “.xls” or “.xlsx” file.
df.to_excel("diabetes_out.xlsx", index=False)
You can view the first few or last few rows of a DataFrame using the .head() or .tail() methods, respectively.
You can specify the number of rows through the n argument (the default value is 5).
df.head()
df.tail(n = 10)
The .describe() method prints the summary statistics of all numeric columns, such as count, mean, standard
deviation, range, and quartiles of numeric columns.
df.describe()
df.describe().T
The .info() method is a quick way to look at the data types, missing values, and data size of a DataFrame.
Here, we’re setting the show_counts argument to True, which gives a few over the total non-missing values
in each column. We’re also setting memory_usage to True, which shows the total memory usage of the
DataFrame elements. When verbose is set to True, it prints the full summary from .info().
The number of rows and columns of a DataFrame can be identified using the .shape attribute of the
DataFrame. It returns a tuple (row, column) and can be indexed to get only rows, and only columns count
as output.
(768,9)
768
9
Get all columns and column names
Calling the .columns attribute of a DataFrame object returns the column names in the form of
an Index object. As a reminder, a pandas index is the address/label of the row or column.
df.columns
list(df.columns)
df2 = df.copy()
df2.loc[2:5,'Pregnancies'] = None
df2.head(7)
We can check whether each element in a DataFrame is missing using the .isnull() method.
df2.isnull().head(7)
df2.isnull().sum()
Pregnancies 4
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
The pandas package offers several ways to subset, filter, and isolate data in your DataFrames. Here, we'll
see the most common ways.
Isolating one column using [ ]
You can isolate a single column using a square bracket [ ] with a column name in it. The output is a
pandas Series object. A pandas Series is a one-dimensional array containing data of any type, including
integer, float, string, boolean, python objects, etc. A DataFrame is comprised of many series that act as
columns.
df['Outcome']
You can also provide a list of column names inside the square brackets to fetch more than one column.
Here, square brackets are used in two different ways. We use the outer square brackets to indicate a subset
of a DataFrame, and the inner square brackets to create a list.
df[['Pregnancies', 'Outcome']]
A single row can be fetched by passing in a boolean series with one True value. In the example below, the
second row with index = 1 is returned. Here, .index returns the row labels of the DataFrame, and the
comparison turns that into a Boolean one-dimensional array.
df[df.index==1]
Similarly, two or more rows can be returned using the .isin() method instead of a == operator.
df[df.index.isin(range(2,10))]
You can fetch specific rows by labels or conditions using .loc[] and .iloc[] ("location" and "integer
location"). .loc[] uses a label to point to a row, column or cell, whereas .iloc[] uses the numeric position.
To understand the difference between the two, let’s modify the index of df2 created earlier.
df2.index = range(1,769)
The below example returns a pandas Series instead of a DataFrame. The 1 represents the row index (label),
whereas the 1 in .iloc[] is the row position (first row).
df2.loc[1]
Pregnancies 6.000
Glucose 148.000
BloodPressure 72.000
SkinThickness 35.000
Insulin 0.000
BMI 33.600
DiabetesPedigreeFunction 0.627
Age 50.000
Outcome 1.000
Name: 1, dtype: float64
df2.iloc[1]
Pregnancies 1.000
Glucose 85.000
BloodPressure 66.000
SkinThickness 29.000
Insulin 0.000
BMI 26.600
DiabetesPedigreeFunction 0.351
Age 31.000
Outcome 0.000
Name: 2, dtype: float64
You can also fetch multiple rows by providing a range in square brackets.
df2.loc[100:110]
df2.iloc[100:110]
You can also select specific columns along with rows. This is where .iloc[] is different from .loc[] – it
requires column location and not column labels.
pandas lets you filter data by conditions over row/column values. For example, the below code selects the
row where Blood Pressure is exactly 122. Here, we are isolating rows using the brackets [ ] as seen in
previous sections. However, instead of inputting row indices or column names, we are inputting a condition
where the column BloodPressure is equal to 122. We denote this condition using df.BloodPressure == 122.
df[df.BloodPressure == 122]
The below example fetched all rows where Outcome is 1. Here df.Outcome selects that column, df.Outcome
== 1 returns a Series of Boolean values determining which Outcomes are equal to 1, then [] takes a subset
of df where that Boolean Series is True.
df[df.Outcome == 1]
You can use a > operator to draw comparisons. The below code fetches Pregnancies, Glucose,
and BloodPressure for all records with BloodPressure greater than 100.
Data cleaning is one of the most common tasks in data science. pandas lets you preprocess data for any use,
including but not limited to training machine learning and deep learning models. Let’s use the
DataFrame df2 from earlier, having four missing values, to illustrate a few data cleaning use cases. As a
reminder, here's how you can see how many missing values are in a DataFrame.
df2.isnull().sum()
Pregnancies 4
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
The axis argument lets you specify whether you are dropping rows, or columns, with missing values. The
default axis removes the rows containing NaNs. Use axis = 1 to remove the columns with one or more NaN
values. Also, notice how we are using the argument inplace=True which lets you skip saving the output
of .dropna() into a new DataFrame.
df3 = df2.copy()
df3.dropna(inplace=True, axis=1)
df3.head()
You can also drop both rows and columns with missing values by setting the how argument to 'all'
df3 = df2.copy()
df3.dropna(inplace=True, how='all')
Instead of dropping, replacing missing values with a summary statistic or a specific value (depending on
the use case) maybe the best way to go. For example, if there is one missing row from a temperature column
denoting temperatures throughout the days of the week, replacing that missing value with the average
temperature of that week may be more effective than dropping values completely. You can replace the
missing data with the row, or column mean using the code below.
df3 = df2.copy()
# Get the mean of Pregnancies
mean_value = df3['Pregnancies'].mean()
# Fill missing values using .fillna()
df3 = df3.fillna(mean_value)
You can remove all duplicate rows (default) from the DataFrame using .drop_duplicates() method.
df3 = df3.drop_duplicates()
df3.shape
(768, 9)
Renaming columns
A common data cleaning task is renaming columns. With the .rename() method, you can use columns as an
argument to rename specific columns. The below code shows the dictionary for mapping old and new
column names.
df3.columns = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DPF', 'Age', 'Outcome', 'STF']
df3.head()
The main value proposition of pandas lies in its quick data analysis functionality. In this section, we'll focus
on a set of analysis techniques you can use in pandas.
Summary operators (mean, mode, median)
As you saw earlier, you can get the mean of each column value using the .mean() method.
df.mean()
Similarly, the median of each column is computed with the .median() method
df.median()
pandas provides fast and efficient computation by combining two or more columns like scalar variables.
The below code divides each value in the column Glucose with the corresponding value in
the Insulin column to compute a new column named Glucose_Insulin_Ratio.
df2['Glucose_Insulin_Ratio'] = df2['Glucose']/df2['Insulin']
df2.head()
Often times you'll work with categorical values, and you'll want to count the number of observations each
category has in a column. Category values can be counted using the .value_counts() methods. Here, for
example, we are counting the number of observations where Outcome is diabetic (1) and the number of
observations where the Outcome is non-diabetic (0).
df['Outcome'].value_counts()
You can also apply .value_counts() to a DataFrame object and specific columns within it instead of just a
column. Here, for example, we are applying value_counts() on df with the subset argument, which takes in
a list of columns.
df.value_counts(subset=['Pregnancies', 'Outcome'])
.groupby() enables grouping by more than one column by passing a list of column names, as shown below.
df.groupby(['Pregnancies', 'Outcome']).mean()
Bar Graphs:
import matplotlib.pyplot as plt
%matplotlib inline
x=[1,2,3,4,5]
y=[5,10,15,20,25]
plt.bar(x,y)
x=[1,2,3,4,5]
y=[5,10,15,20,25]
plt.bar(x,y)
plt.xlabel('sales')
plt.ylabel('year')
plt.title("SALES")
plt.bar(x,y,color='r',label="first")
plt.bar(x1,y1,color='b',label="second")
plt.xlabel('sales')
plt.ylabel('year')
plt.title("SALES")
plt.legend()
plt.figure(figsize=(10,5))
Histogram:
import numpy as np
male=np.arange(10,20,2)
female=np.arange(12,22,2)
plt.hist([male,female],rwidth=0.4,bins=4)
plt.hist([male,female],rwidth=0.4,bins=4,orientation="horizontal")
scatter plot:
x=[1,2,3,4,5,6,7]
y=[12,14,16,18,19,21,23]
plt.scatter(x,y)
x=[1,2,3,4,5,6,7]
y=[12,14,16,18,19,21,23]
plt.scatter(x,y,marker='+',color='r')
x1 = [1, 2, 3, 4, 5, 6, 7]
y1 = [12, 14, 16, 18, 19, 21, 23]
x2 = [1, 2, 3, 4, 5, 6, 7]
y2 = [13, 15, 17, 18, 22, 24, 25]
EXERCISE
Use the following CSV file for this exercise. Read this file using Pandas or NumPy or using in-built
matplotlib function.
Exercise 1: Read Total profit of all months and show it using a line plot
Total profit data provided for each month. Generated line plot must include the following properties:
–
X label name = Month Number
Y label name = Total profit
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
profitList = df ['total_profit'].tolist()
monthList = df ['month_number'].tolist()
plt.plot(monthList, profitList, label = 'Month-wise Profit data of last year')
plt.xlabel('Month number')
plt.ylabel('Profit in dollar')
plt.xticks(monthList)
plt.title('Company profit per month')
plt.yticks([100000, 200000, 300000, 400000, 500000])
plt.show()
Exercise 2: Get total profit of all months and show line plot with the following Style properties
Generated line plot must include following Style properties: –
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
profitList = df ['total_profit'].tolist()
monthList = df ['month_number'].tolist()
plt.xlabel('Month Number')
plt.ylabel('Profit in dollar')
plt.legend(loc='lower right')
plt.title('Company Sales data of last year')
plt.xticks(monthList)
plt.yticks([100000, 200000, 300000, 400000, 500000])
plt.show()
Exercise 3: Read all product sales data and show it using a multiline plot
Display the number of units sold per month for each product using multiline plots. (i.e., Separate Plotline
for each product ).
The graph should look like this.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
monthList = df ['month_number'].tolist()
faceCremSalesData = df ['facecream'].tolist()
faceWashSalesData = df ['facewash'].tolist()
toothPasteSalesData = df ['toothpaste'].tolist()
bathingsoapSalesData = df ['bathingsoap'].tolist()
shampooSalesData = df ['shampoo'].tolist()
moisturizerSalesData = df ['moisturizer'].tolist()
plt.xlabel('Month Number')
plt.ylabel('Sales units in number')
plt.legend(loc='upper left')
plt.xticks(monthList)
plt.yticks([1000, 2000, 4000, 6000, 8000, 10000, 12000, 15000, 18000])
plt.title('Sales data')
plt.show()
Exercise 4: Read toothpaste sales data of each month and show it using a scatter plot
Also, add a grid in the plot. gridline style should “–“.
The scatter plot should look like this.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
monthList = df ['month_number'].tolist()
toothPasteSalesData = df ['toothpaste'].tolist()
plt.scatter(monthList, toothPasteSalesData, label = 'Tooth paste Sales data')
plt.xlabel('Month Number')
plt.ylabel('Number of units Sold')
plt.legend(loc='upper left')
plt.title(' Tooth paste Sales data')
plt.xticks(monthList)
plt.grid(True, linewidth= 1, linestyle="--")
plt.show()
Exercise 5: Read face cream and facewash product sales data and show it using the bar chart
The bar chart should display the number of units sold per month for each product. Add a separate
bar for each product in the same chart.
The bar chart should look like this.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
monthList = df ['month_number'].tolist()
faceCremSalesData = df ['facecream'].tolist()
faceWashSalesData = df ['facewash'].tolist()
plt.bar([a-0.25 for a in monthList], faceCremSalesData, width= 0.25, label = 'Face Cream sales data',
align='edge')
plt.bar([a+0.25 for a in monthList], faceWashSalesData, width= -0.25, label = 'Face Wash sales data',
align='edge')
plt.xlabel('Month Number')
plt.ylabel('Sales units in number')
plt.legend(loc='upper left')
plt.title(' Sales data')
plt.xticks(monthList)
plt.grid(True, linewidth= 1, linestyle="--")
plt.title('Facewash and facecream sales data')
plt.show()
Exercise 6: Read sales data of bathing soap of all months and show it using a bar chart. Save this plot
to your hard disk
The bar chart should look like this.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
monthList = df ['month_number'].tolist()
bathingsoapSalesData = df ['bathingsoap'].tolist()
plt.bar(monthList, bathingsoapSalesData)
plt.xlabel('Month Number')
plt.ylabel('Sales units in number')
plt.title(' Sales data')
plt.xticks(monthList)
plt.grid(True, linewidth= 1, linestyle="--")
plt.title('bathingsoap sales data')
plt.savefig('D:\Python\Articles\matplotlib\sales_data_of_bathingsoap.png', dpi=150)
plt.show()
Exercise 7: Read the total profit of each month and show it using the histogram to see the most
common profit ranges
The histogram should look like this.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
profitList = df ['total_profit'].tolist()
labels = ['low', 'average', 'Good', 'Best']
profit_range = [150000, 175000, 200000, 225000, 250000, 300000, 350000]
plt.hist(profitList, profit_range, label = 'Profit data')
plt.xlabel('profit range in dollar')
plt.ylabel('Actual Profit in dollar')
plt.legend(loc='upper left')
plt.xticks(profit_range)
plt.title('Profit data')
plt.show()
Exercise 8: Calculate total sale data for last year for each product and show it using a Pie chart
Note: In Pie chart display Number of units sold per year for each product in percentage.
The Pie chart should look like this.
import
Import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
monthList = df ['month_number'].tolist()
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
monthList = df ['month_number'].tolist()
bathingsoap = df ['bathingsoap'].tolist()
faceWashSalesData = df ['facewash'].tolist()
plt.xticks(monthList)
plt.xlabel('Month Number')
plt.ylabel('Sales units in number')
plt.show()
Exercise Question 10: Read all product sales data and show it using the stack plot
The Stack plot should look like this.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("D:\\Python\\Articles\\matplotlib\\sales_data.csv")
monthList = df ['month_number'].tolist()
faceCremSalesData = df ['facecream'].tolist()
faceWashSalesData = df ['facewash'].tolist()
toothPasteSalesData = df ['toothpaste'].tolist()
bathingsoapSalesData = df ['bathingsoap'].tolist()
shampooSalesData = df ['shampoo'].tolist()
moisturizerSalesData = df ['moisturizer'].tolist()
plt.xlabel('Month Number')
plt.ylabel('Sales unints in Number')
plt.title('Alll product sales data using stack plot')
plt.legend(loc='upper left')
plt.show()
WEEK-4
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
df=pd.read_csv( "/house_prices.csv")
df
price
area
0 2600 550000
1 3000 565000
2 3200 610000
3 3600 680000
4 4000 725000
%matplotlib inline
plt.xlabel('area(sqr ft)')
plt.ylabel('prices')
plt.scatter(df.area,df.price,color='red',marker='+')
new_df=df.drop('price',axis='columns')
new_df
area
0 2600
1 3000
2 3200
3 3600
4 4000
df
area price
0 2600 550000
1 3000 565000
2 3200 610000
3 3600 680000
4 4000 725000
price=df.price
price
0 550000
1 565000
2 610000
3 680000
4 725000
Name: price, dtype: int64
reg=linear_model.LinearRegression()
reg.fit(new_df,price)
reg.predict([[3300]])
array([628715.75342466])
reg.coef_
array([135.78767123])
reg.intercept_
180616.43835616432
WEEK-6
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
Loading Data
The required Pima Indian Diabetes dataset using pandas' read CSV function. You can download the Kaggle
data set
# load dataset
pima = pd.read_csv("diabetes.csv", header=None, names=col_names)
pima.head()
# If there are object columns, convert them to numerical using one-hot encoding or other methods
if len(object_cols) > 0:
# Example using one-hot encoding with pandas get_dummies
X = pd.get_dummies(X, columns=object_cols)
Accuracy: 0.6363636363636364
hyperparameter tuning using GridSearchCV to improve performance.
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_depth': [None, 5, 10, 15, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'criterion': ['gini', 'entropy'],
'max_features': [None, 'sqrt', 'log2']
}
Best parameters found: {'criterion': 'entropy', 'max_depth': 15, 'max_features': 'sqrt', 'min_samples_leaf': 4,
'min_samples_split': 10}
1. Loading Data:
o The data is loaded from a CSV file, and the features (X) and target (y) are separated.
2. Train with Default Parameters:
o A basic DecisionTreeClassifier model is created with default settings.
o The model is trained on X_train and y_train, and accuracy is evaluated on the X_test set to
get a baseline score (accuracy_default).
3. Hyperparameter Tuning:
o A param_grid dictionary is defined to explore different values for key parameters:
max_depth: Controls the depth of the tree.
min_samples_split: Minimum number of samples required to split a node.
min_samples_leaf: Minimum number of samples required at a leaf node.
criterion: Splitting criterion, either "gini" or "entropy."
max_features: Maximum number of features considered for each split.
o GridSearchCV performs cross-validation to find the best parameters from this grid.
o The best parameters and model are printed and stored as best_dt.
4. Train and Evaluate the Tuned Model:
o The best_dt model (with optimal hyperparameters) is evaluated on X_test.
o The resulting accuracy (accuracy_tuned) should ideally be an improvement over the initial
default accuracy.
WEEK-7
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
iris = load_iris()
type(iris)
iris.data
print(iris.feature_names)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
print(iris.target)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0000000000000111111111111111111111111
1111111111111111111111111122222222222
2222222222222222222222222222222222222
2 2]
print(iris.target_names)
plt.scatter(iris.data[:,0],iris.data[:,3],c=iris.target, cmap=plt.cm.Paired)
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[3])
plt.show()
p = iris.data
q = iris.target
print(p.shape)
print(q.shape)
(150, 4)
(150,)
print(p_train.shape)
print(p_test.shape)
(120, 4)
(30, 4)
print(q_train.shape)
print(q_test.shape)
(120,)
(30,)
k_range = range(1,25)
scores = {}
scores_list = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(p_train,q_train)
q_pred=knn.predict(p_test)
scores[k] = metrics.accuracy_score(q_test,q_pred)
scores_list.append(metrics.accuracy_score(q_test,q_pred))
scores
{1: 0.9333333333333333,
2: 0.9333333333333333,
3: 0.9666666666666667,
4: 0.9666666666666667,
5: 0.9666666666666667,
6: 0.9666666666666667,
7: 0.9666666666666667,
8: 0.9666666666666667,
9: 0.9666666666666667,
10: 0.9666666666666667,
11: 0.9666666666666667,
12: 0.9666666666666667,
13: 0.9666666666666667,
14: 0.9666666666666667,
15: 0.9666666666666667,
16: 0.9666666666666667,
17: 0.9666666666666667,
18: 0.9666666666666667,
19: 0.9666666666666667,
20: 0.9333333333333333,
21: 0.9666666666666667,
22: 0.9333333333333333,
23: 0.9666666666666667,
24: 0.9666666666666667}
print(classes[y_predict[0]])
print(classes[y_predict[1]])
setosa
virginica
WEEK-8
Diabetes Prediction
dataframe.columns
Output:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
Datatype of each feature:
dataframe.dtypes
classification = LogisticRegression()
classification.fit(X_train_resampled, y_train_resampled)
y_predictions = classification.predict(X_test)
print(y_predictions)
Output:
[0 0 0 1 0 1 0 1 0 0 1 0 0 1 1 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 1 1 1 1 0
0100110011111011101100100001000100000
1100011100101001010110001000011011000
0010000101000010010010000000000001000
1010001000001100000011010100000100010
0100100110011101001110001001100001101
0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0]
WEEK-9
customer_data = pd.read_csv('Mall_Customers.csv')
Data Analysis:
customer_data.head()
Output:
customer_data.shape
Output:
(250, 5)
customer_data.info()
Output:
customer_data.isnull().sum()
Output:
0
CustomerID 0
Gender 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype: int64
Choosing the Annual income column & Spending score Column
X = customer_data.iloc[:,[3,4]].values
print(X)
plt.figure(figsize=(8,8))
plt.scatter(X[Y==0,0],X[Y==0,1],s=50,c='green',label='Cluster 1')
plt.scatter(X[Y==1,0],X[Y==1,1],s=50,c='red',label='Cluster 2')
plt.scatter(X[Y==2,0],X[Y==2,1],s=50,c='blue',label='Cluster 3')
plt.scatter(X[Y==3,0],X[Y==3,1],s=50,c='violet',label='Cluster 4')
plt.scatter(X[Y==4,0],X[Y==4,1],s=50,c='yellow',label='Cluster 5')
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1],s=100, c='cyan',label='Centroids')
plt.title('Customer Groups')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.show()
Output: