0% found this document useful (0 votes)
16 views53 pages

Ds 1-000000

This document is a laboratory manual for a Data Science course designed for B.E. Semester 5 students at L.D. College of Engineering. It outlines the practical work required for the course, emphasizing hands-on experience with data analysis tools and techniques using programming languages like Python and R. The manual includes objectives, course outcomes, safety instructions, and assessment rubrics to guide both faculty and students in the practical application of data science concepts.

Uploaded by

krishborad543
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views53 pages

Ds 1-000000

This document is a laboratory manual for a Data Science course designed for B.E. Semester 5 students at L.D. College of Engineering. It outlines the practical work required for the course, emphasizing hands-on experience with data analysis tools and techniques using programming languages like Python and R. The manual includes objectives, course outcomes, safety instructions, and assessment rubrics to guide both faculty and students in the practical application of data science concepts.

Uploaded by

krishborad543
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

230280116071 Kashis Makwana

A
Laboratory Manual for

Data Science
(3151608)

B.E. Semester 5
(Information Technology Department)

L.D.College Of Engineering,
Ahmedabad

Directorate of Technical
Education,Gandhinagar,Gujarat
Page |1
230280116071 Kashis Makwana

Certificate

This is to certify that Mr./Ms. Makwana Kashis Arvindbhai


Enrollment No 230280116071 of B.E. Semester_5_department
Information Technology of this Institute (GTU Code: ) has
satisfactorily completed the Practical / Tutorial work for the subject Data
Science (3151608) for the academic year 2025-26.

Place:L.D. College Of Engineering


Date:

Name and Sign of Faculty member

Head of the Department

Page |2
230280116071 Kashis Makwana

Preface

Main motto of any laboratory/practical/field work is for enhancing required skills as well as
creating ability amongst students to solve real time problem by developing relevant competencies
in psychomotor domain. By keeping in view, GTU has designed competency focused outcome-
based curriculum for engineering degree programs where sufficient weightage is given to
practical work. It shows importance of enhancement of skills amongst the students and it pays
attention to utilize every second of time allotted for practical amongst students, instructors and
faculty members to achieve relevant outcomes by performing the experiments rather than having
merely study type experiments. It is must for effective implementation of competency focused
outcome-based curriculum that every practical is keenly designed to serve as a tool to develop
and enhance relevant competency required by the various industry among every student. These
psychomotor skills are very difficult to develop through traditional chalk and board content
delivery method in the classroom. Accordingly, this lab manual is designed to focus on the
industry defined relevant outcomes, rather than old practice of conducting practical to prove
concept and theory.

Data Science is a rapidly growing field that combines statistical and computational techniques to
extract knowledge and insights from data. The goal of this lab manual is to provide students with
hands-on experience in using data science tools and techniques to analyze and interpret real-world
data.

This manual is designed to accompany a course in Data Science and assumes a basic knowledge
of programming concepts and statistical analysis. The labs are structured to guide students
through the process of collecting, cleaning, analyzing, and visualizing data, using popular
programming languages and software tools such as Python, R, SQL, and Tableau.

Each lab in this manual consists of a set of instructions that guide students through a specific data
analysis project. The labs are organized in a progressive sequence, with each lab building on the
skills and concepts covered in the previous lab. The exercises within each lab are designed to be
completed in a single class session, with additional time required for preparation and follow-up
analysis.

Throughout the manual, we emphasize the importance of critical thinking and data ethics,
providing guidance on how to analyze data responsibly and communicate findings effectively. By
the end of this manual, students will have gained a solid foundation in data science and be well-
equipped to apply these skills to real-world problems.

Page |3
3151608 Data Science 230280116001 Kashis Makwana

Practical – Course Outcome matrix

Course Outcomes (COs):


After successful completion of this course, the students should be able to CO-
1 Describe the various areas where data science is applied.
CO-2 Identify the data types, relation between data and visualization technique for data.
CO-3 Explain probability, distribution, sampling, Estimation.
CO-4 Solve regression and classification problem

Sr.
Objective(s) of Experiment CO-1 CO-2 CO-3 CO-4
No.
Exploration and Visualization Using Mathematical and
1. Statistical Tools √ √
Study of Measures of Central Tendency,
Correlation, Percentile, Decile, Quartile, Measure
2. √ √
of Variation, and Measure of Shape (Skewness and
Kurtosis) with Excel Functions
Study of Basics of Python data types, NumPy, Matplotlib,
3. √
Pandas. √
Implementation of Various Probability
4. Distributions with NumPy Random √
Library Functions
Implementation of Estimation of Parameters for the
5. BestFit Probability Distribution using the Fitter Class in √
Python.
Implementation of Linear Regression with Scikit-learn
6. √
library in Python.
Implementation of Logistic Regression with
7. √
Scikit-learn library in Python
Implementation of Decision Tree for Student
8. √
Classification
Guidelines for Faculty members

1. Course Coordinator / Faculty should provide the guideline with demonstration of practical
to the students with all features.
2. Course Coordinator / Faculty shall explain basic concepts/theory related to the experiment
to the students before starting of each practical
3. Involve all the students in performance of each experiment.
4. Course Coordinator / Faculty is expected to share the skills and competencies to be
developed in the students and ensure that the respective skills and competencies are
developed in the students after the completion of the experimentation.
5. Course Coordinator / Faculty should give opportunity to students for hands-on experience
after the demonstration.
6. Course Coordinator / Faculty may provide additional knowledge and skills to the students
even though not covered in the manual but are expected from the students by concerned
industry.

Page |4
230280116071 Kashis Makwana

7. Give practical assignment and assess the performance of students based on task assigned to
check whether it is as per the instructions or not.
8. Course Coordinator / Faculty is expected to refer complete curriculum of the course and
follow the guidelines for implementation.

Instructions for Students

1. Students are expected to carefully listen to all the theory classes delivered by the faculty
members and understand the COs, content of the course, teaching and examination scheme,
skill set to be developed etc.
2. Students will have to perform experiments as per practical list given.
3. Students have to show output of each program in their practical file.
4. Students are instructed to submit practical list as per given sample list shown on next page.
5. Student should develop a habit of submitting the experimentation work as per the schedule
and s/he should be well prepared for the same.

Common Safety Instructions

Students are expected to


1) Switch on the PC carefully (not to use wet hands)
2) Shutdown the PC properly at the end of your Lab
3) Carefully handle the peripherals (Mouse, Keyboard, Network cable etc).
4) Use Laptop in lab after getting permission from Course Coordinator / Faculty

Rubrics for Practical Assessment (For Total marks 10)


Criteri
Parameters Strong Average Poor
a No
Moderate level
Excellent understanding of understanding of
Understanding of Problem not understood
problem and relevance problem and
C1 Problem and can't establish the
with the theory clearly relevance with the
(2 Marks) relation with the theory.(0)
understood.(2) theory clearly
understood.(1)
Moderate ability to
identify strategies
Good ability to identify for solving Poor ability to identify
Analysis of the strategies for solving problems (by strategies for
problems (brainstorming, solving problems (require
C2 Problem guidance of
exploration of various special attention from
(2 Marks) faculty,
solutions, trial and error). faculty,
(2) exploration of trial and error). (0)
limited solutions,
trial and error).(1)

Page |5
230280116071 Kashis Makwana

Efficient implementation Moderate level of


Capability of Partial implementation
with proper naming implementation.
C3 writing program with poor
convention and Poor naming
(5 Marks) understanding.(2-0)
understanding.(5) convention. (4-3)
Ordinary
Unique documentation
documentation of Weak documentation of
(not copied from other
Documentation given problem given problem without
C4 sources) of given problem
(1 Marks) with proper proper formatting and
with proper formatting and
formatting and language(0)
language.(1)
language(0.5)

Page |6
230280116071 Kashis Makwana

Index
(Progressive Assessment Sheet)
Sr. Objective(s) of Experiment Page Date of Date of Assessment Sign. of
No. No. perform submission Marks Faculty
ance with date
Exploration and Visualization Using
1. Mathematical and
Statistical Tools (10 Marks)
Study of Measures of Central
Tendency, Correlation,
Percentile, Decile, Quartile,
2. Measure of Variation, and
Measure of Shape (Skewness
and Kurtosis) with Excel
Functions. (10 Marks)
Study of Basics of Python data types,
3. NumPy, Matplotlib, Pandas. (10
Marks)
Implementation of Various
Probability Distributions with
4.
NumPy Random Library
Functions. (10 Marks)
Implementation of Estimation of
Parameters for the Best-Fit
5.
Probability Distribution using the
Fitter Class in Python. (10 Marks)
Implementation of Linear
6. Regression with Scikit-learn library
in Python. (20 Marks)
Implementation of Logistic
Regression with
7.
Scikitlearn library in
Python (20 Marks)
Implementation of
Decision Tree for Student
8.
Classification.
(10 Marks)
Total 100

Page |7
230280116071 Kashis Makwana

Experiment No: 1

Date:

AIM: Data Exploration and Visualization Using Mathematical and Statistical Tools

Introduction:

Data exploration and visualization are important steps in the data analysis process. In this lab,
students will learn how to explore and visualize data using mathematical and statistical tools such
as histograms, box plots, scatter plots, and correlation matrices. Students will also learn how to use
Excel/R to perform these analyses.

Relevant CO: CO1, CO2

Objectives:

1. To understand the importance of data exploration and visualization in data analysis.


2. To learn how to use Excel and/or R to create histograms, box plots, scatter plots, and correlation
matrices.
3. To interpret the results of these analyses and draw conclusions from them.
4. To present the results of these analyses in a clear and effective manner.

Materials:

- A computer with Microsoft Excel and R installed - Sample dataset


provided by subject faculty or shown below.

Procedure:

1. Open Microsoft Excel and create a new workbook.


2. Input the sample dataset provided by your faculty into a worksheet in the workbook.
3. Use Excel to create a histogram for each column in the dataset.
4. Use Excel to create a box plot for each column in the dataset.
5. Use R to create a scatter plot for two variables in the dataset.
6. Use R to create a correlation matrix for all variables in the dataset.
7. Interpret the results of these analyses and draw conclusions from them.
8. Present the results of these analyses in a clear and effective manner using charts and graphs.
9. Submit the completed workbook and presentation to the faculty for grading.

Education Marital Employment


Age Gender Income Level Status Status Industry
32 Female 45000 Bachelor's Single Employed Technology
45 Male 65000 Master's Married Employed Finance
28 Female 35000 High School Single Unemployed None
52 Male 80000 Doctorate Married Employed Education
36 Female 55000 Bachelor's Divorced Employed Healthcare

Page |8
230280116071 Kashis Makwana

40 Male 70000 Bachelor's Married Self-Employed Consulting


29 Female 40000 Associate's Single Employed Retail
55 Male 90000 Master's Married Employed Engineering
33 Female 47000 Bachelor's Single Employed Government
47 Male 75000 Bachelor's Married Self-Employed Entertainment
41 Female 60000 Master's Single Employed Nonprofit
38 Male 52000 High School Divorced Employed Construction
31 Female 48000 Bachelor's Married Employed Technology
49 Male 85000 Doctorate Married Employed Finance
27 Female 30000 High School Single Unemployed None
54 Male 92000 Master's Married Employed Education
39 Female 58000 Bachelor's Married Self-Employed Consulting
30 Male 42000 Associate's Single Employed Retail
56 Female 96000 Doctorate Married Employed Healthcare
35 Male 55000 Bachelor's Single Employed Government
48 Female 73000 Bachelor's Married Self-Employed Entertainment
42 Male 65000 Master's Divorced Employed Nonprofit
37 Female 50000 High School Married Employed Construction
34 Male 49000 Bachelor's Single Unemployed None
51 Female 82000 Master's Married Employed Engineering
Example:

This dataset includes information on age, gender, income, education level, and marital status,
employment status and Industry for a sample of 25 individuals. This data could be used to explore
and visualize various relationships and patterns, such as the relationship between age and income,
or the distribution of income by education level. Few more relationships and patterns that could be
explored and visualized using the sample dataset I provided:

1. Relationship between age and income: Create a scatter plot to see if there is a relationship
between age and income. Also calculate the correlation coefficient to determine the strength and
direction of the relationship.

Fig 1.1 Relation between Age and Income

By looking at the graph we can say there is a strong relation between Age and Income as all the
points are very close. Here all the points are in increasing order. Thus we can conclude as Age and Income
are having Strong Positive Correlation.

Page |9
230280116071 Kashis Makwana

Calculation of correlation coefficient results in 0.978833742 which is very close to positive 1 indicating
strong positive correlation.

2. Distribution of income by gender: Create a box plot to compare the distribution of income
between males and females. This could reveal any differences in the median, quartiles, and
outliers for each gender.

Fig 1.2 Distribution of income by gender

In order to plot the data we first sort the data by gender and then by income. By having two
categories of gender we plot the income category vice into box plot.

Box plot gives us 5 number summery set - minimum, first quartile, median, third quartile
and maximum.

⚫ Box drwan shows first quartile to third quartile.


⚫ A vertical line goes through the box at the median.
⚫ The whiskers go from each quartile to the minimum or maximum.
Lower Upper
MIN Q1 MEDIAN Q3 MAX IQR Outliner Outliner
Limit Limit
Male 42000 54250 67500 81250 92000 27000 94750 121750
Difference 42000 12250 13250 13750 10750

Female 30000 45000 50000 60000 96000 15000 67500 82500

Difference 30000 15000 5000 10000 22000


Table 1.1 Calculation of 5 number summery given in box plot chart

P a g e | 10
230280116071 Kashis Makwana

For Calculating Outliner,


IQR = Q3-Q1, here IQR stands for Inner Quartile Range
Lower Outliner Limit = Q1 + (IQR*1.5)
Upper Outliner Limit = Q3 + (IQR*1.5)

3. Distribution of income by education level: Create a box plot to compare the distribution of
income for each level of education. This could reveal any differences in the median, quartiles,
and outliers for each education level.

Fig 1.3 Distribution of income by Education level

4. Relationship between education level and marital status: Create a contingency table and
calculate the chi-square test statistic to see if there is a relationship between education level
and marital status. This could reveal whether certain education levels are more or less likely to
be associated with certain marital statuses.

H0: No significance relation between education level and marital status Level of
Significance : 5% (0.05)

Contingency table :
Marital Associate's Bachelor's Doctorate High Master's Grand
Status School Total
Divorced 1 1 1 3

Married 5 3 1 4 13

Single 2 4 2 1 9

Grand 2 10 3 4 6 25
Total

P a g e | 11
230280116071 Kashis Makwana

Table 1.2 Contingency Table Expected


Frequencies Table :
Marital Associate's Bachelor's Doctorate High Master's Grand
Status School Total
Divorced 0.24 1.2 0.36 0.48 0.72 3
Married 1.04 5.2 1.56 2.08 3.12 13
Single 0.72 3.6 1.08 1.44 2.16 9
Grand 2 10 3 4 6 25
Total
Table 1.3 Expected Frequency Table

Chi - Square = 0.645865939


Level of significance = 5%
Value of significance on 5% = 1.96

Here, 0.645865939 < 1.96


Thus, Null hypothesis Rejected.
Conclusion : There is a significance relation between education level and marital status

5. Relationship between age and education level: Create a histogram to see the distribution of
ages for each education level. This could reveal any differences or similarities in the age
distribution across education levels.

Fig 1.5 Relationship between age and education level

6. Distribution of employment status by ethnicity: Create a stacked bar chart to compare the
distribution of each ethnicity group across different employment statuses. This could reveal any
differences or similarities in the employment status of different ethnicity groups.

P a g e | 12
230280116071 Kashis Makwana

Fig 1.6 Distribution of employee status by ethnicity

7. Distribution of employment status by gender: Students could create a contingency table and
calculate the chi-square test statistic to see if there is a relationship between gender and
employment status. This could reveal whether certain genders are more or less likely to be
employed.

H0: No significance relation between gender and employment status Level of


Significance : 5% (0.05) Contingency table :
Count of Gender Employment Status
Grand
Gender Employed Self-Employed Unemployed
Total
Female 9 2 2 13
Male 9 2 1 12
Grand Total 18 4 3 25
Table 1.4 Contingency Table

Expected Frequencies Table :


Count of Gender Employment Status
Grand
Gender Employed Self-Employed Unemployed
Total
Female 9.36 2.08 1.56 13
Male 8.64 1.92 1.44 12
Grand Total 18 4 3 25
Table 1.5 Expected Frequency Table

Chi - Square = 0.863378835


Level of significance = 5%

P a g e | 13
230280116071 Kashis Makwana

Value of significance on 5% = 1.96

Here, 0.863378835 < 1.96


Thus, Null hypothesis Rejected.
Conclusion : There is a significance relation between gender and employment
status

Observations / Program:

Relationship between age and income: Create a scatter plot to see if there is a relationship
between age and income. Also calculate the correlation coefficient to determine the strength and
direction of the relationship. import matplotlib.pyplot as plt
x = [32,45,28,52,36,40,29,55,33,47,41,38,31,49,27,54,39,30,56,35,48,42,37,34,51] y =
[45000,65000,35000,80000,55000,70000,40000,90000,47000,75000,60000,52000,48
000,85000,30000,92000,58000,42000,96000,55000,73000,65000,50000,49000,82000
]
plt.xlabel("Age") plt.ylabel("Income")
plt.title("Relation between Age an Income")
plt.scatter(x,y) plt.show() Output:

Fig 1.9 Relation between Age and Income Conclusion:

In this lab, students learned how to explore and visualize data using mathematical and statistical
tools such as histograms, box plots, scatter plots, and correlation matrices. These tools are useful in
identifying patterns and relationships in data, and in making informed decisions based on data
analysis. The skills students have learned in this lab will be helpful in your future studies and career
in data analysis.
Quiz:

1. What are the measures of central tendency? Provide examples and explain when each
measure is appropriate to use.

Answer : Mean, Median and Mode are the measures of central tendency. These statistics indicate
where most values in a distribution fall and are also referred to as the central location of a
distribution.

P a g e | 14
230280116071 Kashis Makwana

MEAN : The mean is usually the best measure of central tendency to use when your data
distribution is continuous and symmetrical, such as when your data is normally distributed.
Example : if we would like to know how many hours on average an employee spends at training
in a year, we can find the mean training hours of a group of employees.

MEDIAN : The median is generally a better measure of the center when there are extreme values
or outliers because it is not affected by the precise numerical values of the outliers. The median is
usually preferred to other measures of central tendency when your data set is skewed (i.e., forms a
skewed distribution) or you are dealing with ordinal data.
Example : consider the wages of staff at a factory below:

Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests that
this mean value might not be the best way to accurately reflect the typical salary of a worker, as
most workers have salaries in the $12k to 18k range. The mean is being skewed by the two large
salaries. Therefore, in this situation we usually prefer the median over the mean (or mode) is when
our data is skewed. the median best retains this position and is not as strongly influenced by the
skewed values.

MODE: the mode is used for categorical data where we wish to


know which is the most common category
Example : We can see above that the most common form of
transport, in this particular data set, is the bus. However, it leaves
us with problems when we have two or more values that share
the highest
frequency

Fig 1.10 Distinct transport form

2. How can you calculate the correlation coefficient between two variables using
mathematical and statistical tools? Interpret the correlation coefficient value.

Answer : The covariance of two variables divided by the product of their standard deviations gives
Pearson’s correlation coefficient. It is usually represented by ρ (rho). ρ (X,Y) = cov (X,Y) / σX.σY.
If x & y are the two variables of discussion, then the correlation coefficient can be calculated using
the formula

Here,
n = Number of values or elements ∑x = Sum of
1st values list
∑y = Sum of 2nd values list
∑xy = Sum of the product of 1st and 2nd values
∑x2 = Sum of squares of 1st values

P a g e | 15
230280116071 Kashis Makwana

∑y2 = Sum of squares of 2nd values

Fig 1.11 Interpretation of the correlation coefficient values

3. Explain the concept of skewness and kurtosis in statistics. How can you measure and
interpret these measures using mathematical and statistical tools?

Answer :
Skewness - Skewness is a statistical number that tells us if a distribution is symmetric or not.
Here,
S: standard deviation x̄ : Mean, n = Number of values or
elements

If a distribution is symmetric, then the


Skewness value is 0. If Skewness is greater
than 0, then it is called right-skewed or
that the right tail is longer than the left tail.
If Skewness is less than 0, then it is called
left-skewed or that the left tail is longer
than the right tail.
Fig 1.12 Interpretation of Skewness value

Kurtosis : Kurtosis is a statistical number that tells us if a distribution is taller or shorter than a
normal distribution.

Here,
S: standard deviation
x̄ : Mean, n = Number of values or elements

If a distribution is similar to the normal distribution, the Kurtosis value is 0. If Kurtosis is


greater than 0, then it has a higher peak compared to the normal distribution. If Kurtosis is less
than 0, then it is flatter than a normal distribution.

There are three types of distributions:


⚫ Leptokurtic: Sharply peaked with fat tails, and less variable. ⚫ Mesokurtic:
Medium peaked
⚫ Platykurtic: Flattest peak and highly dispersed.

P a g e | 16
230280116071 Kashis Makwana

Fig 1.13 Interpretation of Kurtosis value

Suggested References:
1. "Python for Data Analysis" by Wes McKinney
2. "Data Visualization with Python and Matplotlib" by Benjamin Root

Rubrics wise marks obtained

Understanding of Analysis of Capability of Documentation


Problem the Problem writing program Total

02 02 05 01 10

P a g e | 17
230280116071 Kashis Makwana

Experiment No: 2
Date:

AIM: Study of Measures of Central Tendency, Correlation, Percentile, Decile, Quartile, Measure of
Variation, and Measure of Shape (Skewness and Kurtosis) with Excel Functions Relevant CO:
CO1, CO2

Objective:
The objective of this lab practical is to provide students with hands-on experience in using Excel
functions to explore and analyze a sample data sheet. Students will learn to calculate measures of
central tendency, correlation, percentile, decile, quartile, measure of variation, and measure of shape
using Excel functions. Additionally, students will learn to create visualizations to better understand
the data.

Materials:
- Computer with Microsoft Excel installed
- Sample data sheet (provided below or dataset may be provided by subject teacher) Sample
Data Sheet:
Test1 Test2
StudentID Score Score Age Gender
1 85 92 19 Male
2 92 87 20 Female
3 78 80 18 Male
4 85 89 19 Male
5 90 95 21 Female
6 75 82 18 Male
7 83 87 20 Female
8 92 90 19 Male
9 80 85 18 Female
10 87 88 20 Female
|
Procedure:

Part 1: Measures of Central Tendency 1. Open the


sample data sheet in Excel.
2. Calculate the mean, median, and mode for the test 1 score column using Excel functions.
3. Calculate the mean, median, and mode for the test 2 score column using Excel functions.
4. Write a brief interpretation of the results.
Test1 Test2
Score Score
Mean 84.7 87.5
Median 85 87.5
Mode 85 87

P a g e | 18
230280116071 Kashis Makwana

Table 2.1 Calculation of mean , median, mode

⚫ To find mean we use AVERAGE function as


=AVERAGE(cell name of first value in column of test1/2(85/92):cell name of last value in
column of test1/2(87/88)

⚫ To find mean we use MEDIAN function as


=MEDIAN(cell name of first value in column of test1/2(85/92):cell name of last value in
column of test1/2(87/88)

⚫ To find mean we use MODE function as


=MODE(cell name of first value in column of test1/2(85/92):cell name of last value in column
of test1/2(87/88)

Part 2: Correlation
1. Calculate the correlation between test 1 score and test 2 score using Excel functions.

⚫ Corelation can be find using function as


=PEARSON(B2:B11,C2:C11)
Here, B2:B11 is the range of Test1 Score and C2:C22 is the rsnge of Test2 Score

2. Create a scatter plot to visualize the relationship between test 1 score and test 2 score.

Fig. 2.1 Relation between test 1 and test 2

3. Write a brief interpretation of the results.

Calculation of correlation coefficient results in 0.744684233 which is very close to positive 1 indicating
strong positive correlation between Test1 and Test2. A scatter graph gives the clear visualization of relation.

Part 3: Percentile, Decile, and Quartile


1. Calculate the 25th, 50th, and 75th percentiles using Excel functions for both test 1 score and
test 2 score columns.
2. Calculate the 30th, 40th, and 70th deciles using Excel functions for both test 1 score and test
2 score columns.

P a g e | 19
230280116071 Kashis Makwana

3. Calculate the first and third quartiles using Excel functions for both test 1 score and test 2
score columns.
4. Create a box plot for both test 1 score and test 2 score columns.

Fig. 2.2 Box plot for test 1 and test 2

5. Write a brief interpretation of the results.

⚫ To find Percentile we use PERCENTILE function as


=PERCENTILE(cell name of first value in column of test1/2(85/92):cell name of last value in
column of test1/2(87/88, PERCENTAGE(i.e 25%.))

⚫ To find Decile we use DECILE function as


=PERCENTILE(cell name of first value in column of test1/2(85/92):cell name of
last value in column of test1/2(87/88, PERCENTAGE(i.e %.)) Since 1th Decile
is equal to 10% percentile

⚫ To find Quartile we use QUARTILE function as


=QUARTILE(cell name of first value in column of test1/2(85/92):cell name of last value in
column of test1/2(87/88,1or3)
Percentile Decile Quartile

25th 50th 75th 30th 40th 70th 1st 3rd

Test 1 80.75 85 89.25 82.1 84.2 87.9 80.75 89.25


Test 2 85.5 87.5 89.75 86.4 87 89.3 85.5 89.75
Table 2.2 Calculation of Percentile, Decile and Quartile

Part 4: Measure of Variation


1. Calculate the range, inter-quartile distance, variance, and standard deviation for both test 1
score and test 2 score columns using Excel functions.
2. Write a brief interpretation of the results.

⚫ To find Range we subtract min from max


=max-min
st rd
⚫ To find inter-quartile distance we subtract 1 quartile from 3 quartile

P a g e | 20
230280116071 Kashis Makwana

⚫ To find variance use VAR function as


=VAR(range of both column)
⚫ To find standard deviation use STDEV function as
⚫ =STDEV(range of both column)

Standard
inter Quartile Min Max Range Variance Variance
Test1 Score 8.5 75 92 17
27.46315789 5.240530307
Test2 Score 4.25 80 95 15

Table 2.3 Calculation of Inter Quartile, Range, Variance and Standard Deviation

Part 5: Measure of Shape


1. Calculate the skewness and kurtosis for both test 1 score and test 2 score columns using
Excel functions.
2. Write a brief interpretation of the results.

⚫ To find Skewness use SKEW function as


=SKEW(cell name of first value in column of test1/2(85/92):cell name of last value in column
of test1/2(87/88))
⚫ To find Kurtosiswe use KURT function as
=KURT(cell name of first value in column of test1/2(85/92):cell name of last value in column
of test1/2(87/88))
Test1 Score Test2 Score

Skewness -0.270863675 -0.113215642


Kurtosis -0.924830004 -0.046667406
Table 2.4 Calculation of Skewness and Kurtosis

Quiz:
1) What Excel function can be used to calculate the mean of a dataset?
a) AVERAGE
b) MEDIAN
c) MODE
d) STANDARDIZE

ANSWER : b) AVERAGE

2) What does the correlation coefficient measure in terms of the relationship between two
variables?
a) Strength of the linear relationship
b) Variability of the data
c) Difference between mean and median
d) Skewness of the distribution

ANSWER : a) Strength of the linear relationship

P a g e | 21
230280116071 Kashis Makwana

Suggested Refrences:
1. "Microsoft Excel Data Analysis and Business Modeling" by Wayne L. Winston
2. "Excel 2021: Data Analysis and Business Modeling" by Wayne L. Winston

Rubrics wise marks obtained

Understanding of Analysis of Capability of Documentation


Problem the Problem writing program Total

02 02 05 01 10

P a g e | 22
230280116071 Kashis Makwana

Experiment No: 3
Date:

AIM: Study of Basics of Python data types, NumPy, Matplotlib, Pandas.

Relevant CO: CO1, CO2

Objective:
The objective of this lab practical is to gain hands-on experience with NumPy, Matplotlib, and
Pandas libraries to manipulate and visualize data. Through this practical, students will learn how to
use different functions of these libraries to perform various data analysis tasks.

Materials Used:
- Python programming environment
- NumPy library
- Matplotlib library
- Pandas library
- Dataset file (provided by faculty)
//Example of dataset file like sales_Data.csv
o Date: Date of sale
o Product: Name of the product sold
o Units Sold: Number of units sold
o Revenue: Total revenue generated from the sale
o Region: Geographic region where the sale took place
o Salesperson: Name of the salesperson who made the sale

Procedures:

Part 1: NumPy
1. Import the NumPy library into Python.
2. Create a NumPy array with the following specifications:
a. Dimensions: 5x5
b. Data type: integer
c. Values: random integers between 1 and 100
3. Reshape the array into a 1x25 array and calculate the mean, median, variance, and standard
deviation using NumPy functions.
4. Generate a random integer array of length 10 and find the percentile, decile, and quartile values
using NumPy functions.

Part 2: Matplotlib
1. Import the Matplotlib library into Python.
2. Create a simple bar chart using the following data:
a. X-axis values: ['A', 'B', 'C', 'D']
b. Y-axis values: [10, 20, 30, 40]
3. Customize the plot by adding a title, axis labels, and changing the color and style of the bars.
4. Create a pie chart using the following data:
a. Labels: ['Red', 'Blue', 'Green', 'Yellow']
b. Values: [20, 30, 10, 40]
5. Customize the pie chart by adding a title, changing the colors of the slices, and adding a legend.

P a g e | 23
230280116071 Kashis Makwana

Part 3: Pandas
1. Import the Pandas library into Python.
2. Load the "sales_data.csv" file into a Pandas data frame.
3. Calculate the following statistics for the Units Sold and Revenue columns:
a. Mean
b. Median
c. Variance
d. Standard deviation
4. Group the data frame by Product and calculate the mean, median, variance, and standard deviation
of Units Sold and Revenue for each product using Pandas functions.
5. Create a line chart to visualize the trend of Units Sold and Revenue over time for each product.

Interpretation/Program/code:
//write here
import numpy as np
random_array = np.random.randint(1, 101, (5, 5), dtype=int)
reshaped_array = random_array.reshape(1, 25)
mean_value = np.mean(reshaped_array)
median_value = np.median(reshaped_array)
variance_value = np.var(reshaped_array)
std_deviation_value = np.std(reshaped_array)

# 4. Generate a random integer array of length 10 and find percentile, decile, and quartile values
random_integers = np.random.randint(1, 101, 10, dtype=int)

percentile_values = np.percentile(random_integers, [25, 50, 75])


decile_values = np.percentile(random_integers, np.arange(10, 101, 10))
quartile_values = np.percentile(random_integers, [25, 50, 75])

print("Random 5x5 array:\n", random_array)


print("\nReshaped 1x25 array:\n", reshaped_array)
print("\nMean:", mean_value)
print("Median:", median_value)
print("Variance:", variance_value)
print("Standard Deviation:", std_deviation_value)
print("\nRandom 10-element array:", random_integers)
print("\nPercentile values:", percentile_values)
print("Decile values:", decile_values)
print("Quartile values:", quartile_values)

Output:
Random 5x5 array:
[[61 73 42 26 7]
[14 92 70 72 71]
[31 98 87 81 83]
[85 87 24 7 88]

P a g e | 24
230280116071 Kashis Makwana

[40 4 29 97 30]]

Reshaped 1x25 array:


[[61 73 42 26 7 14 92 70 72 71 31 98 87 81 83 85 87 24 7 88 40 4 29 97
30]]

Mean: 55.96
Median: 70.0
Variance: 981.9584000000001
Standard Deviation: 31.336215470282944

Random 10-element array: [49 11 68 80 94 27 20 14 60 58]

Percentile values: [21.75 53.5 66. ]


Decile values: [13.7 18.8 24.9 40.2 53.5 58.8 62.4 70.4 81.4 94. ]
Quartile values: [21.75 53.5 66. ]

Interpretation/Program/code:
import matplotlib.pyplot as plt

# 2. Create a simple bar chart


x_values = ['A', 'B', 'C', 'D']
y_values = [10, 20, 30, 40]
plt.bar(x_values, y_values, color='skyblue', edgecolor='black', linestyle='--')

# 3. Customize the bar chart


plt.title('Simple Bar Chart')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.grid(True, linestyle='--', alpha=0.6)
plt.ylim(0, 50) # Set the y-axis limit

# 4. Create a pie chart


labels = ['Red', 'Blue', 'Green', 'Yellow']
values = [20, 30, 10, 40]
colors = ['red', 'blue', 'green', 'yellow']
plt.figure() # Create a new figure for the pie chart
plt.pie(values, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
# 5. Customize the pie chart
plt.title('Pie Chart')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle
plt.legend()
plt.show()

P a g e | 25
230280116071 Kashis Makwana

Output:

Interpretation/Program/code:

# Step 1: Import Pandas library


import pandas as pd
import matplotlib.pyplot as plt

# Step 2: Load the "sales_data.csv" file into a Pandas data frame


data = pd.read_csv("D:\Semester-5\DS\SaleData.csv")

# Step 3: Calculate statistics for Units Sold and Revenue columns


units_sold_mean = data['Units Sold'].mean()
units_sold_median = data['Units Sold'].median()
units_sold_variance = data['Units Sold'].var()
units_sold_stddev = data['Units Sold'].std()

data['Revenue'] = data['Revenue'].str.replace(r'[^0-9.]', '', regex=True).astype(float)

revenue_mean = data['Revenue'].mean()
revenue_median = data['Revenue'].median()
revenue_variance = data['Revenue'].var()
revenue_stddev = data['Revenue'].std()

# Step 4: Group the data frame by Product and calculate statistics


grouped_data = data.groupby('Product')[['Units Sold', 'Revenue']].agg(['mean', 'median', 'var', 'std'])

# Step 5: Create a line chart to visualize the trend over time for each product
for product, product_data in data.groupby('Product'):
plt.figure(figsize=(10, 5))
plt.plot(product_data['Date'], product_data['Units Sold'], label='Units Sold', marker='o')
plt.plot(product_data['Date'], product_data['Revenue'], label='Revenue', marker='o')

P a g e | 26
230280116071 Kashis Makwana

plt.title(f'{product} Sales Over Time')


plt.xlabel('Date')
plt.ylabel('Units Sold / Revenue')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Output:

P a g e | 27
230280116071 Kashis Makwana

Conclusion:
In conclusion, this lab practical provided hands-on experience with NumPy, Matplotlib, and Pandas
libraries in Python for data manipulation and visualization. These libraries have wide-ranging
applications in various fields, enabling researchers and analysts to gain insights from large datasets
quickly and efficiently. Through exercises such as calculating statistical measures and visualizing
data using charts, we explored the functionality and flexibility of these powerful data analysis tools.
Overall, gaining proficiency in these libraries equips individuals to tackle complex data analysis
challenges and contribute to their respective fields of study or industries.

Quiz:

1. What is the difference between a list and a tuple in Python?


In Python, both lists and tuples are used to store collections of items, but they have some key
differences:

1. Mutability:
- List: Lists are mutable, which means you can change their contents (add, remove, or modify
elements) after they are created. You can use methods like `append()`, `remove()`, and `pop()`
to modify a list.
- Tuple: Tuples are immutable, which means once you create a tuple, you cannot change its
elements. You can't add or remove elements from a tuple, nor can you modify the existing
elements.

2. Syntax:
- List: Lists are created using square brackets `[...]`. For example: `my_list = [1, 2, 3]`
- Tuple: Tuples are created using parentheses `(...)`. For example: `my_tuple = (1, 2, 3)` or
even without parentheses: `my_tuple = 1, 2, 3`

P a g e | 28
230280116071 Kashis Makwana

3. Performance:
- Lists can be slightly slower than tuples in terms of iteration and access time because of their
mutability. When you modify a list, it may require resizing the underlying data structure, which
can introduce some overhead.
- Tuples, being immutable, are generally faster for iteration and access, and they consume
slightly less memory than lists.

4. Use Cases:
- Lists are typically used when you have a collection of items that may change or need to be
modified over time. For example, a list of tasks to-do that you can add or remove items from.
- Tuples are often used when you have a collection of items that should not change. For
example, a tuple might be used to represent a set of coordinates (x, y) or a record in a database.

5. Packing and Unpacking:


- Tuples are often used for "packing" and "unpacking" values. For example, you can return
multiple values from a function as a tuple and then unpack those values when calling the
function.

2. How can you use NumPy to generate an array of random numbers?


NumPy provides a variety of functions to generate arrays of random numbers. You can use these
functions to create arrays with random values according to different distributions. Here are
some common ways to generate random number arrays using NumPy:
1. **Random Values from a Uniform Distribution**:
- To generate random numbers between 0 and 1 from a uniform distribution:

```python
import numpy as np
random_array = np.random.rand(5, 5) # Creates a 5x5 array of random values
2. **Random Integers**:
- To generate random integers within a specified range:
```

```python
import numpy as np

random_integers = np.random.randint(1, 100, size=10) # Generates 10 random integers


between 1 and 100
```

3. **Standard Normal Distribution (Gaussian)**:


- To generate random numbers from a standard normal distribution (mean=0, standard
deviation=1):
```python
import numpy as np

random_values = np.random.randn(10) # Generates an array of 10 random numbers from a


standard normal distribution
```

P a g e | 29
230280116071 Kashis Makwana

4. **Custom Normal Distribution**:


- To generate random numbers from a custom normal distribution with a specified mean and
standard deviation:
```python
import numpy as np
mean = 5
std_dev = 2
random_values = mean + std_dev * np.random.randn(10) # Generates an array of 10 random
numbers from a custom normal distribution
```
5. **Random Numbers from Other Distributions**:
- NumPy also provides functions to generate random numbers from other distributions like
exponential, Poisson, binomial, and more. For example:
```python
import numpy as np

exponential_values = np.random.exponential(scale=2, size=10) # Generates 10 random


values from an exponential distribution with scale=2
```

Suggested References:-

1. Dinesh Kumar, Business Analytics, Wiley India Business alytics: The Science
2. V.K. Jain, Data Science & Analytics, Khanna Book Publishing, New Delhi of Dat
3. Data Science For Dummies by Lillian Pierson , Jake Porway

Rubrics wise marks obtained

Understanding of Analysis of Capability of Documentation


Total
Problem the Problem writing program
02 02 05 01 10

P a g e | 30
230280116071 Kashis Makwana

Experiment No: 4
Date:

AIM: Implementation of Various Probability Distributions with NumPy Random Library


Functions

Relevant CO: - CO3


Objective:
The objective of this lab practical is to gain an understanding of various probability distributions
and implement those using NumPy random library functions.

Materials Used:
- Python environment (Anaconda, Jupyter Notebook, etc.)
- NumPy library

Procedure:
1. Introduction to Probability Distributions:
o Probability theory is the branch of mathematics that deals with the study of random events
or phenomena. In probability theory, a probability distribution is a function that describes
the likelihood of different outcomes in a random process. Probability distributions can be
categorized into two types: discrete and continuous.
o Discrete probability distributions are used when the possible outcomes of a random
process are countable and can be listed. The most commonly used discrete probability
distributions are Bernoulli, Binomial, and Poisson distributions.
o Continuous probability distributions are used when the possible outcomes of a random
process are not countable and can take any value within a certain range. The most
commonly used continuous probability distributions are Normal and Exponential
distributions.
o Each probability distribution has its own set of properties, such as mean, variance,
skewness, and kurtosis. Mean represents the average value of the random variable,
variance represents how much the values vary around the mean, skewness represents the
degree of asymmetry of the distribution, and kurtosis represents the degree of peakedness
or flatness of the distribution.
o Probability distributions are widely used in fields such as finance, engineering, physics,
and social sciences to model real-world phenomena and make predictions about future
events. Understanding different probability distributions and their properties is an
important tool for analyzing data and making informed decisions.

2. Implementation of Probability Distributions using NumPy random library functions:

#python
import numpy as np
import matplotlib.pyplot as plt

# Generate 1000 random numbers following a normal distribution with mean 0 and standard
deviation 1
normal_dist = np.random.normal(0, 1, 1000)

# Calculate the mean and standard deviation of the distribution


mean = np.mean(normal_dist)
std_dev = np.std(normal_dist)

P a g e | 31
230280116071 Kashis Makwana

# Generate 1000 random numbers following a Poisson distribution with lambda 5


poisson_dist = np.random.poisson(5, 1000)

# Calculate the mean and variance of the Poisson distribution


poisson_mean = np.mean(poisson_dist)
poisson_var = np.var(poisson_dist)

# Plot the PDF and CDF of the normal distribution


plt.hist(normal_dist, bins=30, density=True, alpha=0.5)
plt.plot(np.sort(normal_dist), 1/(std_dev*np.sqrt(2*np.pi))*np.exp(-(np.sort(normal_dist)-
mean)**2/(2*std_dev**2)), linewidth=2)
plt.plot(np.sort(normal_dist), 0.5*(1+np.tanh((np.sort(normal_dist)-mean)/std_dev*np.sqrt(2))),
linewidth=2)
plt.show()

# Plot the PDF and CDF of the Poisson distribution


plt.hist(poisson_dist, bins=15, density=True, alpha=0.5)
plt.plot(np.arange(0, 15, 0.1), np.exp(-poisson_mean)*poisson_mean**np.arange(0, 15,
0.1) /np.math.factorial(np.arange(0, 15, 0.1)), linewidth=2)
plt.plot(np.arange(0, 15, 0.1), np.exp(oisson_mean)*np.array([np.sum(poisson_var**np.arange(0,
i+1))/np.math.factorial(np.arange(0, i+1)) for i in np.arange(0, 15, 0.1)]), linewidth=2)
plt.show()

In this example, we generate 1000 random numbers following a normal distribution with mean 0
and standard deviation 1 using the `np.random.normal()` function. We then calculate the mean and
standard deviation of the distribution using the `np.mean()` and `np.std()` functions.

We also generate 1000 random numbers following a Poisson distribution with lambda 5 using the
`np.random.poisson()` function. We calculate the mean and variance of the Poisson distribution
using the `np.mean()` and `np.var()` functions.

We then plot the probability density function (PDF) and cumulative distribution function (CDF) of
both distributions using the `plt.hist()` and `plt.plot()` functions from the Matplotlib library.

3. Exercise:
- Generate a dataset of your choice or given by faculty with a given probability distribution using
NumPy random library functions
- Plot the probability density function and cumulative distribution function for the generated data
- Calculate the descriptive statistics of the generated data

Interpretation/Program/code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Set a random seed for reproducibility


np.random.seed(42)

# Generate a dataset with a normal distribution


mean = 5.0

P a g e | 32
230280116071 Kashis Makwana

std_dev = 2.0
sample_size = 1000

data = np.random.normal(mean, std_dev, sample_size)

# Plot the Probability Density Function (PDF)


plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.hist(data, bins=30, density=True, alpha=0.6, color='g')
x = np.linspace(mean - 4 * std_dev, mean + 4 * std_dev, 100)
pdf = stats.norm.pdf(x, mean, std_dev)
plt.plot(x, pdf, 'k', linewidth=2)
plt.title("Probability Density Function (PDF)")

# Plot the Cumulative Distribution Function (CDF)


plt.subplot(1, 2, 2)
plt.hist(data, bins=30, density=True, cumulative=True, alpha=0.6, color='b')
cdf = stats.norm.cdf(x, mean, std_dev)
plt.plot(x, cdf, 'k', linewidth=2)
plt.title("Cumulative Distribution Function (CDF)")

plt.show()

# Calculate descriptive statistics


mean_data = np.mean(data)
std_dev_data = np.std(data)
median_data = np.median(data)
min_data = np.min(data)
max_data = np.max(data)

print("Descriptive Statistics:")
print(f"Mean: {mean_data:.2f}")
print(f"Standard Deviation: {std_dev_data:.2f}")
print(f"Median: {median_data:.2f}")
print(f"Minimum: {min_data:.2f}")
print(f"Maximum: {max_data:.2f}")
Output:

P a g e | 33
230280116071 Kashis Makwana

Descriptive Statistics:
Mean: 5.04
Standard Deviation: 1.96
Median: 5.05
Minimum: -1.48
Maximum: 12.71

Conclusion:
This lab practical provided an opportunity to explore and implement various probability
distributions using NumPy random library functions. By understanding and applying different
probability distributions, one can model real-world phenomena and make predictions about future
events. With the knowledge gained in this lab practical, student will be equipped to work with
probability distributions and analyze data in a wide range of fields, including finance, engineering,
and social sciences.

Quiz:

1. Which NumPy function can be used to generate random numbers from a normal distribution?

a) numpy.random.uniform
b) numpy.random.poisson
c) numpy.random.normal
d) numpy.random.exponential

Ans : c) numpy.radom.normal

2. What is the purpose of the probability density function (PDF) in probability distributions?

a) To calculate the cumulative probability


b) To generate random numbers
c) To visualize the distribution
d) To calculate the probability of a specific value

Ans : d) To calculate the probability of a specific value

Suggested References:-
1. Dinesh Kumar, Business Analytics, Wiley India Business alytics: The Science
2. V.K. Jain, Data Science & Analytics, Khanna Book Publishing, New Delhi of Dat
3. Data Science For Dummies by Lillian Pierson , Jake Porway
Rubrics wise marks obtained

Understanding of Analysis of Capability of Documentation


Total
Problem the Problem writing program
02 02 05 01 10

P a g e | 34
230280116071 Kashis Makwana

Experiment-5
Date:

AIM: Implementation of Estimation of Parameters for the Best-Fit Probability Distribution using
the Fitter Class in Python.

Relevant CO: - CO3

Objectives: The objective of this lab practical is to learn how to estimate the parameters for the
best-fit probability distribution for a given dataset using the Fitter class in Python.
Materials Used:

1. Python 3.x
2. Jupyter Notebook
3. NumPy library
4. Fitter library

Theory:

Dataset:
Consider the following dataset, which represents the heights of individuals in centimeters:
170, 165, 180, 172, 160, 175, 168, 155, 185, 190, 162, 178, 168, 172, 180, 160, 165, 172, 168, 175
Procedure:

1. Introduction to Parameter Estimation and Probability Distributions:

Probability distributions provide a mathematical framework for describing the likelihood of


different outcomes or events in a dataset. Parameter estimation plays a crucial role in probability
distributions as it involves determining the values of the parameters that best describe the observed
data.

Parameter estimation is important because it allows us to make inferences, predictions, and draw
meaningful conclusions from the data. By estimating the parameters, we can effectively model and
analyze various phenomena, summarizing complex datasets in a more simplified and interpretable
manner.

The concept of the best-fit probability distribution refers to finding the distribution that provides the
closest match to the observed data. The best-fit distribution is determined by estimating the
parameters in such a way that the observed data exhibits the highest likelihood or best matches the
underlying characteristics of the data. Selecting the best-fit distribution helps us understand the
data's behavior, make accurate predictions, and gain insights into its properties.

Commonly used probability distributions include the normal (Gaussian) distribution, uniform
distribution, exponential distribution, Poisson distribution, and binomial distribution. Each
distribution has its own characteristics and applications in various fields.

Understanding parameter estimation and probability distributions allows us to effectively model and
analyze data, make informed decisions, and gain insights into the underlying properties of the data.
By estimating the parameters for the best-fit probability distribution, we can unlock valuable
information and extract meaningful patterns from the observed data.

P a g e | 35
230280116071 Kashis Makwana

2. Installation of Required Libraries:


- Install the necessary libraries, including NumPy and Fitter, using the appropriate package
manager.

3. Loading and Preparing the Dataset:


- Load the dataset from a file or use the provided dataset.
- Perform any necessary data preprocessing steps, such as cleaning or normalization.

4. Estimating Parameters for Best-Fit Probability Distribution using Fitter Class:


- Import the required libraries and instantiate the Fitter class.
- Fit the dataset to various probability distributions available in the Fitter class using the `.fit()`
method.
- Determine the best-fit distribution based on goodness-of-fit metrics, such as AIC (Akaike
Information Criterion) or BIC (Bayesian Information Criterion).
- Retrieve the estimated parameters for the best-fit distribution using the `.summary()` method.

5. Visualization of Best-Fit Distribution:


- Plot the histogram of the dataset.
- Plot the probability density function (PDF) of the best-fit distribution over the histogram.

6. Interpretation and Analysis:


- Interpret the estimated parameters of the best-fit distribution.
- Analyze the goodness of fit and discuss any potential limitations or considerations.

7. Conclusion:
- Summarize the importance of parameter estimation and the best-fit distribution in data analysis.
- Highlight the capabilities of the Fitter class in Python for automating the estimation of
parameters.
- Discuss potential applications and further exploration in different domains.

Interpretation/Program/code:
# Step 2: Import necessary libraries
import numpy as np
from fitter import Fitter
import matplotlib.pyplot as plt
from scipy.stats import norm

# Step 3: Loading and Preparing the Dataset


# Your dataset
heights = [170, 165, 180, 172, 160, 175, 168, 155, 185, 190, 162, 178, 168, 172, 180, 160, 165, 172,
168, 175]

# Step 4: Estimating Parameters for Best-Fit Probability Distribution


# Fit the dataset to the normal distribution
params = norm.fit(heights)
mean, std = params

# Step 5: Visualization of Best-Fit Distribution


# Plot the histogram of the dataset

P a g e | 36
230280116071 Kashis Makwana

plt.hist(heights, bins=10, density=True, alpha=0.5, color='b', label='Histogram')

# Plot the probability density function (PDF) of the best-fit distribution (normal)
x = np.linspace(min(heights), max(heights), 100)
y = norm.pdf(x, mean, std)
plt.plot(x, y, 'r-', lw=2, label='Best Fit Distribution (Normal)')

plt.legend()
plt.title('Best Fit Distribution: Normal')
plt.xlabel('Height (cm)')
plt.ylabel('Probability')
plt.show()

# Step 6: Interpretation and Analysis


print("Best Fit Distribution: Normal")
print(f"Estimated Parameters - Mean: {mean}, Standard Deviation: {std}")

# Step 7: Conclusion
print("Parameter estimation and best-fit distribution selection are essential for data analysis.")
print("The Fitter library simplifies the process of finding the best-fit distribution.")
Output:

Best Fit Distribution: Normal


Estimated Parameters - Mean: 171.0, Standard Deviation: 8.608135686662937
Parameter estimation and best-fit distribution selection are essential for data analysis.
The Fitter library simplifies the process of finding the best-fit distribution.

Conclusion:

In this example, we have a dataset of heights of individuals. We use the Fitter class from the `fitter`
library to estimate the parameters for the best-fit probability distribution.

We instantiate the Fitter class with the dataset `data`. Then, we use the `.fit()` method to fit the data
to various distributions available in the Fitter class. The `.fit()` method automatically estimates the
parameters for each distribution and selects the best-fit distribution based on the goodness-of-fit

P a g e | 37
230280116071 Kashis Makwana

metrics.

Finally, we retrieve the best-fit distribution using the `.get_best()` method and print the summary of
the distribution using the `.summary()` method. We also plot the histogram of the dataset and
overlay the probability density function (PDF) of the best-fit distribution using the `.plot_pdf()`
method.

Note: Before running the code, make sure you have the `numpy`, `fitter`, and `matplotlib` libraries
installed. You can install the `fitter` library using pip: `pip install fitter`.
Through this practical, we learned the importance of parameter estimation in probability
distributions and the significance of selecting the best-fit distribution for accurate modeling and
analysis. The Fitter class provided a convenient and efficient way to fit the dataset to various
distributions and evaluate their goodness of fit using metrics such as AIC or BIC.

Quiz:

1. What is the purpose of the Fitter class in Python?

a) To fit a probability distribution to a given dataset


b) To generate random numbers from a probability distribution
c) To calculate descriptive statistics of a dataset
d) To visualize the probability distribution of a dataset

Ans : a) To fit a probability distribution to a given dataset

2. Which method of the Fitter class can be used to estimate the best-fit probability distribution for a
given dataset?

a) fit
b) predict
c) evaluate
d) transform

Ans : a) fit

Suggested References:-

1. Dinesh Kumar, Business Analytics, Wiley India Business alytics: The Science
2. V.K. Jain, Data Science & Analytics, Khanna Book Publishing, New Delhi of Dat
3. Data Science For Dummies by Lillian Pierson , Jake Porway

Rubrics wise marks obtained

Understanding of Analysis of Capability of Documentation


Total
Problem the Problem writing program
02 02 05 01 10

P a g e | 38
230280116071 Kashis Makwana

Experiment-6
Date:

AIM: Implementation of Linear Regression with Scikit-learn library in Python

Relevant CO: - CO4

Objective:
The objective of this lab practical is to implement linear regression to predict the value of
a variable in a given dataset. Linear regression is a statistical technique used to model the
relationship between a dependent variable and one or more independent variables. In this
lab, we will explore how to build a linear regression model and use it to make predictions.

Materials Used:
- Python 3.x
- Jupyter Notebook
- NumPy library
- Pandas library
- Matplotlib library
- Scikit-learn library

Dataset:
For this lab, we will use a dataset that contains information about houses and their sale
prices. The dataset has the following columns:

- `Area` (in square feet): Represents the area of the house.


- `Bedrooms`: Number of bedrooms in the house.
- `Bathrooms`: Number of bathrooms in the house.
- `Garage Cars`: Number of cars that can be accommodated in the garage.
- `Sale Price` (in dollars): Represents the sale price of the house.

Area, Bedrooms, Bathrooms, Garage Cars, Sale Price


2000,3,2,2,250000
1800,4,3,2,280000
2200,3,2,2,265000
1500,2,1,1,200000
2400,4,3,3,320000
1900,3,2,2,275000
1700,3,2,1,230000
2100,4,3,2,295000

Procedure:

1. Introduction to Linear Regression:

Linear regression is a statistical technique used to model the relationship between a


dependent variable and one or more independent variables. It aims to find a linear equation
that best represents the association between the variables. Linear regression assumes a
linear relationship and seeks to minimize the differences between observed and predicted

P a g e | 39
230280116071 Kashis Makwana

values. It has applications in prediction, understanding correlations, and making data-


driven decisions.
The equation for a simple linear regression model can be represented as:

y = β0 + β1*x + ε

where:

y is the dependent variable


x is the independent variable
β0 is the y-intercept (the value of y when x = 0)
β1 is the slope (the change in y for a unit change in x)
ε represents the error term or residual

2. Importing Required Libraries and Loading the Dataset:


- Import the necessary libraries, including NumPy, Pandas, Matplotlib, and Scikit-learn.
- Load the dataset into a Pandas DataFrame using the appropriate function or by reading
from a file.

3. Exploratory Data Analysis:


- Perform exploratory data analysis to gain insights into the dataset.
- Analyze the distribution and statistical summary of the variables.
- Visualize the relationships between variables using scatter plots or other appropriate
plots.

4. Data Preprocessing:
- Handle missing values, if any, by imputation or removal.
- Convert categorical variables into numerical representations, if required.
- Split the dataset into input features (independent variables) and the target variable
(dependent variable).

5. Splitting the Dataset into Training and Testing Sets:


- Split the dataset into training and testing sets to evaluate the model's performance.
- Typically, use a 70-30 or 80-20 split for training and testing, respectively.

6. Building the Linear Regression Model:


- Import the LinearRegression class from Scikit-learn.
- Instantiate the LinearRegression model.
- Fit the model to the training data using the `.fit()` method.

7. Model Evaluation and Prediction:


- Evaluate the model's performance using appropriate evaluation metrics, such as mean
squared error (MSE) or R-squared.
- Make predictions on the testing data using the `.predict()` method.

8. Visualization of Results:
- Visualize the actual values versus the predicted values using scatter plots or other
suitable plots.
- Plot the regression line to show the relationship between the independent and dependent
variables.

P a g e | 40
230280116071 Kashis Makwana

Interpretation/Program/code:
# Importing Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Loading the Dataset


data = pd.read_csv("D:\Semester-5\DS\P5.csv") # Replace "your_dataset.csv" with the actual file
path if needed

# Exploratory Data Analysis


# Display the first few rows of the dataset
print(data.head())

# Analyze the distribution and summary statistics of the variables


print(data.describe())

# Visualize relationships between variables


plt.scatter(data['Area'], data['Sale Price'])
plt.xlabel('Area (sq. ft)')
plt.ylabel('Sale Price ($)')
plt.title('Scatter Plot of Area vs. Sale Price')
plt.show()

# Data Preprocessing
# Check for missing values
missing_values = data.isnull().sum()
print("Missing Values:")
print(missing_values)

# If there are missing values, handle them using imputation or removal

# Splitting the Dataset into Training and Testing Sets


X = data[['Area', 'Bedrooms', 'Bathrooms', 'Garage Cars']]
y = data['Sale Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Building the Linear Regression Model


model = LinearRegression()
model.fit(X_train, y_train)

# Model Evaluation and Prediction


y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

P a g e | 41
230280116071 Kashis Makwana

r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)

# Visualization of Results
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Sale Price ($)")
plt.ylabel("Predicted Sale Price ($)")
plt.title("Actual vs. Predicted Sale Price")
plt.show()

# Plot the regression line


plt.scatter(data['Area'], data['Sale Price'])
plt.plot(X_test['Area'], y_pred, color='red', linewidth=3)
plt.xlabel('Area (sq. ft)')
plt.ylabel('Sale Price ($)')
plt.title('Linear Regression: Area vs. Sale Price')
plt.show()

Output:

Area Bedrooms Bathrooms Garage Cars Sale Price


0 2000 3 2 2 250000
1 1800 4 3 2 280000
2 2200 3 2 2 265000
3 1500 2 1 1 200000
4 2400 4 3 3 320000

Area Bedrooms Bathrooms Garage Cars Sale Price


count 8.000000 8.000000 8.000000 8.00000 8.000000
mean 1950.000000 3.250000 2.250000 1.87500 264375.000000
std 287.849167 0.707107 0.707107 0.64087 37648.515433
min 1500.000000 2.000000 1.000000 1.00000 200000.000000
25% 1775.000000 3.000000 2.000000 1.75000 245000.000000
50% 1950.000000 3.000000 2.000000 2.00000 270000.000000
75% 2125.000000 4.000000 3.000000 2.00000 283750.000000
max 2400.000000 4.000000 3.000000 3.00000 320000.000000

P a g e | 42
230280116071 Kashis Makwana

Conclusion:

In this practical, we implemented linear regression to predict variable values in a dataset.


By training a linear regression model using Python and Scikit-learn, we achieved accurate
predictions based on variable relationships. Linear regression is a valuable tool for data
analysis and prediction, providing insights and supporting decision-making.

Quiz:

1. Which scikit-learn function is used to create a linear regression model object in Python?

a) sklearn.linear_model.LinearRegression
b) sklearn.preprocessing.StandardScaler
c) sklearn.model_selection.train_test_split
d) sklearn.metrics.mean_squared_error

Ans : a) sklearn.linear_model.LinearRegression

P a g e | 43
230280116071 Kashis Makwana

2. What is the purpose of the coefficient of determination (R-squared) in linear regression?

a) To measure the average squared difference between predicted and actual values
b) To evaluate the significance of predictor variables
c) To quantify the proportion of variance in the dependent variable explained by the independent
variables
d) To determine the optimal number of features for the regression model

Ans : a) To measure the average squared difference between predicted and actual values

Suggested References:-
1. "Pattern Recognition and Machine Learning" by Christopher M. Bishop
2. Dinesh Kumar, Business Analytics, Wiley India Business alytics: The Science
3. V.K. Jain, Data Science & Analytics, Khanna Book Publishing, New Delhi of Dat
4. Data Science For Dummies by Lillian Pierson , Jake Porway

Rubrics wise marks obtained

Understanding of Analysis of Capability of Documentation


Problem the Problem writing program Total

04 04 10 04 20

P a g e | 44
230280116071 Kashis Makwana

Experiment-7
Date:

AIM: Implementation of Logistic Regression with Scikit-learn library in Python

Relevant CO:- CO4

Objective:
The objective of this lab practical is to implement logistic regression using Scikit-learn
library in Python. Logistic regression is a popular classification algorithm used to model
the relationship between input variables and categorical outcomes. In this lab, we will
explore how to build a logistic regression model and use it for classification tasks.

Materials Used:
- Python 3.x
- Jupyter Notebook
- Scikit-learn library
- Pandas library
- NumPy library
- Matplotlib library

Dataset:
For this lab, we will use a dataset that contains information about customers and whether
they churned or not from a telecommunications company. The dataset has the following
columns:

- `CustomerID`: Unique identifier for each customer


- `Gender`: Gender of the customer (Male/Female)
- `Age`: Age of the customer
- `Income`: Income of the customer
- `Churn`: Binary variable indicating whether the customer churned (1) or not (0)

CustomerID,Gender,Age,Income,Churn
1,Male,32,50000,0
2,Female,28,35000,0
3,Male,45,80000,1
4,Male,38,60000,0
5,Female,20,20000,1
6,Female,55,75000,0
7,Male,42,90000,0
8,Female,29,40000,1

Procedure:

1. Introduction to Logistic Regression:


Logistic regression is a classification algorithm used to predict binary outcomes or
probabilities. It models the relationship between input features and the probability of an
event occurring. By applying the logistic function, it maps the linear regression output to
a value between 0 and 1, allowing for classification based on predicted probabilities.
Logistic regression is interpretable, handles categorical and continuous features, and finds

P a g e | 45
230280116071 Kashis Makwana

applications in various domains. It is a fundamental and effective approach for binary


classification tasks.
2. Importing Required Libraries and Loading the Dataset:
- Import the necessary libraries, including Scikit-learn, Pandas, NumPy, and Matplotlib.
- Load the dataset into a Pandas DataFrame using the appropriate function or by reading
from a file.

3. Exploratory Data Analysis:


- Perform exploratory data analysis to understand the dataset.
- Analyze the distribution of variables, detect any missing values, and handle them if
necessary.
- Visualize the relationships between variables using plots and charts.

4. Data Preprocessing:
- Split the dataset into input features (independent variables) and the target variable
(dependent variable).
- Convert categorical variables into numerical representations using one-hot encoding or
label encoding.
- Split the dataset into training and testing sets for model evaluation.

5. Building the Logistic Regression Model:


- Import the Logistic Regression class from Scikit-learn.
- Instantiate the Logistic Regression model.
- Fit the model to the training data using the `.fit()` method.

6. Model Evaluation and Prediction:


- Evaluate the model's performance using appropriate evaluation metrics such as
accuracy, precision, recall, and F1-score.
- Make predictions on the testing data using the `.predict()` method.

7. Visualization of Results:
- Visualize the model's performance using confusion matrix, ROC curve, or other suitable
visualizations.
- Plot the decision boundary to demonstrate the classification boundaries.

//Write here

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, roc_curve, roc_auc_score, auc

# Step 3: Load the dataset


data = pd.DataFrame({
'CustomerID': [1, 2, 3, 4, 5, 6, 7, 8],
'Gender': ['Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Female'],

P a g e | 46
230280116071 Kashis Makwana

'Age': [32, 28, 45, 38, 20, 55, 42, 29],


'Income': [50000, 35000, 80000, 60000, 20000, 75000, 90000, 40000],
'Churn': [0, 0, 1, 0, 1, 0, 0, 1]
})

# Step 4: Data Preprocessing


# Split the dataset into features (X) and target variable (y)
X = data[['Age', 'Income']] # Selecting 'Age' and 'Income' as features
y = data['Churn']

# Perform one-hot encoding for the 'Gender' column


data = pd.get_dummies(data, columns=['Gender'], drop_first=True)
X = data[['Age', 'Income', 'Gender_Male']]

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 5: Building the Logistic Regression Model


# Create the Logistic Regression model
model = LogisticRegression()

# Fit the model to the training data


model.fit(X_train, y_train)

# Step 6: Model Evaluation and Prediction


# Evaluate the model's performance
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print("Confusion Matrix:\n", conf_matrix)

# Step 7: Visualization of Results


# Visualize model performance with ROC curve
y_probs = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_probs)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

P a g e | 47
230280116071 Kashis Makwana

plt.xlabel('False Positive Rate')


plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

Output:
Accuracy: 0.0
Precision: 0.0
Recall: 0.0
F1-Score: 0.0
Confusion Matrix:
[[0 3]
[0 0]]

8. Conclusion:

Logistic regression is a powerful classification algorithm that models the relationship


between input features and binary outcomes or probabilities. By utilizing the logistic
function, it provides interpretable predictions and is applicable in various domains.
Logistic regression is a valuable tool for binary classification tasks, offering simplicity,
interpretability, and effectiveness in predicting outcomes based on input features.

Quiz:

1. Which scikit-learn function is used to create a logistic regression model object in Python?

a) sklearn.linear_model.LogisticRegression
b) sklearn.preprocessing.StandardScaler
c) sklearn.model_selection.train_test_split
d) sklearn.metrics.accuracy_score

Ans : a) sklearn.linear_model.LogisticRegression

2. In logistic regression, what does the sigmoid function do?

a) Maps the predicted values to binary classes


b) Calculates the log-odds of the target variable
c) Determines the optimal threshold for classification

P a g e | 48
230280116071 Kashis Makwana

d) Measures the goodness of fit of the logistic regression model

Ans : b) Calculates the log-odds of the target variable

Suggested References:-
1. "Pattern Recognition and Machine Learning" by Christopher M. Bishop
2. Dinesh Kumar, Business Analytics, Wiley India Business alytics: The Science
3. V.K. Jain, Data Science & Analytics, Khanna Book Publishing, New Delhi of Dat
4. Data Science For Dummies by Lillian Pierson , Jake Porway

Rubrics wise marks obtained

Understanding of Analysis of Capability of Documentation


Problem the Problem writing program Total

04 04 10 02 20

P a g e | 49
230280116071 Kashis Makwana

Experiment No: 8
Date:

AIM: Implementation of Decision Tree for Student Classification

Relevant CO :- CO4

Objective:

The objective of this lab practical is to implement a decision tree algorithm to classify
students as either average or clever based on given student data. Decision trees are widely
used in machine learning and data mining for classification and regression tasks. In this
lab, we will explore how to build a decision tree model and use it to classify students based
on their attributes.

Materials Used:
- Python 3.x
- Jupyter Notebook
- Scikit-learn library
- Pandas library
- NumPy library
- Matplotlib library

Dataset:
For this lab, we will use a dataset that contains information about students and their
performance. The dataset has the following columns:

- `Age`: Age of the student


- `StudyHours`: Number of hours the student studies per day
- `PreviousGrade`: Grade achieved in the previous exam
- `Result`: Classification label indicating whether the student is average (0) or clever (1)

Procedure:

1. Introduction to Decision Trees:


- Decision trees are widely used in machine learning for classification tasks. They make
decisions based on splitting criteria and feature importance. In this lab, we will implement
a decision tree algorithm to classify students as average or clever based on their attributes,
such as age, study hours, and previous grades.

2. Importing Required Libraries and Loading the Dataset:


- Import the necessary libraries, including Scikit-learn, Pandas, NumPy, and Matplotlib.
- Load the dataset into a Pandas DataFrame using the appropriate function or by reading
from a file.

3. Exploratory Data Analysis:


- Perform exploratory data analysis to understand the dataset.
- Analyze the distribution of variables, detect any missing values, and handle them if
necessary.

P a g e | 50
230280116071 Kashis Makwana

- Visualize the relationships between variables using plots and charts.

4. Data Preprocessing:
- Split the dataset into input features (independent variables) and the target variable
(dependent variable).
- Convert categorical variables into numerical representations using one-hot encoding or
label encoding.
- Split the dataset into training and testing sets for model evaluation.

5. Building the Decision Tree Model:


- Import the DecisionTreeClassifier class from Scikit-learn.
- Instantiate the DecisionTreeClassifier model with the desired parameters.
- Fit the model to the training data using the `.fit()` method.

6. Model Evaluation and Prediction:


- Evaluate the model's performance using appropriate evaluation metrics such as
accuracy, precision, recall, and F1-score.
- Make predictions on the testing data using the `.predict()` method.

7. Visualization of the Decision Tree:


- Visualize the decision tree using tree plotting techniques available in Scikit-learn or
other visualization libraries.
- Interpret the decision tree structure and analyze the important features.

//Write here
# Step 1: Introduction to Decision Trees
# Decision Tree Classification for student performance

# Step 2: Importing Required Libraries and Loading the Dataset


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree

# Step 3: Load the dataset


data = pd.DataFrame({
'Age': [18, 20, 22, 21, 19, 23, 24, 25, 27, 28],
'StudyHours': [2, 3, 4, 6, 5, 2, 3, 7, 6, 4],
'PreviousGrade': ['C', 'C', 'B', 'B', 'C', 'D', 'B', 'A', 'B', 'A'],
'Result': [0, 0, 1, 1, 0, 1, 1, 1, 1, 1]
})

# Step 4: Data Preprocessing


# Split the dataset into features (X) and target variable (y)
X = data[['Age', 'StudyHours', 'PreviousGrade']]
y = data['Result']

# Perform one-hot encoding for the 'PreviousGrade' column

P a g e | 51
230280116071 Kashis Makwana

X = pd.get_dummies(X, columns=['PreviousGrade'], drop_first=True)

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 5: Building the Decision Tree Model


# Create the Decision Tree Classifier model
model = DecisionTreeClassifier()

# Fit the model to the training data


model.fit(X_train, y_train)

# Step 6: Model Evaluation and Prediction


# Evaluate the model's performance
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print(classification_report(y_test, y_pred))

# Step 7: Visualization of the Decision Tree


# Visualize the decision tree
plt.figure(figsize=(12, 6))
plot_tree(model, feature_names=X.columns.tolist(), class_names=['Average', 'Clever'],
filled=True, rounded=True, fontsize=10)
plt.show()

Output:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-Score: 1.0
precision recall f1-score support

0 1.00 1.00 1.00 1


1 1.00 1.00 1.00 2

accuracy 1.00 3
macro avg 1.00 1.00 1.00 3
weighted avg 1.00 1.00 1.00 3

Conclusion:
- The implementation of the decision tree algorithm proved effective in classifying
students as average or clever based on their attributes. Decision trees provide interpretable
results and can be used in various domains for classification tasks. The decision tree model
offers insights into the important features contributing to the classification. This lab

P a g e | 52
230280116071 Kashis Makwana

demonstrates the practical application of decision trees for student classification.


Quiz:

1. In decision tree classification, what is the main objective of the splitting criterion?

a) To maximize the number of features used for classification


b) To minimize the number of nodes in the decision tree
c) To maximize the accuracy of classification
d) To minimize the entropy or Gini impurity of the resulting subsets

2. What is the purpose of pruning in decision tree algorithms?

a) To remove irrelevant features from the dataset


b) To prevent overfitting and improve generalization of the decision tree
c) To improve the interpretability of the decision tree
d) To reduce the time complexity of the decision tree algorithm

Suggested References:-
1. "Pattern Recognition and Machine Learning" by Christopher M. Bishop
2. Dinesh Kumar, Business Analytics, Wiley India Business alytics: The Science
3. V.K. Jain, Data Science & Analytics, Khanna Book Publishing, New Delhi of Dat
4. Data Science For Dummies by Lillian Pierson , Jake Porway
Rubrics wise marks obtained

Understanding of Analysis of Capability of Documentation


Total
Problem the Problem writing program
02 02 05 01 10

P a g e | 53

You might also like