Ds 1-000000
Ds 1-000000
A
Laboratory Manual for
Data Science
(3151608)
B.E. Semester 5
(Information Technology Department)
L.D.College Of Engineering,
Ahmedabad
Directorate of Technical
Education,Gandhinagar,Gujarat
Page |1
230280116071 Kashis Makwana
Certificate
Page |2
230280116071 Kashis Makwana
Preface
Main motto of any laboratory/practical/field work is for enhancing required skills as well as
creating ability amongst students to solve real time problem by developing relevant competencies
in psychomotor domain. By keeping in view, GTU has designed competency focused outcome-
based curriculum for engineering degree programs where sufficient weightage is given to
practical work. It shows importance of enhancement of skills amongst the students and it pays
attention to utilize every second of time allotted for practical amongst students, instructors and
faculty members to achieve relevant outcomes by performing the experiments rather than having
merely study type experiments. It is must for effective implementation of competency focused
outcome-based curriculum that every practical is keenly designed to serve as a tool to develop
and enhance relevant competency required by the various industry among every student. These
psychomotor skills are very difficult to develop through traditional chalk and board content
delivery method in the classroom. Accordingly, this lab manual is designed to focus on the
industry defined relevant outcomes, rather than old practice of conducting practical to prove
concept and theory.
Data Science is a rapidly growing field that combines statistical and computational techniques to
extract knowledge and insights from data. The goal of this lab manual is to provide students with
hands-on experience in using data science tools and techniques to analyze and interpret real-world
data.
This manual is designed to accompany a course in Data Science and assumes a basic knowledge
of programming concepts and statistical analysis. The labs are structured to guide students
through the process of collecting, cleaning, analyzing, and visualizing data, using popular
programming languages and software tools such as Python, R, SQL, and Tableau.
Each lab in this manual consists of a set of instructions that guide students through a specific data
analysis project. The labs are organized in a progressive sequence, with each lab building on the
skills and concepts covered in the previous lab. The exercises within each lab are designed to be
completed in a single class session, with additional time required for preparation and follow-up
analysis.
Throughout the manual, we emphasize the importance of critical thinking and data ethics,
providing guidance on how to analyze data responsibly and communicate findings effectively. By
the end of this manual, students will have gained a solid foundation in data science and be well-
equipped to apply these skills to real-world problems.
Page |3
3151608 Data Science 230280116001 Kashis Makwana
Sr.
Objective(s) of Experiment CO-1 CO-2 CO-3 CO-4
No.
Exploration and Visualization Using Mathematical and
1. Statistical Tools √ √
Study of Measures of Central Tendency,
Correlation, Percentile, Decile, Quartile, Measure
2. √ √
of Variation, and Measure of Shape (Skewness and
Kurtosis) with Excel Functions
Study of Basics of Python data types, NumPy, Matplotlib,
3. √
Pandas. √
Implementation of Various Probability
4. Distributions with NumPy Random √
Library Functions
Implementation of Estimation of Parameters for the
5. BestFit Probability Distribution using the Fitter Class in √
Python.
Implementation of Linear Regression with Scikit-learn
6. √
library in Python.
Implementation of Logistic Regression with
7. √
Scikit-learn library in Python
Implementation of Decision Tree for Student
8. √
Classification
Guidelines for Faculty members
1. Course Coordinator / Faculty should provide the guideline with demonstration of practical
to the students with all features.
2. Course Coordinator / Faculty shall explain basic concepts/theory related to the experiment
to the students before starting of each practical
3. Involve all the students in performance of each experiment.
4. Course Coordinator / Faculty is expected to share the skills and competencies to be
developed in the students and ensure that the respective skills and competencies are
developed in the students after the completion of the experimentation.
5. Course Coordinator / Faculty should give opportunity to students for hands-on experience
after the demonstration.
6. Course Coordinator / Faculty may provide additional knowledge and skills to the students
even though not covered in the manual but are expected from the students by concerned
industry.
Page |4
230280116071 Kashis Makwana
7. Give practical assignment and assess the performance of students based on task assigned to
check whether it is as per the instructions or not.
8. Course Coordinator / Faculty is expected to refer complete curriculum of the course and
follow the guidelines for implementation.
1. Students are expected to carefully listen to all the theory classes delivered by the faculty
members and understand the COs, content of the course, teaching and examination scheme,
skill set to be developed etc.
2. Students will have to perform experiments as per practical list given.
3. Students have to show output of each program in their practical file.
4. Students are instructed to submit practical list as per given sample list shown on next page.
5. Student should develop a habit of submitting the experimentation work as per the schedule
and s/he should be well prepared for the same.
Page |5
230280116071 Kashis Makwana
Page |6
230280116071 Kashis Makwana
Index
(Progressive Assessment Sheet)
Sr. Objective(s) of Experiment Page Date of Date of Assessment Sign. of
No. No. perform submission Marks Faculty
ance with date
Exploration and Visualization Using
1. Mathematical and
Statistical Tools (10 Marks)
Study of Measures of Central
Tendency, Correlation,
Percentile, Decile, Quartile,
2. Measure of Variation, and
Measure of Shape (Skewness
and Kurtosis) with Excel
Functions. (10 Marks)
Study of Basics of Python data types,
3. NumPy, Matplotlib, Pandas. (10
Marks)
Implementation of Various
Probability Distributions with
4.
NumPy Random Library
Functions. (10 Marks)
Implementation of Estimation of
Parameters for the Best-Fit
5.
Probability Distribution using the
Fitter Class in Python. (10 Marks)
Implementation of Linear
6. Regression with Scikit-learn library
in Python. (20 Marks)
Implementation of Logistic
Regression with
7.
Scikitlearn library in
Python (20 Marks)
Implementation of
Decision Tree for Student
8.
Classification.
(10 Marks)
Total 100
Page |7
230280116071 Kashis Makwana
Experiment No: 1
Date:
AIM: Data Exploration and Visualization Using Mathematical and Statistical Tools
Introduction:
Data exploration and visualization are important steps in the data analysis process. In this lab,
students will learn how to explore and visualize data using mathematical and statistical tools such
as histograms, box plots, scatter plots, and correlation matrices. Students will also learn how to use
Excel/R to perform these analyses.
Objectives:
Materials:
Procedure:
Page |8
230280116071 Kashis Makwana
This dataset includes information on age, gender, income, education level, and marital status,
employment status and Industry for a sample of 25 individuals. This data could be used to explore
and visualize various relationships and patterns, such as the relationship between age and income,
or the distribution of income by education level. Few more relationships and patterns that could be
explored and visualized using the sample dataset I provided:
1. Relationship between age and income: Create a scatter plot to see if there is a relationship
between age and income. Also calculate the correlation coefficient to determine the strength and
direction of the relationship.
By looking at the graph we can say there is a strong relation between Age and Income as all the
points are very close. Here all the points are in increasing order. Thus we can conclude as Age and Income
are having Strong Positive Correlation.
Page |9
230280116071 Kashis Makwana
Calculation of correlation coefficient results in 0.978833742 which is very close to positive 1 indicating
strong positive correlation.
2. Distribution of income by gender: Create a box plot to compare the distribution of income
between males and females. This could reveal any differences in the median, quartiles, and
outliers for each gender.
In order to plot the data we first sort the data by gender and then by income. By having two
categories of gender we plot the income category vice into box plot.
Box plot gives us 5 number summery set - minimum, first quartile, median, third quartile
and maximum.
P a g e | 10
230280116071 Kashis Makwana
3. Distribution of income by education level: Create a box plot to compare the distribution of
income for each level of education. This could reveal any differences in the median, quartiles,
and outliers for each education level.
4. Relationship between education level and marital status: Create a contingency table and
calculate the chi-square test statistic to see if there is a relationship between education level
and marital status. This could reveal whether certain education levels are more or less likely to
be associated with certain marital statuses.
H0: No significance relation between education level and marital status Level of
Significance : 5% (0.05)
Contingency table :
Marital Associate's Bachelor's Doctorate High Master's Grand
Status School Total
Divorced 1 1 1 3
Married 5 3 1 4 13
Single 2 4 2 1 9
Grand 2 10 3 4 6 25
Total
P a g e | 11
230280116071 Kashis Makwana
5. Relationship between age and education level: Create a histogram to see the distribution of
ages for each education level. This could reveal any differences or similarities in the age
distribution across education levels.
6. Distribution of employment status by ethnicity: Create a stacked bar chart to compare the
distribution of each ethnicity group across different employment statuses. This could reveal any
differences or similarities in the employment status of different ethnicity groups.
P a g e | 12
230280116071 Kashis Makwana
7. Distribution of employment status by gender: Students could create a contingency table and
calculate the chi-square test statistic to see if there is a relationship between gender and
employment status. This could reveal whether certain genders are more or less likely to be
employed.
P a g e | 13
230280116071 Kashis Makwana
Observations / Program:
Relationship between age and income: Create a scatter plot to see if there is a relationship
between age and income. Also calculate the correlation coefficient to determine the strength and
direction of the relationship. import matplotlib.pyplot as plt
x = [32,45,28,52,36,40,29,55,33,47,41,38,31,49,27,54,39,30,56,35,48,42,37,34,51] y =
[45000,65000,35000,80000,55000,70000,40000,90000,47000,75000,60000,52000,48
000,85000,30000,92000,58000,42000,96000,55000,73000,65000,50000,49000,82000
]
plt.xlabel("Age") plt.ylabel("Income")
plt.title("Relation between Age an Income")
plt.scatter(x,y) plt.show() Output:
In this lab, students learned how to explore and visualize data using mathematical and statistical
tools such as histograms, box plots, scatter plots, and correlation matrices. These tools are useful in
identifying patterns and relationships in data, and in making informed decisions based on data
analysis. The skills students have learned in this lab will be helpful in your future studies and career
in data analysis.
Quiz:
1. What are the measures of central tendency? Provide examples and explain when each
measure is appropriate to use.
Answer : Mean, Median and Mode are the measures of central tendency. These statistics indicate
where most values in a distribution fall and are also referred to as the central location of a
distribution.
P a g e | 14
230280116071 Kashis Makwana
MEAN : The mean is usually the best measure of central tendency to use when your data
distribution is continuous and symmetrical, such as when your data is normally distributed.
Example : if we would like to know how many hours on average an employee spends at training
in a year, we can find the mean training hours of a group of employees.
MEDIAN : The median is generally a better measure of the center when there are extreme values
or outliers because it is not affected by the precise numerical values of the outliers. The median is
usually preferred to other measures of central tendency when your data set is skewed (i.e., forms a
skewed distribution) or you are dealing with ordinal data.
Example : consider the wages of staff at a factory below:
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests that
this mean value might not be the best way to accurately reflect the typical salary of a worker, as
most workers have salaries in the $12k to 18k range. The mean is being skewed by the two large
salaries. Therefore, in this situation we usually prefer the median over the mean (or mode) is when
our data is skewed. the median best retains this position and is not as strongly influenced by the
skewed values.
2. How can you calculate the correlation coefficient between two variables using
mathematical and statistical tools? Interpret the correlation coefficient value.
Answer : The covariance of two variables divided by the product of their standard deviations gives
Pearson’s correlation coefficient. It is usually represented by ρ (rho). ρ (X,Y) = cov (X,Y) / σX.σY.
If x & y are the two variables of discussion, then the correlation coefficient can be calculated using
the formula
Here,
n = Number of values or elements ∑x = Sum of
1st values list
∑y = Sum of 2nd values list
∑xy = Sum of the product of 1st and 2nd values
∑x2 = Sum of squares of 1st values
P a g e | 15
230280116071 Kashis Makwana
3. Explain the concept of skewness and kurtosis in statistics. How can you measure and
interpret these measures using mathematical and statistical tools?
Answer :
Skewness - Skewness is a statistical number that tells us if a distribution is symmetric or not.
Here,
S: standard deviation x̄ : Mean, n = Number of values or
elements
Kurtosis : Kurtosis is a statistical number that tells us if a distribution is taller or shorter than a
normal distribution.
Here,
S: standard deviation
x̄ : Mean, n = Number of values or elements
P a g e | 16
230280116071 Kashis Makwana
Suggested References:
1. "Python for Data Analysis" by Wes McKinney
2. "Data Visualization with Python and Matplotlib" by Benjamin Root
02 02 05 01 10
P a g e | 17
230280116071 Kashis Makwana
Experiment No: 2
Date:
AIM: Study of Measures of Central Tendency, Correlation, Percentile, Decile, Quartile, Measure of
Variation, and Measure of Shape (Skewness and Kurtosis) with Excel Functions Relevant CO:
CO1, CO2
Objective:
The objective of this lab practical is to provide students with hands-on experience in using Excel
functions to explore and analyze a sample data sheet. Students will learn to calculate measures of
central tendency, correlation, percentile, decile, quartile, measure of variation, and measure of shape
using Excel functions. Additionally, students will learn to create visualizations to better understand
the data.
Materials:
- Computer with Microsoft Excel installed
- Sample data sheet (provided below or dataset may be provided by subject teacher) Sample
Data Sheet:
Test1 Test2
StudentID Score Score Age Gender
1 85 92 19 Male
2 92 87 20 Female
3 78 80 18 Male
4 85 89 19 Male
5 90 95 21 Female
6 75 82 18 Male
7 83 87 20 Female
8 92 90 19 Male
9 80 85 18 Female
10 87 88 20 Female
|
Procedure:
P a g e | 18
230280116071 Kashis Makwana
Part 2: Correlation
1. Calculate the correlation between test 1 score and test 2 score using Excel functions.
2. Create a scatter plot to visualize the relationship between test 1 score and test 2 score.
Calculation of correlation coefficient results in 0.744684233 which is very close to positive 1 indicating
strong positive correlation between Test1 and Test2. A scatter graph gives the clear visualization of relation.
P a g e | 19
230280116071 Kashis Makwana
3. Calculate the first and third quartiles using Excel functions for both test 1 score and test 2
score columns.
4. Create a box plot for both test 1 score and test 2 score columns.
P a g e | 20
230280116071 Kashis Makwana
Standard
inter Quartile Min Max Range Variance Variance
Test1 Score 8.5 75 92 17
27.46315789 5.240530307
Test2 Score 4.25 80 95 15
Table 2.3 Calculation of Inter Quartile, Range, Variance and Standard Deviation
Quiz:
1) What Excel function can be used to calculate the mean of a dataset?
a) AVERAGE
b) MEDIAN
c) MODE
d) STANDARDIZE
ANSWER : b) AVERAGE
2) What does the correlation coefficient measure in terms of the relationship between two
variables?
a) Strength of the linear relationship
b) Variability of the data
c) Difference between mean and median
d) Skewness of the distribution
P a g e | 21
230280116071 Kashis Makwana
Suggested Refrences:
1. "Microsoft Excel Data Analysis and Business Modeling" by Wayne L. Winston
2. "Excel 2021: Data Analysis and Business Modeling" by Wayne L. Winston
02 02 05 01 10
P a g e | 22
230280116071 Kashis Makwana
Experiment No: 3
Date:
Objective:
The objective of this lab practical is to gain hands-on experience with NumPy, Matplotlib, and
Pandas libraries to manipulate and visualize data. Through this practical, students will learn how to
use different functions of these libraries to perform various data analysis tasks.
Materials Used:
- Python programming environment
- NumPy library
- Matplotlib library
- Pandas library
- Dataset file (provided by faculty)
//Example of dataset file like sales_Data.csv
o Date: Date of sale
o Product: Name of the product sold
o Units Sold: Number of units sold
o Revenue: Total revenue generated from the sale
o Region: Geographic region where the sale took place
o Salesperson: Name of the salesperson who made the sale
Procedures:
Part 1: NumPy
1. Import the NumPy library into Python.
2. Create a NumPy array with the following specifications:
a. Dimensions: 5x5
b. Data type: integer
c. Values: random integers between 1 and 100
3. Reshape the array into a 1x25 array and calculate the mean, median, variance, and standard
deviation using NumPy functions.
4. Generate a random integer array of length 10 and find the percentile, decile, and quartile values
using NumPy functions.
Part 2: Matplotlib
1. Import the Matplotlib library into Python.
2. Create a simple bar chart using the following data:
a. X-axis values: ['A', 'B', 'C', 'D']
b. Y-axis values: [10, 20, 30, 40]
3. Customize the plot by adding a title, axis labels, and changing the color and style of the bars.
4. Create a pie chart using the following data:
a. Labels: ['Red', 'Blue', 'Green', 'Yellow']
b. Values: [20, 30, 10, 40]
5. Customize the pie chart by adding a title, changing the colors of the slices, and adding a legend.
P a g e | 23
230280116071 Kashis Makwana
Part 3: Pandas
1. Import the Pandas library into Python.
2. Load the "sales_data.csv" file into a Pandas data frame.
3. Calculate the following statistics for the Units Sold and Revenue columns:
a. Mean
b. Median
c. Variance
d. Standard deviation
4. Group the data frame by Product and calculate the mean, median, variance, and standard deviation
of Units Sold and Revenue for each product using Pandas functions.
5. Create a line chart to visualize the trend of Units Sold and Revenue over time for each product.
Interpretation/Program/code:
//write here
import numpy as np
random_array = np.random.randint(1, 101, (5, 5), dtype=int)
reshaped_array = random_array.reshape(1, 25)
mean_value = np.mean(reshaped_array)
median_value = np.median(reshaped_array)
variance_value = np.var(reshaped_array)
std_deviation_value = np.std(reshaped_array)
# 4. Generate a random integer array of length 10 and find percentile, decile, and quartile values
random_integers = np.random.randint(1, 101, 10, dtype=int)
Output:
Random 5x5 array:
[[61 73 42 26 7]
[14 92 70 72 71]
[31 98 87 81 83]
[85 87 24 7 88]
P a g e | 24
230280116071 Kashis Makwana
[40 4 29 97 30]]
Mean: 55.96
Median: 70.0
Variance: 981.9584000000001
Standard Deviation: 31.336215470282944
Interpretation/Program/code:
import matplotlib.pyplot as plt
P a g e | 25
230280116071 Kashis Makwana
Output:
Interpretation/Program/code:
revenue_mean = data['Revenue'].mean()
revenue_median = data['Revenue'].median()
revenue_variance = data['Revenue'].var()
revenue_stddev = data['Revenue'].std()
# Step 5: Create a line chart to visualize the trend over time for each product
for product, product_data in data.groupby('Product'):
plt.figure(figsize=(10, 5))
plt.plot(product_data['Date'], product_data['Units Sold'], label='Units Sold', marker='o')
plt.plot(product_data['Date'], product_data['Revenue'], label='Revenue', marker='o')
P a g e | 26
230280116071 Kashis Makwana
Output:
P a g e | 27
230280116071 Kashis Makwana
Conclusion:
In conclusion, this lab practical provided hands-on experience with NumPy, Matplotlib, and Pandas
libraries in Python for data manipulation and visualization. These libraries have wide-ranging
applications in various fields, enabling researchers and analysts to gain insights from large datasets
quickly and efficiently. Through exercises such as calculating statistical measures and visualizing
data using charts, we explored the functionality and flexibility of these powerful data analysis tools.
Overall, gaining proficiency in these libraries equips individuals to tackle complex data analysis
challenges and contribute to their respective fields of study or industries.
Quiz:
1. Mutability:
- List: Lists are mutable, which means you can change their contents (add, remove, or modify
elements) after they are created. You can use methods like `append()`, `remove()`, and `pop()`
to modify a list.
- Tuple: Tuples are immutable, which means once you create a tuple, you cannot change its
elements. You can't add or remove elements from a tuple, nor can you modify the existing
elements.
2. Syntax:
- List: Lists are created using square brackets `[...]`. For example: `my_list = [1, 2, 3]`
- Tuple: Tuples are created using parentheses `(...)`. For example: `my_tuple = (1, 2, 3)` or
even without parentheses: `my_tuple = 1, 2, 3`
P a g e | 28
230280116071 Kashis Makwana
3. Performance:
- Lists can be slightly slower than tuples in terms of iteration and access time because of their
mutability. When you modify a list, it may require resizing the underlying data structure, which
can introduce some overhead.
- Tuples, being immutable, are generally faster for iteration and access, and they consume
slightly less memory than lists.
4. Use Cases:
- Lists are typically used when you have a collection of items that may change or need to be
modified over time. For example, a list of tasks to-do that you can add or remove items from.
- Tuples are often used when you have a collection of items that should not change. For
example, a tuple might be used to represent a set of coordinates (x, y) or a record in a database.
```python
import numpy as np
random_array = np.random.rand(5, 5) # Creates a 5x5 array of random values
2. **Random Integers**:
- To generate random integers within a specified range:
```
```python
import numpy as np
P a g e | 29
230280116071 Kashis Makwana
Suggested References:-
1. Dinesh Kumar, Business Analytics, Wiley India Business alytics: The Science
2. V.K. Jain, Data Science & Analytics, Khanna Book Publishing, New Delhi of Dat
3. Data Science For Dummies by Lillian Pierson , Jake Porway
P a g e | 30
230280116071 Kashis Makwana
Experiment No: 4
Date:
Materials Used:
- Python environment (Anaconda, Jupyter Notebook, etc.)
- NumPy library
Procedure:
1. Introduction to Probability Distributions:
o Probability theory is the branch of mathematics that deals with the study of random events
or phenomena. In probability theory, a probability distribution is a function that describes
the likelihood of different outcomes in a random process. Probability distributions can be
categorized into two types: discrete and continuous.
o Discrete probability distributions are used when the possible outcomes of a random
process are countable and can be listed. The most commonly used discrete probability
distributions are Bernoulli, Binomial, and Poisson distributions.
o Continuous probability distributions are used when the possible outcomes of a random
process are not countable and can take any value within a certain range. The most
commonly used continuous probability distributions are Normal and Exponential
distributions.
o Each probability distribution has its own set of properties, such as mean, variance,
skewness, and kurtosis. Mean represents the average value of the random variable,
variance represents how much the values vary around the mean, skewness represents the
degree of asymmetry of the distribution, and kurtosis represents the degree of peakedness
or flatness of the distribution.
o Probability distributions are widely used in fields such as finance, engineering, physics,
and social sciences to model real-world phenomena and make predictions about future
events. Understanding different probability distributions and their properties is an
important tool for analyzing data and making informed decisions.
#python
import numpy as np
import matplotlib.pyplot as plt
# Generate 1000 random numbers following a normal distribution with mean 0 and standard
deviation 1
normal_dist = np.random.normal(0, 1, 1000)
P a g e | 31
230280116071 Kashis Makwana
In this example, we generate 1000 random numbers following a normal distribution with mean 0
and standard deviation 1 using the `np.random.normal()` function. We then calculate the mean and
standard deviation of the distribution using the `np.mean()` and `np.std()` functions.
We also generate 1000 random numbers following a Poisson distribution with lambda 5 using the
`np.random.poisson()` function. We calculate the mean and variance of the Poisson distribution
using the `np.mean()` and `np.var()` functions.
We then plot the probability density function (PDF) and cumulative distribution function (CDF) of
both distributions using the `plt.hist()` and `plt.plot()` functions from the Matplotlib library.
3. Exercise:
- Generate a dataset of your choice or given by faculty with a given probability distribution using
NumPy random library functions
- Plot the probability density function and cumulative distribution function for the generated data
- Calculate the descriptive statistics of the generated data
Interpretation/Program/code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
P a g e | 32
230280116071 Kashis Makwana
std_dev = 2.0
sample_size = 1000
plt.show()
print("Descriptive Statistics:")
print(f"Mean: {mean_data:.2f}")
print(f"Standard Deviation: {std_dev_data:.2f}")
print(f"Median: {median_data:.2f}")
print(f"Minimum: {min_data:.2f}")
print(f"Maximum: {max_data:.2f}")
Output:
P a g e | 33
230280116071 Kashis Makwana
Descriptive Statistics:
Mean: 5.04
Standard Deviation: 1.96
Median: 5.05
Minimum: -1.48
Maximum: 12.71
Conclusion:
This lab practical provided an opportunity to explore and implement various probability
distributions using NumPy random library functions. By understanding and applying different
probability distributions, one can model real-world phenomena and make predictions about future
events. With the knowledge gained in this lab practical, student will be equipped to work with
probability distributions and analyze data in a wide range of fields, including finance, engineering,
and social sciences.
Quiz:
1. Which NumPy function can be used to generate random numbers from a normal distribution?
a) numpy.random.uniform
b) numpy.random.poisson
c) numpy.random.normal
d) numpy.random.exponential
Ans : c) numpy.radom.normal
2. What is the purpose of the probability density function (PDF) in probability distributions?
Suggested References:-
1. Dinesh Kumar, Business Analytics, Wiley India Business alytics: The Science
2. V.K. Jain, Data Science & Analytics, Khanna Book Publishing, New Delhi of Dat
3. Data Science For Dummies by Lillian Pierson , Jake Porway
Rubrics wise marks obtained
P a g e | 34
230280116071 Kashis Makwana
Experiment-5
Date:
AIM: Implementation of Estimation of Parameters for the Best-Fit Probability Distribution using
the Fitter Class in Python.
Objectives: The objective of this lab practical is to learn how to estimate the parameters for the
best-fit probability distribution for a given dataset using the Fitter class in Python.
Materials Used:
1. Python 3.x
2. Jupyter Notebook
3. NumPy library
4. Fitter library
Theory:
Dataset:
Consider the following dataset, which represents the heights of individuals in centimeters:
170, 165, 180, 172, 160, 175, 168, 155, 185, 190, 162, 178, 168, 172, 180, 160, 165, 172, 168, 175
Procedure:
Parameter estimation is important because it allows us to make inferences, predictions, and draw
meaningful conclusions from the data. By estimating the parameters, we can effectively model and
analyze various phenomena, summarizing complex datasets in a more simplified and interpretable
manner.
The concept of the best-fit probability distribution refers to finding the distribution that provides the
closest match to the observed data. The best-fit distribution is determined by estimating the
parameters in such a way that the observed data exhibits the highest likelihood or best matches the
underlying characteristics of the data. Selecting the best-fit distribution helps us understand the
data's behavior, make accurate predictions, and gain insights into its properties.
Commonly used probability distributions include the normal (Gaussian) distribution, uniform
distribution, exponential distribution, Poisson distribution, and binomial distribution. Each
distribution has its own characteristics and applications in various fields.
Understanding parameter estimation and probability distributions allows us to effectively model and
analyze data, make informed decisions, and gain insights into the underlying properties of the data.
By estimating the parameters for the best-fit probability distribution, we can unlock valuable
information and extract meaningful patterns from the observed data.
P a g e | 35
230280116071 Kashis Makwana
7. Conclusion:
- Summarize the importance of parameter estimation and the best-fit distribution in data analysis.
- Highlight the capabilities of the Fitter class in Python for automating the estimation of
parameters.
- Discuss potential applications and further exploration in different domains.
Interpretation/Program/code:
# Step 2: Import necessary libraries
import numpy as np
from fitter import Fitter
import matplotlib.pyplot as plt
from scipy.stats import norm
P a g e | 36
230280116071 Kashis Makwana
# Plot the probability density function (PDF) of the best-fit distribution (normal)
x = np.linspace(min(heights), max(heights), 100)
y = norm.pdf(x, mean, std)
plt.plot(x, y, 'r-', lw=2, label='Best Fit Distribution (Normal)')
plt.legend()
plt.title('Best Fit Distribution: Normal')
plt.xlabel('Height (cm)')
plt.ylabel('Probability')
plt.show()
# Step 7: Conclusion
print("Parameter estimation and best-fit distribution selection are essential for data analysis.")
print("The Fitter library simplifies the process of finding the best-fit distribution.")
Output:
Conclusion:
In this example, we have a dataset of heights of individuals. We use the Fitter class from the `fitter`
library to estimate the parameters for the best-fit probability distribution.
We instantiate the Fitter class with the dataset `data`. Then, we use the `.fit()` method to fit the data
to various distributions available in the Fitter class. The `.fit()` method automatically estimates the
parameters for each distribution and selects the best-fit distribution based on the goodness-of-fit
P a g e | 37
230280116071 Kashis Makwana
metrics.
Finally, we retrieve the best-fit distribution using the `.get_best()` method and print the summary of
the distribution using the `.summary()` method. We also plot the histogram of the dataset and
overlay the probability density function (PDF) of the best-fit distribution using the `.plot_pdf()`
method.
Note: Before running the code, make sure you have the `numpy`, `fitter`, and `matplotlib` libraries
installed. You can install the `fitter` library using pip: `pip install fitter`.
Through this practical, we learned the importance of parameter estimation in probability
distributions and the significance of selecting the best-fit distribution for accurate modeling and
analysis. The Fitter class provided a convenient and efficient way to fit the dataset to various
distributions and evaluate their goodness of fit using metrics such as AIC or BIC.
Quiz:
2. Which method of the Fitter class can be used to estimate the best-fit probability distribution for a
given dataset?
a) fit
b) predict
c) evaluate
d) transform
Ans : a) fit
Suggested References:-
1. Dinesh Kumar, Business Analytics, Wiley India Business alytics: The Science
2. V.K. Jain, Data Science & Analytics, Khanna Book Publishing, New Delhi of Dat
3. Data Science For Dummies by Lillian Pierson , Jake Porway
P a g e | 38
230280116071 Kashis Makwana
Experiment-6
Date:
Objective:
The objective of this lab practical is to implement linear regression to predict the value of
a variable in a given dataset. Linear regression is a statistical technique used to model the
relationship between a dependent variable and one or more independent variables. In this
lab, we will explore how to build a linear regression model and use it to make predictions.
Materials Used:
- Python 3.x
- Jupyter Notebook
- NumPy library
- Pandas library
- Matplotlib library
- Scikit-learn library
Dataset:
For this lab, we will use a dataset that contains information about houses and their sale
prices. The dataset has the following columns:
Procedure:
P a g e | 39
230280116071 Kashis Makwana
y = β0 + β1*x + ε
where:
4. Data Preprocessing:
- Handle missing values, if any, by imputation or removal.
- Convert categorical variables into numerical representations, if required.
- Split the dataset into input features (independent variables) and the target variable
(dependent variable).
8. Visualization of Results:
- Visualize the actual values versus the predicted values using scatter plots or other
suitable plots.
- Plot the regression line to show the relationship between the independent and dependent
variables.
P a g e | 40
230280116071 Kashis Makwana
Interpretation/Program/code:
# Importing Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Data Preprocessing
# Check for missing values
missing_values = data.isnull().sum()
print("Missing Values:")
print(missing_values)
P a g e | 41
230280116071 Kashis Makwana
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)
# Visualization of Results
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Sale Price ($)")
plt.ylabel("Predicted Sale Price ($)")
plt.title("Actual vs. Predicted Sale Price")
plt.show()
Output:
P a g e | 42
230280116071 Kashis Makwana
Conclusion:
Quiz:
1. Which scikit-learn function is used to create a linear regression model object in Python?
a) sklearn.linear_model.LinearRegression
b) sklearn.preprocessing.StandardScaler
c) sklearn.model_selection.train_test_split
d) sklearn.metrics.mean_squared_error
Ans : a) sklearn.linear_model.LinearRegression
P a g e | 43
230280116071 Kashis Makwana
a) To measure the average squared difference between predicted and actual values
b) To evaluate the significance of predictor variables
c) To quantify the proportion of variance in the dependent variable explained by the independent
variables
d) To determine the optimal number of features for the regression model
Ans : a) To measure the average squared difference between predicted and actual values
Suggested References:-
1. "Pattern Recognition and Machine Learning" by Christopher M. Bishop
2. Dinesh Kumar, Business Analytics, Wiley India Business alytics: The Science
3. V.K. Jain, Data Science & Analytics, Khanna Book Publishing, New Delhi of Dat
4. Data Science For Dummies by Lillian Pierson , Jake Porway
04 04 10 04 20
P a g e | 44
230280116071 Kashis Makwana
Experiment-7
Date:
Objective:
The objective of this lab practical is to implement logistic regression using Scikit-learn
library in Python. Logistic regression is a popular classification algorithm used to model
the relationship between input variables and categorical outcomes. In this lab, we will
explore how to build a logistic regression model and use it for classification tasks.
Materials Used:
- Python 3.x
- Jupyter Notebook
- Scikit-learn library
- Pandas library
- NumPy library
- Matplotlib library
Dataset:
For this lab, we will use a dataset that contains information about customers and whether
they churned or not from a telecommunications company. The dataset has the following
columns:
CustomerID,Gender,Age,Income,Churn
1,Male,32,50000,0
2,Female,28,35000,0
3,Male,45,80000,1
4,Male,38,60000,0
5,Female,20,20000,1
6,Female,55,75000,0
7,Male,42,90000,0
8,Female,29,40000,1
Procedure:
P a g e | 45
230280116071 Kashis Makwana
4. Data Preprocessing:
- Split the dataset into input features (independent variables) and the target variable
(dependent variable).
- Convert categorical variables into numerical representations using one-hot encoding or
label encoding.
- Split the dataset into training and testing sets for model evaluation.
7. Visualization of Results:
- Visualize the model's performance using confusion matrix, ROC curve, or other suitable
visualizations.
- Plot the decision boundary to demonstrate the classification boundaries.
//Write here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, roc_curve, roc_auc_score, auc
P a g e | 46
230280116071 Kashis Makwana
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print("Confusion Matrix:\n", conf_matrix)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
P a g e | 47
230280116071 Kashis Makwana
Output:
Accuracy: 0.0
Precision: 0.0
Recall: 0.0
F1-Score: 0.0
Confusion Matrix:
[[0 3]
[0 0]]
8. Conclusion:
Quiz:
1. Which scikit-learn function is used to create a logistic regression model object in Python?
a) sklearn.linear_model.LogisticRegression
b) sklearn.preprocessing.StandardScaler
c) sklearn.model_selection.train_test_split
d) sklearn.metrics.accuracy_score
Ans : a) sklearn.linear_model.LogisticRegression
P a g e | 48
230280116071 Kashis Makwana
Suggested References:-
1. "Pattern Recognition and Machine Learning" by Christopher M. Bishop
2. Dinesh Kumar, Business Analytics, Wiley India Business alytics: The Science
3. V.K. Jain, Data Science & Analytics, Khanna Book Publishing, New Delhi of Dat
4. Data Science For Dummies by Lillian Pierson , Jake Porway
04 04 10 02 20
P a g e | 49
230280116071 Kashis Makwana
Experiment No: 8
Date:
Relevant CO :- CO4
Objective:
The objective of this lab practical is to implement a decision tree algorithm to classify
students as either average or clever based on given student data. Decision trees are widely
used in machine learning and data mining for classification and regression tasks. In this
lab, we will explore how to build a decision tree model and use it to classify students based
on their attributes.
Materials Used:
- Python 3.x
- Jupyter Notebook
- Scikit-learn library
- Pandas library
- NumPy library
- Matplotlib library
Dataset:
For this lab, we will use a dataset that contains information about students and their
performance. The dataset has the following columns:
Procedure:
P a g e | 50
230280116071 Kashis Makwana
4. Data Preprocessing:
- Split the dataset into input features (independent variables) and the target variable
(dependent variable).
- Convert categorical variables into numerical representations using one-hot encoding or
label encoding.
- Split the dataset into training and testing sets for model evaluation.
//Write here
# Step 1: Introduction to Decision Trees
# Decision Tree Classification for student performance
P a g e | 51
230280116071 Kashis Makwana
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print(classification_report(y_test, y_pred))
Output:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-Score: 1.0
precision recall f1-score support
accuracy 1.00 3
macro avg 1.00 1.00 1.00 3
weighted avg 1.00 1.00 1.00 3
Conclusion:
- The implementation of the decision tree algorithm proved effective in classifying
students as average or clever based on their attributes. Decision trees provide interpretable
results and can be used in various domains for classification tasks. The decision tree model
offers insights into the important features contributing to the classification. This lab
P a g e | 52
230280116071 Kashis Makwana
1. In decision tree classification, what is the main objective of the splitting criterion?
Suggested References:-
1. "Pattern Recognition and Machine Learning" by Christopher M. Bishop
2. Dinesh Kumar, Business Analytics, Wiley India Business alytics: The Science
3. V.K. Jain, Data Science & Analytics, Khanna Book Publishing, New Delhi of Dat
4. Data Science For Dummies by Lillian Pierson , Jake Porway
Rubrics wise marks obtained
P a g e | 53