0% found this document useful (0 votes)
29 views33 pages

Data Science Lab: NumPy & Pandas Guide

The document outlines an exercise to install and explore Python packages including NumPy, SciPy, Jupyter, Statsmodels, and Pandas for data analysis. It details procedures for installation, feature exploration, and practical examples using the Iris dataset and a sample DataFrame. The results indicate successful execution of the programs demonstrating various functionalities of the mentioned packages.

Uploaded by

bhavanipriy73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views33 pages

Data Science Lab: NumPy & Pandas Guide

The document outlines an exercise to install and explore Python packages including NumPy, SciPy, Jupyter, Statsmodels, and Pandas for data analysis. It details procedures for installation, feature exploration, and practical examples using the Iris dataset and a sample DataFrame. The results indicate successful execution of the programs demonstrating various functionalities of the mentioned packages.

Uploaded by

bhavanipriy73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PANIMALAR ENGINEERING COLLEGE

Department of CSE
Reg no: 211422104360
EX NO : 1

DATE:

Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and
Pandas packages. Reading data from text file, Excel and the web. Exploring various
commands for doing descriptive analytics on Iris dataset.

AIM:
To install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and Pandas
packages.

PROCEDURE:
Installation
NumPy, SciPy, Jupyter, Statsmodels, and Pandas can be easily installed using Python's package
manager, pip. Open a terminal or command prompt and type the following commands one by one:
 pip install numpy
 pip install scipy
 pip install jupyter
 pip install statsmodels
 pip install pandas
Explore the Features:
 NumPy: NumPy is a fundamental package for scientific computing with Python. It
provides support for arrays, matrices, and high-level mathematical functions to operate on
these arrays.
 SciPy: SciPy is built on top of NumPy and provides additional functionality for scientific
computing. It includes modules for optimization, integration, interpolation, linear algebra,
and more.
 Jupyter: Jupyter is a web-based interactive computing platform that allows you to create
and share documents containing live code, equations, visualizations, and narrative text.
 Statsmodels: Statsmodels is a Python module that provides classes and functions for
estimating many different statistical models, as well as for conducting statistical tests and
exploring data.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 1


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
 Pandas: Pandas is a powerful data analysis and manipulation library for Python. It
provides data structures like Series and DataFrame, which are ideal for working with
structured data.

a. Exploring numpy

import numpy as np
print("==== 1. Array creation ====")
arr1=[Link]([1,2,3,4])
print("Array from list : ",arr1)
arr2=[Link]((2,3))
print("Array of zeroes : \n",arr2)
arr3=[Link]((3,2))
print("Array of ones : ",arr3)
arr4=[Link](0,10,2)
print("Array with range : ",arr4)
arr5=[Link](0,1,5)
print("Array with linspace : ",arr5)
print("\n==== 2. Array operations ====")
arr6=[Link]([1,2,3,4])
arr7=[Link]([5,6,7,8])
sum_arr=arr6+arr7
print("Array addition : ",sum_arr)
prod_arr=arr6*arr7
print("Array multiplication : ",prod_arr)
exp_arr=arr6**2
print("Array exponentiation : ",exp_arr)
print("\n===== 3. Indexing and slicing====")
matrix=[Link]([[1,2,3],[4,5,6],[7,8,9]])
element=matrix[1,2]
print("Element at [1,2] : ",element)
sub_matrix=matrix[0:2,1:3]
print("Sub-matrix : \n",sub_matrix)

21CS1711 DATA SCIENCE AND ANALYTICS LAB 2


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
mask=matrix>5
print("Elements greater than 5 : \n",matrix[mask])
print("\n====4. Broadcasting ====")
arr8=[Link]([1,2,3])
arr9=[Link]([[10],[20],[30]])
broadcasted_result=arr8+arr9
print("Broadcasted result: \n",broadcasted_result)
print("\n==== 5. Linear algebra====")
a=[Link]([[1,2],[3,4]])
b=[Link]([[5,6],[7,8]])
matmul_result=[Link](a,b)
print("Matrix multiplication : \n",matmul_result)
det_a=[Link](a)
print("Determinant of matrix a : ",det_a)
print("\n==== 6. Statistical opertaions ====")
arr10=[Link]([1,2,3,4,5])
mean_val=[Link](arr10)
print("Mean:",mean_val)
std_val=[Link](arr10)
print("Standard deviation :",std_val)
median_val=[Link](arr10)
print("Median:",median_val)

OUTPUT:
==== 1. Array creation ====
Array from list : [1 2 3 4]
Array of zeroes :
[[0. 0. 0.]
[0. 0. 0.]]
Array of ones : [[1. 1.]
[1. 1.]
[1. 1.]]
Array with range : [0 2 4 6 8]
Array with linspace : [0. 0.25 0.5 0.75 1. ]

==== 2. Array operations ====


Array addition : [ 6 8 10 12]
Array multiplication : [ 5 12 21 32]
Array exponentiation : [ 1 4 9 16]

===== 3. Indexing and slicing====

21CS1711 DATA SCIENCE AND ANALYTICS LAB 3


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Element at [1,2] : 6
Sub-matrix :
[[2 3]
[5 6]]
Elements greater than 5 :
[6 7 8 9]

====4. Broadcasting ====


Broadcasted result:
[[11 12 13]
[21 22 23]
[31 32 33]]

==== 5. Linear algebra====


Matrix multiplication :
[[19 22]
[43 50]]
Determinant of matrix a : -2.0000000000000004

==== 6. Statistical opertaions ====


Mean: 3.0
Standard deviation : 1.4142135623730951
Median: 3.0

RESULT:
Hence, the above program for exploring the features of Numpy has been written and
executed successfully.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 4


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
b. Exploring pandas

import pandas as pd
import numpy as np
print("\n==== 1. Create DataFrame ====")
data={
'Name':['Alice','Bob','Charlie','David','Edward'],
'Age':[24,27,22,32,29],
'City':['New York','Los Angeles','Chicago','Houstan','Phoenix'],
'Salary':[70000,80000,120000,90000,100000]
}
df=[Link](data)
print("DataFrame created from a dictionary:\n",df)
print("\n==== [Link] Operations ====")
age_column=df['Age']
print("Age column :\n",age_column)
row_2=[Link][2]
print("\nRow 2:\n",row_2)
row_label=[Link][1]
print("\nRow with label 1:\n",row_label)
print("\n==== 3. Filtering and Conditions ====")
filtered_df=df[df['Age']>25]
print("Filtered DataFrame (Age>25):\n",filtered_df)
filtered_df_multi_cond=df[(df['Age']>25)&(df['Salary']<100000)]
print("Filtered DataFrame (Age>25 and Salary<100000):\n",filtered_df_multi_cond)
print("\n==== [Link] statistics====")
summary_stats=[Link]()
print("Summary statistics of numeric columns :\n",summary_stats)
mean_salary=df['Salary'].mean()
print("\nMean salary:",mean_salary)
max_salary=df['Salary'].max()
print("\nMaxium salary:",max_salary)
print("\n==== 5. Grouping data ====")

21CS1711 DATA SCIENCE AND ANALYTICS LAB 5


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
grouped_by_city=[Link]('City')['Salary'].mean()
print("\nAverage salary grouped by city :\n",grouped_by_city)
print("\n==== 6. Sorting data====")
sorted_by_salary=df.sort_values(by='Salary',ascending=False)
print("DataFrame sorted by Salary(descending):\n",sorted_by_salary)
sorted_by_age=df.sort_values(by='Age',ascending=True)
print("DataFrame sorted by Age(ascending):\n",sorted_by_age)
print("\n==== 7. Adding and removing columns====")
df['Experience']=[2,5,1,8,4]
print("\nDataFrame with'Experience' coumn added:\n",df)
df_dropped=[Link](columns=['Experience'])
print("\nDataFrame after dropping 'Experience' column:\n",df_dropped)
print("\n===== [Link] DataFrames====")
data2={
'Name':['Alice','Bob','Charlie','David','Edward'],
'Department':['HR','IT','Finance','Marketing','Sales']
}
df2=[Link](data2)
merged_df=[Link](df,df2,on='Name')
print("Merged DataFrame:\n",merged_df)
df_with_na=[Link]()
print("\n==== [Link] missing data====")
df_with_na.loc[1,'Salary']=[Link]
print("DataFrame with missing data:\n",df_with_na)
df_filled=df_with_na.fillna({'Salary':df['Salary'].mean()})
print("\nDataFrame after filling missing data:\n",df_filled)
df_dropped_na=df_with_na.dropna()
print("\nDataFrame after dropping rows with missing data:\n",df_dropped_na)

OUTPUT:
==== 1. Create DataFrame ====
DataFrame created from a dictionary:
Name Age City Salary
0 Alice 24 New York 70000
1 Bob 27 Los Angeles 80000
2 Charlie 22 Chicago 120000

21CS1711 DATA SCIENCE AND ANALYTICS LAB 6


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
3 David 32 Houstan 90000
4 Edward 29 Phoenix 100000

==== [Link] Operations ====


Age column :
0 24
1 27
2 22
3 32
4 29
Name: Age, dtype: int64

Row 2:
Name Charlie
Age 22
City Chicago
Salary 120000
Name: 2, dtype: object

Row with label 1:


Name Bob
Age 27
City Los Angeles
Salary 80000
Name: 1, dtype: object

==== 3. Filtering and Conditions ====


Filtered DataFrame (Age>25):
Name Age City Salary
1 Bob 27 Los Angeles 80000
3 David 32 Houstan 90000
4 Edward 29 Phoenix 100000
Filtered DataFrame (Age>25 and Salary<100000):
Name Age City Salary
1 Bob 27 Los Angeles 80000
3 David 32 Houstan 90000

==== [Link] statistics====


Summary statistics of numeric columns :
Age Salary
count 5.000000 5.000000
mean 26.800000 92000.000000
std 3.962323 19235.384062
min 22.000000 70000.000000
25% 24.000000 80000.000000
50% 27.000000 90000.000000
75% 29.000000 100000.000000
max 32.000000 120000.000000

Mean salary: 92000.0

Maxium salary: 120000

==== 5. Grouping data ====

Average salary grouped by city :


City
Chicago 120000.0
Houstan 90000.0

21CS1711 DATA SCIENCE AND ANALYTICS LAB 7


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Los Angeles 80000.0
New York 70000.0
Phoenix 100000.0
Name: Salary, dtype: float64

==== 6. Sorting data====


DataFrame sorted by Salary(descending):
Name Age City Salary
2 Charlie 22 Chicago 120000
4 Edward 29 Phoenix 100000
3 David 32 Houstan 90000
1 Bob 27 Los Angeles 80000
0 Alice 24 New York 70000
DataFrame sorted by Age(ascending):
Name Age City Salary
2 Charlie 22 Chicago 120000
0 Alice 24 New York 70000
1 Bob 27 Los Angeles 80000
4 Edward 29 Phoenix 100000
3 David 32 Houstan 90000

==== 7. Adding and removing columns====

DataFrame with'Experience' coumn added:


Name Age City Salary Experience
0 Alice 24 New York 70000 2
1 Bob 27 Los Angeles 80000 5
2 Charlie 22 Chicago 120000 1
3 David 32 Houstan 90000 8
4 Edward 29 Phoenix 100000 4

DataFrame after dropping 'Experience' column:


Name Age City Salary
0 Alice 24 New York 70000
1 Bob 27 Los Angeles 80000
2 Charlie 22 Chicago 120000
3 David 32 Houstan 90000
4 Edward 29 Phoenix 100000

===== [Link] DataFrames====


Merged DataFrame:
Name Age City Salary Experience Department
0 Alice 24 New York 70000 2 HR
1 Bob 27 Los Angeles 80000 5 IT
2 Charlie 22 Chicago 120000 1 Finance
3 David 32 Houstan 90000 8 Marketing
4 Edward 29 Phoenix 100000 4 Sales

==== [Link] missing data====


DataFrame with missing data:
Name Age City Salary Experience
0 Alice 24 New York 70000.0 2
1 Bob 27 Los Angeles NaN 5
2 Charlie 22 Chicago 120000.0 1
3 David 32 Houstan 90000.0 8
4 Edward 29 Phoenix 100000.0 4

DataFrame after filling missing data:


Name Age City Salary Experience

21CS1711 DATA SCIENCE AND ANALYTICS LAB 8


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
0 Alice 24 New York 70000.0 2
1 Bob 27 Los Angeles 92000.0 5
2 Charlie 22 Chicago 120000.0 1
3 David 32 Houstan 90000.0 8
4 Edward 29 Phoenix 100000.0 4

DataFrame after dropping rows with missing data:


Name Age City Salary Experience
0 Alice 24 New York 70000.0 2
2 Charlie 22 Chicago 120000.0 1
3 David 32 Houstan 90000.0 8
4 Edward 29 Phoenix 100000.0 4

RESULT:
Hence, the above program for exploring the various features of Pandas has been
written and executed successfully.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 9


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
c. Exploring scipy and statsmodel
import [Link] as sm
import pandas as pd
# 1. Create sample dataset
data = {
'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Exam_Score': [35, 37, 40, 43, 45, 50, 53, 55, 58, 60]
}
df = [Link](data)
print("Sample Data:\n", df)
X = df['Hours_Studied']
y = df['Exam_Score']
X = sm.add_constant(X) # Add intercept term
model = [Link](y, X).fit()
df['Predicted_Score'] = [Link](X)
print("\nPredicted Scores:\n", df)

OUTPUT:

RESULT:
Hence, the above program for exploring the various features of Statsmodel has been
written and executed successfully.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 10


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
d. Reading data from text file, excel and the web

i. To read data from a text file (such as CSV files) using Pandas' read_csv() function is used.
import pandas as pd
# Read data from a CSV file
df = pd.read_csv('[Link]')
# Display the first few rows of the DataFrame
print([Link]())
ii. Reading data from an Excel file is similar to reading from a text file, but Pandas'
read_excel() function is used.
# Read data from an Excel file
df = pd.read_excel('[Link]', sheet_name='Sheet1')
# Display the first few rows of the DataFrame
print([Link]())
iii. Pandas also allows to read data directly from a URL.
# Read data from a URL
url = '[Link]
df = pd.read_csv(url)
# Display the first few rows of the DataFrame
print([Link]())

OUTPUT
[Link]
Hello, world!
This is a sample text file.

Output after reading the text file


Hello, world!
This is a sample text file.
[Link]

21CS1711 DATA SCIENCE AND ANALYTICS LAB 11


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

Output after reading the Excel file


Name Age Gender
0 Alice 30 Female
1 Bob 25 Male
2 Charlie 35 Male

Output after reading the URL [Link]


Column1 Column2 Column3
0 value value value
1 value value value
2 value value value
3 value value value
4 value value value

Result:
Thus the program to read data from text file, Excel and the web are executed.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 12


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
e. Exploring various commands for doing descriptive analytics on iris dataset.

i. Importing relevant libraries


import pandas as pd
import numpy as np
import seaborn as sns
import [Link] as plt
[Link]()
from sklearn import metrics
%matplotlib inline
ii. Loading & printing iris data
df = pd.read_csv('[Link]')
[Link]()
iris_data=pd.read_csv(‘[Link]’)
print(iris_data)
iii. Displaying up the top rows of the dataset with their columns.
The function head() will display the top rows of the dataset, the default value of this
function is 5, that is it will show top 5 rows when no argument is given to it .
[Link]
print(iris_data.head())
iv. Displaying the shape of the dataset.
The shape of the dataset means to print the total number of rows or entries and the total
number of columns or features of that particular dataset.
[Link]
print(iris_data.shape)
v. Summary of the DataFrame's information
[Link]() would provide a concise summary of the DataFrame's information. This includes
the number of non-null values in each column, the data type of each column, and memory
usage.
[Link]()
print(iris_data.info())
vi. Statistical Insight

21CS1711 DATA SCIENCE AND ANALYTICS LAB 13


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
The describe() function applies basic statistical computations on the dataset like extreme
values, count of data points standard deviation, etc. Any missing value or NaN value is
automatically skipped. describe() function gives a good picture of the distribution of data.
[Link]()
print(iris_data.describe())
vii. Checking For Duplicate Entries
The [Link]() method in pandas returns a boolean Series indicating duplicate rows in a
DataFrame. It marks each row as True if it is a duplicate of a previous row, and False
otherwise.
[Link]()
print(iris_data[iris_data.duplicated()])
viii. Checking the balance
The value_counts() method in pandas is used to count the occurrences of unique values in a
[Link] code will print the counts of unique species in the 'species' column of the
DataFrame iris_data. Each unique species name will be listed along with its count.
df.value_counts("Species")
print(iris_data['Species'].value_counts())

DATA VISUALIZATION
ix. Species count
It is used to create a count plot to visualize the count of each species in the Iris dataset using
seaborn.
[Link](‘Species Count’)
[Link]([‘species’])
# Set the title for the plot
[Link]('Species Count')
# Create a count plot to visualize the count of each species
[Link](iris_data['Species'])
# Display the plot
[Link]()
x. Univariate analysis
Univariate analysis involves examining the distribution and characteristics of a single
variable. It helps in understanding the basic properties of individual variables in the dataset.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 14


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
[Link](x='x_variable', y='y_variable', hue='hue_variable', data=df,
s=marker_size)
COMPARISON OF DIFFERENT SPECIES DEPENDING ON SEPAL WIDTH AND
LENGTH.
[Link](figsize=(17,9))
[Link](‘Comparison between various species based on sapel length and width’)
[Link](iris_data[‘sepal_length’],iris_data[‘sepal_width’],hue=iris_data[‘species’],s=
50)
COMPARISON OF DIFFERENT SPECIES DEPENDING ON PETAL WIDTH AND
LENGTH.
[Link](figsize=(16,9))
[Link](‘Comparison between various species based on petal lenght and width’)
[Link](iris_data[‘petal_length’], iris_data[‘petal_width’], hue = iris_data[‘species’],
s= 50)
xi. Bi-variate Analysis
[Link](data=df, hue='hue_variable', height=plot_height)
[Link](iris_data,hue=”species”,height=4)

OUTPUT
Printing iris data

21CS1711 DATA SCIENCE AND ANALYTICS LAB 15


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Displaying up the top rows

Displaying the shape of the dataset


The dataframe contains 6 columns and 150 rows.
Summary of the DataFrame's information

Statistical Insight

21CS1711 DATA SCIENCE AND ANALYTICS LAB 16


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

Checking For Duplicate Entries

Checking the balance

Visualizing the target column

21CS1711 DATA SCIENCE AND ANALYTICS LAB 17


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Comparison of different species depending on sepal width and length.

Comparison of different species depending on petal width and length

21CS1711 DATA SCIENCE AND ANALYTICS LAB 18


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Bi-variate Analysis

Result:
Thus the program to do the descriptive analytics on Iris dataset and exploring the
features of NumPy, SciPy, Jupyter, Statsmodels and Pandas packages are executed.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 19


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
EX NO: 2
DATE:

Use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the following:
a. Univariate analysis
b. Bivariate analysis
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.

a. Univariate analysis

Aim
To use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the Univariate analysis.
Procedure
Univariate analysis involves the examination of a single variable at a time. It focuses on
understanding the distribution and characteristics of that variable without considering any
relationships with other variables.
Load the datasets: Load the diabetes dataset from UCI and the Pima Indians Diabetes dataset
into pandas DataFrames.
Univariate analysis: Calculate the frequency, mean, median, mode, variance, standard
deviation, skewness, and kurtosis for each variable in the dataset.
import pandas as pd
import [Link] as plt
import seaborn as sns
url = '[Link]
[Link]'
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness','Insulin', 'BMI',
'DiabetesPedigreeFunction', 'Age', 'Outcome']
df = pd.read_csv(url, names=columns)
[Link]()
from [Link] import load_diabetes

21CS1711 DATA SCIENCE AND ANALYTICS LAB 20


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
diabetes = load_diabetes()
uci_df = [Link]([Link],columns=diabetes.feature_names)
uci_df['target'] = [Link]
print(uci_df.head())
import [Link] as plt
import seaborn as sns
# Example for Pima
for col in [Link][:-1]: # exclude Outcome
[Link](df[col], kde=True)
[Link](f'Pima: {col}')
[Link]()
# Example for UCI
for col in uci_df.columns[:-1]: # exclude target
[Link](uci_df[col], kde=True)
[Link](f'UCI: {col}')
[Link]()
OUTPUT

21CS1711 DATA SCIENCE AND ANALYTICS LAB 21


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

Result:
Thus the program for performing the Univariate analysis is executed.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 22


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
b. Bivariate Analysis:

Aim:
To use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the Bivariate analysis.

Procedure
 Import Necessary Libraries:
Import pandas for data manipulation, scikit-learn for machine learning, and matplotlib
for visualization.
 Load the Dataset:
Download the dataset from the UCI repository or use a direct URL.
 Explore the Data:
Get familiar with the dataset, check for missing values, data types, and distribution of
variables.
 Split Data into Features and Target Variable:
Separate the dataset into features (independent variables) and the target variable
(dependent variable).
 Split Data into Training and Testing Sets:
Divide the dataset into training and testing sets to evaluate the model's performance.
 Build and Train the Linear Regression Model:
Initialize the linear regression model and fit it to the training data.
 Evaluate the Model: Make predictions on the test data and evaluate the model's
performance using metrics such as Mean Squared Error and R-squared.

import seaborn as sns


import [Link] as plt
# Loop through all columns except 'Outcome'
for col in [Link][:-1]:
[Link](figsize=(6, 4))
[Link](x='Outcome', y=col, data=df)
[Link](f'{col} vs Outcome (Pima)')
[Link]()
uci_df['target_bin'] = [Link](uci_df['target'], q=3, labels=['Low', 'Medium', 'High'])

21CS1711 DATA SCIENCE AND ANALYTICS LAB 23


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
for col in uci_df.columns[:-2]: # exclude 'target' and 'target_bin'
[Link](figsize=(6, 4))
[Link](x='target_bin', y=col, data=uci_df)
[Link](f'{col} vs Target Bin (UCI)')
[Link]()
[Link](figsize=(10,6))
[Link]([Link](), annot=True, cmap='coolwarm')
[Link]("Correlation between Variables")
[Link]()
# Drop the 'target_bin' column before calculating correlation
[Link](figsize=(12,8))
[Link](uci_df.drop(columns=['target_bin']).corr(), annot=True, cmap='coolwarm')
[Link]('UCI Diabetes Dataset Correlation')
[Link]()
OUTPUT

21CS1711 DATA SCIENCE AND ANALYTICS LAB 24


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

Result:
Thus the program for performing the Bivariate analysis is executed.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 25


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
c. Multivariate Analysis:
Aim:
To use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the Multivariate analysis.
Procedure:
 Load the Pima Indians Diabetes dataset from the UCI repository.
 Split the dataset into features (X) and the target variable (y).
 Split the data into training and testing sets using 80% for training and 20% for testing.
 Train a multivariate regression model using scikit-learn's LinearRegression.
 Predict the outcome on the test set.
 Evaluate the model using Mean Squared Error (MSE).
Program
Multivariate Analysis using PIMA DATASETS
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from [Link] import r2_score, mean_squared_error
X_pima = [Link](columns=['Glucose', 'Outcome'])
y_pima = df['Glucose']
X_train, X_test, y_train, y_test = train_test_split(X_pima, y_pima, test_size=0.2,
random_state=42)
model_pima = LinearRegression()
model_pima.fit(X_train, y_train)
y_pred_pima = model_pima.predict(X_test)
print("PIMA:")
print("R² Score:", r2_score(y_test, y_pred_pima))
print("MSE:", mean_squared_error(y_test, y_pred_pima))
OUTPUT
PIMA:
R² Score: 0.135973429426115
MSE: 869.4777053181734

Multivariate Analysis using UCI DATASETS


from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

21CS1711 DATA SCIENCE AND ANALYTICS LAB 26


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
from [Link] import r2_score, mean_squared_error
# Drop the 'target_bin' column before splitting the data
X_uci = uci_df.drop(columns=['target', 'target_bin'])
y_uci = uci_df['target']
X_train, X_test, y_train, y_test = train_test_split(X_uci, y_uci, test_size=0.2,
random_state=42)
model_uci = LinearRegression()
model_uci.fit(X_train, y_train)
y_pred_uci = model_uci.predict(X_test)
print("UCI:")
print("R² Score:", r2_score(y_test, y_pred_uci))
print("MSE:", mean_squared_error(y_test, y_pred_uci))
OUTPUT
UCI:
R² Score: 0.4526027629719195
MSE: 2900.193628493482

Result:
Thus the program for performing the Multivariate analysis is executed.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 27


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

EX NO : 3
DATE:

APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS ON UCI DATA SETS

Aim
To explore various plotting functions on UCI data sets.

Procedure:
 Download a dataset from [Link]:
[Link]
 Save that in downloads or any other Folder and install packages.
 Apply these following commands on the dataset.
 The Output will display.

Program:
import pandas as pd
import [Link] as plt
# Load the dataset
data_url = "[Link]
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pd.read_csv(data_url, names=names)
# Plot a histogram of sepal length
[Link](dataset['sepal-length'], bins=10)
[Link]('Sepal Length')
[Link]('Frequency')
[Link]('Histogram of Sepal Length')
[Link]()
# Plot a scatter plot of sepal length vs sepal width
[Link](dataset['sepal-length'], dataset['sepal-width'])
[Link]('Sepal Length')

21CS1711 DATA SCIENCE AND ANALYTICS LAB 28


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
[Link]('Sepal Width')
[Link]('Scatter Plot of Sepal Length vs Sepal Width')
[Link]()
# Plot a box plot of petal length for each class
[Link](column='petal-length', by='class')
[Link]('Box Plot of Petal Length for Each Class')
[Link]('Class')
[Link]('Petal Length')
[Link]()
# Plot a bar chart of the mean petal width for each class
class_means = [Link]('class')['petal-width'].mean()
class_means.plot(kind='bar')
[Link]('Mean Petal Width for Each Class')
[Link]('Class')
[Link]('Mean Petal Width')
[Link]()

Output:

Result:
Thus the program to explore various plotting functions on UCI data sets is performed.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 29


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

EX NO: 4
DATE:

IMPLEMENT DECISION TREE CLASSIFICATION

Aim
To write a program to perform decision tree classification.

Procedure
 Loads the Iris dataset.
 Splits the dataset into features (X) and the target variable (y).
 Splits the data into training and testing sets using 80% for training and 20% for testing.
 Trains a decision tree classifier using scikit-learn's DecisionTreeClassifier.
 Predicts the class labels on the test set.
 Evaluates the model's accuracy using accuracy_score.
 Prints a classification report containing precision, recall, F1-score, and support for each
class using classification_report.

Program
import pandas as pd
from sklearn.model_selection import train_test_split
from [Link] import DecisionTreeClassifier
from [Link] import accuracy_score, classification_report
# Load the dataset
data_url = "[Link]
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
iris_data = pd.read_csv(data_url, names=names)
# Split features and target variable
X = iris_data.drop('class', axis=1)
y = iris_data['class']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

21CS1711 DATA SCIENCE AND ANALYTICS LAB 30


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
# Train decision tree classifier
classifier = DecisionTreeClassifier(random_state=42)
[Link](X_train, y_train)
# Predictions on the test set
predictions = [Link](X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
# Classification report
print("Classification Report:")
print(classification_report(y_test, predictions))

Output:

Result:
Thus a program to perform decision tree classification on Iris dataset is verified.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 31


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

EX NO: 5
DATE:

IMPLEMENT CLUSTERING TECHNIQUES

Aim
To write a program to implement clustering techniques.

Procedure
 Imports necessary libraries: numpy for numerical operations, matplotlib for plotting,
make_blobs to generate sample data, and KMeans for the K-means clustering algorithm.
 Generates sample data using make_blobs function. You can replace this with your own
dataset.
 Initializes K-Means with the desired number of clusters.
 Fits the KMeans model to the data and predicts cluster labels.
 Plots the data points with different colors representing different clusters.
 Plots the centroids of the clusters in red.

Program
import numpy as np
import [Link] as plt
from [Link] import make_blobs
from [Link] import KMeans
# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Apply KMeans clustering


kmeans = KMeans(n_clusters=4)
[Link](X)
y_kmeans = [Link](X)
# Plotting the clusters
[Link](X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

21CS1711 DATA SCIENCE AND ANALYTICS LAB 32


PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
# Plotting centroids
centers = kmeans.cluster_centers_
[Link](centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)
[Link]('Feature 1')
[Link]('Feature 2')
[Link]('K-Means Clustering')
[Link]()

OUTPUT:

Result:
Thus a program to perform to implement clustering techniques is verified.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 33

You might also like