0% found this document useful (0 votes)

29 views33 pages

Data Science Lab: NumPy & Pandas Guide

The document outlines an exercise to install and explore Python packages including NumPy, SciPy, Jupyter, Statsmodels, and Pandas for data analysis. It details procedures for installation, feature exploration, and practical examples using the Iris dataset and a sample DataFrame. The results indicate successful execution of the programs demonstrating various functionalities of the mentioned packages.

Uploaded by

bhavanipriy73

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views33 pages

Data Science Lab: NumPy & Pandas Guide

Uploaded by

bhavanipriy73

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

PANIMALAR ENGINEERING COLLEGE

Department of CSE
Reg no: 211422104360
EX NO : 1

DATE:

Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and
Pandas packages. Reading data from text file, Excel and the web. Exploring various
commands for doing descriptive analytics on Iris dataset.

AIM:
To install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and Pandas
packages.

PROCEDURE:
Installation
NumPy, SciPy, Jupyter, Statsmodels, and Pandas can be easily installed using Python's package
manager, pip. Open a terminal or command prompt and type the following commands one by one:
 pip install numpy
 pip install scipy
 pip install jupyter
 pip install statsmodels
 pip install pandas
Explore the Features:
 NumPy: NumPy is a fundamental package for scientific computing with Python. It
provides support for arrays, matrices, and high-level mathematical functions to operate on
these arrays.
 SciPy: SciPy is built on top of NumPy and provides additional functionality for scientific
computing. It includes modules for optimization, integration, interpolation, linear algebra,
and more.
 Jupyter: Jupyter is a web-based interactive computing platform that allows you to create
and share documents containing live code, equations, visualizations, and narrative text.
 Statsmodels: Statsmodels is a Python module that provides classes and functions for
estimating many different statistical models, as well as for conducting statistical tests and
exploring data.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 1

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
 Pandas: Pandas is a powerful data analysis and manipulation library for Python. It
provides data structures like Series and DataFrame, which are ideal for working with
structured data.

a. Exploring numpy

import numpy as np
print("==== 1. Array creation ====")
arr1=[Link]([1,2,3,4])
print("Array from list : ",arr1)
arr2=[Link]((2,3))
print("Array of zeroes : \n",arr2)
arr3=[Link]((3,2))
print("Array of ones : ",arr3)
arr4=[Link](0,10,2)
print("Array with range : ",arr4)
arr5=[Link](0,1,5)
print("Array with linspace : ",arr5)
print("\n==== 2. Array operations ====")
arr6=[Link]([1,2,3,4])
arr7=[Link]([5,6,7,8])
sum_arr=arr6+arr7
print("Array addition : ",sum_arr)
prod_arr=arr6*arr7
print("Array multiplication : ",prod_arr)
exp_arr=arr6**2
print("Array exponentiation : ",exp_arr)
print("\n===== 3. Indexing and slicing====")
matrix=[Link]([[1,2,3],[4,5,6],[7,8,9]])
element=matrix[1,2]
print("Element at [1,2] : ",element)
sub_matrix=matrix[0:2,1:3]
print("Sub-matrix : \n",sub_matrix)

21CS1711 DATA SCIENCE AND ANALYTICS LAB 2

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
mask=matrix>5
print("Elements greater than 5 : \n",matrix[mask])
print("\n====4. Broadcasting ====")
arr8=[Link]([1,2,3])
arr9=[Link]([[10],[20],[30]])
broadcasted_result=arr8+arr9
print("Broadcasted result: \n",broadcasted_result)
print("\n==== 5. Linear algebra====")
a=[Link]([[1,2],[3,4]])
b=[Link]([[5,6],[7,8]])
matmul_result=[Link](a,b)
print("Matrix multiplication : \n",matmul_result)
det_a=[Link](a)
print("Determinant of matrix a : ",det_a)
print("\n==== 6. Statistical opertaions ====")
arr10=[Link]([1,2,3,4,5])
mean_val=[Link](arr10)
print("Mean:",mean_val)
std_val=[Link](arr10)
print("Standard deviation :",std_val)
median_val=[Link](arr10)
print("Median:",median_val)

OUTPUT:
==== 1. Array creation ====
Array from list : [1 2 3 4]
Array of zeroes :
[[0. 0. 0.]
[0. 0. 0.]]
Array of ones : [[1. 1.]
[1. 1.]
[1. 1.]]
Array with range : [0 2 4 6 8]
Array with linspace : [0. 0.25 0.5 0.75 1. ]

==== 2. Array operations ====

Array addition : [ 6 8 10 12]
Array multiplication : [ 5 12 21 32]
Array exponentiation : [ 1 4 9 16]

===== 3. Indexing and slicing====

21CS1711 DATA SCIENCE AND ANALYTICS LAB 3

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Element at [1,2] : 6
Sub-matrix :
[[2 3]
[5 6]]
Elements greater than 5 :
[6 7 8 9]

====4. Broadcasting ====

Broadcasted result:
[[11 12 13]
[21 22 23]
[31 32 33]]

==== 5. Linear algebra====

Matrix multiplication :
[[19 22]
[43 50]]
Determinant of matrix a : -2.0000000000000004

==== 6. Statistical opertaions ====

Mean: 3.0
Standard deviation : 1.4142135623730951
Median: 3.0

RESULT:
Hence, the above program for exploring the features of Numpy has been written and
executed successfully.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 4

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
b. Exploring pandas

import pandas as pd
import numpy as np
print("\n==== 1. Create DataFrame ====")
data={
'Name':['Alice','Bob','Charlie','David','Edward'],
'Age':[24,27,22,32,29],
'City':['New York','Los Angeles','Chicago','Houstan','Phoenix'],
'Salary':[70000,80000,120000,90000,100000]
}
df=[Link](data)
print("DataFrame created from a dictionary:\n",df)
print("\n==== [Link] Operations ====")
age_column=df['Age']
print("Age column :\n",age_column)
row_2=[Link][2]
print("\nRow 2:\n",row_2)
row_label=[Link][1]
print("\nRow with label 1:\n",row_label)
print("\n==== 3. Filtering and Conditions ====")
filtered_df=df[df['Age']>25]
print("Filtered DataFrame (Age>25):\n",filtered_df)
filtered_df_multi_cond=df[(df['Age']>25)&(df['Salary']<100000)]
print("Filtered DataFrame (Age>25 and Salary<100000):\n",filtered_df_multi_cond)
print("\n==== [Link] statistics====")
summary_stats=[Link]()
print("Summary statistics of numeric columns :\n",summary_stats)
mean_salary=df['Salary'].mean()
print("\nMean salary:",mean_salary)
max_salary=df['Salary'].max()
print("\nMaxium salary:",max_salary)
print("\n==== 5. Grouping data ====")

21CS1711 DATA SCIENCE AND ANALYTICS LAB 5

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
grouped_by_city=[Link]('City')['Salary'].mean()
print("\nAverage salary grouped by city :\n",grouped_by_city)
print("\n==== 6. Sorting data====")
sorted_by_salary=df.sort_values(by='Salary',ascending=False)
print("DataFrame sorted by Salary(descending):\n",sorted_by_salary)
sorted_by_age=df.sort_values(by='Age',ascending=True)
print("DataFrame sorted by Age(ascending):\n",sorted_by_age)
print("\n==== 7. Adding and removing columns====")
df['Experience']=[2,5,1,8,4]
print("\nDataFrame with'Experience' coumn added:\n",df)
df_dropped=[Link](columns=['Experience'])
print("\nDataFrame after dropping 'Experience' column:\n",df_dropped)
print("\n===== [Link] DataFrames====")
data2={
'Name':['Alice','Bob','Charlie','David','Edward'],
'Department':['HR','IT','Finance','Marketing','Sales']
}
df2=[Link](data2)
merged_df=[Link](df,df2,on='Name')
print("Merged DataFrame:\n",merged_df)
df_with_na=[Link]()
print("\n==== [Link] missing data====")
df_with_na.loc[1,'Salary']=[Link]
print("DataFrame with missing data:\n",df_with_na)
df_filled=df_with_na.fillna({'Salary':df['Salary'].mean()})
print("\nDataFrame after filling missing data:\n",df_filled)
df_dropped_na=df_with_na.dropna()
print("\nDataFrame after dropping rows with missing data:\n",df_dropped_na)

OUTPUT:
==== 1. Create DataFrame ====
DataFrame created from a dictionary:
Name Age City Salary
0 Alice 24 New York 70000
1 Bob 27 Los Angeles 80000
2 Charlie 22 Chicago 120000

21CS1711 DATA SCIENCE AND ANALYTICS LAB 6

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
3 David 32 Houstan 90000
4 Edward 29 Phoenix 100000

==== [Link] Operations ====

Age column :
0 24
1 27
2 22
3 32
4 29
Name: Age, dtype: int64

Row 2:
Name Charlie
Age 22
City Chicago
Salary 120000
Name: 2, dtype: object

Row with label 1:

Name Bob
Age 27
City Los Angeles
Salary 80000
Name: 1, dtype: object

==== 3. Filtering and Conditions ====

Filtered DataFrame (Age>25):
Name Age City Salary
1 Bob 27 Los Angeles 80000
3 David 32 Houstan 90000
4 Edward 29 Phoenix 100000
Filtered DataFrame (Age>25 and Salary<100000):
Name Age City Salary
1 Bob 27 Los Angeles 80000
3 David 32 Houstan 90000

Mean salary: 92000.0

Maxium salary: 120000

==== 5. Grouping data ====

Average salary grouped by city :

City
Chicago 120000.0
Houstan 90000.0

21CS1711 DATA SCIENCE AND ANALYTICS LAB 7

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Los Angeles 80000.0
New York 70000.0
Phoenix 100000.0
Name: Salary, dtype: float64

==== 6. Sorting data====

DataFrame sorted by Salary(descending):
Name Age City Salary
2 Charlie 22 Chicago 120000
4 Edward 29 Phoenix 100000
3 David 32 Houstan 90000
1 Bob 27 Los Angeles 80000
0 Alice 24 New York 70000
DataFrame sorted by Age(ascending):
Name Age City Salary
2 Charlie 22 Chicago 120000
0 Alice 24 New York 70000
1 Bob 27 Los Angeles 80000
4 Edward 29 Phoenix 100000
3 David 32 Houstan 90000

==== 7. Adding and removing columns====

DataFrame with'Experience' coumn added:

Name Age City Salary Experience
0 Alice 24 New York 70000 2
1 Bob 27 Los Angeles 80000 5
2 Charlie 22 Chicago 120000 1
3 David 32 Houstan 90000 8
4 Edward 29 Phoenix 100000 4

DataFrame after dropping 'Experience' column:

Name Age City Salary
0 Alice 24 New York 70000
1 Bob 27 Los Angeles 80000
2 Charlie 22 Chicago 120000
3 David 32 Houstan 90000
4 Edward 29 Phoenix 100000

===== [Link] DataFrames====

Merged DataFrame:
Name Age City Salary Experience Department
0 Alice 24 New York 70000 2 HR
1 Bob 27 Los Angeles 80000 5 IT
2 Charlie 22 Chicago 120000 1 Finance
3 David 32 Houstan 90000 8 Marketing
4 Edward 29 Phoenix 100000 4 Sales

==== [Link] missing data====

DataFrame with missing data:
Name Age City Salary Experience
0 Alice 24 New York 70000.0 2
1 Bob 27 Los Angeles NaN 5
2 Charlie 22 Chicago 120000.0 1
3 David 32 Houstan 90000.0 8
4 Edward 29 Phoenix 100000.0 4

DataFrame after filling missing data:

Name Age City Salary Experience

21CS1711 DATA SCIENCE AND ANALYTICS LAB 8

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
0 Alice 24 New York 70000.0 2
1 Bob 27 Los Angeles 92000.0 5
2 Charlie 22 Chicago 120000.0 1
3 David 32 Houstan 90000.0 8
4 Edward 29 Phoenix 100000.0 4

DataFrame after dropping rows with missing data:

Name Age City Salary Experience
0 Alice 24 New York 70000.0 2
2 Charlie 22 Chicago 120000.0 1
3 David 32 Houstan 90000.0 8
4 Edward 29 Phoenix 100000.0 4

RESULT:
Hence, the above program for exploring the various features of Pandas has been
written and executed successfully.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 9

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
c. Exploring scipy and statsmodel
import [Link] as sm
import pandas as pd
# 1. Create sample dataset
data = {
'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Exam_Score': [35, 37, 40, 43, 45, 50, 53, 55, 58, 60]
}
df = [Link](data)
print("Sample Data:\n", df)
X = df['Hours_Studied']
y = df['Exam_Score']
X = sm.add_constant(X) # Add intercept term
model = [Link](y, X).fit()
df['Predicted_Score'] = [Link](X)
print("\nPredicted Scores:\n", df)

OUTPUT:

RESULT:
Hence, the above program for exploring the various features of Statsmodel has been
written and executed successfully.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 10

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
d. Reading data from text file, excel and the web

i. To read data from a text file (such as CSV files) using Pandas' read_csv() function is used.
import pandas as pd
# Read data from a CSV file
df = pd.read_csv('[Link]')
# Display the first few rows of the DataFrame
print([Link]())
ii. Reading data from an Excel file is similar to reading from a text file, but Pandas'
read_excel() function is used.
# Read data from an Excel file
df = pd.read_excel('[Link]', sheet_name='Sheet1')
# Display the first few rows of the DataFrame
print([Link]())
iii. Pandas also allows to read data directly from a URL.
# Read data from a URL
url = '[Link]
df = pd.read_csv(url)
# Display the first few rows of the DataFrame
print([Link]())

OUTPUT
[Link]
Hello, world!
This is a sample text file.

Output after reading the text file

Hello, world!
This is a sample text file.
[Link]

21CS1711 DATA SCIENCE AND ANALYTICS LAB 11

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

Output after reading the Excel file

Name Age Gender
0 Alice 30 Female
1 Bob 25 Male
2 Charlie 35 Male

Output after reading the URL [Link]

Column1 Column2 Column3
0 value value value
1 value value value
2 value value value
3 value value value
4 value value value

Result:
Thus the program to read data from text file, Excel and the web are executed.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 12

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
e. Exploring various commands for doing descriptive analytics on iris dataset.

i. Importing relevant libraries

import pandas as pd
import numpy as np
import seaborn as sns
import [Link] as plt
[Link]()
from sklearn import metrics
%matplotlib inline
ii. Loading & printing iris data
df = pd.read_csv('[Link]')
[Link]()
iris_data=pd.read_csv(‘[Link]’)
print(iris_data)
iii. Displaying up the top rows of the dataset with their columns.
The function head() will display the top rows of the dataset, the default value of this
function is 5, that is it will show top 5 rows when no argument is given to it .
[Link]
print(iris_data.head())
iv. Displaying the shape of the dataset.
The shape of the dataset means to print the total number of rows or entries and the total
number of columns or features of that particular dataset.
[Link]
print(iris_data.shape)
v. Summary of the DataFrame's information
[Link]() would provide a concise summary of the DataFrame's information. This includes
the number of non-null values in each column, the data type of each column, and memory
usage.
[Link]()
print(iris_data.info())
vi. Statistical Insight

21CS1711 DATA SCIENCE AND ANALYTICS LAB 13

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
The describe() function applies basic statistical computations on the dataset like extreme
values, count of data points standard deviation, etc. Any missing value or NaN value is
automatically skipped. describe() function gives a good picture of the distribution of data.
[Link]()
print(iris_data.describe())
vii. Checking For Duplicate Entries
The [Link]() method in pandas returns a boolean Series indicating duplicate rows in a
DataFrame. It marks each row as True if it is a duplicate of a previous row, and False
otherwise.
[Link]()
print(iris_data[iris_data.duplicated()])
viii. Checking the balance
The value_counts() method in pandas is used to count the occurrences of unique values in a
[Link] code will print the counts of unique species in the 'species' column of the
DataFrame iris_data. Each unique species name will be listed along with its count.
df.value_counts("Species")
print(iris_data['Species'].value_counts())

DATA VISUALIZATION
ix. Species count
It is used to create a count plot to visualize the count of each species in the Iris dataset using
seaborn.
[Link](‘Species Count’)
[Link]([‘species’])
# Set the title for the plot
[Link]('Species Count')
# Create a count plot to visualize the count of each species
[Link](iris_data['Species'])
# Display the plot
[Link]()
x. Univariate analysis
Univariate analysis involves examining the distribution and characteristics of a single
variable. It helps in understanding the basic properties of individual variables in the dataset.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 14

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
[Link](x='x_variable', y='y_variable', hue='hue_variable', data=df,
s=marker_size)
COMPARISON OF DIFFERENT SPECIES DEPENDING ON SEPAL WIDTH AND
LENGTH.
[Link](figsize=(17,9))
[Link](‘Comparison between various species based on sapel length and width’)
[Link](iris_data[‘sepal_length’],iris_data[‘sepal_width’],hue=iris_data[‘species’],s=
50)
COMPARISON OF DIFFERENT SPECIES DEPENDING ON PETAL WIDTH AND
LENGTH.
[Link](figsize=(16,9))
[Link](‘Comparison between various species based on petal lenght and width’)
[Link](iris_data[‘petal_length’], iris_data[‘petal_width’], hue = iris_data[‘species’],
s= 50)
xi. Bi-variate Analysis
[Link](data=df, hue='hue_variable', height=plot_height)
[Link](iris_data,hue=”species”,height=4)

OUTPUT
Printing iris data

21CS1711 DATA SCIENCE AND ANALYTICS LAB 15

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Displaying up the top rows

Displaying the shape of the dataset

The dataframe contains 6 columns and 150 rows.
Summary of the DataFrame's information

Statistical Insight

21CS1711 DATA SCIENCE AND ANALYTICS LAB 16

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

Checking For Duplicate Entries

Checking the balance

Visualizing the target column

21CS1711 DATA SCIENCE AND ANALYTICS LAB 17

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Comparison of different species depending on sepal width and length.

Comparison of different species depending on petal width and length

21CS1711 DATA SCIENCE AND ANALYTICS LAB 18

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Bi-variate Analysis

Result:
Thus the program to do the descriptive analytics on Iris dataset and exploring the
features of NumPy, SciPy, Jupyter, Statsmodels and Pandas packages are executed.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 19

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
EX NO: 2
DATE:

Use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the following:
a. Univariate analysis
b. Bivariate analysis
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.

a. Univariate analysis

Aim
To use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the Univariate analysis.
Procedure
Univariate analysis involves the examination of a single variable at a time. It focuses on
understanding the distribution and characteristics of that variable without considering any
relationships with other variables.
Load the datasets: Load the diabetes dataset from UCI and the Pima Indians Diabetes dataset
into pandas DataFrames.
Univariate analysis: Calculate the frequency, mean, median, mode, variance, standard
deviation, skewness, and kurtosis for each variable in the dataset.
import pandas as pd
import [Link] as plt
import seaborn as sns
url = '[Link]
[Link]'
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness','Insulin', 'BMI',
'DiabetesPedigreeFunction', 'Age', 'Outcome']
df = pd.read_csv(url, names=columns)
[Link]()
from [Link] import load_diabetes

21CS1711 DATA SCIENCE AND ANALYTICS LAB 20

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
diabetes = load_diabetes()
uci_df = [Link]([Link],columns=diabetes.feature_names)
uci_df['target'] = [Link]
print(uci_df.head())
import [Link] as plt
import seaborn as sns
# Example for Pima
for col in [Link][:-1]: # exclude Outcome
[Link](df[col], kde=True)
[Link](f'Pima: {col}')
[Link]()
# Example for UCI
for col in uci_df.columns[:-1]: # exclude target
[Link](uci_df[col], kde=True)
[Link](f'UCI: {col}')
[Link]()
OUTPUT

21CS1711 DATA SCIENCE AND ANALYTICS LAB 21

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

Result:
Thus the program for performing the Univariate analysis is executed.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 22

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
b. Bivariate Analysis:

Aim:
To use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the Bivariate analysis.

Procedure
 Import Necessary Libraries:
Import pandas for data manipulation, scikit-learn for machine learning, and matplotlib
for visualization.
 Load the Dataset:
Download the dataset from the UCI repository or use a direct URL.
 Explore the Data:
Get familiar with the dataset, check for missing values, data types, and distribution of
variables.
 Split Data into Features and Target Variable:
Separate the dataset into features (independent variables) and the target variable
(dependent variable).
 Split Data into Training and Testing Sets:
Divide the dataset into training and testing sets to evaluate the model's performance.
 Build and Train the Linear Regression Model:
Initialize the linear regression model and fit it to the training data.
 Evaluate the Model: Make predictions on the test data and evaluate the model's
performance using metrics such as Mean Squared Error and R-squared.

import seaborn as sns

import [Link] as plt
# Loop through all columns except 'Outcome'
for col in [Link][:-1]:
[Link](figsize=(6, 4))
[Link](x='Outcome', y=col, data=df)
[Link](f'{col} vs Outcome (Pima)')
[Link]()
uci_df['target_bin'] = [Link](uci_df['target'], q=3, labels=['Low', 'Medium', 'High'])

21CS1711 DATA SCIENCE AND ANALYTICS LAB 23

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
for col in uci_df.columns[:-2]: # exclude 'target' and 'target_bin'
[Link](figsize=(6, 4))
[Link](x='target_bin', y=col, data=uci_df)
[Link](f'{col} vs Target Bin (UCI)')
[Link]()
[Link](figsize=(10,6))
[Link]([Link](), annot=True, cmap='coolwarm')
[Link]("Correlation between Variables")
[Link]()
# Drop the 'target_bin' column before calculating correlation
[Link](figsize=(12,8))
[Link](uci_df.drop(columns=['target_bin']).corr(), annot=True, cmap='coolwarm')
[Link]('UCI Diabetes Dataset Correlation')
[Link]()
OUTPUT

21CS1711 DATA SCIENCE AND ANALYTICS LAB 24

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

Result:
Thus the program for performing the Bivariate analysis is executed.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 25

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
c. Multivariate Analysis:
Aim:
To use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the Multivariate analysis.
Procedure:
 Load the Pima Indians Diabetes dataset from the UCI repository.
 Split the dataset into features (X) and the target variable (y).
 Split the data into training and testing sets using 80% for training and 20% for testing.
 Train a multivariate regression model using scikit-learn's LinearRegression.
 Predict the outcome on the test set.
 Evaluate the model using Mean Squared Error (MSE).
Program
Multivariate Analysis using PIMA DATASETS
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from [Link] import r2_score, mean_squared_error
X_pima = [Link](columns=['Glucose', 'Outcome'])
y_pima = df['Glucose']
X_train, X_test, y_train, y_test = train_test_split(X_pima, y_pima, test_size=0.2,
random_state=42)
model_pima = LinearRegression()
model_pima.fit(X_train, y_train)
y_pred_pima = model_pima.predict(X_test)
print("PIMA:")
print("R² Score:", r2_score(y_test, y_pred_pima))
print("MSE:", mean_squared_error(y_test, y_pred_pima))
OUTPUT
PIMA:
R² Score: 0.135973429426115
MSE: 869.4777053181734

Multivariate Analysis using UCI DATASETS

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

21CS1711 DATA SCIENCE AND ANALYTICS LAB 26

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
from [Link] import r2_score, mean_squared_error
# Drop the 'target_bin' column before splitting the data
X_uci = uci_df.drop(columns=['target', 'target_bin'])
y_uci = uci_df['target']
X_train, X_test, y_train, y_test = train_test_split(X_uci, y_uci, test_size=0.2,
random_state=42)
model_uci = LinearRegression()
model_uci.fit(X_train, y_train)
y_pred_uci = model_uci.predict(X_test)
print("UCI:")
print("R² Score:", r2_score(y_test, y_pred_uci))
print("MSE:", mean_squared_error(y_test, y_pred_uci))
OUTPUT
UCI:
R² Score: 0.4526027629719195
MSE: 2900.193628493482

Result:
Thus the program for performing the Multivariate analysis is executed.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 27

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

EX NO : 3
DATE:

APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS ON UCI DATA SETS

Aim
To explore various plotting functions on UCI data sets.

Procedure:
 Download a dataset from [Link]:
[Link]
 Save that in downloads or any other Folder and install packages.
 Apply these following commands on the dataset.
 The Output will display.

Program:
import pandas as pd
import [Link] as plt
# Load the dataset
data_url = "[Link]
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pd.read_csv(data_url, names=names)
# Plot a histogram of sepal length
[Link](dataset['sepal-length'], bins=10)
[Link]('Sepal Length')
[Link]('Frequency')
[Link]('Histogram of Sepal Length')
[Link]()
# Plot a scatter plot of sepal length vs sepal width
[Link](dataset['sepal-length'], dataset['sepal-width'])
[Link]('Sepal Length')

21CS1711 DATA SCIENCE AND ANALYTICS LAB 28

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
[Link]('Sepal Width')
[Link]('Scatter Plot of Sepal Length vs Sepal Width')
[Link]()
# Plot a box plot of petal length for each class
[Link](column='petal-length', by='class')
[Link]('Box Plot of Petal Length for Each Class')
[Link]('Class')
[Link]('Petal Length')
[Link]()
# Plot a bar chart of the mean petal width for each class
class_means = [Link]('class')['petal-width'].mean()
class_means.plot(kind='bar')
[Link]('Mean Petal Width for Each Class')
[Link]('Class')
[Link]('Mean Petal Width')
[Link]()

Output:

Result:
Thus the program to explore various plotting functions on UCI data sets is performed.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 29

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

EX NO: 4
DATE:

IMPLEMENT DECISION TREE CLASSIFICATION

Aim
To write a program to perform decision tree classification.

Procedure
 Loads the Iris dataset.
 Splits the dataset into features (X) and the target variable (y).
 Splits the data into training and testing sets using 80% for training and 20% for testing.
 Trains a decision tree classifier using scikit-learn's DecisionTreeClassifier.
 Predicts the class labels on the test set.
 Evaluates the model's accuracy using accuracy_score.
 Prints a classification report containing precision, recall, F1-score, and support for each
class using classification_report.

Program
import pandas as pd
from sklearn.model_selection import train_test_split
from [Link] import DecisionTreeClassifier
from [Link] import accuracy_score, classification_report
# Load the dataset
data_url = "[Link]
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
iris_data = pd.read_csv(data_url, names=names)
# Split features and target variable
X = iris_data.drop('class', axis=1)
y = iris_data['class']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

21CS1711 DATA SCIENCE AND ANALYTICS LAB 30

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
# Train decision tree classifier
classifier = DecisionTreeClassifier(random_state=42)
[Link](X_train, y_train)
# Predictions on the test set
predictions = [Link](X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
# Classification report
print("Classification Report:")
print(classification_report(y_test, predictions))

Output:

Result:
Thus a program to perform decision tree classification on Iris dataset is verified.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 31

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360

EX NO: 5
DATE:

IMPLEMENT CLUSTERING TECHNIQUES

Aim
To write a program to implement clustering techniques.

Procedure
 Imports necessary libraries: numpy for numerical operations, matplotlib for plotting,
make_blobs to generate sample data, and KMeans for the K-means clustering algorithm.
 Generates sample data using make_blobs function. You can replace this with your own
dataset.
 Initializes K-Means with the desired number of clusters.
 Fits the KMeans model to the data and predicts cluster labels.
 Plots the data points with different colors representing different clusters.
 Plots the centroids of the clusters in red.

Program
import numpy as np
import [Link] as plt
from [Link] import make_blobs
from [Link] import KMeans
# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Apply KMeans clustering

kmeans = KMeans(n_clusters=4)
[Link](X)
y_kmeans = [Link](X)
# Plotting the clusters
[Link](X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

21CS1711 DATA SCIENCE AND ANALYTICS LAB 32

PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
# Plotting centroids
centers = kmeans.cluster_centers_
[Link](centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)
[Link]('Feature 1')
[Link]('Feature 2')
[Link]('K-Means Clustering')
[Link]()

OUTPUT:

Result:
Thus a program to perform to implement clustering techniques is verified.

21CS1711 DATA SCIENCE AND ANALYTICS LAB 33

Explore NumPy, SciPy, Jupyter, Pandas
No ratings yet
Explore NumPy, SciPy, Jupyter, Pandas
18 pages
FDS Laboratory: Data Science Practices
No ratings yet
FDS Laboratory: Data Science Practices
43 pages
Machine Learning Lab Record Coimbatore
No ratings yet
Machine Learning Lab Record Coimbatore
21 pages
Data Science Lab Record at M.A.M College
No ratings yet
Data Science Lab Record at M.A.M College
65 pages
Data Science Lab Exercises at Panimalar College
No ratings yet
Data Science Lab Exercises at Panimalar College
102 pages
Python Data Handling and Computation Guide
No ratings yet
Python Data Handling and Computation Guide
4 pages
Data Analysis Lab: Python & Visualization
No ratings yet
Data Analysis Lab: Python & Visualization
11 pages
Numpy Features and Array Operations
No ratings yet
Numpy Features and Array Operations
46 pages
Python Libraries for Machine Learning
No ratings yet
Python Libraries for Machine Learning
35 pages
Data Science Laboratory Record 2024-25
No ratings yet
Data Science Laboratory Record 2024-25
42 pages
NumPy and Pandas Array Operations Guide
No ratings yet
NumPy and Pandas Array Operations Guide
8 pages
Data Science Lab Record 2024-2025
No ratings yet
Data Science Lab Record 2024-2025
53 pages
Data Engineering Labs with Pandas & Numpy
No ratings yet
Data Engineering Labs with Pandas & Numpy
4 pages
Data Science Lab Practical Record
No ratings yet
Data Science Lab Practical Record
36 pages
NumPy and Pandas: Python Data Science Tools
No ratings yet
NumPy and Pandas: Python Data Science Tools
12 pages
Data Science Laboratory Lab Record 2025-26
No ratings yet
Data Science Laboratory Lab Record 2025-26
32 pages
NumPy Array Creation and Operations
No ratings yet
NumPy Array Creation and Operations
15 pages
NumPy and Pandas Lab Overview
No ratings yet
NumPy and Pandas Lab Overview
54 pages
Machine Learning Lab Manual for CSE
No ratings yet
Machine Learning Lab Manual for CSE
25 pages
Data Science Lab Manual for AI Students
No ratings yet
Data Science Lab Manual for AI Students
70 pages
Python Data Structures and Numpy Basics
No ratings yet
Python Data Structures and Numpy Basics
34 pages
Data Science Lab Syllabus and Exercises
No ratings yet
Data Science Lab Syllabus and Exercises
70 pages
Data Science Lab Experiments Guide
No ratings yet
Data Science Lab Experiments Guide
35 pages
NumPy and Pandas Tutorial Guide
No ratings yet
NumPy and Pandas Tutorial Guide
8 pages
Data Analytics Lab Manual Using Python
50% (2)
Data Analytics Lab Manual Using Python
8 pages
NumPy Array Creation and Operations
No ratings yet
NumPy Array Creation and Operations
12 pages
OCS353 Data Science Lab Manual
No ratings yet
OCS353 Data Science Lab Manual
34 pages
Working with NumPy and Pandas Data
No ratings yet
Working with NumPy and Pandas Data
40 pages
Data Science Lab Record 2023-2024
No ratings yet
Data Science Lab Record 2023-2024
58 pages
Machine Learning Data Handling Basics
No ratings yet
Machine Learning Data Handling Basics
33 pages
Data Preprocessing in Python Libraries
No ratings yet
Data Preprocessing in Python Libraries
159 pages
NumPy and Pandas Programming Examples
No ratings yet
NumPy and Pandas Programming Examples
24 pages
Data Mining and Big Data Lab Manual
No ratings yet
Data Mining and Big Data Lab Manual
70 pages
Data Analysis Techniques in Python
No ratings yet
Data Analysis Techniques in Python
13 pages
Data Science Lab Manual for CS3361
No ratings yet
Data Science Lab Manual for CS3361
44 pages
NumPy and Pandas Practice Questions
No ratings yet
NumPy and Pandas Practice Questions
2 pages
Data Science & Analytics Lab Guide
No ratings yet
Data Science & Analytics Lab Guide
72 pages
Computer Science Lab Practical Record
No ratings yet
Computer Science Lab Practical Record
28 pages
Understanding NumPy for Data Science
No ratings yet
Understanding NumPy for Data Science
52 pages
Python Programming Fundamentals Guide
No ratings yet
Python Programming Fundamentals Guide
16 pages
Numpy and Pandas Python Programs
No ratings yet
Numpy and Pandas Python Programs
153 pages
Shalvin's NumPy Assessment Results
No ratings yet
Shalvin's NumPy Assessment Results
9 pages
Data Science Laboratory Course Overview
No ratings yet
Data Science Laboratory Course Overview
24 pages
Machine Learning Lab Manual 2024
No ratings yet
Machine Learning Lab Manual 2024
26 pages
Python Advanced Study Guide
No ratings yet
Python Advanced Study Guide
12 pages
Python Data Analysis with NumPy & Pandas
No ratings yet
Python Data Analysis with NumPy & Pandas
17 pages
Data Science Operations in Python
No ratings yet
Data Science Operations in Python
42 pages
Python Libraries for AI & ML Studies
No ratings yet
Python Libraries for AI & ML Studies
41 pages
Install NumPy, SciPy, Jupyter, Pandas
No ratings yet
Install NumPy, SciPy, Jupyter, Pandas
31 pages
BCA 212P Data Science Practical Guide
No ratings yet
BCA 212P Data Science Practical Guide
30 pages
Exploring the Iris UCI Dataset
No ratings yet
Exploring the Iris UCI Dataset
41 pages
Data Science Practical Record BCA DS
No ratings yet
Data Science Practical Record BCA DS
25 pages
Convert 26AS Text to Excel Guide
No ratings yet
Convert 26AS Text to Excel Guide
38 pages
Data Science Lab Report: Experiments
No ratings yet
Data Science Lab Report: Experiments
19 pages
Machine Learning Lab Record 2024-2025
No ratings yet
Machine Learning Lab Record 2024-2025
31 pages
Python JSON and NumPy Array Examples
No ratings yet
Python JSON and NumPy Array Examples
14 pages
NumPy Operations and Data Analysis Guide
No ratings yet
NumPy Operations and Data Analysis Guide
18 pages
NumPy and Pandas Learning Plan
No ratings yet
NumPy and Pandas Learning Plan
6 pages
Collection and Map Characteristics
No ratings yet
Collection and Map Characteristics
20 pages
T2MFDF An LLM-Enhanced Multimodal Fault Diagnosis Framework Integrating Time-Series and Textual Data
No ratings yet
T2MFDF An LLM-Enhanced Multimodal Fault Diagnosis Framework Integrating Time-Series and Textual Data
11 pages
LKAN LLM-Based Knowledge-Aware Attention Network For Clinical Staging of Liver Cancer
No ratings yet
LKAN LLM-Based Knowledge-Aware Attention Network For Clinical Staging of Liver Cancer
14 pages
Java Oops
No ratings yet
Java Oops
7 pages
Stack and List
No ratings yet
Stack and List
10 pages
Cap Gemini
No ratings yet
Cap Gemini
4 pages
Capgemini Leetcode
No ratings yet
Capgemini Leetcode
5 pages
Java and DBMS Concepts Overview
No ratings yet
Java and DBMS Concepts Overview
9 pages
Spring Boot & Angular Interview Guide
No ratings yet
Spring Boot & Angular Interview Guide
38 pages
Profit and Loss Calculations
No ratings yet
Profit and Loss Calculations
49 pages
Java Dynamic Programming & Greedy Algorithms
No ratings yet
Java Dynamic Programming & Greedy Algorithms
3 pages
Java Coding Challenges for Beginners
No ratings yet
Java Coding Challenges for Beginners
3 pages
PEAS Framework for Vacuum Cleaner Agent
100% (2)
PEAS Framework for Vacuum Cleaner Agent
3 pages
Java Code Execution and Output Analysis
No ratings yet
Java Code Execution and Output Analysis
42 pages
Solve SEND + MORE = MONEY Puzzle
No ratings yet
Solve SEND + MORE = MONEY Puzzle
1 page
Signal and System Question Bank
No ratings yet
Signal and System Question Bank
4 pages
Sticker Collection by Robots
No ratings yet
Sticker Collection by Robots
2 pages
DTFT in Digital Signal Processing
No ratings yet
DTFT in Digital Signal Processing
14 pages
Depth-First Search in Path Planning
No ratings yet
Depth-First Search in Path Planning
7 pages
ANFIS Implementation in Python
No ratings yet
ANFIS Implementation in Python
12 pages
GMM-UBM Speaker Recognition Project
No ratings yet
GMM-UBM Speaker Recognition Project
29 pages
Probability and Statistics Exam Paper
No ratings yet
Probability and Statistics Exam Paper
4 pages
AI and Data Science Exam Papers 2022-2023
No ratings yet
AI and Data Science Exam Papers 2022-2023
2 pages
Graphics Programming with DDA and Fill
No ratings yet
Graphics Programming with DDA and Fill
3 pages
Provisional Result Intimation for BS Math
No ratings yet
Provisional Result Intimation for BS Math
21 pages
Convolution and Correlation in DSP
No ratings yet
Convolution and Correlation in DSP
12 pages
Compiling Keras Neural Networks
No ratings yet
Compiling Keras Neural Networks
5 pages
Resource Leveling vs. Smoothing Explained
No ratings yet
Resource Leveling vs. Smoothing Explained
7 pages
Hyperbolic PDEs: Wave Equation Solutions
No ratings yet
Hyperbolic PDEs: Wave Equation Solutions
11 pages
Problem Solving Techniques in Computing
100% (1)
Problem Solving Techniques in Computing
16 pages
VaR Backtesting in Excel for Risk Management
No ratings yet
VaR Backtesting in Excel for Risk Management
2 pages
Upper Confidence Bound in RL by Thakur
No ratings yet
Upper Confidence Bound in RL by Thakur
12 pages
Meta-Ensemble Learning for Heart Disease
No ratings yet
Meta-Ensemble Learning for Heart Disease
21 pages
Fluid Flow Modeling with Transformers
No ratings yet
Fluid Flow Modeling with Transformers
10 pages
Brute Force & Divide-and-Conquer Algorithms
No ratings yet
Brute Force & Divide-and-Conquer Algorithms
41 pages
State-Space Analysis in Control Systems
No ratings yet
State-Space Analysis in Control Systems
31 pages
Neural Network Training Basics
No ratings yet
Neural Network Training Basics
23 pages
Computer-Aided Speech Therapy for DLD
No ratings yet
Computer-Aided Speech Therapy for DLD
6 pages
MATLAB Quadratic Programming Guide
No ratings yet
MATLAB Quadratic Programming Guide
4 pages
Python Data Structures and Algorithms
No ratings yet
Python Data Structures and Algorithms
9 pages
Moog Ladder Filter Stability Analysis
No ratings yet
Moog Ladder Filter Stability Analysis
8 pages
Optimizing Parallel Gripper Design
No ratings yet
Optimizing Parallel Gripper Design
10 pages
Competitive Programming Tracker
No ratings yet
Competitive Programming Tracker
2 pages
Control Engineering Lesson Plan
No ratings yet
Control Engineering Lesson Plan
3 pages
Data Cleaning and Exploration in Analytics
No ratings yet
Data Cleaning and Exploration in Analytics
37 pages

Data Science Lab: NumPy & Pandas Guide

Uploaded by

Data Science Lab: NumPy & Pandas Guide

Uploaded by

PANIMALAR ENGINEERING COLLEGE

21CS1711 DATA SCIENCE AND ANALYTICS LAB 1

21CS1711 DATA SCIENCE AND ANALYTICS LAB 2

==== 2. Array operations ====

===== 3. Indexing and slicing====

21CS1711 DATA SCIENCE AND ANALYTICS LAB 3

====4. Broadcasting ====

==== 5. Linear algebra====

==== 6. Statistical opertaions ====

21CS1711 DATA SCIENCE AND ANALYTICS LAB 4

21CS1711 DATA SCIENCE AND ANALYTICS LAB 5

21CS1711 DATA SCIENCE AND ANALYTICS LAB 6

==== [Link] Operations ====

Row with label 1:

==== 3. Filtering and Conditions ====

==== [Link] statistics====

Mean salary: 92000.0

Maxium salary: 120000

==== 5. Grouping data ====

Average salary grouped by city :

21CS1711 DATA SCIENCE AND ANALYTICS LAB 7

==== 6. Sorting data====

==== 7. Adding and removing columns====

DataFrame with'Experience' coumn added:

DataFrame after dropping 'Experience' column:

===== [Link] DataFrames====

==== [Link] missing data====

DataFrame after filling missing data:

21CS1711 DATA SCIENCE AND ANALYTICS LAB 8

DataFrame after dropping rows with missing data:

21CS1711 DATA SCIENCE AND ANALYTICS LAB 9

21CS1711 DATA SCIENCE AND ANALYTICS LAB 10

Output after reading the text file

21CS1711 DATA SCIENCE AND ANALYTICS LAB 11

Output after reading the Excel file

Output after reading the URL [Link]

21CS1711 DATA SCIENCE AND ANALYTICS LAB 12

i. Importing relevant libraries

21CS1711 DATA SCIENCE AND ANALYTICS LAB 13

21CS1711 DATA SCIENCE AND ANALYTICS LAB 14

21CS1711 DATA SCIENCE AND ANALYTICS LAB 15

Displaying the shape of the dataset

21CS1711 DATA SCIENCE AND ANALYTICS LAB 16

Checking For Duplicate Entries

Checking the balance

Visualizing the target column

21CS1711 DATA SCIENCE AND ANALYTICS LAB 17

Comparison of different species depending on petal width and length

21CS1711 DATA SCIENCE AND ANALYTICS LAB 18

21CS1711 DATA SCIENCE AND ANALYTICS LAB 19

21CS1711 DATA SCIENCE AND ANALYTICS LAB 20

21CS1711 DATA SCIENCE AND ANALYTICS LAB 21

21CS1711 DATA SCIENCE AND ANALYTICS LAB 22

import seaborn as sns

21CS1711 DATA SCIENCE AND ANALYTICS LAB 23

21CS1711 DATA SCIENCE AND ANALYTICS LAB 24

21CS1711 DATA SCIENCE AND ANALYTICS LAB 25

Multivariate Analysis using UCI DATASETS

21CS1711 DATA SCIENCE AND ANALYTICS LAB 26

21CS1711 DATA SCIENCE AND ANALYTICS LAB 27

APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS ON UCI DATA SETS

21CS1711 DATA SCIENCE AND ANALYTICS LAB 28

21CS1711 DATA SCIENCE AND ANALYTICS LAB 29

IMPLEMENT DECISION TREE CLASSIFICATION

21CS1711 DATA SCIENCE AND ANALYTICS LAB 30

21CS1711 DATA SCIENCE AND ANALYTICS LAB 31

IMPLEMENT CLUSTERING TECHNIQUES

# Apply KMeans clustering

21CS1711 DATA SCIENCE AND ANALYTICS LAB 32

21CS1711 DATA SCIENCE AND ANALYTICS LAB 33

You might also like