PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
EX NO : 1
DATE:
Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and
Pandas packages. Reading data from text file, Excel and the web. Exploring various
commands for doing descriptive analytics on Iris dataset.
AIM:
To install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and Pandas
packages.
PROCEDURE:
Installation
NumPy, SciPy, Jupyter, Statsmodels, and Pandas can be easily installed using Python's package
manager, pip. Open a terminal or command prompt and type the following commands one by one:
pip install numpy
pip install scipy
pip install jupyter
pip install statsmodels
pip install pandas
Explore the Features:
NumPy: NumPy is a fundamental package for scientific computing with Python. It
provides support for arrays, matrices, and high-level mathematical functions to operate on
these arrays.
SciPy: SciPy is built on top of NumPy and provides additional functionality for scientific
computing. It includes modules for optimization, integration, interpolation, linear algebra,
and more.
Jupyter: Jupyter is a web-based interactive computing platform that allows you to create
and share documents containing live code, equations, visualizations, and narrative text.
Statsmodels: Statsmodels is a Python module that provides classes and functions for
estimating many different statistical models, as well as for conducting statistical tests and
exploring data.
21CS1711 DATA SCIENCE AND ANALYTICS LAB 1
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Pandas: Pandas is a powerful data analysis and manipulation library for Python. It
provides data structures like Series and DataFrame, which are ideal for working with
structured data.
a. Exploring numpy
import numpy as np
print("==== 1. Array creation ====")
arr1=[Link]([1,2,3,4])
print("Array from list : ",arr1)
arr2=[Link]((2,3))
print("Array of zeroes : \n",arr2)
arr3=[Link]((3,2))
print("Array of ones : ",arr3)
arr4=[Link](0,10,2)
print("Array with range : ",arr4)
arr5=[Link](0,1,5)
print("Array with linspace : ",arr5)
print("\n==== 2. Array operations ====")
arr6=[Link]([1,2,3,4])
arr7=[Link]([5,6,7,8])
sum_arr=arr6+arr7
print("Array addition : ",sum_arr)
prod_arr=arr6*arr7
print("Array multiplication : ",prod_arr)
exp_arr=arr6**2
print("Array exponentiation : ",exp_arr)
print("\n===== 3. Indexing and slicing====")
matrix=[Link]([[1,2,3],[4,5,6],[7,8,9]])
element=matrix[1,2]
print("Element at [1,2] : ",element)
sub_matrix=matrix[0:2,1:3]
print("Sub-matrix : \n",sub_matrix)
21CS1711 DATA SCIENCE AND ANALYTICS LAB 2
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
mask=matrix>5
print("Elements greater than 5 : \n",matrix[mask])
print("\n====4. Broadcasting ====")
arr8=[Link]([1,2,3])
arr9=[Link]([[10],[20],[30]])
broadcasted_result=arr8+arr9
print("Broadcasted result: \n",broadcasted_result)
print("\n==== 5. Linear algebra====")
a=[Link]([[1,2],[3,4]])
b=[Link]([[5,6],[7,8]])
matmul_result=[Link](a,b)
print("Matrix multiplication : \n",matmul_result)
det_a=[Link](a)
print("Determinant of matrix a : ",det_a)
print("\n==== 6. Statistical opertaions ====")
arr10=[Link]([1,2,3,4,5])
mean_val=[Link](arr10)
print("Mean:",mean_val)
std_val=[Link](arr10)
print("Standard deviation :",std_val)
median_val=[Link](arr10)
print("Median:",median_val)
OUTPUT:
==== 1. Array creation ====
Array from list : [1 2 3 4]
Array of zeroes :
[[0. 0. 0.]
[0. 0. 0.]]
Array of ones : [[1. 1.]
[1. 1.]
[1. 1.]]
Array with range : [0 2 4 6 8]
Array with linspace : [0. 0.25 0.5 0.75 1. ]
==== 2. Array operations ====
Array addition : [ 6 8 10 12]
Array multiplication : [ 5 12 21 32]
Array exponentiation : [ 1 4 9 16]
===== 3. Indexing and slicing====
21CS1711 DATA SCIENCE AND ANALYTICS LAB 3
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Element at [1,2] : 6
Sub-matrix :
[[2 3]
[5 6]]
Elements greater than 5 :
[6 7 8 9]
====4. Broadcasting ====
Broadcasted result:
[[11 12 13]
[21 22 23]
[31 32 33]]
==== 5. Linear algebra====
Matrix multiplication :
[[19 22]
[43 50]]
Determinant of matrix a : -2.0000000000000004
==== 6. Statistical opertaions ====
Mean: 3.0
Standard deviation : 1.4142135623730951
Median: 3.0
RESULT:
Hence, the above program for exploring the features of Numpy has been written and
executed successfully.
21CS1711 DATA SCIENCE AND ANALYTICS LAB 4
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
b. Exploring pandas
import pandas as pd
import numpy as np
print("\n==== 1. Create DataFrame ====")
data={
'Name':['Alice','Bob','Charlie','David','Edward'],
'Age':[24,27,22,32,29],
'City':['New York','Los Angeles','Chicago','Houstan','Phoenix'],
'Salary':[70000,80000,120000,90000,100000]
}
df=[Link](data)
print("DataFrame created from a dictionary:\n",df)
print("\n==== [Link] Operations ====")
age_column=df['Age']
print("Age column :\n",age_column)
row_2=[Link][2]
print("\nRow 2:\n",row_2)
row_label=[Link][1]
print("\nRow with label 1:\n",row_label)
print("\n==== 3. Filtering and Conditions ====")
filtered_df=df[df['Age']>25]
print("Filtered DataFrame (Age>25):\n",filtered_df)
filtered_df_multi_cond=df[(df['Age']>25)&(df['Salary']<100000)]
print("Filtered DataFrame (Age>25 and Salary<100000):\n",filtered_df_multi_cond)
print("\n==== [Link] statistics====")
summary_stats=[Link]()
print("Summary statistics of numeric columns :\n",summary_stats)
mean_salary=df['Salary'].mean()
print("\nMean salary:",mean_salary)
max_salary=df['Salary'].max()
print("\nMaxium salary:",max_salary)
print("\n==== 5. Grouping data ====")
21CS1711 DATA SCIENCE AND ANALYTICS LAB 5
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
grouped_by_city=[Link]('City')['Salary'].mean()
print("\nAverage salary grouped by city :\n",grouped_by_city)
print("\n==== 6. Sorting data====")
sorted_by_salary=df.sort_values(by='Salary',ascending=False)
print("DataFrame sorted by Salary(descending):\n",sorted_by_salary)
sorted_by_age=df.sort_values(by='Age',ascending=True)
print("DataFrame sorted by Age(ascending):\n",sorted_by_age)
print("\n==== 7. Adding and removing columns====")
df['Experience']=[2,5,1,8,4]
print("\nDataFrame with'Experience' coumn added:\n",df)
df_dropped=[Link](columns=['Experience'])
print("\nDataFrame after dropping 'Experience' column:\n",df_dropped)
print("\n===== [Link] DataFrames====")
data2={
'Name':['Alice','Bob','Charlie','David','Edward'],
'Department':['HR','IT','Finance','Marketing','Sales']
}
df2=[Link](data2)
merged_df=[Link](df,df2,on='Name')
print("Merged DataFrame:\n",merged_df)
df_with_na=[Link]()
print("\n==== [Link] missing data====")
df_with_na.loc[1,'Salary']=[Link]
print("DataFrame with missing data:\n",df_with_na)
df_filled=df_with_na.fillna({'Salary':df['Salary'].mean()})
print("\nDataFrame after filling missing data:\n",df_filled)
df_dropped_na=df_with_na.dropna()
print("\nDataFrame after dropping rows with missing data:\n",df_dropped_na)
OUTPUT:
==== 1. Create DataFrame ====
DataFrame created from a dictionary:
Name Age City Salary
0 Alice 24 New York 70000
1 Bob 27 Los Angeles 80000
2 Charlie 22 Chicago 120000
21CS1711 DATA SCIENCE AND ANALYTICS LAB 6
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
3 David 32 Houstan 90000
4 Edward 29 Phoenix 100000
==== [Link] Operations ====
Age column :
0 24
1 27
2 22
3 32
4 29
Name: Age, dtype: int64
Row 2:
Name Charlie
Age 22
City Chicago
Salary 120000
Name: 2, dtype: object
Row with label 1:
Name Bob
Age 27
City Los Angeles
Salary 80000
Name: 1, dtype: object
==== 3. Filtering and Conditions ====
Filtered DataFrame (Age>25):
Name Age City Salary
1 Bob 27 Los Angeles 80000
3 David 32 Houstan 90000
4 Edward 29 Phoenix 100000
Filtered DataFrame (Age>25 and Salary<100000):
Name Age City Salary
1 Bob 27 Los Angeles 80000
3 David 32 Houstan 90000
==== [Link] statistics====
Summary statistics of numeric columns :
Age Salary
count 5.000000 5.000000
mean 26.800000 92000.000000
std 3.962323 19235.384062
min 22.000000 70000.000000
25% 24.000000 80000.000000
50% 27.000000 90000.000000
75% 29.000000 100000.000000
max 32.000000 120000.000000
Mean salary: 92000.0
Maxium salary: 120000
==== 5. Grouping data ====
Average salary grouped by city :
City
Chicago 120000.0
Houstan 90000.0
21CS1711 DATA SCIENCE AND ANALYTICS LAB 7
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Los Angeles 80000.0
New York 70000.0
Phoenix 100000.0
Name: Salary, dtype: float64
==== 6. Sorting data====
DataFrame sorted by Salary(descending):
Name Age City Salary
2 Charlie 22 Chicago 120000
4 Edward 29 Phoenix 100000
3 David 32 Houstan 90000
1 Bob 27 Los Angeles 80000
0 Alice 24 New York 70000
DataFrame sorted by Age(ascending):
Name Age City Salary
2 Charlie 22 Chicago 120000
0 Alice 24 New York 70000
1 Bob 27 Los Angeles 80000
4 Edward 29 Phoenix 100000
3 David 32 Houstan 90000
==== 7. Adding and removing columns====
DataFrame with'Experience' coumn added:
Name Age City Salary Experience
0 Alice 24 New York 70000 2
1 Bob 27 Los Angeles 80000 5
2 Charlie 22 Chicago 120000 1
3 David 32 Houstan 90000 8
4 Edward 29 Phoenix 100000 4
DataFrame after dropping 'Experience' column:
Name Age City Salary
0 Alice 24 New York 70000
1 Bob 27 Los Angeles 80000
2 Charlie 22 Chicago 120000
3 David 32 Houstan 90000
4 Edward 29 Phoenix 100000
===== [Link] DataFrames====
Merged DataFrame:
Name Age City Salary Experience Department
0 Alice 24 New York 70000 2 HR
1 Bob 27 Los Angeles 80000 5 IT
2 Charlie 22 Chicago 120000 1 Finance
3 David 32 Houstan 90000 8 Marketing
4 Edward 29 Phoenix 100000 4 Sales
==== [Link] missing data====
DataFrame with missing data:
Name Age City Salary Experience
0 Alice 24 New York 70000.0 2
1 Bob 27 Los Angeles NaN 5
2 Charlie 22 Chicago 120000.0 1
3 David 32 Houstan 90000.0 8
4 Edward 29 Phoenix 100000.0 4
DataFrame after filling missing data:
Name Age City Salary Experience
21CS1711 DATA SCIENCE AND ANALYTICS LAB 8
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
0 Alice 24 New York 70000.0 2
1 Bob 27 Los Angeles 92000.0 5
2 Charlie 22 Chicago 120000.0 1
3 David 32 Houstan 90000.0 8
4 Edward 29 Phoenix 100000.0 4
DataFrame after dropping rows with missing data:
Name Age City Salary Experience
0 Alice 24 New York 70000.0 2
2 Charlie 22 Chicago 120000.0 1
3 David 32 Houstan 90000.0 8
4 Edward 29 Phoenix 100000.0 4
RESULT:
Hence, the above program for exploring the various features of Pandas has been
written and executed successfully.
21CS1711 DATA SCIENCE AND ANALYTICS LAB 9
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
c. Exploring scipy and statsmodel
import [Link] as sm
import pandas as pd
# 1. Create sample dataset
data = {
'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Exam_Score': [35, 37, 40, 43, 45, 50, 53, 55, 58, 60]
}
df = [Link](data)
print("Sample Data:\n", df)
X = df['Hours_Studied']
y = df['Exam_Score']
X = sm.add_constant(X) # Add intercept term
model = [Link](y, X).fit()
df['Predicted_Score'] = [Link](X)
print("\nPredicted Scores:\n", df)
OUTPUT:
RESULT:
Hence, the above program for exploring the various features of Statsmodel has been
written and executed successfully.
21CS1711 DATA SCIENCE AND ANALYTICS LAB 10
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
d. Reading data from text file, excel and the web
i. To read data from a text file (such as CSV files) using Pandas' read_csv() function is used.
import pandas as pd
# Read data from a CSV file
df = pd.read_csv('[Link]')
# Display the first few rows of the DataFrame
print([Link]())
ii. Reading data from an Excel file is similar to reading from a text file, but Pandas'
read_excel() function is used.
# Read data from an Excel file
df = pd.read_excel('[Link]', sheet_name='Sheet1')
# Display the first few rows of the DataFrame
print([Link]())
iii. Pandas also allows to read data directly from a URL.
# Read data from a URL
url = '[Link]
df = pd.read_csv(url)
# Display the first few rows of the DataFrame
print([Link]())
OUTPUT
[Link]
Hello, world!
This is a sample text file.
Output after reading the text file
Hello, world!
This is a sample text file.
[Link]
21CS1711 DATA SCIENCE AND ANALYTICS LAB 11
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Output after reading the Excel file
Name Age Gender
0 Alice 30 Female
1 Bob 25 Male
2 Charlie 35 Male
Output after reading the URL [Link]
Column1 Column2 Column3
0 value value value
1 value value value
2 value value value
3 value value value
4 value value value
Result:
Thus the program to read data from text file, Excel and the web are executed.
21CS1711 DATA SCIENCE AND ANALYTICS LAB 12
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
e. Exploring various commands for doing descriptive analytics on iris dataset.
i. Importing relevant libraries
import pandas as pd
import numpy as np
import seaborn as sns
import [Link] as plt
[Link]()
from sklearn import metrics
%matplotlib inline
ii. Loading & printing iris data
df = pd.read_csv('[Link]')
[Link]()
iris_data=pd.read_csv(‘[Link]’)
print(iris_data)
iii. Displaying up the top rows of the dataset with their columns.
The function head() will display the top rows of the dataset, the default value of this
function is 5, that is it will show top 5 rows when no argument is given to it .
[Link]
print(iris_data.head())
iv. Displaying the shape of the dataset.
The shape of the dataset means to print the total number of rows or entries and the total
number of columns or features of that particular dataset.
[Link]
print(iris_data.shape)
v. Summary of the DataFrame's information
[Link]() would provide a concise summary of the DataFrame's information. This includes
the number of non-null values in each column, the data type of each column, and memory
usage.
[Link]()
print(iris_data.info())
vi. Statistical Insight
21CS1711 DATA SCIENCE AND ANALYTICS LAB 13
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
The describe() function applies basic statistical computations on the dataset like extreme
values, count of data points standard deviation, etc. Any missing value or NaN value is
automatically skipped. describe() function gives a good picture of the distribution of data.
[Link]()
print(iris_data.describe())
vii. Checking For Duplicate Entries
The [Link]() method in pandas returns a boolean Series indicating duplicate rows in a
DataFrame. It marks each row as True if it is a duplicate of a previous row, and False
otherwise.
[Link]()
print(iris_data[iris_data.duplicated()])
viii. Checking the balance
The value_counts() method in pandas is used to count the occurrences of unique values in a
[Link] code will print the counts of unique species in the 'species' column of the
DataFrame iris_data. Each unique species name will be listed along with its count.
df.value_counts("Species")
print(iris_data['Species'].value_counts())
DATA VISUALIZATION
ix. Species count
It is used to create a count plot to visualize the count of each species in the Iris dataset using
seaborn.
[Link](‘Species Count’)
[Link]([‘species’])
# Set the title for the plot
[Link]('Species Count')
# Create a count plot to visualize the count of each species
[Link](iris_data['Species'])
# Display the plot
[Link]()
x. Univariate analysis
Univariate analysis involves examining the distribution and characteristics of a single
variable. It helps in understanding the basic properties of individual variables in the dataset.
21CS1711 DATA SCIENCE AND ANALYTICS LAB 14
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
[Link](x='x_variable', y='y_variable', hue='hue_variable', data=df,
s=marker_size)
COMPARISON OF DIFFERENT SPECIES DEPENDING ON SEPAL WIDTH AND
LENGTH.
[Link](figsize=(17,9))
[Link](‘Comparison between various species based on sapel length and width’)
[Link](iris_data[‘sepal_length’],iris_data[‘sepal_width’],hue=iris_data[‘species’],s=
50)
COMPARISON OF DIFFERENT SPECIES DEPENDING ON PETAL WIDTH AND
LENGTH.
[Link](figsize=(16,9))
[Link](‘Comparison between various species based on petal lenght and width’)
[Link](iris_data[‘petal_length’], iris_data[‘petal_width’], hue = iris_data[‘species’],
s= 50)
xi. Bi-variate Analysis
[Link](data=df, hue='hue_variable', height=plot_height)
[Link](iris_data,hue=”species”,height=4)
OUTPUT
Printing iris data
21CS1711 DATA SCIENCE AND ANALYTICS LAB 15
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Displaying up the top rows
Displaying the shape of the dataset
The dataframe contains 6 columns and 150 rows.
Summary of the DataFrame's information
Statistical Insight
21CS1711 DATA SCIENCE AND ANALYTICS LAB 16
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Checking For Duplicate Entries
Checking the balance
Visualizing the target column
21CS1711 DATA SCIENCE AND ANALYTICS LAB 17
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Comparison of different species depending on sepal width and length.
Comparison of different species depending on petal width and length
21CS1711 DATA SCIENCE AND ANALYTICS LAB 18
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Bi-variate Analysis
Result:
Thus the program to do the descriptive analytics on Iris dataset and exploring the
features of NumPy, SciPy, Jupyter, Statsmodels and Pandas packages are executed.
21CS1711 DATA SCIENCE AND ANALYTICS LAB 19
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
EX NO: 2
DATE:
Use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the following:
a. Univariate analysis
b. Bivariate analysis
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.
a. Univariate analysis
Aim
To use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the Univariate analysis.
Procedure
Univariate analysis involves the examination of a single variable at a time. It focuses on
understanding the distribution and characteristics of that variable without considering any
relationships with other variables.
Load the datasets: Load the diabetes dataset from UCI and the Pima Indians Diabetes dataset
into pandas DataFrames.
Univariate analysis: Calculate the frequency, mean, median, mode, variance, standard
deviation, skewness, and kurtosis for each variable in the dataset.
import pandas as pd
import [Link] as plt
import seaborn as sns
url = '[Link]
[Link]'
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness','Insulin', 'BMI',
'DiabetesPedigreeFunction', 'Age', 'Outcome']
df = pd.read_csv(url, names=columns)
[Link]()
from [Link] import load_diabetes
21CS1711 DATA SCIENCE AND ANALYTICS LAB 20
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
diabetes = load_diabetes()
uci_df = [Link]([Link],columns=diabetes.feature_names)
uci_df['target'] = [Link]
print(uci_df.head())
import [Link] as plt
import seaborn as sns
# Example for Pima
for col in [Link][:-1]: # exclude Outcome
[Link](df[col], kde=True)
[Link](f'Pima: {col}')
[Link]()
# Example for UCI
for col in uci_df.columns[:-1]: # exclude target
[Link](uci_df[col], kde=True)
[Link](f'UCI: {col}')
[Link]()
OUTPUT
21CS1711 DATA SCIENCE AND ANALYTICS LAB 21
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Result:
Thus the program for performing the Univariate analysis is executed.
21CS1711 DATA SCIENCE AND ANALYTICS LAB 22
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
b. Bivariate Analysis:
Aim:
To use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the Bivariate analysis.
Procedure
Import Necessary Libraries:
Import pandas for data manipulation, scikit-learn for machine learning, and matplotlib
for visualization.
Load the Dataset:
Download the dataset from the UCI repository or use a direct URL.
Explore the Data:
Get familiar with the dataset, check for missing values, data types, and distribution of
variables.
Split Data into Features and Target Variable:
Separate the dataset into features (independent variables) and the target variable
(dependent variable).
Split Data into Training and Testing Sets:
Divide the dataset into training and testing sets to evaluate the model's performance.
Build and Train the Linear Regression Model:
Initialize the linear regression model and fit it to the training data.
Evaluate the Model: Make predictions on the test data and evaluate the model's
performance using metrics such as Mean Squared Error and R-squared.
import seaborn as sns
import [Link] as plt
# Loop through all columns except 'Outcome'
for col in [Link][:-1]:
[Link](figsize=(6, 4))
[Link](x='Outcome', y=col, data=df)
[Link](f'{col} vs Outcome (Pima)')
[Link]()
uci_df['target_bin'] = [Link](uci_df['target'], q=3, labels=['Low', 'Medium', 'High'])
21CS1711 DATA SCIENCE AND ANALYTICS LAB 23
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
for col in uci_df.columns[:-2]: # exclude 'target' and 'target_bin'
[Link](figsize=(6, 4))
[Link](x='target_bin', y=col, data=uci_df)
[Link](f'{col} vs Target Bin (UCI)')
[Link]()
[Link](figsize=(10,6))
[Link]([Link](), annot=True, cmap='coolwarm')
[Link]("Correlation between Variables")
[Link]()
# Drop the 'target_bin' column before calculating correlation
[Link](figsize=(12,8))
[Link](uci_df.drop(columns=['target_bin']).corr(), annot=True, cmap='coolwarm')
[Link]('UCI Diabetes Dataset Correlation')
[Link]()
OUTPUT
21CS1711 DATA SCIENCE AND ANALYTICS LAB 24
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
Result:
Thus the program for performing the Bivariate analysis is executed.
21CS1711 DATA SCIENCE AND ANALYTICS LAB 25
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
c. Multivariate Analysis:
Aim:
To use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the Multivariate analysis.
Procedure:
Load the Pima Indians Diabetes dataset from the UCI repository.
Split the dataset into features (X) and the target variable (y).
Split the data into training and testing sets using 80% for training and 20% for testing.
Train a multivariate regression model using scikit-learn's LinearRegression.
Predict the outcome on the test set.
Evaluate the model using Mean Squared Error (MSE).
Program
Multivariate Analysis using PIMA DATASETS
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from [Link] import r2_score, mean_squared_error
X_pima = [Link](columns=['Glucose', 'Outcome'])
y_pima = df['Glucose']
X_train, X_test, y_train, y_test = train_test_split(X_pima, y_pima, test_size=0.2,
random_state=42)
model_pima = LinearRegression()
model_pima.fit(X_train, y_train)
y_pred_pima = model_pima.predict(X_test)
print("PIMA:")
print("R² Score:", r2_score(y_test, y_pred_pima))
print("MSE:", mean_squared_error(y_test, y_pred_pima))
OUTPUT
PIMA:
R² Score: 0.135973429426115
MSE: 869.4777053181734
Multivariate Analysis using UCI DATASETS
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
21CS1711 DATA SCIENCE AND ANALYTICS LAB 26
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
from [Link] import r2_score, mean_squared_error
# Drop the 'target_bin' column before splitting the data
X_uci = uci_df.drop(columns=['target', 'target_bin'])
y_uci = uci_df['target']
X_train, X_test, y_train, y_test = train_test_split(X_uci, y_uci, test_size=0.2,
random_state=42)
model_uci = LinearRegression()
model_uci.fit(X_train, y_train)
y_pred_uci = model_uci.predict(X_test)
print("UCI:")
print("R² Score:", r2_score(y_test, y_pred_uci))
print("MSE:", mean_squared_error(y_test, y_pred_uci))
OUTPUT
UCI:
R² Score: 0.4526027629719195
MSE: 2900.193628493482
Result:
Thus the program for performing the Multivariate analysis is executed.
21CS1711 DATA SCIENCE AND ANALYTICS LAB 27
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
EX NO : 3
DATE:
APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS ON UCI DATA SETS
Aim
To explore various plotting functions on UCI data sets.
Procedure:
Download a dataset from [Link]:
[Link]
Save that in downloads or any other Folder and install packages.
Apply these following commands on the dataset.
The Output will display.
Program:
import pandas as pd
import [Link] as plt
# Load the dataset
data_url = "[Link]
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pd.read_csv(data_url, names=names)
# Plot a histogram of sepal length
[Link](dataset['sepal-length'], bins=10)
[Link]('Sepal Length')
[Link]('Frequency')
[Link]('Histogram of Sepal Length')
[Link]()
# Plot a scatter plot of sepal length vs sepal width
[Link](dataset['sepal-length'], dataset['sepal-width'])
[Link]('Sepal Length')
21CS1711 DATA SCIENCE AND ANALYTICS LAB 28
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
[Link]('Sepal Width')
[Link]('Scatter Plot of Sepal Length vs Sepal Width')
[Link]()
# Plot a box plot of petal length for each class
[Link](column='petal-length', by='class')
[Link]('Box Plot of Petal Length for Each Class')
[Link]('Class')
[Link]('Petal Length')
[Link]()
# Plot a bar chart of the mean petal width for each class
class_means = [Link]('class')['petal-width'].mean()
class_means.plot(kind='bar')
[Link]('Mean Petal Width for Each Class')
[Link]('Class')
[Link]('Mean Petal Width')
[Link]()
Output:
Result:
Thus the program to explore various plotting functions on UCI data sets is performed.
21CS1711 DATA SCIENCE AND ANALYTICS LAB 29
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
EX NO: 4
DATE:
IMPLEMENT DECISION TREE CLASSIFICATION
Aim
To write a program to perform decision tree classification.
Procedure
Loads the Iris dataset.
Splits the dataset into features (X) and the target variable (y).
Splits the data into training and testing sets using 80% for training and 20% for testing.
Trains a decision tree classifier using scikit-learn's DecisionTreeClassifier.
Predicts the class labels on the test set.
Evaluates the model's accuracy using accuracy_score.
Prints a classification report containing precision, recall, F1-score, and support for each
class using classification_report.
Program
import pandas as pd
from sklearn.model_selection import train_test_split
from [Link] import DecisionTreeClassifier
from [Link] import accuracy_score, classification_report
# Load the dataset
data_url = "[Link]
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
iris_data = pd.read_csv(data_url, names=names)
# Split features and target variable
X = iris_data.drop('class', axis=1)
y = iris_data['class']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
21CS1711 DATA SCIENCE AND ANALYTICS LAB 30
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
# Train decision tree classifier
classifier = DecisionTreeClassifier(random_state=42)
[Link](X_train, y_train)
# Predictions on the test set
predictions = [Link](X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
# Classification report
print("Classification Report:")
print(classification_report(y_test, predictions))
Output:
Result:
Thus a program to perform decision tree classification on Iris dataset is verified.
21CS1711 DATA SCIENCE AND ANALYTICS LAB 31
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
EX NO: 5
DATE:
IMPLEMENT CLUSTERING TECHNIQUES
Aim
To write a program to implement clustering techniques.
Procedure
Imports necessary libraries: numpy for numerical operations, matplotlib for plotting,
make_blobs to generate sample data, and KMeans for the K-means clustering algorithm.
Generates sample data using make_blobs function. You can replace this with your own
dataset.
Initializes K-Means with the desired number of clusters.
Fits the KMeans model to the data and predicts cluster labels.
Plots the data points with different colors representing different clusters.
Plots the centroids of the clusters in red.
Program
import numpy as np
import [Link] as plt
from [Link] import make_blobs
from [Link] import KMeans
# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Apply KMeans clustering
kmeans = KMeans(n_clusters=4)
[Link](X)
y_kmeans = [Link](X)
# Plotting the clusters
[Link](X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
21CS1711 DATA SCIENCE AND ANALYTICS LAB 32
PANIMALAR ENGINEERING COLLEGE
Department of CSE
Reg no: 211422104360
# Plotting centroids
centers = kmeans.cluster_centers_
[Link](centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)
[Link]('Feature 1')
[Link]('Feature 2')
[Link]('K-Means Clustering')
[Link]()
OUTPUT:
Result:
Thus a program to perform to implement clustering techniques is verified.
21CS1711 DATA SCIENCE AND ANALYTICS LAB 33