0% found this document useful (0 votes)

30 views37 pages

DS Journal - Final

The document is a practical certificate from Lords Universal College for a B.Sc. in Computer Science, detailing various practical exercises completed by a student. It includes topics such as data frames, feature scaling, regression, logistic regression, clustering, PCA, Excel usage, and hypothesis testing. Each practical session includes specific tasks, dates, and signatures for verification.

Uploaded by

akhileshworks593

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views37 pages

DS Journal - Final

Uploaded by

akhileshworks593

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

VIKAS VIDYA EDUCATION TRUST'S

Lords Universal College

Department of Computer Science

CERTIFICATE

This is to certify that Mr./Ms. of

Uni. Exam No. _______ ( Semester) has satisfactorily completed

Practical, in the subject of as a

part of B.Sc. in Computer Science Program during the academic year 20 -

20_________.

Place:

Date:

Subject In-charge Co-Ordinator,

Department of Computer
Science

Signature of External Examiner

INDEX

Sr No. Date Practical Signature

Data Frames and Basic Data Pre-processing.

● Read data from CSV and JSON files into a data
frame.
1 23/01/25
● Perform basic data pre-processing tasks such as
handling missing values and outliers.
● Manipulate and transform data using functions like
filtering, sorting, and grouping.

Feature Scaling and Dummification

● Apply feature-scaling techniques like
2 30/01/25 standardization and normalization to numerical
features.
● Perform feature dummification to convert
categorical variables into numerical representations.

Regression and Its Types

● Implement simple linear regression using a dataset.
3 30/01/25
● Explore and interpret the regression model
coefficients and goodness-of-fit measures.
● Extend the analysis to multiple linear regression and
assess the impact of additional predictors.

Logistic Regression and Decision Tree

● Build a logistic regression model to predict a binary
outcome.
4 06/02/25
● Evaluate the model's performance using
classification metrics (e.g., accuracy, precision,
recall).
● Construct a decision tree model and interpret the
decision rules for classification.
K-Means Clustering
5 06/02/25
● Apply the K-Means algorithm to group similar data
points into clusters.

● Determine the optimal number of clusters using

elbow method or silhouette analysis.
● Visualize the clustering results and analyze the
cluster characteristics.

Principal Component Analysis (PCA)

● Perform PCA on a dataset to reduce dimensionality.
6 13/02/25
● Evaluate the explained variance and select the
appropriate number of principal components.
● Visualize the data in the reduced-dimensional space.

Introduction to Excel
● Perform conditional formatting on a dataset using

7 20/02/25 various criteria.

● Create a pivot table to analyze and summarize data.
● Use VLOOKUP function to retrieve information
from a different worksheet or table.
● Perform what-if analysis using Goal Seek to
determine input values for desired output.

Hypothesis Testing
● Formulate null and alternative hypotheses for a
given problem.
8 20/02/25 ● Conduct a hypothesis test using appropriate
statistical tests (e.g., t-test, chi square test).
● Interpret the results and draw conclusions based on
the test outcomes.
Practical no : 01 : Data Frames and Basic
Data Pre-processing

Aim: Read data from CSV and JSON files

into a data frame. Perform basic data pre-
processing tasks such as handling missing
values and outliers.Manipulate and
transform data using functions like
filtering, sorting, and grouping.
In [1]: import pandas as pd
import numpy as np

# Reading a CSV file into a DataFrame

df=pd.read_csv(r"C:\Users\DELL\Desktop\MSC Data science\Data sets\Iris.csv")

print(df.head()) # Display the first 5 rows of the DataFrame

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa

In [2]: # Step 2: Basic Data Exploration

df.head()

Out[2]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

In [3]: df.tail()
Out[3]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

In [4]: df.describe()

Out[4]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

count 150.000000 150.000000 150.000000 150.000000 150.000000

mean 75.500000 5.843333 3.054000 3.758667 1.198667

std 43.445368 0.828066 0.433594 1.764420 0.763161

min 1.000000 4.300000 2.000000 1.000000 0.100000

25% 38.250000 5.100000 2.800000 1.600000 0.300000

50% 75.500000 5.800000 3.000000 4.350000 1.300000

75% 112.750000 6.400000 3.300000 5.100000 1.800000

max 150.000000 7.900000 4.400000 6.900000 2.500000

In [5]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB

In [6]: df.shape

Out[6]: (150, 6)

In [7]: df.dtypes
Out[7]: Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object
dtype: object

In [8]: df.columns

Out[8]: Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',

'Species'],
dtype='object')

In [9]: # Step 3: Checking for Missing Values

# Checking for missing values in each column of the CSV DataFrame
missing_values = df.isnull().sum()
print("\nMissing Values in CSV Data:")
print(missing_values)

Missing Values in CSV Data:

Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64

In [10]: df = df.drop(columns=['Species'])

In [11]: # Step 4: Handling Missing Values

# We will fill missing values in columns with the mean of the column
# (You could also drop missing rows or use other strategies depending on your need
fill= df.fillna(df.mean())
print("\nFilled Missing Values with Mean (CSV Data):")
print(df.head())

Filled Missing Values with Mean (CSV Data):

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2

In [12]: # Alternatively, you can drop rows with missing values:

# df_csv_dropped = df_csv.dropna()
# print("\nDropped Rows with Missing Values (CSV Data):")
# print(df_csv_dropped.head())

In [13]: # Step 5: Handling Outliers

# Here we will calculate Z-scores and remove rows where Z-score is greater than 3
z_scores = np.abs((fill - fill.mean()) / fill.std())
z_scores
Out[13]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

0 1.714797 0.897674 1.028611 1.336794 1.308593

1 1.691780 1.139200 0.124540 1.336794 1.308593

2 1.668762 1.380727 0.336720 1.393470 1.308593

3 1.645745 1.501490 0.106090 1.280118 1.308593

4 1.622728 1.018437 1.259242 1.336794 1.308593

... ... ... ... ... ...

145 1.622728 1.034539 0.124540 0.816888 1.443121

146 1.645745 0.551486 1.277692 0.703536 0.918985

147 1.668762 0.793012 0.124540 0.816888 1.050019

148 1.691780 0.430722 0.797981 0.930239 1.443121

149 1.714797 0.068433 0.124540 0.760212 0.787951

150 rows × 5 columns

In [14]: # Remove rows where any Z-score is greater than 3 (outliers)

do = fill[(z_scores < 3).all(axis=1)]
print("\nData After Removing Outliers (CSV Data):")
print(do.head())

Data After Removing Outliers (CSV Data):

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2

In [15]: # Step 6: Filtering Data (Example: Select rows where a column value is greater tha
threshold_value = 3 # Example threshold value
filter = do[do['SepalLengthCm'] > threshold_value]
print(f"\nFiltered Data (Rows with column_name > {threshold_value}):")
print(filter)

Filtered Data (Rows with column_name > 3):

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8

[149 rows x 5 columns]

In [16]: # Step 7: Sorting Data (Sorting by a column in descending order)
df_sorted = filter.sort_values(by='SepalWidthCm', ascending=False)
print(df_sorted.head())

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

33 34 5.5 4.2 1.4 0.2
32 33 5.2 4.1 1.5 0.1
14 15 5.8 4.0 1.2 0.2
16 17 5.4 3.9 1.3 0.4
5 6 5.4 3.9 1.7 0.4

In [17]: # Step 8: Grouping Data (Example: Group by a column and calculate the mean of anot
df_grouped = df_sorted.groupby('SepalLengthCm').agg({
'PetalLengthCm': 'mean', # Calculate the mean of 'another_column' for each gr
'PetalWidthCm': 'sum' # Calculate the sum of 'yet_another_column' for each gr
}).reset_index() # Reset index to avoid multi-index
print("\nGrouped Data (Mean and Sum for Each Group):")
print(df_grouped.head())

Grouped Data (Mean and Sum for Each Group):

SepalLengthCm PetalLengthCm PetalWidthCm
0 4.3 1.100000 0.1
1 4.4 1.333333 0.6
2 4.5 1.300000 0.3
3 4.6 1.325000 0.9
4 4.7 1.450000 0.4
Practical No 02: Feature Scaling and
Dummification

Part I: Apply feature-scaling techniques

like standardization and normalization to
numerical features
In [1]: import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [2]: df = pd.read_csv(r"C:\Users\DELL\Desktop\wine.csv")
df

Out[2]: Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.pheno

0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.2

1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.2

2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.3

3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.2

4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.3

... ... ... ... ... ... ... ... ...

173 3 13.71 5.65 2.45 20.5 95 1.68 0.61 0.5

174 3 13.40 3.91 2.48 23.0 102 1.80 0.75 0.4

175 3 13.27 4.28 2.26 20.0 120 1.59 0.69 0.4

176 3 13.17 2.59 2.37 20.0 120 1.65 0.68 0.5

177 3 14.13 4.10 2.74 24.5 96 2.05 0.76 0.5

178 rows × 14 columns

In [3]: df1 = pd.read_csv(r"C:\Users\DELL\Desktop\wine.csv", usecols=[0, 1, 2], skiprows=1

df1.columns = ['classlabel', 'Alcohol', 'Malic Acid']
print("Original DataFrame:")
df1

Original DataFrame:

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Out[3]: classlabel Alcohol Malic Acid

0 1 13.20 1.78

1 1 13.16 2.36

2 1 14.37 1.95

3 1 13.24 2.59

4 1 14.20 1.76

... ... ... ...

172 3 13.71 5.65

173 3 13.40 3.91

174 3 13.27 4.28

175 3 13.17 2.59

176 3 14.13 4.10

177 rows × 3 columns

MinMax Scaler
There is another way of data scaling, where the minimum of feature is made equal to zero
and the maximum of feature equal to one. MinMax Scaler shrinks the data within the given
range, usually of 0 to 1. It transforms data by scaling features to a given range. It scales the
values to a specific value range without changing the shape of the original distribution.

In [ ]: scaling=MinMaxScaler()
scaled_value=scaling.fit_transform(df1[['Alcohol','Malic Acid']])
df1[['Alcohol','Malic Acid']]=scaled_value
print("\n Dataframe after MinMax Scaling")
df1

StandardScaler
StandardScaler is a preprocessing technique in scikit-learn used for standardizing features
by removing the mean and scaling to unit variance. StandardScaler, a popular
preprocessing technique provided by scikit-learn, offers a simple yet effective method for
standardizing feature values. StandardScaler operates on the principle of normalization,
where it transforms the distribution of each feature to have a mean of zero and a standard
deviation of one. This process ensures that all features are on the same scale, preventing
any single feature from dominating the learning process due to its larger magnitude.

In [4]: scaling=StandardScaler()
scaled_standardvalue=scaling.fit_transform(df1[['Alcohol','Malic Acid']])
df1[['Alcohol','Malic Acid']]=scaled_standardvalue
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
print("\n Dataframe after Standard Scaling")
df1

Dataframe after Standard Scaling

Out[4]: classlabel Alcohol Malic Acid

0 1 0.255824 -0.501624

1 1 0.206229 0.018020

2 1 1.706501 -0.349315

3 1 0.305420 0.224086

4 1 1.495719 -0.519543

... ... ... ...

172 3 0.888171 2.965658

173 3 0.503803 1.406725

174 3 0.342617 1.738222

175 3 0.218628 0.224086

176 3 1.408926 1.576953

177 rows × 3 columns

Part II : Perform feature Dummification to

convert categorical variables into
numerical representations.
In [5]: import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [6]: iris=pd.read_csv(r"C:\Users\DELL\Desktop\MSC Data science\Data sets\Iris.csv")

iris

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Out[6]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

... ... ... ... ... ... ...

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 6 columns

In [7]: le=LabelEncoder()
iris['code']=le.fit_transform(iris.Species)
iris

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Out[7]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species code

Iris-
0 1 5.1 3.5 1.4 0.2 0
setosa

Iris-
1 2 4.9 3.0 1.4 0.2 0
setosa

Iris-
2 3 4.7 3.2 1.3 0.2 0
setosa

Iris-
3 4 4.6 3.1 1.5 0.2 0
setosa

Iris-
4 5 5.0 3.6 1.4 0.2 0
setosa

... ... ... ... ... ... ... ...

Iris-
145 146 6.7 3.0 5.2 2.3 2
virginica

Iris-
146 147 6.3 2.5 5.0 1.9 2
virginica

Iris-
147 148 6.5 3.0 5.2 2.0 2
virginica

Iris-
148 149 6.2 3.4 5.4 2.3 2
virginica

Iris-
149 150 5.9 3.0 5.1 1.8 2
virginica

150 rows × 7 columns

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Pratical No: 03 - Regression and Its Types

Aim : To Implement simple linear

regression using a dataset.Explore and
interpret the regression model coefficients
and goodness-of-fit measures.
In [1]: import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score

In [2]: df = pd.read_csv(r"C:\Users\DELL\Downloads\fetch_california_housing.csv")
df.head()

Out[2]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitud

0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.2

1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.2

2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.2

3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.2

4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.2

In [3]: df.tail()

Out[3]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Long

20635 1.5603 25.0 5.045455 1.133333 845.0 2.560606 39.48 -

20636 2.5568 18.0 6.114035 1.315789 356.0 3.122807 39.49 -

20637 1.7000 17.0 5.205543 1.120092 1007.0 2.325635 39.43 -

20638 1.8672 18.0 5.329513 1.171920 741.0 2.123209 39.43 -

20639 2.3886 16.0 5.254717 1.162264 1387.0 2.616981 39.37 -

In [4]: df.shape

Out[4]: (20640, 9)

In [5]: df.size
Out[5]: 185760

In [6]: df.describe()

Out[6]: MedInc HouseAge AveRooms AveBedrms Population AveOccup

count 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000 20640.00000

mean 3.870671 28.639486 5.429000 1.096675 1425.476744 3.07065

std 1.899822 12.585558 2.474173 0.473911 1132.462122 10.38605

min 0.499900 1.000000 0.846154 0.333333 3.000000 0.69230

25% 2.563400 18.000000 4.440716 1.006079 787.000000 2.42974

50% 3.534800 29.000000 5.229129 1.048780 1166.000000 2.81811

75% 4.743250 37.000000 6.052381 1.099526 1725.000000 3.28226

max 15.000100 52.000000 141.909091 34.066667 35682.000000 1243.33333

In [7]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MedInc 20640 non-null float64
1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
8 MedHouseVal 20640 non-null float64
dtypes: float64(9)
memory usage: 1.4 MB

In [8]: df.dtypes

Out[8]: MedInc float64

HouseAge float64
AveRooms float64
AveBedrms float64
Population float64
AveOccup float64
Latitude float64
Longitude float64
MedHouseVal float64
dtype: object

In [9]: #import ssl

#ssl._create_default_https_context = ssl._create_unverified_context

housing = fetch_california_housing()
# Convert to DataFrame
housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)
housing_df.head() # Print first few rows

Out[9]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitud

0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.2

1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.2

2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.2

3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.2

4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.2

In [10]: housing_df['PRICE']=housing.target
X=housing_df[['AveRooms']]
y=housing_df[['PRICE']]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [11]: model=LinearRegression()
model.fit(X_train,y_train)

Out[11]: ▾ LinearRegression

LinearRegression()

In [12]: mse=mean_squared_error(y_test,model.predict(X_test))
r2=r2_score(y_test,model.predict(X_test))

In [13]: print("Mean Squared Error: ", mse)

print("R-squared: ",r2)
print("Intercept: ",model.intercept_)
print("Co-efficient: ",model.coef_)

Mean Squared Error: 1.2923314440807299

R-squared: 0.013795337532284901
Intercept: [1.65476227]
Co-efficient: [[0.07675559]]

Part II: Extend the analysis to multiple

linear regression and assess the impact of
additional predictors.
In [14]: X = housing_df.drop('PRICE',axis=1)
y = housing_df['PRICE']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42

In [15]: model = LinearRegression()

model.fit(X_train,y_train)
y_pred = model.predict(X_test)
In [16]: mse = mean_squared_error(y_test,y_pred)
r2 = r2_score(y_test,y_pred)

In [17]: print("Mean Squared Error:",mse)

print("R-squared:",r2)
print("Intercept:",model.intercept_)
print("Coefficient:",model.coef_)

Mean Squared Error: 0.555891598695244

R-squared: 0.5757877060324511
Intercept: -37.023277706064064
Coefficient: [ 4.48674910e-01 9.72425752e-03 -1.23323343e-01 7.83144907e-01
-2.02962058e-06 -3.52631849e-03 -4.19792487e-01 -4.33708065e-01]
Practical no:04

Aim: Logistic Regression and Decision Tree

Part I: Build a logistic regression model to

predict a binary outcome. Evaluate the
model's performance using classification
metrics (e.g., accuracy, precision, recall).
In [1]: import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, classif

In [2]: df=pd.read_csv(r"C:\Users\DELL\Desktop\MSC Data science\Data sets\Iris.csv")

Out[2]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

... ... ... ... ... ... ...

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 6 columns

In [3]: # Keep only two classes

df1 = df[df['Species'] != 2]
df1
Out[3]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

... ... ... ... ... ... ...

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 6 columns

In [4]: # Keep only two classes (filter out class 2)

df = df[df['Species'] != 2]

# Define features and target

X = df.drop('Species', axis=1)
y = df['Species']

In [5]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_st

logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

C:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\lin
ear_model\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=
1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
Out[5]: ▾ LogisticRegression

LogisticRegression()

In [6]: # Predictions
y_pred_logistic = logistic_model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred_logistic))
print("\nClassification Report")
print(classification_report(y_test, y_pred_logistic))
Accuracy: 1.0

Classification Report
precision recall f1-score support

Iris-setosa 1.00 1.00 1.00 10

Iris-versicolor 1.00 1.00 1.00 9
Iris-virginica 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

Part II: Construct a decision tree model

and interpret the decision rules for
classification.
In [7]: from sklearn.tree import DecisionTreeClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_st

In [8]: model = DecisionTreeClassifier()

model.fit(X_train, y_train)
y_pred_tree = model.predict(X_test)
y_pred_tree

Out[8]: array(['Iris-versicolor', 'Iris-setosa', 'Iris-virginica',

'Iris-versicolor', 'Iris-versicolor', 'Iris-setosa',
'Iris-versicolor', 'Iris-virginica', 'Iris-versicolor',
'Iris-versicolor', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa',
'Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica',
'Iris-versicolor', 'Iris-versicolor', 'Iris-virginica',
'Iris-setosa', 'Iris-virginica', 'Iris-setosa', 'Iris-virginica',
'Iris-virginica', 'Iris-virginica', 'Iris-virginica',
'Iris-virginica', 'Iris-setosa', 'Iris-setosa'], dtype=object)

In [9]: # Print Decision Tree Metrics

print("\nDecision Tree Metrics")
print("Accuracy: ", accuracy_score(y_test, y_pred_tree))
print("\nClassification Report")
print(classification_report(y_test, y_pred_tree))
Decision Tree Metrics
Accuracy: 1.0

Classification Report
precision recall f1-score support

Iris-setosa 1.00 1.00 1.00 10

Iris-versicolor 1.00 1.00 1.00 9
Iris-virginica 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

In [ ]:
Practical No: 05 - K-Means Clustering

Aim:Apply the K-Means algorithm to group

similar data points into clusters. Determine the
optimal number of clusters using elbow method
or silhouette analysis. Visualize the clustering
results and analyze the cluster characteristics.
In [1]: import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

In [2]: # Generate synthetic data

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Step 1: Elbow Method to find the optimal number of clusters

inertia = []
K_range = range(1, 11)

In [3]: for k in K_range:

kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
inertia.append(kmeans.inertia_)

In [4]: # Plot Elbow Curve

plt.plot(K_range, inertia, marker='o', linestyle='--')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
In [5]: # Step 2: Apply K-Means with the chosen k (let's pick k=4)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
y_kmeans = kmeans.fit_predict(X)

In [6]: # Step 3: Visualize Clustering Results

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', edgecolors='k')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
s=200, c='red', marker='X', label='Centroids')
plt.title('K-Means Clustering')
plt.legend()
plt.show()
Practical No: 06 - Principal Component
Analysis (PCA)

Aim: erform PCA on a dataset to reduce

dimensionality. Evaluate the explained
variance and select the appropriate number
of principal components. Visualize the data
in the reduced-dimensional space.
In [1]: import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

In [2]: # Load dataset (Iris dataset)

data = load_iris()
X = data.data # Features
y = data.target # Labels

In [3]: # Standardize the data (important for PCA)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

In [4]: # Evaluate explained variance

explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)

In [5]: # Plot explained variance

plt.figure(figsize=(6, 4))
plt.plot(range(1, len(explained_variance) + 1), cumulative_variance, marker='o', line
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance vs. Number of Components')
plt.grid(True)
plt.show()
In [6]: # Choose first two principal components for visualization
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X_scaled)

# Scatter plot of the first two principal components

plt.figure(figsize=(6, 4))
plt.scatter(X_pca_2d[:, 0], X_pca_2d[:, 1], c=y, cmap='viridis', edgecolor='k')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA: 2D Projection of Data')
plt.colorbar(label='Class Label')
plt.grid(True)
plt.show()
PRACTICAL 7
AIM: Introduction to Excel , Perform various operations in excel.
A. Perform conditional formatting on a dataset using various criteria.

Step l: Go to condi onal forma ng > Greater Than

Step 2: Enter the greater than filter value for example 2000.

Step 3: Go to Data Bars > Solid Fill in condi onal forma ng.
B. Create a pivot table to analyse and summarize data. Steps

Step I : select the en re table and go to Insert tab PivotChart > Pivotchart Step

2: Select "New worksheet" in the create pivot chart window.

Step 3: Select and drag a ributes in the below boxes.

C. Use VLOOKUP func on to retrieve informa on from a different worksheet or tables.

Steps:
Step l : click on an empty cell and type the following command.

=VLOOKUP(B3,B3:D3,1,TRUE)

D. Perform what-if analysis using Goal Seek to determine input values for desired output.
Step 1: In the Data tab go to the what if analysis>Goal seek.

Step 2: Fill the informa on in the window accordingly and click ok.
PART B

Aim:Create Pivot Table in Excel For following Analysis and Visualize the data using Pivot

Chart Steps:1) First Create the Excel Data

2.)Select the en re dataset

3.)Select the pivot chart and pivot table op on from insert tab

4.)Select the table range and cell where you want the output
5.)Perform the following Ques ons

i)Find out the total sales

ii)Find out the sum of sales Color wise

iii)Find out the sum of units

iv)Find out Region wise total sales and total units
PART C

Aim:Apply Vlookup func ons to retrieve informa on for the following Queries

Q.1)Find the part name for part number ”A002”

1.) =VLOOKUP(B3,B2:E21,2,FALSE)

Q.2)Find the Supplier ID for part name ”Ball Joint”

2.) =VLOOKUP("Ball joint",C2:F21,4,FALSE)

Q.3)Find the Part Price for part name ”Muffer”

Q.4)Find the Status of part number ”A008”

PRACTICAL 8

AIM : Conduct Hypothesis Testing using appropriate statistical tests (eg., t-test,
chi-square test.

Data Science Practical Certificate
No ratings yet
Data Science Practical Certificate
25 pages
Data Science Practicals
No ratings yet
Data Science Practicals
47 pages
Data Science Practicals
No ratings yet
Data Science Practicals
40 pages
Index: SR. NO. Practical Name Date of Perform NO. Sign
No ratings yet
Index: SR. NO. Practical Name Date of Perform NO. Sign
28 pages
Omkar
No ratings yet
Omkar
37 pages
Data Analysis Lab with Python
No ratings yet
Data Analysis Lab with Python
11 pages
Fundamentals of Data Science Students
No ratings yet
Fundamentals of Data Science Students
52 pages
Practical No - 1
No ratings yet
Practical No - 1
5 pages
ML LabReport Final Index Edited
No ratings yet
ML LabReport Final Index Edited
35 pages
Data Science Lab Manual..
No ratings yet
Data Science Lab Manual..
54 pages
Vamshi ml-1,2
No ratings yet
Vamshi ml-1,2
25 pages
DSBDAlab Manual
No ratings yet
DSBDAlab Manual
116 pages
Python in Research
No ratings yet
Python in Research
18 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
Ai Record Programs
No ratings yet
Ai Record Programs
34 pages
24UAD315 DEV Final Record
No ratings yet
24UAD315 DEV Final Record
49 pages
Statistics IMP Questions and Answers
No ratings yet
Statistics IMP Questions and Answers
23 pages
Data Science Lab Program Printout
No ratings yet
Data Science Lab Program Printout
43 pages
Data Analysis Python
No ratings yet
Data Analysis Python
3 pages
Sample Worksheet 1
No ratings yet
Sample Worksheet 1
8 pages
Ai&Ml Lab Record Final
No ratings yet
Ai&Ml Lab Record Final
31 pages
Dev Record Aids
No ratings yet
Dev Record Aids
24 pages
CS3362 Data Science Laboratory Manual 2022-23
No ratings yet
CS3362 Data Science Laboratory Manual 2022-23
54 pages
IP Record Python 23-24 Aryan
No ratings yet
IP Record Python 23-24 Aryan
42 pages
1152CS239-Intro. To Data Science-Syllabus
No ratings yet
1152CS239-Intro. To Data Science-Syllabus
6 pages
Data Science Journal
No ratings yet
Data Science Journal
40 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
Set-D CT2 Answerkey
No ratings yet
Set-D CT2 Answerkey
11 pages
Assignment 2 (Set B)
No ratings yet
Assignment 2 (Set B)
5 pages
ML Lab Record
No ratings yet
ML Lab Record
38 pages
DS-DS Lab-1
No ratings yet
DS-DS Lab-1
4 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
Data Science Practical Book - Ipynb
No ratings yet
Data Science Practical Book - Ipynb
21 pages
Course - Introduction To Data Science (SD211105)
No ratings yet
Course - Introduction To Data Science (SD211105)
10 pages
Machine Learning: Technical Requirements & Data Processing Guide
No ratings yet
Machine Learning: Technical Requirements & Data Processing Guide
30 pages
Data Science & Big Data Lab Guide
No ratings yet
Data Science & Big Data Lab Guide
167 pages
DM Record Final
No ratings yet
DM Record Final
68 pages
Python Data Science Cheat Sheet
100% (2)
Python Data Science Cheat Sheet
6 pages
DSR LAB MANUAL - 10 Programs
No ratings yet
DSR LAB MANUAL - 10 Programs
34 pages
ML Lab Manual 1-10
No ratings yet
ML Lab Manual 1-10
58 pages
ML Lab
No ratings yet
ML Lab
46 pages
ML Lab Manual (Upto Cie-1)
No ratings yet
ML Lab Manual (Upto Cie-1)
33 pages
Machine
No ratings yet
Machine
10 pages
Dsa Lab Manual
No ratings yet
Dsa Lab Manual
35 pages
Data Science in Society Cat
No ratings yet
Data Science in Society Cat
5 pages
ML Lab Manual 2025-2
No ratings yet
ML Lab Manual 2025-2
35 pages
Data Analytics Lab Course Overview
No ratings yet
Data Analytics Lab Course Overview
125 pages
Data Science Practical With Solutions BSC Cs Sem 6
No ratings yet
Data Science Practical With Solutions BSC Cs Sem 6
29 pages
PP DWDM 4 5
No ratings yet
PP DWDM 4 5
26 pages
Sessional QP-TaT
No ratings yet
Sessional QP-TaT
5 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
TYCS Practical
No ratings yet
TYCS Practical
26 pages
FDS Lab
No ratings yet
FDS Lab
43 pages
DADV - Lab - Subject - 303105315
No ratings yet
DADV - Lab - Subject - 303105315
35 pages
PracticalList - EDT - BCA - 2024 SET B1 - 4
No ratings yet
PracticalList - EDT - BCA - 2024 SET B1 - 4
8 pages
FDS Data Science Lab Manual
No ratings yet
FDS Data Science Lab Manual
53 pages
Megh 1234 Dvda
No ratings yet
Megh 1234 Dvda
21 pages
227C4A Data Science
No ratings yet
227C4A Data Science
2 pages
House Price Prediction with Random Forest
No ratings yet
House Price Prediction with Random Forest
41 pages
Predictive Modelling of Crime Dataset Using Data Mining
No ratings yet
Predictive Modelling of Crime Dataset Using Data Mining
16 pages
YOLOv5 for Document Layout Detection
No ratings yet
YOLOv5 for Document Layout Detection
14 pages
Enhancing Liver Ultrasound Images
No ratings yet
Enhancing Liver Ultrasound Images
5 pages
Analysis of Wheat
No ratings yet
Analysis of Wheat
21 pages
Project Title
No ratings yet
Project Title
3 pages
CCS360 - R.s-Unit-V - Part 1-Evaluating R.S Study Meterial
No ratings yet
CCS360 - R.s-Unit-V - Part 1-Evaluating R.S Study Meterial
17 pages
Solutions To Selected Exercises From Chapter 9 Bain & Engelhardt - Second Edition
No ratings yet
Solutions To Selected Exercises From Chapter 9 Bain & Engelhardt - Second Edition
13 pages
Paper
No ratings yet
Paper
11 pages
Trend in Scrappage and Survival US Light Duty Vehicle
No ratings yet
Trend in Scrappage and Survival US Light Duty Vehicle
24 pages
Prediction of Fuel Consumption of Mining
No ratings yet
Prediction of Fuel Consumption of Mining
8 pages
Regression Analysis
100% (2)
Regression Analysis
9 pages
Applied Computational Intelligence and Soft Computing - 2023 - Ahmed - Book Recommendation
No ratings yet
Applied Computational Intelligence and Soft Computing - 2023 - Ahmed - Book Recommendation
12 pages
Classification Vs Regression Interview Questions
No ratings yet
Classification Vs Regression Interview Questions
3 pages
Information Theoretic Principles For Agent Learning
No ratings yet
Information Theoretic Principles For Agent Learning
9 pages
GeM Bidding 7945876
No ratings yet
GeM Bidding 7945876
7 pages
Crop Price Prediction Using Machine Learning
No ratings yet
Crop Price Prediction Using Machine Learning
5 pages
ECT402 WirelessCommunication Module4 Part2
No ratings yet
ECT402 WirelessCommunication Module4 Part2
35 pages
ML QB
No ratings yet
ML QB
21 pages
Krajewski - Om12 - 08
No ratings yet
Krajewski - Om12 - 08
74 pages
GARCH Techniques for Volatility Forecasting
No ratings yet
GARCH Techniques for Volatility Forecasting
13 pages
Unit 2 Notes - Final
No ratings yet
Unit 2 Notes - Final
32 pages
Algosintrvwques
No ratings yet
Algosintrvwques
27 pages
Gradient Descent Walkthrough Exercise
No ratings yet
Gradient Descent Walkthrough Exercise
3 pages
Optimizing Drilling Parameters Efficiently
No ratings yet
Optimizing Drilling Parameters Efficiently
15 pages
DNN Full Merged Compressed Compressed
No ratings yet
DNN Full Merged Compressed Compressed
863 pages
Unit4 Notes
No ratings yet
Unit4 Notes
27 pages
AI-Enhanced Perovskite Solar Cells
No ratings yet
AI-Enhanced Perovskite Solar Cells
10 pages
Computing Jacobian and Hessian of Estimators and Their Application To Risk Approximation
No ratings yet
Computing Jacobian and Hessian of Estimators and Their Application To Risk Approximation
4 pages
Lec-1 Bias-variance-Tradeoff
No ratings yet
Lec-1 Bias-variance-Tradeoff
24 pages

DS Journal - Final

Uploaded by

DS Journal - Final

Uploaded by

VIKAS VIDYA EDUCATION TRUST'S

Lords Universal College

This is to certify that Mr./Ms. of

Uni. Exam No. ___________ (____ Semester) has satisfactorily completed

Practical, in the subject of as a

part of B.Sc. in Computer Science Program during the academic year 20 -

Subject In-charge Co-Ordinator,

Signature of External Examiner

Sr No. Date Practical Signature

Data Frames and Basic Data Pre-processing.

Feature Scaling and Dummification

Regression and Its Types

Logistic Regression and Decision Tree

● Determine the optimal number of clusters using

Principal Component Analysis (PCA)

7 20/02/25 various criteria.

Aim: Read data from CSV and JSON files

# Reading a CSV file into a DataFrame

print(df.head()) # Display the first 5 rows of the DataFrame

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

In [2]: # Step 2: Basic Data Exploration

Out[2]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

Out[4]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

count 150.000000 150.000000 150.000000 150.000000 150.000000

mean 75.500000 5.843333 3.054000 3.758667 1.198667

std 43.445368 0.828066 0.433594 1.764420 0.763161

min 1.000000 4.300000 2.000000 1.000000 0.100000

25% 38.250000 5.100000 2.800000 1.600000 0.300000

50% 75.500000 5.800000 3.000000 4.350000 1.300000

75% 112.750000 6.400000 3.300000 5.100000 1.800000

max 150.000000 7.900000 4.400000 6.900000 2.500000

Out[8]: Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',

In [9]: # Step 3: Checking for Missing Values

Missing Values in CSV Data:

In [11]: # Step 4: Handling Missing Values

Filled Missing Values with Mean (CSV Data):

In [12]: # Alternatively, you can drop rows with missing values:

In [13]: # Step 5: Handling Outliers

0 1.714797 0.897674 1.028611 1.336794 1.308593

1 1.691780 1.139200 0.124540 1.336794 1.308593

2 1.668762 1.380727 0.336720 1.393470 1.308593

3 1.645745 1.501490 0.106090 1.280118 1.308593

4 1.622728 1.018437 1.259242 1.336794 1.308593

... ... ... ... ... ...

145 1.622728 1.034539 0.124540 0.816888 1.443121

146 1.645745 0.551486 1.277692 0.703536 0.918985

147 1.668762 0.793012 0.124540 0.816888 1.050019

148 1.691780 0.430722 0.797981 0.930239 1.443121

149 1.714797 0.068433 0.124540 0.760212 0.787951

150 rows × 5 columns

In [14]: # Remove rows where any Z-score is greater than 3 (outliers)

Data After Removing Outliers (CSV Data):

Filtered Data (Rows with column_name > 3):

[149 rows x 5 columns]

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

Grouped Data (Mean and Sum for Each Group):

Part I: Apply feature-scaling techniques

Out[2]: Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.pheno

0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.2

1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.2

2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.3

3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.2

4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.3

... ... ... ... ... ... ... ... ...

173 3 13.71 5.65 2.45 20.5 95 1.68 0.61 0.5

174 3 13.40 3.91 2.48 23.0 102 1.80 0.75 0.4

Uni. Exam No. _______ ( Semester) has satisfactorily completed