0% found this document useful (0 votes)

10 views19 pages

Unit II (DWDM)

Unit II covers data preprocessing, which is essential for preparing raw data for analysis by cleaning, transforming, and organizing it. Key steps include data cleaning, integration, transformation, and reduction, all aimed at improving data quality and model accuracy. The document also discusses techniques for handling missing values, noisy data, and dimensionality reduction to ensure reliable analysis and effective machine learning outcomes.

Uploaded by

sairam.06706

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views19 pages

Unit II (DWDM)

Uploaded by

sairam.06706

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

UNIT II

UNIT II: Data Preprocessing: An Overview, Data Cleaning, Data Integration, Data
Reduction, Data Transformation and Data Discretization.
Data preprocessing is the process of preparing raw data for analysis by cleaning and
transforming it into a usable format. In data mining it refers to preparing raw data for mining
by performing tasks like cleaning, transforming, and organizing it into a format suitable for
mining algorithms.
Data preprocessing is important because raw data is often messy, inconsistent, and unsuitable
for direct use in analysis or model training. Without it, your results can be misleading,
inaccurate, or completely useless.
1.Ensures Data Quality
 Real-world datasets often contain missing values, duplicates, or outliers.

 Preprocessing detects and fixes these issues, ensuring your model works on clean and
reliable data.

2. Improves Accuracy of Models

 Machine learning algorithms are sensitive to noise, irrelevant features, and scale
differences.

 By normalizing, standardizing, or selecting features, preprocessing improves prediction

accuracy.

3. Handles Inconsistencies
 Different sources may store data in different formats (e.g., “Male/Female” vs. “M/F”).

 Preprocessing makes data consistent and comparable.

4. Reduces Complexity
 Dimensionality reduction (e.g., PCA) or feature selection removes redundant or
irrelevant data, making models faster and easier to train.

5. Prevents Bias from Bad Data

 Incorrect or unbalanced datasets can bias models toward wrong conclusions.

 Preprocessing ensures fairer, more representative datasets.

6. Saves Time in the Long Run

 Clean, well-structured data is easier to reuse and share for multiple experiments.
Steps in Data Preprocessing

 Some key steps in data preprocessing are Data Cleaning, Data Integration, Data
Transformation, and Data Reduction.
Data Cleaning: It is the process of identifying and correcting errors or inconsistencies in the
dataset. It involves handling missing values, removing duplicates, and correcting incorrect or
outlier data to ensure the dataset is accurate and reliable. Clean data is essential for effective
analysis, as it improves the quality of results and enhances the performance of data models.

 Missing Values: This occur when data is absent from a dataset. You can either ignore
the rows with missing data or fill the gaps manually, with the attribute mean, or by using
the most probable value. This ensures the dataset remains accurate and complete for
analysis.

 Noisy Data: It refers to irrelevant or incorrect data that is difficult for machines to
interpret, often caused by errors in data collection or entry. It can be handled in several
ways:

o Binning Method: The data is sorted into equal segments, and each segment is
smoothed by replacing values with the mean or boundary values.

o Regression: Data can be smoothed by fitting it to a regression function, either

linear or multiple, to predict values.

o Clustering: This method groups similar data points together, with outliers
either being undetected or falling outside the clusters. These techniques help
remove noise and improve data quality.
 Removing Duplicates: It involves identifying and eliminating repeated data entries
to ensure accuracy and consistency in the dataset. This process prevents errors and
ensures reliable analysis by keeping only unique records.

Data Integration

It involves merging data from various sources into a single, unified dataset. It can be
challenging due to differences in data formats, structures, and meanings. Techniques like
record linkage and data fusion help in combining data efficiently, ensuring consistency and
accuracy.

Data Transformation: It involves converting data into a format suitable for analysis.
Common techniques include normalization, which scales data to a common range;
standardization, which adjusts data to have zero mean and unit variance; and discretization,
which converts continuous data into discrete categories. These techniques help prepare the
data for more accurate analysis.
 Data Normalization: The process of scaling data to a common range to ensure
consistency across variables.
 Discretization: Converting continuous data into discrete categories for easier
analysis.
 Data Aggregation: Combining multiple data points into a summary form, such as
averages or totals, to simplify analysis.
 Concept Hierarchy Generation: Organizing data into a hierarchy of concepts to
provide a higher-level view for better understanding and analysis.

Data Reduction: It reduces the dataset's size while maintaining key information. This can
be done through feature selection, which chooses the most relevant features, and feature
extraction, which transforms the data into a lower-dimensional space while preserving
important details. It uses various reduction techniques such as,

 Dimensionality Reduction (e.g., Principal Component Analysis): A technique that

reduces the number of variables in a dataset while retaining its essential information.
 Numerosity Reduction: Reducing the number of data points by methods like
sampling to simplify the dataset without losing critical patterns.
 Data Compression: Reducing the size of data by encoding it in a more compact form,
making it easier to store and process.

Data cleaning : Data cleaning (also called data cleansing or data scrubbing) is the process
of detecting and correcting (or removing) errors, inconsistencies, and inaccuracies in a
dataset so it becomes accurate, complete, and reliable for analysis.

Importance & Benefits of Data Cleaning

1. Improves Data Accuracy

 Removes errors, duplicates, and inconsistencies so your dataset truly reflects reality.
 Example: Changing “Indai” → “India” ensures correct analysis.
Benefit: Reliable results and better decision-making.

2. Enhances Model Performance

 Clean, consistent data helps machine learning algorithms learn patterns more
effectively.
 Noisy or inconsistent data can confuse the model.
Benefit: Higher accuracy, precision, and recall.
3. Improves Consistency Across Systems
 Standardizing formats (dates, text, units) ensures smooth integration with other
systems.
Benefit: Easier merging of datasets from different sources.
4. Helps in Compliance & Governance
 Clean, well-maintained datasets are essential for following legal requirements like
GDPR or HIPAA.
Benefit: Avoids legal risks and penalties.
Sources of Missing Values

Missing values occur when certain data points are not recorded, lost, or not applicable. They
can be introduced at data collection, entry, storage, or processing stages.

 Data Entry Errors

 Data Collection Problems
 Data Loss During Transfer or Storage
 Inapplicable or Irrelevant Data
 Equipment or Sensor Failure
 Data Filtering or Cleaning
 Privacy or Confidentiality Restrictions

Data Cleaning Steps:

1) Data Inspection (Identify Problems)

 Understand the dataset structure and quality.

 Check for missing values, duplicates, outliers, inconsistencies.
 Example: Using df.info() and df.describe() in Python to spot anomalies.

2) Remove or Handle Duplicate Data

 Detect and delete identical records.
 Example: df.drop_duplicates() in Pandas.

3) Handle Missing Values

 Methods:
o Remove rows/columns with too many missing values.
o Fill in (impute) using mean, median, mode, or prediction models.
 Example: Filling missing "age" with the median value.
4) Correct Structural Errors
 Fix typos, inconsistent capitalization, wrong formats.
 Example: “M”, “Male”, “male” → “Male”.

5) Handle Outliers
 Detect using statistical methods (Z-score, IQR).
 Decide to remove, cap, or transform them.
 Example: Removing height values like 500 cm.

6) Standardize Data Formats

 Make sure all dates, numbers, and text follow the same format.
 Example: Convert MM/DD/YYYY to YYYY-MM-DD.

7) Validate Data Consistency

 Check relationships between columns.
 Example: If Purchase Date < Registration Date, it’s invalid.

8)Final Review & Export

 Verify that data is clean.
 Save cleaned data for analysis or modelling.

Handling the noise data:

Binning

Data binning or bucketing is a data preprocessing method used to minimize the effects of
small observation errors. The original data values are divided into small intervals known as
bins and then they are replaced by a general value calculated for that bin. This has a
smoothing effect on the input data and may also reduce the chances of overfitting in the case
of small datasets

Types of Binning Techniques

Binning can be broadly categorized into three types based on how the bins are defined:
1. Equal-Width Binning
Each bin has an equal width, determined by dividing the range of the data into nn intervals.
Formula:

 Advantages: Simple to implement and easy to understand.

 Disadvantages: May result in bins with highly uneven data distribution.

2. Equal-Frequency Binning
Each bin contains approximately the same number of data points.
 Advantages: Ensures balanced bin sizes, avoiding sparse bins.
 Disadvantages: The bin width may vary significantly.

Applications of Binning
1. Data Preprocessing: Often used to prepare data for machine learning models by
converting continuous variables into categorical ones.
2. Anomaly Detection: Helps identify anomalies or outliers by binning data and
analyzing the distributions.
3. Data Visualization: Used in histograms and bar charts to represent the frequency
distribution of data.
4. Feature Engineering: Creates categorical features that can enhance the performance
of certain machine learning models.

Data Integration:

Data integration is the process of combining data from multiple sources into a unified view
so it can be used for analysis, reporting, or machine learning.
It’s a key step in data preprocessing when working with datasets from different places.

Challenges in Data Integration

1. Schema Differences
o Same information stored in different formats or field names.
oExample: "Customer_ID" in one file vs. "CustID" in another.
2. Data Redundancy
o Duplicate records across sources.
o Example: Same customer record in both sales and marketing databases.
3. Data Inconsistency
o Conflicting values for the same attribute.
o Example: Customer age is 30 in one dataset and 32 in another.
4. Different Units or Scales
o Example: Weight recorded in pounds vs. kilograms.
5. Missing Values
o Some sources may not have certain attributes.
Example:

import pandas as pd
from sklearn.datasets import load_iris

# Load iris dataset as DataFrame

iris = load_iris(as_frame=True)
df_iris = iris.frame

# Create dummy sales data (same number of rows = 150)

df_sales = pd.DataFrame({
'OrderID': range(1001, 1001 + len(df_iris)),
'Product': ['Pen', 'Book', 'Pencil'] * 50,
'Amount': [50, 150, 75] * 50
})

# Combine column-wise
combined = pd.concat([df_iris, df_sales], axis=1)
print(combined.head())

Output:

Data Reduction :

Data Reduction refers to the process of reducing the volume of data while maintaining its
informational quality. It involves methods for minimizing, summarizing, or simplifying data
while preserving its fundamental properties for storage or analysis.
Dimensionality Reduction:

Dimensionality reduction is the process of reducing the number of input variables (features)
in a dataset while keeping as much useful information as possible.
It’s mainly used to simplify data, speed up processing, and avoid problems like overfitting.

Types of Dimensionality Reduction

Feature Selection
 Selecting the most important features and discarding the rest.
 Methods:
o Filter methods (Correlation, Chi-square test)
o Wrapper methods (Forward Selection, Backward Elimination)
o Embedded methods (Lasso, Decision Tree feature importance)

Correlation:

Correlation measures the strength and direction of the relationship between two variables.
 Values range from –1 to +1.
o +1 → Perfect positive relationship (when one increases, the other increases).
o –1 → Perfect negative relationship (when one increases, the other decreases).
o 0 → No relationship.
 Common measure: Pearson correlation coefficient (for numerical data).
Formula for Pearson correlation (r):

Example:
 Height and weight → High positive correlation (~0.85).
 Age and gaming hours → Negative correlation (~–0.40).
Uses in preprocessing:
 Detect multicollinearity (too-high correlation) → remove redundant features.

2. Chi-Square Test
The Chi-Square (χ2\chi^2χ2) test checks whether there is a significant relationship between
two categorical variables.
 It compares observed frequencies (from data) with expected frequencies (if there was
no relationship).
Formula:

Where:
 OiO_iOi = Observed value
 EiE_iEi = Expected value
Steps:
1. Create a contingency table of observed counts.
2. Calculate expected counts assuming no relationship.
3. Apply the formula and compare the result with the Chi-Square critical value (from a
statistical table) or use p-value.
Example:
 Does gender (Male/Female) affect product preference (A/B)?
 If p-value < 0.05 → Significant relationship.

Feature Extraction
 Transforming existing features into a smaller set of new features.
 Methods:
o PCA (Principal Component Analysis) → Creates new uncorrelated
variables (principal components).

PCA (Principal Component Analysis): PCA is a dimensionality reduction technique used

to convert a large set of variables into a smaller set of new variables (called principal
components) while preserving as much important information (variance) as possible.

Steps in PCA
1. Standardize the Data
o Ensure all features have the same scale (mean = 0, variance = 1).
2. Compute the Covariance Matrix
o Measures how features vary together.
3. Calculate Eigenvalues & Eigenvectors
o Eigenvalues → amount of variance explained by each principal component.
o Eigenvectors → directions of the principal components.
4. Sort Components by Variance
o Rank them by eigenvalues in descending order.
5. Select Top k Components
o Choose enough components to explain desired variance (e.g., 95%).
6. Transform the Data
o Multiply original data by selected eigenvectors to get the reduced dataset.
Example:
# PCA on Iris Dataset - Complete Code

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# 1. Load Iris Dataset

iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target # Add species as target
species_names = iris.target_names

# 2. Standardize the Features

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df.iloc[:, :-1]) # exclude target column

# 3. Apply PCA (reduce to 2 components for visualization)

pca = PCA(n_components=2)
df_pca = pca.fit_transform(df_scaled)

# 4. Create a DataFrame for PCA results

df_pca_result = pd.DataFrame(data=df_pca, columns=['PC1', 'PC2'])
df_pca_result['species'] = df['species']

# 5. Explained Variance
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Total variance explained:", sum(pca.explained_variance_ratio_))

# 6. Visualization
plt.figure(figsize=(8,6))
for target, color, label in zip([0, 1, 2], ['red', 'green', 'blue'], species_names):
subset = df_pca_result[df_pca_result['species'] == target]
plt.scatter(subset['PC1'], subset['PC2'], c=color, label=label, alpha=0.7, edgecolors='k')

plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]*100:.2f}% variance)")

plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]*100:.2f}% variance)")
plt.title("PCA on Iris Dataset")
plt.legend()
plt.grid(True)
plt.show()

# 7. Component Loadings (Contribution of each feature)

loadings = pd.DataFrame(pca.components_,
columns=iris.feature_names,
index=['PC1', 'PC2'])
print("\nFeature contributions to each principal component:")
print(loadings)
Data Compression

Data compression is defined as the process whereby information is encoded in less bits than
it had originally occupied. This mainly happens through methods that eliminate duplication
and other extraneous information.

Lossless Data Compression

Lossless data compression guarantees that the decompressed data is identical to the original
data. It works best for text and data files where precision matters.
 Huffman coding: Uses a frequency-sorted binary tree to locate values efficiently.
 Run-length encoding (RLE): This compresses sequences of replicated data values.

Huffman Coding

Huffman coding is a lossless data compression algorithm that assigns shorter binary codes to
more frequent symbols and longer codes to less frequent symbols, reducing overall storage
space.

Working steps:

 Count the frequency of each symbol in the data.

 Build a priority queue (min-heap) of nodes (each node represents a symbol +
frequency).
 Combine the two smallest frequency nodes into a new node (sum of frequencies).
 Repeat until there is only one tree (Huffman Tree).
 Assign binary codes:
 Left branch → 0
 Right branch → 1
 Encode the data using these variable-length codes.

Example:

Symbol Frequency
A 2
B 2
Symbol Frequency
C 1

Step 2: Build Huffman Tree

1. Take smallest two:

o C (1) + A (2) → New node (3)
2. Combine with B (2):
o B (2) + Node(3) → Root (5)

Step 3: Assign Codes

 Go Left → 0
 Go Right → 1

Tree might look like:

(5)

/ \

B(2) (3)

/ \

C(1) A(2)

Codes:

 B→0
 C → 10
 A → 11

Step 4: Encode

Text: AABBC

 A → 11
 A → 11
 B→0
 B→0
 C → 10

Encoded:

11 11 0 0 10 → 11110010
RLE:

Run-Length Encoding compresses data by storing repeated characters as a single value plus
the number of repeats.
It works best on data with many consecutive repeated values (e.g., images with large blocks
of the same color).

Example:

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

# Load Iris dataset

iris = load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)

df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

# Sort by species to ensure runs

df = df.sort_values(by='species').reset_index(drop=True)

# RLE function

def run_length_encode_column(column):

encoding = []

prev = column[0]

count = 1

for value in column[1:]:

if value == prev:

count += 1

else:

encoding.append((prev, count))

prev = value

count = 1
encoding.append((prev, count))

return encoding

# Run RLE on species

rle_species = run_length_encode_column(df['species'].tolist())

# Convert RLE result to DataFrame

rle_df = pd.DataFrame(rle_species, columns=['species', 'count'])

# --- Visualization 1: Bar Chart of Species Count from RLE ---

plt.figure(figsize=(6, 4))

plt.bar(rle_df['species'], rle_df['count'], color=['green', 'blue', 'red'])

plt.title("RLE: Species Counts in Iris Dataset")

plt.xlabel("Species")

plt.ylabel("Count")

plt.show()

Data Transformation :

Data Transformation is the process of converting data from its original format into a more
useful, consistent, or structured format for analysis, storage, or further processing. It’s like
“cleaning and reshaping” your data so it’ s ready for use in machine learning or statistical
models.

Smoothing : Smoothing removes noise (random variations) from data to reveal patterns more
clearly. One common method is binning with mean smoothing.

Ex: Data: 3, 6, 8, 5, 10, 12, 14, 11, 13

Step 1: Apply Equal-Width Binning

We split the data into 3 bins (groups) of equal width.

Bin 1: 3, 6, 8

Bin 2: 5, 10, 12

Bin 3: 14, 11, 13

Step 2: Mean Smoothing

Replace each value in a bin with the mean of that bin.

 Bin 1 mean: (3 + 6 + 8) / 3 = 5.67 → values become 5.67, 5.67, 5.67

 Bin 2 mean: (5 + 10 + 12) / 3 = 9 → values become 9, 9, 9
 Bin 3 mean: (14 + 11 + 13) / 3 = 12.67 → values become 12.67, 12.67, 12.67

Step 3: Smoothed Data

Original: 3, 6, 8, 5, 10, 12, 14, 11, 13
Smoothed: 5.67, 5.67, 5.67, 9, 9, 9, 12.67, 12.67, 12.67

Aggregation

Aggregation is the process of combining two or more attributes (or records) into a single,
more meaningful attribute. It is used to:

 Summarize data
 Reduce data size
 Show higher-level trends rather than individual details

Example 1 – Attribute Aggregation

Dataset:

Sales_Q1 Sales_Q2 Sales_Q3 Sales_Q4

1000 1200 1100 1300

900 950 1000 1050

After Aggregation (Total Annual

Sales):

Annual_Sales

4600

3900

Generalization

o Replaces low-level data with higher-level concepts.

o Example: Changing age into age group (20–29, 30–39, etc.).

Normalization (Scaling)

Data normalization is a technique used in data mining to transform the values of a dataset into
a common scale. This is important because many machine learning algorithms are sensitive to
the scale of the input features and can produce better results when the data is normalized.
Normalization is used to scale the data of an attribute so that it falls in a smaller range, such
as -1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms.

Methods of Data Normalization

1. Decimal Scaling

2. Min-Max Normalization

3. z-Score Normalization(zero-mean Normalization)

1. Decimal Scaling Method For Normalization

It normalizes by moving the decimal point of values of the data. To normalize the data by this
technique, we divide each value of the data by the maximum absolute value of data. The data
value, vi, of data is normalized to vi' by using the formula below –

where j is the smallest integer such that max(|vi'|)<1.

Example :

Let the input data is: -10, 201, 301, -401, 501, 601, 701. To normalize the above data,

Step 1: Maximum absolute value in given data(m): 701

Step 2: Divide the given data by 1000 (i.e j=3)

Result: The normalized data is: -0.01, 0.201, 0.301, -0.401, 0.501, 0.601, 0.701

Min-Max Normalization

In this technique of data normalization, linear transformation is performed on the original

data. Minimum and maximum value from data is fetched and each value is replaced
according to the following formula.

Where A is the attribute data, Min(A), Max(A) are the minimum and maximum absolute
value of A respectively. v' is the new value of each entry in data. v is the old value of each
entry in data. new_max(A), new_min(A) is the max and min value of the range(i.e boundary
value of range required) respectively.

Z-score normalization
In this technique, values are normalized based on mean and standard deviation of the data A.

v', v is the new and old of each entry in data respectively. σA, A is the standard deviation and
mean of A respectively.

Discretization:

Data discretization is the process of converting continuous numerical data into a set of finite
intervals (bins) or categories. It’s a type of data reduction that makes the data simpler to
analyze, especially for algorithms that work better with discrete values (e.g., decision trees,
association rules).

Types of Discretization Methods

1. Unsupervised Discretization (No class label used)

 Equal-width binning
o Divides range into equal-sized intervals.
o Example: Age range 20–50 → bins of width 10 → [20–30), [30–40), [40–50]
 Equal-frequency binning
o Each bin has roughly the same number of data points.
 Clustering-based
o Use k-means or other clustering to group values.

2. Supervised Discretization (Uses class label to guide binning)

 Entropy-based discretization (e.g., ID3, C4.5)

o Split points chosen to maximize information gain.
 ChiMerge
o Uses chi-square test to merge intervals until a stopping criterion is met.

Encoding (for Categorical Data)

o Converts categorical values into numerical format.

o Methods:

One-Hot Encoding: One-hot encoding is a data transformation technique used to convert

categorical (non-numeric) data into a numerical format so that machine learning models can
processit.
It works by creating a separate binary column for each category and marking the presence (1)
or absence (0) of that category.

Original Data:

Color
Red
Green
Blue
Red
Color_Red Color_Green Color_Blue
1 0 0
0 1 0
0 0 1
1 0 0

Label Encoding:

Label encoding is a data transformation technique that converts categorical values into numeric
codes by assigning a unique integer to each category. Unlike one-hot encoding, it doesn’t create
new columns—it replaces each category with a number.

Example

Original Data:

Color
Red
Green
Blue
Red

After Label Encoding:

Color
2
1
0
2

Attribute Construction (Feature Engineering)

o Creating new attributes from existing ones.

o Example: From “Date of Birth” → create “Age” attribute.

Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
UNIT 2 Data Warehousing
No ratings yet
UNIT 2 Data Warehousing
45 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Data Preprocessing: Clean, Transform, Integrate
No ratings yet
Data Preprocessing: Clean, Transform, Integrate
6 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Unit 2
No ratings yet
Unit 2
16 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
9 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Unit 2 DA
No ratings yet
Unit 2 DA
3 pages
DM Unit 1
No ratings yet
DM Unit 1
18 pages
Unit II - Data Preprocessing and Classification RSK-1
No ratings yet
Unit II - Data Preprocessing and Classification RSK-1
115 pages
Week 3
No ratings yet
Week 3
23 pages
Data Mining
No ratings yet
Data Mining
22 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Study Material Data Preprocessing
No ratings yet
Study Material Data Preprocessing
11 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Data Preparation Steps for Analysis
No ratings yet
Data Preparation Steps for Analysis
3 pages
Lecture 2 DM
No ratings yet
Lecture 2 DM
11 pages
Data Cleaning Preprocessing
No ratings yet
Data Cleaning Preprocessing
28 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
Foundation of DS
No ratings yet
Foundation of DS
21 pages
18mca52c U2
No ratings yet
18mca52c U2
23 pages
Unit - II
No ratings yet
Unit - II
56 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Unit 2
No ratings yet
Unit 2
11 pages
BI Unit 4 Final
No ratings yet
BI Unit 4 Final
2 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Lec 9
No ratings yet
Lec 9
1 page
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
? Data Preprocessing
No ratings yet
? Data Preprocessing
19 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
633777800398832500ata Minig Presentation
No ratings yet
633777800398832500ata Minig Presentation
20 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Dmi Unit 3
No ratings yet
Dmi Unit 3
12 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
Chap 3
No ratings yet
Chap 3
26 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
Chap 8 Data Preprocessing - Short
No ratings yet
Chap 8 Data Preprocessing - Short
7 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
How Should Data Preparation Be Done For An Analytics Project
No ratings yet
How Should Data Preparation Be Done For An Analytics Project
30 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Cours Preprocessing
No ratings yet
Cours Preprocessing
23 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
Business Analytics: A Data-Driven Decision Making Approach For Business, Volume I
No ratings yet
Business Analytics: A Data-Driven Decision Making Approach For Business, Volume I
48 pages
Clustering Gene Expression Data: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu April 2001
No ratings yet
Clustering Gene Expression Data: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu April 2001
9 pages
Unhealthy Region of Citrus Leaf Detection Using Image Processing Techniques
No ratings yet
Unhealthy Region of Citrus Leaf Detection Using Image Processing Techniques
6 pages
Color Image Segmentation Based On Principal Component Analysis With Application of Firefly Algorithm and Gaussian Mixture Model
No ratings yet
Color Image Segmentation Based On Principal Component Analysis With Application of Firefly Algorithm and Gaussian Mixture Model
12 pages
Vizster: Visualizing Online Social Networks: Bstract
No ratings yet
Vizster: Visualizing Online Social Networks: Bstract
8 pages
Cluster Analysis With SPSS
No ratings yet
Cluster Analysis With SPSS
8 pages
AI Fundamentals Midterm Exam - Attempt Review
No ratings yet
AI Fundamentals Midterm Exam - Attempt Review
17 pages
Data Mining - NOTES 2022
No ratings yet
Data Mining - NOTES 2022
16 pages
Unit-5 Outlier Analysis
No ratings yet
Unit-5 Outlier Analysis
32 pages
Hierarchical Clustering PDF
No ratings yet
Hierarchical Clustering PDF
5 pages
Heart Attack Prediction System Using Fuzzy C Means Classifier
No ratings yet
Heart Attack Prediction System Using Fuzzy C Means Classifier
9 pages
Session21 22 Cluster Validation MANOVA
No ratings yet
Session21 22 Cluster Validation MANOVA
18 pages
RoboDoc Journal Paper
No ratings yet
RoboDoc Journal Paper
8 pages
Module2 Ids 240201 162026
No ratings yet
Module2 Ids 240201 162026
11 pages
FACES IV & Circumplex Model Analysis
No ratings yet
FACES IV & Circumplex Model Analysis
20 pages
From Development To Data Science: A Complete Roadmap
No ratings yet
From Development To Data Science: A Complete Roadmap
11 pages
Machine Learning Exam Guide
No ratings yet
Machine Learning Exam Guide
9 pages
LLLMS: A Data-Driven Survey of Evolving Research On Limitations of Large Language Models
No ratings yet
LLLMS: A Data-Driven Survey of Evolving Research On Limitations of Large Language Models
53 pages
(Ebook PDF) Introduction To Data Mining 2nd Edition by Pang-Ning Tanpdf Download
100% (8)
(Ebook PDF) Introduction To Data Mining 2nd Edition by Pang-Ning Tanpdf Download
51 pages
Draft Syllabus of B.E. Sem VII & VIII Biomedical Engg.
No ratings yet
Draft Syllabus of B.E. Sem VII & VIII Biomedical Engg.
60 pages
Machine Learning in Trading
80% (5)
Machine Learning in Trading
205 pages
RapidMiner Data Types
No ratings yet
RapidMiner Data Types
4 pages
An Introduction To Data Mining IIT Bombay
No ratings yet
An Introduction To Data Mining IIT Bombay
48 pages
Lec 06 Clustering
No ratings yet
Lec 06 Clustering
44 pages
Course Plan 21CSC307P - Machine Learning For Data Analytics
No ratings yet
Course Plan 21CSC307P - Machine Learning For Data Analytics
13 pages
DSF - UNIT III Notes
No ratings yet
DSF - UNIT III Notes
17 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
19 pages
Hierarchical Routing Protocols in Wireless Sensor Network
No ratings yet
Hierarchical Routing Protocols in Wireless Sensor Network
5 pages
Advanced Machine Learning Mastering Level Learning With Python
No ratings yet
Advanced Machine Learning Mastering Level Learning With Python
81 pages
K Medroids
No ratings yet
K Medroids
13 pages