Data Transformation in Data Mining

Last Updated : 12 Jul, 2025

Data transformation in data mining refers to the process of converting raw data into a format that is suitable for analysis and modeling. It also ensures that data is free of errors and inconsistencies. The goal of data transformation is to prepare the data for data mining so that it can be used to extract useful insights and knowledge.

Data Transformation Techniques

The data transformation involves various methods that are:

1. Smoothing

It is a process that is used to remove noise from the dataset using some algorithms It allows for highlighting important features present in the dataset. It helps in predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce any variance or any other noise form. The concept behind data smoothing is that it will be able to identify simple changes to help predict different trends and patterns. This serves as a help to analysts or traders who need to look at a lot of data which can often be difficult to digest for finding patterns that they wouldn't see otherwise.

Example :

Imagine you have noisy data like this:

[5, 7, 6, 20, 7, 8, 6]

Here, the value 20 is an outlier that makes the data look jagged.

One simple smoothing method is replacing each value with the average of its neighbors to reduce sharp jumps. For example, replacing 20 with the average of its neighbors (6 + 7) / 2 = 6.5 gives smoother data:

[5, 7, 6, 6.5, 7, 8, 6]

2. Aggregation

Data collection or aggregation is the method of storing and presenting data in a summary format. The data may be obtained from multiple data sources to integrate these data sources into a data analysis description. This is a crucial step since the accuracy of data analysis insights is highly dependent on the quantity and quality of the data used. Gathering accurate data of high quality and a large enough quantity is necessary to produce relevant results. The collection of data is useful for everything from decisions concerning financing or business strategy of the product, pricing, operations, and marketing strategies.

Example : Sales, data may be aggregated to compute monthly and annual total amounts.

January: 1,000 February: 1,200 March: 1,300
Monthly sales can be aggregated to calculate quarterly sales:
Total Sales = $3,500

3. Discretization

It is a process of transforming continuous data into set of small intervals. Most Data Mining activities in the real world require continuous attributes. Yet many of the existing data mining frameworks are unable to handle these attributes. Also, even if a data mining task can manage a continuous attribute, it can significantly improve its efficiency by replacing a constant quality attribute with its discrete values.

Example:
Continuous age values: 22, 25, 37, 60
Discretized into categories:

0-25 → "Young"
26–50 → "Middle-aged"
50+ → "Senior"

So: 22 → Young, 37 → Middle-aged, 60 → Senior

4. Attribute Construction

Where new attributes are created and applied to assist the mining process from the given set of attributes. This simplifies the original data and makes the mining more efficient.

Example:
Original attributes:

Height (in cm) = 175
Weight (in kg) = 70

Constructed attribute:

BMI = Weight / (Height in meters)²
= 70 / (1.75)² = 22.86

5. Generalization

It converts low-level data attributes to high-level data attributes using concept hierarchy. For Example Age initially in Numerical form (22, 25) is converted into categorical value (young, old). Like , categorical attributes such as house addresses, may be generalized to higher-level definitions, such as town or country.

Example:
Low-level attribute: Age = 24
Generalized as:

Age group: "Young"

Low-level attribute: City = San Francisco
Generalized as:

State: "California"

6. Normalization

Data normalization involves converting all data variables into a given range. Techniques that are used for normalization are:

Min-Max Normalization:
- This transforms the original data linearly.
- Suppose that: min_A is the minima and max_A is the maxima of an attribute
- v is the value you want to plot in the new range.
- v' is the new value you get after normalizing the old value.

v' = (v - min_A) / (max_A - min_A)

Z-Score Normalization:
- In z-score normalization (or zero-mean normalization) the values of an attribute (A), are normalized based on the mean of A and its standard deviation
- A value v of attribute A is normalized to v' by computing using below formula-

v' = (v - mean(A)) / (standard deviation(A))

Decimal Scaling:
- It normalizes the values of an attribute by changing the position of their decimal points
- The number of points by which the decimal point is moved can be determined by the absolute maximum value of attribute A.
- A value, v, of attribute A is normalized to v' by computing
- where j is the smallest integer such that Max(|v'|) < 1.
- Suppose: Values of an attribute P varies from -99 to 99.
- The maximum absolute value of P is 99.
- For normalizing the values we divide the numbers by 100 (i.e., j = 2) or (number of integers in the largest number) so that values come out to be as 0.98, 0.97 and so on.

Read about Normalization in detail.

pcp21599

Improve

Article Tags :

Data Transformation in Data Mining

Data Transformation Techniques

1. Smoothing

2. Aggregation

3. Discretization

4. Attribute Construction

5. Generalization

6. Normalization

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Thank You!

What kind of Experience do you want to share?