Data Transformation
Data Transformation
FROM customers
1
Data Cleaning (Handling NULL values) Data Type Conversion
FROM orders;
2
Data Aggregation Data Filtering
GROUP BY customer_id;
3
Merging Data from Multiple Sources
FROM customers c
4
import pandas as pd
# Sample extracted data
df = pd.DataFrame(data)
# Transformations
print(df) 5
Loading Data into a Data Warehouse
INSERT INTO sales_summary (customer_id, total_spent, last_purchase_date)
FROM staging_orders
GROUP BY customer_id;
6
Loading Multiple Rows at Once
INSERT INTO customers (customer_id, first_name, last_name, email)
VALUES
7
Loading from a CSV File
COPY customers FROM '/path/to/customers.csv'
8
Handling Duplicates with ON CONFLICT (PostgreSQL)
INSERT INTO customers (customer_id, first_name, last_name, email)
ON CONFLICT (customer_id)
9
Loading Data into a SQL Database
import pandas as pd
from sqlalchemy import create_engine
# Sample transformed data
data = {'customer_id': [101, 102, 103],
'first_name': ['Alice', 'Bob', 'Charlie'],
'last_name': ['Smith', 'Jones', 'Brown'],
'email': ['[email protected]', '[email protected]', '[email protected]']}
df = pd.DataFrame(data)
# Database connection
engine = create_engine('postgresql://user:password@localhost:5432/mydatabase')
# Load data into the database
df.to_sql('customers', engine, if_exists='append', index=False) 10
Data Transformation
Topics
● Introduction
● What is Data Transformation?
○ Definition
○ Key Goals of Data Transformation
● Types of Data Transformation
○ Structural Transformation
○ Data Cleaning & Standardization
○ Data Enrichment
○ Data Aggregation & Summarization
○ Data Normalization & Scaling
○ Data Anonymization & Masking
● Data Transformation in ETL (Extract, Transform, Load)
● Data Transformation Tools
○ Open-Source Tools
○ Commercial Tools
● Challenges in Data Transformation
● Best Practices for Data Transformation
● Use Cases of Data Transformation 12
Introduction
Data transformation is a critical process in data integration, analytics, and warehousing.
It involves converting, cleaning, structuring, and enriching raw data into a format that is
suitable for analysis, reporting, and decision-making.
Transformation ensures that data from diverse sources is standardized, normalized, and
compatible for further processing.
13
Definition
Data transformation refers to the conversion of data from one format, structure, or value
representation to another to make it suitable for analysis or integration.
14
Key Goals of Data Transformation
Standardization – Convert data into a uniform format.
15
Structural Transformation
Changes the organization or format of data.
● Column Splitting – Splitting a single column into multiple columns (e.g., "Full
Name" → "First Name" & "Last Name").
● Column Merging – Combining multiple columns into one (e.g., "City" & "State"
→ "Location").
● Pivoting & Unpivoting – Changing row-column relationships (e.g., converting
row-based data into column-based format).
● Schema Mapping – Aligning different database schemas to ensure consistency
across sources.
16
Data Cleaning & Standardization
Improves accuracy, consistency, and completeness of data.
17
Data Enrichment
Enhancing data by adding more context or information.
18
Data Aggregation & Summarization
Consolidates data for easier analysis.
19
Data Normalization & Scaling
Ensures values are within a specific range for better comparison.
20
Data Anonymization & Masking
Protects sensitive information for privacy compliance (GDPR, HIPAA).
21
Data Transformation in ETL (Extract, Transform, Load)
Data transformation is the middle stage of ETL processes:
1. Extract – Collects raw data from multiple sources (databases, files, APIs).
2. Transform – Cleans, restructures, enriches, and converts data into a usable format.
3. Load – Stores transformed data in a target system (data warehouse, analytics
platform).
22
Data Transformation Tools
Open-Source Tools Commercial Tools
● Apache Nifi – Automates data flow between ● Informatica PowerCenter – Advanced ETL
systems. and data governance features.
● Talend Open Studio – Provides a visual ETL ● Microsoft SQL Server Integration Services
pipeline builder. (SSIS) – Handles ETL for Microsoft
● Pentaho Data Integration (PDI) – Supports environments.
ETL and real-time data processing. ● AWS Glue – Serverless ETL for cloud data
● dbt (Data Build Tool) – Specializes in transformation.
transforming data within cloud warehouses. ● Google Dataflow – Batch and real-time data
transformation in Google Cloud.
23
Challenges in Data Transformation
Scalability Issues – Handling large volumes of data can slow down processing.
Data Quality & Consistency – Incomplete or incorrect data can impact analytics.
24
Best Practices for Data Transformation
Define Clear Transformation Rules – Ensure well-documented logic for data changes.
Automate Data Cleaning – Use scripts or tools to handle missing and inconsistent values.
25
Use Cases of Data Transformation
Business Intelligence (BI) & Reporting Data Warehousing
● Convert sales transactions into structured ● Merge customer data from multiple sources.
reports. ● Summarize historical records for trend
● Aggregate data for KPIs (Key Performance analysis.
Indicators). ● Ensure schema consistency across databases.
● Standardize data from multiple regions and
currencies. Compliance & Security
Machine Learning & AI ● Anonymize personal data for GDPR
● Normalize and scale data for model training. compliance.
● Clean and remove noise for better accuracy. ● Mask sensitive financial information.
● Convert text data into structured features ● Redact personally identifiable information in
(NLP). healthcare records.
26