100% found this document useful (1 vote)
38 views26 pages

Data Transformation

The document provides an overview of data extraction, transformation, and loading (ETL) processes using MySQL and Python with pandas. It covers various aspects of data manipulation, including data cleaning, aggregation, merging, and loading into databases, as well as best practices and challenges in data transformation. Additionally, it discusses the importance of data transformation in analytics and compliance, highlighting tools and use cases.

Uploaded by

Hassan Faraz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
38 views26 pages

Data Transformation

The document provides an overview of data extraction, transformation, and loading (ETL) processes using MySQL and Python with pandas. It covers various aspects of data manipulation, including data cleaning, aggregation, merging, and loading into databases, as well as best practices and challenges in data transformation. Additionally, it discusses the importance of data transformation in analytics and compliance, highlighting tools and use cases.

Uploaded by

Hassan Faraz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Extracting Data from a MySQL Database

SELECT customer_id, first_name, last_name, email, created_at

FROM customers

WHERE created_at >= '2024-01-01';

1
Data Cleaning (Handling NULL values) Data Type Conversion

SELECT customer_id, first_name, last_name, SELECT order_id,

COALESCE(email, '[email protected]') AS CAST(order_date AS DATE) AS order_date,


email CAST(total_amount AS DECIMAL(10,2)) AS
FROM customers; total_amount

FROM orders;

2
Data Aggregation Data Filtering

SELECT customer_id, SUM(total_amount) AS SELECT * FROM customers


total_spent
WHERE email LIKE '%@%' AND
FROM orders LENGTH(phone_number) >= 10;

GROUP BY customer_id;

3
Merging Data from Multiple Sources

SELECT c.customer_id, c.first_name, c.last_name,

o.order_id, o.order_date, o.total_amount

FROM customers c

JOIN orders o ON c.customer_id = o.customer_id;

4
import pandas as pd
# Sample extracted data

data = {'customer_id': [1, 2, 3],

'name': ['Alice', 'Bob', 'Charlie'],

'email': ['[email protected]', None, '[email protected]'],

'total_spent': ['100.5', '200.75', 'NULL']}

df = pd.DataFrame(data)
# Transformations

df['email'].fillna('[email protected]', inplace=True) # Handling NULL values

df['total_spent'] = pd.to_numeric(df['total_spent'], errors='coerce').fillna(0) # Convert and handle errors

print(df) 5
Loading Data into a Data Warehouse
INSERT INTO sales_summary (customer_id, total_spent, last_purchase_date)

SELECT customer_id, SUM(total_amount), MAX(order_date)

FROM staging_orders

GROUP BY customer_id;

6
Loading Multiple Rows at Once
INSERT INTO customers (customer_id, first_name, last_name, email)

VALUES

(101, 'Alice', 'Smith', '[email protected]'),

(102, 'Bob', 'Jones', '[email protected]'),

(103, 'Charlie', 'Brown', '[email protected]');

7
Loading from a CSV File
COPY customers FROM '/path/to/customers.csv'

DELIMITER ',' CSV HEADER;

8
Handling Duplicates with ON CONFLICT (PostgreSQL)
INSERT INTO customers (customer_id, first_name, last_name, email)

VALUES (101, 'Alice', 'Smith', '[email protected]')

ON CONFLICT (customer_id)

DO UPDATE SET email = EXCLUDED.email;

9
Loading Data into a SQL Database
import pandas as pd
from sqlalchemy import create_engine
# Sample transformed data
data = {'customer_id': [101, 102, 103],
'first_name': ['Alice', 'Bob', 'Charlie'],
'last_name': ['Smith', 'Jones', 'Brown'],
'email': ['[email protected]', '[email protected]', '[email protected]']}
df = pd.DataFrame(data)
# Database connection
engine = create_engine('postgresql://user:password@localhost:5432/mydatabase')
# Load data into the database
df.to_sql('customers', engine, if_exists='append', index=False) 10
Data Transformation
Topics
● Introduction
● What is Data Transformation?
○ Definition
○ Key Goals of Data Transformation
● Types of Data Transformation
○ Structural Transformation
○ Data Cleaning & Standardization
○ Data Enrichment
○ Data Aggregation & Summarization
○ Data Normalization & Scaling
○ Data Anonymization & Masking
● Data Transformation in ETL (Extract, Transform, Load)
● Data Transformation Tools
○ Open-Source Tools
○ Commercial Tools
● Challenges in Data Transformation
● Best Practices for Data Transformation
● Use Cases of Data Transformation 12
Introduction
Data transformation is a critical process in data integration, analytics, and warehousing.

It involves converting, cleaning, structuring, and enriching raw data into a format that is
suitable for analysis, reporting, and decision-making.

Transformation ensures that data from diverse sources is standardized, normalized, and
compatible for further processing.

13
Definition
Data transformation refers to the conversion of data from one format, structure, or value
representation to another to make it suitable for analysis or integration.

It is an essential step in ETL (Extract, Transform, Load) pipelines, enabling consistency,


accuracy, and usability of data.

14
Key Goals of Data Transformation
Standardization – Convert data into a uniform format.

Normalization – Reduce data redundancy and inconsistency.

Data Cleaning – Remove errors, duplicates, and inconsistencies.

Aggregation – Summarize data for better analysis.

Data Enrichment – Add missing values or context.

Anonymization – Protect sensitive information for compliance.

15
Structural Transformation
Changes the organization or format of data.

● Column Splitting – Splitting a single column into multiple columns (e.g., "Full
Name" → "First Name" & "Last Name").
● Column Merging – Combining multiple columns into one (e.g., "City" & "State"
→ "Location").
● Pivoting & Unpivoting – Changing row-column relationships (e.g., converting
row-based data into column-based format).
● Schema Mapping – Aligning different database schemas to ensure consistency
across sources.

16
Data Cleaning & Standardization
Improves accuracy, consistency, and completeness of data.

● Handling Missing Values – Filling gaps using interpolation, mean/mode


imputation, or removing incomplete records.
● Deduplication – Identifying and removing duplicate records.
● Standardizing Formats – Ensuring consistent date formats (e.g., MM-DD-YYYY →
YYYY-MM-DD), unit conversions (cm → m).
● Correcting Inconsistencies – Fixing spelling errors, typos, or naming discrepancies
(New York vs. NY).

17
Data Enrichment
Enhancing data by adding more context or information.

● Geocoding – Adding latitude/longitude based on addresses.


● Merging External Data – Integrating demographic or financial data for better
insights.
● Text Tokenization – Breaking text into meaningful components for NLP
applications.

18
Data Aggregation & Summarization
Consolidates data for easier analysis.

● Summing Up Values – Computing total sales, revenue, or user counts.


● Averaging & Statistical Measures – Finding mean, median, standard deviation.
● Grouping & Binning – Categorizing continuous variables into discrete ranges (e.g.,
age groups).

19
Data Normalization & Scaling
Ensures values are within a specific range for better comparison.

● Min-Max Scaling – Rescales values between 0 and 1.


● Z-score Normalization – Converts values to standard deviation-based distribution.
● Log Transformation – Helps normalize skewed data distributions.

20
Data Anonymization & Masking
Protects sensitive information for privacy compliance (GDPR, HIPAA).

● Tokenization – Replacing PII (Personally Identifiable Information) with


pseudonyms.
● Generalization – Reducing granularity of details (e.g., exact age 28 → age group
20-30).
● Data Masking – Hiding confidential data (John Doe → J*** D***).

21
Data Transformation in ETL (Extract, Transform, Load)
Data transformation is the middle stage of ETL processes:

1. Extract – Collects raw data from multiple sources (databases, files, APIs).
2. Transform – Cleans, restructures, enriches, and converts data into a usable format.
3. Load – Stores transformed data in a target system (data warehouse, analytics
platform).

22
Data Transformation Tools
Open-Source Tools Commercial Tools

● Apache Nifi – Automates data flow between ● Informatica PowerCenter – Advanced ETL
systems. and data governance features.
● Talend Open Studio – Provides a visual ETL ● Microsoft SQL Server Integration Services
pipeline builder. (SSIS) – Handles ETL for Microsoft
● Pentaho Data Integration (PDI) – Supports environments.
ETL and real-time data processing. ● AWS Glue – Serverless ETL for cloud data
● dbt (Data Build Tool) – Specializes in transformation.
transforming data within cloud warehouses. ● Google Dataflow – Batch and real-time data
transformation in Google Cloud.

23
Challenges in Data Transformation
Scalability Issues – Handling large volumes of data can slow down processing.

Schema Evolution – Changing data structures over time requires updates.

Data Quality & Consistency – Incomplete or incorrect data can impact analytics.

Performance Optimization – Transformation pipelines must be efficient.

Security & Compliance – Sensitive data requires masking and encryption.

24
Best Practices for Data Transformation
Define Clear Transformation Rules – Ensure well-documented logic for data changes.

Automate Data Cleaning – Use scripts or tools to handle missing and inconsistent values.

Monitor Data Pipelines – Implement logging and alerts for failures.

Use Parallel Processing – Speed up transformations for large datasets.

Ensure Data Governance – Follow compliance regulations for data handling.

Validate Transformed Data – Regularly compare output with expected results.

25
Use Cases of Data Transformation
Business Intelligence (BI) & Reporting Data Warehousing

● Convert sales transactions into structured ● Merge customer data from multiple sources.
reports. ● Summarize historical records for trend
● Aggregate data for KPIs (Key Performance analysis.
Indicators). ● Ensure schema consistency across databases.
● Standardize data from multiple regions and
currencies. Compliance & Security
Machine Learning & AI ● Anonymize personal data for GDPR
● Normalize and scale data for model training. compliance.
● Clean and remove noise for better accuracy. ● Mask sensitive financial information.
● Convert text data into structured features ● Redact personally identifiable information in
(NLP). healthcare records.

26

You might also like