Lecture Notes Ch1 (1)
Lecture Notes Ch1 (1)
INTRODUCTION TO DATA
ENGINEERING
INTRODUCTION
• Data has become a crucial asset for businesses and organizations in the digital
era.
• Data refers to information collected from various sources and formats that can be
processed and analyzed to derive insights and enable data-driven decision
making.
• Sources of data include operational systems, databases, APIs, IoT devices,
websites, social media and more. Data can be structured, unstructured or semi-
structured.
• Big data has become a popular buzzword, fueled by technological advancements
in both software and hardware.
INTRODUCTION (CONT.)
Key uses of data in business:
• Customer analytics - Analyze customer behaviors, preferences, trends to improve marketing and
experiences. E.g. Recommendation engines.
• Operational analytics - Optimize business operations like supply chain, logistics, asset utilization by
finding inefficiencies using data.
• Fraud detection - Identify anomalies and patterns in transactions to detect fraudulent activities.
• Risk management - Use data modeling and simulations to identify, quantify and mitigate business
risks.
• Predictive analytics - Make data-driven forecasts and predictions about future outcomes using
techniques like machine learning.
• Personalization - Customize products, services, content by analyzing individual customer data and
preferences.
• Data-driven decision making - Leverage insights from data analysis to guide business strategy and
important decisions.
AN OVERVIEW OF BIG DATA
• Big data refers to extremely large, complex datasets that traditional data processing
tools cannot easily store, manage or analyze. Big data has the following key
characteristics:
– Volume - Size of datasets in terabytes and petabytes. For example, a petabyte of
server log data.
– Velocity - Speed at which data is generated and processed. For example, millions
of financial transactions per minute.
– Variety - Different structured, semi-structured and unstructured data types like
databases, text, audio, video.
– Veracity - Inconsistency and uncertainty around data accuracy and quality.
– Value - Deriving business value from large datasets using analytics and machine
learning.
• Big data technologies including Hadoop, Spark, Kafka, NoSQL databases, data lakes
AN OVERVIEW OF BIG DATA
• Start as a data analyst or business analyst and make the move to data
engineering. Build on SQL, ETL, reporting skills.
• Complete a data engineering bootcamp or certification program to gain
specialized skills. Programs focus on tools like Hadoop, Spark, SQL, NoSQL,
Python.
• Develop proficiency with essential data engineering tools like SQL, Python,
Hadoop, Spark, Kafka, Airflow through courses, certifications and practice.
• Stay up-to-date on trends in data tech like containers, streaming, cloud
platforms to align with industry needs.
A COMPARISON OF DATA SCIENCE, DATA
ENGINEERING, AND DATA ANALYST ROLES
• Data Scientist
Focuses on extracting • Data Engineer • Data Analyst
insights through advanced
Focuses on building data Focuses on deriving business
analytics and ML
infrastructure - pipelines, insights through reporting
Skills: Statistics, ML warehouses and visualization
algorithms, Python/R,
Skills: Programming, data Skills: SQL, statistics,
storytelling
modeling, ETL, cloud communication, focus on
Tools: Jupyter, RStudio, platforms business needs
PyTorch, TensorFlow, scikit-
Tools: SQL, Hadoop, Spark, Tools: Looker, Tableau, Excel,
learn
Airflow, Kafka, AWS/GCP Power BI
Example task: Build churn
Example task: Create ETL Example task: Create sales
prediction model using
pipeline to move data from KPI dashboard in Tableau
customer data
PostgreSQL to Redshift connected to SQL database
SOME KEY DATA ENGINEERING TOOLS
PostgreSQL:
• Open source relational database management system
• Supports both SQL for structured queries and JSON for flexibility
• Used as primary database for many web and analytics applications
• Handles large datasets while ensuring ACID compliance
• Extensible through additional modules and plugins
Apache Kafka:
• Distributed streaming platform for publishing and subscribing to data streams
• Provides durable and fault-tolerant messaging system
• Used for building real-time data pipelines between systems
• Enables scalable data ingestion and integration
SOME KEY DATA ENGINEERING TOOLS
(CONT.)
Apache Spark:
• Open source distributed general-purpose cluster computing framework
• Used for large-scale data processing, batch analytics, machine learning
• Integrates with Kafka, Cassandra, HBase and other big data tech
• Programming interfaces for Java, Scala, Python, R
Other tools:
• MongoDB: Popular open source NoSQL database
• Airflow: Workflow orchestration and pipeline scheduling
• Tableau, PowerBI: Business intelligence and visualization
• AWS, GCP: Cloud infrastructure and services
This gives a high-level overview of some foundational open source tools like PostgreSQL,
Kafka and Spark, as well as commercial cloud data services. Many more exist in the data
engineering ecosystem.
DATA PIPELINES AND ARCHITECTURE
1. Data extraction - Extract relevant data from various sources like databases,
APIs, files, streams etc.
2. Data validation - Validate and filter data to remove anomalies, errors, duplicate
values.
3. Data transformation - Transform data by applying business logic, aggregations,
merging datasets etc.
4. Data loading - Load processed data into target databases, data warehouses,
lakes etc.
5. Data modeling - Apply structure, create data models optimized for downstream
uses.
6. Data analysis - Analyze data to gain insights through analytics, machine
learning, visualization.
AN EXAMPLE OF A COMMON DATA
PIPELINE
• Data Source:
– Transactional database storing point-of-sale (POS) data from all stores
• Data Ingestion:
– POS data replicated to AWS S3 data lake using AWS Data Pipeline tool
• Data Processing:
– Spark job triggered daily to transform POS data
– Join to product catalog to enrich data
– Aggregate daily sales by product, store, region
– Generate summary statistics like total sales, units sold
AN EXAMPLE OF A COMMON DATA
PIPELINE (CONT.)
• Data Storage:
– Aggregated data stored back in data lake
– Also loaded into Redshift data warehouse
• Data Consumption:
– Business analysts access data in Redshift to generate reports
– Data scientists access data in data lake to build models
– Executives access dashboards visualizing sales data
In this retail pipeline, raw transaction data is extracted, aggregated, and loaded
into analytics systems for usage across the organization. This powers data-driven
business decisions.
THE MODERN DATA PIPELINE DIAGRAM
KEY ARCHITECTURAL COMPONENTS OF A
TYPICAL DATA PIPELINE
Data Sources:
• Applications, databases, APIs, files that produce and provide raw data to be consumed
downstream.
• Examples: CRM systems, transactional databases, IoT devices, web server logs.
Data Ingestion:
• Processes and tools for collecting and moving data from sources into the pipeline.
• Example: Streaming frameworks like Kafka, Flink, batch ETL tools like Talend.
Compute:
• Infrastructure to run data processing workloads and applications.
• Example: provisioning VMs, docker containers, serverless platforms.
KEY ARCHITECTURAL COMPONENTS OF A
TYPICAL DATA PIPELINE (CONT.)
Storage:
• Data stores to land raw data and serve processed data.
• Example: Cloud storage like S3, Azure Blob, analytical databases like Snowflake.
Orchestration:
• Tools to schedule, sequence pipeline tasks and manage workflows.
• Example: Workflow engines like Apache Airflow, Azkaban.
Serving Layer:
• Interfaces to deliver analyzed data for consumption by applications.
• Example: APIs, reports, dashboards, ML models.
KEY ARCHITECTURAL COMPONENTS OF A
TYPICAL DATA PIPELINE (CONT.)
Monitoring:
• Tracking pipeline runs, performance metrics, data quality, lineage.
• Example: Logging, metrics collection, data profiling.
Security:
• Authentication, access control, encryption of data in transit and at rest.
These components work together to build a complete pipeline that
delivers business value.