0% found this document useful (0 votes)
1 views

Lecture Notes Ch1 (1)

Uploaded by

nmaswadeh009
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Lecture Notes Ch1 (1)

Uploaded by

nmaswadeh009
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

DATA ENGINEERING

INTRODUCTION TO DATA
ENGINEERING
INTRODUCTION

• Data has become a crucial asset for businesses and organizations in the digital
era.
• Data refers to information collected from various sources and formats that can be
processed and analyzed to derive insights and enable data-driven decision
making.
• Sources of data include operational systems, databases, APIs, IoT devices,
websites, social media and more. Data can be structured, unstructured or semi-
structured.
• Big data has become a popular buzzword, fueled by technological advancements
in both software and hardware.
INTRODUCTION (CONT.)
Key uses of data in business:
• Customer analytics - Analyze customer behaviors, preferences, trends to improve marketing and
experiences. E.g. Recommendation engines.
• Operational analytics - Optimize business operations like supply chain, logistics, asset utilization by
finding inefficiencies using data.
• Fraud detection - Identify anomalies and patterns in transactions to detect fraudulent activities.
• Risk management - Use data modeling and simulations to identify, quantify and mitigate business
risks.
• Predictive analytics - Make data-driven forecasts and predictions about future outcomes using
techniques like machine learning.
• Personalization - Customize products, services, content by analyzing individual customer data and
preferences.
• Data-driven decision making - Leverage insights from data analysis to guide business strategy and
important decisions.
AN OVERVIEW OF BIG DATA

• Big data refers to extremely large, complex datasets that traditional data processing
tools cannot easily store, manage or analyze. Big data has the following key
characteristics:
– Volume - Size of datasets in terabytes and petabytes. For example, a petabyte of
server log data.
– Velocity - Speed at which data is generated and processed. For example, millions
of financial transactions per minute.
– Variety - Different structured, semi-structured and unstructured data types like
databases, text, audio, video.
– Veracity - Inconsistency and uncertainty around data accuracy and quality.
– Value - Deriving business value from large datasets using analytics and machine
learning.

• Big data technologies including Hadoop, Spark, Kafka, NoSQL databases, data lakes
AN OVERVIEW OF BIG DATA

• Big data cannot be used as a direct input for decision-making due to


its complexity. It must be interpreted, and in order to do that,
organizations require competent individuals who can use the
appropriate data as a starting point, correctly apply data analysis
techniques, and interpret the data in the context of organizational
policies and the business environment.
• Data engineers build and maintain the foundation for data-driven
decision making in organizations through reliable and scalable data
infrastructure and pipelines.
TYPES OF DATA (STRUCTURED)

• Structured Data (key characteristics):


– Organized in a predefined data model like rows and columns
– Stored in tables with relationships, like relational databases
– Each field has a strict data type like string, integer, date, etc.
– Queries and analysis can be easily performed on structured data using simple
declarative languages like SQL
• Examples: SQL databases, spreadsheet tables
o Numeric data like salaries, ages, temperatures
o Categorical data like gender, product IDs, zip codes
o Timestamps, dates, times
TYPES OF DATA (SEMI-STRUCTURED)

• Semi-Structured Data (key characteristics):


– It does not conform to a fixed schema like structured data. New fields can be added flexibly.
– Common semi-structured data formats include JSON, XML, YAML, CSV, etc.
– Querying and analyzing semi-structured data is more challenging than structured data, but easier
than unstructured data.
– Techniques like XPath can be used to query XML data, while JSON data can be processed in python
with json library or MongoDB.
– Tools like Apache Spark support handling of semi-structured data using underlying structs and arrays.
– Semi-structured data is common when there is a need for human-readable formatting along with
easy information exchange between systems.
– Examples include web APIs that return JSON data, health records in XML format, product feeds in CSV
format.
TYPES OF DATA (UNSTRUCTURED)

• Unstructured Data (key characteristics):


– Unstructured data does not have identifiable fields to separate different elements. It is free-flowing.
– It cannot be easily searched or queried using simple declarative statements.
– Common unstructured data types include text, images, audio, video, social media posts, emails,
presentations, webpages and more.
– Unstructured data analysis requires techniques like text mining, image analysis, speech recognition,
natural language processing.
– Since it lacks structure, storing unstructured data is more challenging than structured data. Object
stores, data lakes are common solutions.
– Transforming unstructured data requires developing metadata, extracting entities, tagging elements,
etc to make it more structured.
– Over 80% of data generated today is unstructured as per most estimates.
– Unstructured data can provide valuable signals for use cases like sentiment analysis, content
recommendations, search optimization.
OVERVIEW OF DATA ENGINEERING

• Data engineering is a crucial discipline within the realm of data


management and analytics. It encompasses the processes,
techniques, and tools used to design, develop, and maintain the
architecture, pipelines, and infrastructure required for collecting,
storing, processing, and delivering data to be used for various
analytical and operational purposes.
• Data engineering focuses on ensuring that data is organized,
accessible, and ready for analysis, helping organizations derive
insights and make informed decisions.
ROLES AND RESPONSIBILITIES OF A DATA
ENGINEER
• Design and build data pipeline architecture - Architect the flow of data from diverse
sources to destinations to meet business needs. Select the right systems and tools.
• Develop data ingestion processes - Implement workflows and systems to extract,
validate, and integrate data from sources like databases, APIs, files.
• Transform and cleanse data - Transform data to make it analysis-ready by cleansing,
standardizing, enriching, and shaping data as required.
• Develop and maintain data storage systems - Build and operate database systems like
PostgreSQL, MySQL, MongoDB, HBase to store data for applications and analytics.
• Build data processing systems - Develop batch and real-time processing systems on
distributed platforms like Hadoop, Spark, Flink, Kafka Streams.
ROLES AND RESPONSIBILITIES OF A DATA
ENGINEER
• Create data modeling structures - Model relational and non-relational data to prepare
it for analysis use cases.
• Automate and schedule data pipelines - Use workflow scheduler tools like Airflow to
automate data pipeline tasks and processes.
• Monitor, analyze and troubleshoot - Continuously monitor data pipelines, optimize
performance, resolve issues.
• Implement security and compliance - Apply security controls around data access,
encryption, masking and anonymization to meet regulations.
Data engineers build and maintain the foundation for data-driven decision making in
organizations through reliable and scalable data infrastructure and pipelines.
PATHWAYS TO BECOME A DATA ENGINEER

• Start as a data analyst or business analyst and make the move to data
engineering. Build on SQL, ETL, reporting skills.
• Complete a data engineering bootcamp or certification program to gain
specialized skills. Programs focus on tools like Hadoop, Spark, SQL, NoSQL,
Python.
• Develop proficiency with essential data engineering tools like SQL, Python,
Hadoop, Spark, Kafka, Airflow through courses, certifications and practice.
• Stay up-to-date on trends in data tech like containers, streaming, cloud
platforms to align with industry needs.
A COMPARISON OF DATA SCIENCE, DATA
ENGINEERING, AND DATA ANALYST ROLES
• Data Scientist
 Focuses on extracting • Data Engineer • Data Analyst
insights through advanced
 Focuses on building data  Focuses on deriving business
analytics and ML
infrastructure - pipelines, insights through reporting
 Skills: Statistics, ML warehouses and visualization
algorithms, Python/R,
 Skills: Programming, data  Skills: SQL, statistics,
storytelling
modeling, ETL, cloud communication, focus on
 Tools: Jupyter, RStudio, platforms business needs
PyTorch, TensorFlow, scikit-
 Tools: SQL, Hadoop, Spark,  Tools: Looker, Tableau, Excel,
learn
Airflow, Kafka, AWS/GCP Power BI
 Example task: Build churn
 Example task: Create ETL  Example task: Create sales
prediction model using
pipeline to move data from KPI dashboard in Tableau
customer data
PostgreSQL to Redshift connected to SQL database
SOME KEY DATA ENGINEERING TOOLS
PostgreSQL:
• Open source relational database management system
• Supports both SQL for structured queries and JSON for flexibility
• Used as primary database for many web and analytics applications
• Handles large datasets while ensuring ACID compliance
• Extensible through additional modules and plugins
Apache Kafka:
• Distributed streaming platform for publishing and subscribing to data streams
• Provides durable and fault-tolerant messaging system
• Used for building real-time data pipelines between systems
• Enables scalable data ingestion and integration
SOME KEY DATA ENGINEERING TOOLS
(CONT.)
Apache Spark:
• Open source distributed general-purpose cluster computing framework
• Used for large-scale data processing, batch analytics, machine learning
• Integrates with Kafka, Cassandra, HBase and other big data tech
• Programming interfaces for Java, Scala, Python, R
Other tools:
• MongoDB: Popular open source NoSQL database
• Airflow: Workflow orchestration and pipeline scheduling
• Tableau, PowerBI: Business intelligence and visualization
• AWS, GCP: Cloud infrastructure and services
This gives a high-level overview of some foundational open source tools like PostgreSQL,
Kafka and Spark, as well as commercial cloud data services. Many more exist in the data
engineering ecosystem.
DATA PIPELINES AND ARCHITECTURE

• A data pipeline is the end-to-end flow of data from raw,


unprocessed state to cleaned and analyzed state ready for
applications andData
decision making. The key stages include:
Data Transforma Data Data
Extraction tion Modeling Monitoring

Data Data Data Data


Validation Loading Analysis Governanc
e
DATA PIPELINES STAGES

1. Data extraction - Extract relevant data from various sources like databases,
APIs, files, streams etc.
2. Data validation - Validate and filter data to remove anomalies, errors, duplicate
values.
3. Data transformation - Transform data by applying business logic, aggregations,
merging datasets etc.
4. Data loading - Load processed data into target databases, data warehouses,
lakes etc.
5. Data modeling - Apply structure, create data models optimized for downstream
uses.
6. Data analysis - Analyze data to gain insights through analytics, machine
learning, visualization.
AN EXAMPLE OF A COMMON DATA
PIPELINE
• Data Source:
– Transactional database storing point-of-sale (POS) data from all stores
• Data Ingestion:
– POS data replicated to AWS S3 data lake using AWS Data Pipeline tool
• Data Processing:
– Spark job triggered daily to transform POS data
– Join to product catalog to enrich data
– Aggregate daily sales by product, store, region
– Generate summary statistics like total sales, units sold
AN EXAMPLE OF A COMMON DATA
PIPELINE (CONT.)
• Data Storage:
– Aggregated data stored back in data lake
– Also loaded into Redshift data warehouse
• Data Consumption:
– Business analysts access data in Redshift to generate reports
– Data scientists access data in data lake to build models
– Executives access dashboards visualizing sales data
In this retail pipeline, raw transaction data is extracted, aggregated, and loaded
into analytics systems for usage across the organization. This powers data-driven
business decisions.
THE MODERN DATA PIPELINE DIAGRAM
KEY ARCHITECTURAL COMPONENTS OF A
TYPICAL DATA PIPELINE
Data Sources:
• Applications, databases, APIs, files that produce and provide raw data to be consumed
downstream.
• Examples: CRM systems, transactional databases, IoT devices, web server logs.
Data Ingestion:
• Processes and tools for collecting and moving data from sources into the pipeline.
• Example: Streaming frameworks like Kafka, Flink, batch ETL tools like Talend.
Compute:
• Infrastructure to run data processing workloads and applications.
• Example: provisioning VMs, docker containers, serverless platforms.
KEY ARCHITECTURAL COMPONENTS OF A
TYPICAL DATA PIPELINE (CONT.)
Storage:
• Data stores to land raw data and serve processed data.
• Example: Cloud storage like S3, Azure Blob, analytical databases like Snowflake.
Orchestration:
• Tools to schedule, sequence pipeline tasks and manage workflows.
• Example: Workflow engines like Apache Airflow, Azkaban.
Serving Layer:
• Interfaces to deliver analyzed data for consumption by applications.
• Example: APIs, reports, dashboards, ML models.
KEY ARCHITECTURAL COMPONENTS OF A
TYPICAL DATA PIPELINE (CONT.)
Monitoring:
• Tracking pipeline runs, performance metrics, data quality, lineage.
• Example: Logging, metrics collection, data profiling.
Security:
• Authentication, access control, encryption of data in transit and at rest.
These components work together to build a complete pipeline that
delivers business value.

You might also like