0% found this document useful (0 votes)

106 views30 pages

Data Engineering Fundamentals Explained

This document provides an introduction to foundations of data engineering. It discusses the rise of big data due to increasing data volumes and variety. It describes challenges of big data like volume, velocity and variety. It compares approaches like supercomputing, cluster computing and cloud computing. It also discusses economics of cloud computing and issues around privacy and security in the cloud.

Uploaded by

Vard Farrell

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

106 views30 pages

Data Engineering Fundamentals Explained

Uploaded by

Vard Farrell

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Foundations of Data Engineering

Thomas Neumann

1 / 27
Introduction

About this Lecture

The goal of this lecture is teaching the standard tools and techniques for
large-scale data processing.

Related keywords include:

• Big Data
• cloud computing
• scalable data-processing
• ...
We start with an overview, and then dive into individual topics.

2 / 27
Introduction

Goals and Scope

Note that this lecture emphasizes practical usage (after the introduction):
• we cover many different approaches and techniques
• but all of them will be used in practical manner, both in exercises and
in the lecture

We cover both concepts and usage:

• what software layers are used to handle Big Data?
• what are the principles behind this software?
• which kind of software would one use for which data problem?
• how do I use the software for a concrete problem?

3 / 27
Introduction

Some Pointers to Literature

They are not required for the course, but might be useful for reference
• The Datacenter as a Computer: An Introduction to the Design of
Warehouse-Scale Machines
• Hadoop: The Definitive Guide
• Big Data Processing with Apache Spark
• Big Data Infrastructure course by Peter Boncz

4 / 27
Introduction

The Age of Big Data

5 / 27
Introduction

The Age of Big Data

• 1,527,877 GB/m = 1500 TB/m = 1000 drives/m = 20m stack/m

5 / 27
Introduction

The Age of Big Data

• 1,527,877 GB/m = 1500 TB/m = 1000 drives/m = 20m stack/m

• 4 zetabytes = 3 billion drives 5 / 27
Introduction

“Big Data”

6 / 27
Introduction

The Data Economy

7 / 27
Introduction

The Data Economy

7 / 27
Introduction

Data Disrupting Science

Scientific paradigms:
• Observing
• Modeling
• Simulating
• Collecting and Analyzing Data

8 / 27
Introduction

Big Data

Big Data is a relative term

• if things are breaking, you have Big Data
I Big Data is not always Petabytes in size
I Big Data for Informatics is not the same as for Google
• Big Data is often hard to understand
I a model explaining it might be as complicated as the data itself
I this has implications for Science

• the game may be the same, but the rules are completely different
I what used to work needs to be reinvented in a different context

9 / 27
Introduction

Big Data Challenges (1/3)

• Volume
I data larger than a single machine (CPU,RAM,disk)
I infrastructures and techniques that scale by using more machines
I Google led the way in mastering “cluster data processing”

• Velocity
• Variety

10 / 27
Introduction

Supercomputers?

• take the top two supercomputers in the world today

I Tiahne-2 (Guangzhou, China)

I cost: US$390 million

I Titan (Oak Ridge National Laboratory, US)
I cost: US$97 million
• assume an expected lifetime of five years and compute cost per hour
I Tiahne-2: US$8,220
I Titan: US$2,214

• this is just for the machine showing up at the door

I not factored in operational costs (e.g., running, maintenance, power,

etc.)

11 / 27
Introduction

Let’s rent a supercomputer for an hour!

Amazon Web Services charge US$1.60 per hour for a large instance
• an 880 large instance cluster would cost US$1,408
• data costs US$0.15 per GB to upload
I assume we want to upload 1TB
I this would cost US$153

• the resulting setup would be #146 in the world’s top-500 machines

• total cost: US$1,561 per hour
• search for: LINPACK 880 server

12 / 27
Introduction

Supercomputing vs. Cluster Computing

• Supercomputing
I focus on performance (biggest, fastest).. At any cost!
I oriented towards the [secret] government sector / scientific computing
I programming effort seems less relevant
I Fortran + MPI: months do develop and debug programs
I GPU, i.e. computing with graphics cards

I FPGA, i.e. casting computation in hardware circuits

I assumes high-quality stable hardware

• Cluster Computing
I use a network of many computers to create a ‘supercomputer’

I oriented towards business applications

I use cheap servers (or even desktops), unreliable hardware
I software must make the unreliable parts reliable
I focus on economics (bang for the buck)
I programming effort counts, a lot! No time to lose on debugging..

13 / 27
Introduction

Cloud Computing vs Cluster Computing

• Cluster Computing
I Solving large tasks with more than one machine

I parallel database systems (e.g. Teradata, Vertica)

I NoSQL systems
I Hadoop / MapReduce
• Cloud Computing

14 / 27
Introduction

Cloud Computing vs Cluster Computing

• Cluster Computing
• Cloud Computing
I machines operated by a third party in large data centers

I sysadmin, electricity, backup, maintenance externalized

I rent access by the hour
I renting machines (Linux boxes): Infrastructure as a Service
I renting systems (Redshift SQL): Platform-as-a-service
I renting an software solution (Salesforce): Software-as-a-service
• independent concepts, but they are often combined!

15 / 27
Introduction

Economics of Cloud Computing

• a major argument for Cloud Computing is pricing:

I We could own our machines

I . . . and pay for electricity, cooling, operators

I . . . and allocate enough capacity to deal with peak demand
I since machines rarely operate at more than 30% capacity, we are paying
for wasted resources
• pay-as-you-go rental model
I rent machine instances by the hour

I pay for storage by space/month

I pay for bandwidth by space/hour

• no other costs
• this makes computing a commodity
I just like other commodity services (sewage, electricity etc.)

• some caveats though, we look at them later

16 / 27
Introduction

Cloud Computing: Provisioning

We can quickly scale resources as Target (US retailer) uses Amazon Web
demand dictates Services (AWS) to host [Link]
• high demand: more instances • during massive spikes (November 28
• low demand: fewer instances 2009 –”Black Friday”) [Link] is
unavailable
Elastic provisioning is crucial
Remember your panic when Facebook
was down?
demand

underprovisioning

provisioning

overprovisioning time

17 / 27
Introduction

Cloud Computing: some rough edges

• some provider hosts our data

I but we can only access it using proprietary (non-standard) APIs
I lock-in makes customers vulnerable to price increases and dependent

upon the provider

I local laws (e.g. privacy) might prohibit externalizing data processing

• providers may control our data in unexpected ways:

I July 2009: Amazon remotely removed books from Kindles

I Twitter prevents exporting tweets more than 3200 posts back

I Facebook locks user-data in
I paying customers forced off Picasa towards Google Plus

• anti-terror laws mean that providers have to grant access to

governments
I this privilege can be overused

18 / 27
Introduction

Privacy and Security

• people will not use Cloud Computing if trust is eroded

I who can access it?

I governments? Other people?

I Snowden is the Chernobyl of Big Data
I privacy guarantees needs to be clearly stated and kept-to
• privacy breaches
I numerous examples of Web mail accounts hacked
I many many cases of (UK) governmental data loss
I TJX Companies Inc. (2007): 45 million credit and debit card numbers

stolen
I every day there seems to be another instance of private data being

leaked to the public

19 / 27
Introduction

High performance and low latency

• how quickly data moves around the network

I total system latency is a function of memory, CPU, disk and network

I the CPU speed is often only a minor aspect

• examples
I Algorithmic Trading (put the data centre near the exchange); whoever

can execute a trade the fastest wins

I simulations of physical systems
I search results

I Google 2006: increasing page load time by 0.5 seconds produces a 20%
drop in traffic
I Amazon 2007: for every 100ms increase in load time, sales decrease by
1%
I Google’s web search rewards pages that load quickly

20 / 27
Introduction

Big Data Challenges (2/3)

• Volume
• Velocity
I endless stream of new events
I no time for heavy indexing (new data arrives continuously)
I led to development of data stream technologies

• Variety

21 / 27
Introduction

Big Streaming Data

• storing it is not really a problem: disk space is cheap

• efficiently accessing it and deriving results can be hard
• visualising it can be next to impossible
• repeated observations
I what makes Big Data big are repeated observations
I mobile phones report their locations every 15 seconds
I people post on Twitter > 100 million posts a day
I the Web changes every day
I potentially we need unbounded resources

I repeated observations motivates streaming algorithms

22 / 27
Introduction

Big Data Challenges (3/3)

• Volume
• Velocity
• Variety
I dirty, incomplete, inconclusive data (e.g. text in tweets)
I semantic complications:

I AI techniques needed, not just database queries

I Data mining, Data cleaning, text analysis
I techniques from other DEA lectures should be used in Big Data
I technical complications:
I skewed value distributions and “Power Laws”
I complex graph structures, expensive random access
I complicates cluster data processing (difficult to partition equally)
I localizing data by attaching pieces where you need them makes Big Data
even bigger

23 / 27
Introduction

Power laws

• Big Data typically obeys a power law

• modelling the head is easy, but may not be representative of the full
population
I dealing with the full population might imply Big Data (e.g., selling all
books, not just block busters)
• processing Big Data might reveal power-laws
I most items take a small amount of time to process
I a few items take a lot of time to process

• understanding the nature of data is key

24 / 27
Introduction

Skewed Data

• distributed computation is a natural way to tackle Big Data

I MapReduce encourages sequential, disk-based, localised processing of

data
I MapReduce operates over a cluster of machines

• one consequence of power laws is uneven allocation of data to nodes

I the head might go to one or two nodes
I the tail would spread over all other nodes

I all workers on the tail would finish quickly.

I the head workers would be a lot slower

• power laws can turn parallel algorithms into sequential algorithms

25 / 27
Introduction

Summary

Introduced the notion of Big Data, the three V’s

Explained Super/Cluster/Cloud computing

We will come back to that in the lecture, but we will start simple
• given a complex data set, what should you do to analyze it?
• start with simple approaches, become more and more complex
• we finish with cloud-scale computing, but not always appropriate
• Big Data is not the same for everybody

26 / 27
Introduction

Notes on the Technical Side

We will use a lot of tools during this lecture

• we concentrate on free and/or open source tools
• in general available for all major platforms
• we strongly suggest to use a Linux system, though
• ideally a recent Ubuntu/Debian system
• other systems should work, too, but you are on your own
• using a Virtual Machine is ok, might be easier than a native Linux
system

27 / 27

Hadoop for Scalable Data Management
No ratings yet
Hadoop for Scalable Data Management
58 pages
Selected Topics in Computer Science
100% (1)
Selected Topics in Computer Science
75 pages
Overview of the Hadoop Ecosystem
No ratings yet
Overview of the Hadoop Ecosystem
229 pages
Big Data Analytics: Insights & Innovations
No ratings yet
Big Data Analytics: Insights & Innovations
6 pages
HPC Cluster Interview Insights
No ratings yet
HPC Cluster Interview Insights
23 pages
Data Intensive Computing Overview
No ratings yet
Data Intensive Computing Overview
64 pages
Big Data Analytics Overview and Techniques
No ratings yet
Big Data Analytics Overview and Techniques
82 pages
Big Data Challenges and Hadoop Insights
No ratings yet
Big Data Challenges and Hadoop Insights
55 pages
Big Data Programming Overview
No ratings yet
Big Data Programming Overview
22 pages
Big Data Analytics Overview and Tools
No ratings yet
Big Data Analytics Overview and Tools
84 pages
NITRD's Big Data Research Overview
No ratings yet
NITRD's Big Data Research Overview
34 pages
Big Data Analytics and Visualization Syllabus
No ratings yet
Big Data Analytics and Visualization Syllabus
193 pages
Big Data Insights by Divyanshu Bhardwaj
No ratings yet
Big Data Insights by Divyanshu Bhardwaj
19 pages
Inter and Trans-Firewall Analytics Overview
No ratings yet
Inter and Trans-Firewall Analytics Overview
9 pages
Chapter 09 - in Class
No ratings yet
Chapter 09 - in Class
34 pages
Introduction to Big Data Concepts
100% (1)
Introduction to Big Data Concepts
87 pages
Understanding Big Data and Hadoop Solutions
No ratings yet
Understanding Big Data and Hadoop Solutions
25 pages
Understanding Big Data and Hadoop
No ratings yet
Understanding Big Data and Hadoop
77 pages
Big Data Analysis Fundamentals
No ratings yet
Big Data Analysis Fundamentals
43 pages
Introduction to Big Data Concepts
No ratings yet
Introduction to Big Data Concepts
6 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Big Data Concepts and Technologies Overview
No ratings yet
Big Data Concepts and Technologies Overview
28 pages
Understanding Big Data and Hadoop
No ratings yet
Understanding Big Data and Hadoop
25 pages
Big Data Technology Seminar Report
No ratings yet
Big Data Technology Seminar Report
32 pages
Big Data Management and Processing Overview
No ratings yet
Big Data Management and Processing Overview
46 pages
Big Data Management Overview and Challenges
No ratings yet
Big Data Management Overview and Challenges
53 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
Big Data Insights by Baris Aksanli
No ratings yet
Big Data Insights by Baris Aksanli
21 pages
Proof of Enrollment for CS431/451
No ratings yet
Proof of Enrollment for CS431/451
47 pages
Understanding Big Data Concepts and Evolution
No ratings yet
Understanding Big Data Concepts and Evolution
20 pages
Big Data Technologies Overview 2025
No ratings yet
Big Data Technologies Overview 2025
19 pages
Big Data Technologies for B.Tech Students
No ratings yet
Big Data Technologies for B.Tech Students
71 pages
Big Data Challenges in Bioinformatics
No ratings yet
Big Data Challenges in Bioinformatics
47 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
80 pages
History and Evolution of Big Data
No ratings yet
History and Evolution of Big Data
64 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
Overview of Data Science Concepts
No ratings yet
Overview of Data Science Concepts
20 pages
Understanding Big Data Challenges and Solutions
No ratings yet
Understanding Big Data Challenges and Solutions
32 pages
NoSQL and Cloud Solutions for Big Data
No ratings yet
NoSQL and Cloud Solutions for Big Data
98 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Big Data Technologies Overview 2024
No ratings yet
Big Data Technologies Overview 2024
20 pages
Understanding Big Data: Key Concepts
No ratings yet
Understanding Big Data: Key Concepts
16 pages
Understanding Big Data and Hadoop
No ratings yet
Understanding Big Data and Hadoop
30 pages
Big Data: Key Concepts and Applications
No ratings yet
Big Data: Key Concepts and Applications
25 pages
Understanding Big Data Types and Uses
No ratings yet
Understanding Big Data Types and Uses
45 pages
Big Data Overview and Insights
No ratings yet
Big Data Overview and Insights
60 pages
Big Data Management in Cloud Computing
No ratings yet
Big Data Management in Cloud Computing
11 pages
Big Data Management Overview
No ratings yet
Big Data Management Overview
42 pages
Hadoop Seminar by Alisha Khan
No ratings yet
Hadoop Seminar by Alisha Khan
22 pages
Understanding Big Data and Hadoop
No ratings yet
Understanding Big Data and Hadoop
12 pages
Big Data Scaling: Cloud Computing Course
No ratings yet
Big Data Scaling: Cloud Computing Course
37 pages
Big Data: Transforming Modern Management
No ratings yet
Big Data: Transforming Modern Management
49 pages
Understanding Big Data Concepts
No ratings yet
Understanding Big Data Concepts
16 pages
Introduction to Big Data Analytics
No ratings yet
Introduction to Big Data Analytics
84 pages
Big Data..Unit-1 Notes
No ratings yet
Big Data..Unit-1 Notes
16 pages
Multiple View Geometry Exercise Solutions
No ratings yet
Multiple View Geometry Exercise Solutions
2 pages
Exercise Sheet on Multiview Geometry
No ratings yet
Exercise Sheet on Multiview Geometry
2 pages
Multiple View Geometry Exercise Solutions
No ratings yet
Multiple View Geometry Exercise Solutions
2 pages
Multiple View Geometry Exercise Solutions
No ratings yet
Multiple View Geometry Exercise Solutions
2 pages
Multiple View Geometry Exercise Solutions
No ratings yet
Multiple View Geometry Exercise Solutions
2 pages
Multiple View Geometry Exercise Solutions
No ratings yet
Multiple View Geometry Exercise Solutions
2 pages
Multiple View Geometry Exercise Solutions
No ratings yet
Multiple View Geometry Exercise Solutions
2 pages
KDI Server Racks in Distributed Processing
No ratings yet
KDI Server Racks in Distributed Processing
50 pages
Data Engineering Fundamentals Explained
No ratings yet
Data Engineering Fundamentals Explained
30 pages
CS231n Midterm Review Guide
No ratings yet
CS231n Midterm Review Guide
86 pages
CS231n Neural Network Exam Solutions
No ratings yet
CS231n Neural Network Exam Solutions
20 pages
Data Analysis Basics: Tools & Formats
No ratings yet
Data Analysis Basics: Tools & Formats
44 pages
CS231n Midterm Review Guide
No ratings yet
CS231n Midterm Review Guide
86 pages
8029nts-v2-gps v0801 en
No ratings yet
8029nts-v2-gps v0801 en
121 pages
Vishal Kumar's PeopleSoft Expertise
No ratings yet
Vishal Kumar's PeopleSoft Expertise
3 pages
Data Warehouse Storage Architecture
No ratings yet
Data Warehouse Storage Architecture
33 pages
GanttProject User Handbook 0.52
No ratings yet
GanttProject User Handbook 0.52
24 pages
Sutradhar: College Event Management App
No ratings yet
Sutradhar: College Event Management App
32 pages
Inventory Management System Overview
No ratings yet
Inventory Management System Overview
4 pages
Invoice Capture Center 7 5 Customizing Guide PDF
100% (1)
Invoice Capture Center 7 5 Customizing Guide PDF
209 pages
Hadoop and Big Data Exam Papers
No ratings yet
Hadoop and Big Data Exam Papers
4 pages
What Is The Difference Between Graphs and Charts in LabVIEW - National Instruments
No ratings yet
What Is The Difference Between Graphs and Charts in LabVIEW - National Instruments
2 pages
User ID and Payment Records 2022
No ratings yet
User ID and Payment Records 2022
23 pages
Avast Free Antivirus Registration Key
100% (2)
Avast Free Antivirus Registration Key
2 pages
Configuring Port Forwarding on 615M-1
No ratings yet
Configuring Port Forwarding on 615M-1
5 pages
Maintenance Man-Worker E-book Download
No ratings yet
Maintenance Man-Worker E-book Download
2 pages
General Database Maintenance - CA Workload Automation AE & Workload Control Center - CA Technologies Documentation
No ratings yet
General Database Maintenance - CA Workload Automation AE & Workload Control Center - CA Technologies Documentation
3 pages
QR Code Attendance System Review
100% (1)
QR Code Attendance System Review
8 pages
Week 5 Quiz: Teradata Insights
100% (1)
Week 5 Quiz: Teradata Insights
6 pages
Overview of Distributed Computing Systems
No ratings yet
Overview of Distributed Computing Systems
26 pages
EV Whitepaper - PST Migration With Enterprise Vault 11
No ratings yet
EV Whitepaper - PST Migration With Enterprise Vault 11
60 pages
Introduction to dbExpress in Delphi
No ratings yet
Introduction to dbExpress in Delphi
38 pages
16 Creating RESTful Web Services With Application Modules
No ratings yet
16 Creating RESTful Web Services With Application Modules
49 pages
Job Application for Addis International Bank
100% (1)
Job Application for Addis International Bank
3 pages
El Capitan 10.11 Bootable USB Guide
No ratings yet
El Capitan 10.11 Bootable USB Guide
2 pages
Microsoft Visual Studio 2019 License Terms
No ratings yet
Microsoft Visual Studio 2019 License Terms
4 pages
SEL Relay Training and CEV File Guide
No ratings yet
SEL Relay Training and CEV File Guide
24 pages
Virtual COM Port Drivers For Ross
100% (1)
Virtual COM Port Drivers For Ross
22 pages
Public Domain Book Access Guidelines
No ratings yet
Public Domain Book Access Guidelines
306 pages
A32535 1 PDF
No ratings yet
A32535 1 PDF
504 pages
Technology in Action: Alan Evans Kendall Martin Mary Anne Poatsy Tenth Edition
No ratings yet
Technology in Action: Alan Evans Kendall Martin Mary Anne Poatsy Tenth Edition
86 pages
Shape Learning App Lesson Plan
No ratings yet
Shape Learning App Lesson Plan
2 pages
Google Revenue and Consumer Services
No ratings yet
Google Revenue and Consumer Services
1 page

Data Engineering Fundamentals Explained

Uploaded by

Data Engineering Fundamentals Explained

Uploaded by

Foundations of Data Engineering

About this Lecture

Related keywords include:

Goals and Scope

We cover both concepts and usage:

Some Pointers to Literature

The Age of Big Data

The Age of Big Data

• 1,527,877 GB/m = 1500 TB/m = 1000 drives/m = 20m stack/m

The Age of Big Data

• 1,527,877 GB/m = 1500 TB/m = 1000 drives/m = 20m stack/m

The Data Economy

The Data Economy

Data Disrupting Science

Big Data is a relative term

Big Data Challenges (1/3)

• take the top two supercomputers in the world today

I cost: US$390 million

• this is just for the machine showing up at the door

Let’s rent a supercomputer for an hour!

• the resulting setup would be #146 in the world’s top-500 machines

Supercomputing vs. Cluster Computing

I FPGA, i.e. casting computation in hardware circuits

I oriented towards business applications

Cloud Computing vs Cluster Computing

I parallel database systems (e.g. Teradata, Vertica)

Cloud Computing vs Cluster Computing

I sysadmin, electricity, backup, maintenance externalized

Economics of Cloud Computing

• a major argument for Cloud Computing is pricing:

I . . . and pay for electricity, cooling, operators

I pay for storage by space/month

• some caveats though, we look at them later

Cloud Computing: Provisioning

Cloud Computing: some rough edges

• some provider hosts our data

upon the provider

• providers may control our data in unexpected ways:

I Twitter prevents exporting tweets more than 3200 posts back

• anti-terror laws mean that providers have to grant access to

Privacy and Security

• people will not use Cloud Computing if trust is eroded

I governments? Other people?

leaked to the public

High performance and low latency

• how quickly data moves around the network

I the CPU speed is often only a minor aspect

can execute a trade the fastest wins

Big Data Challenges (2/3)

Big Streaming Data

• storing it is not really a problem: disk space is cheap

I repeated observations motivates streaming algorithms

Big Data Challenges (3/3)

I AI techniques needed, not just database queries

• Big Data typically obeys a power law

• understanding the nature of data is key

• distributed computation is a natural way to tackle Big Data

• one consequence of power laws is uneven allocation of data to nodes

I all workers on the tail would finish quickly.

• power laws can turn parallel algorithms into sequential algorithms

Introduced the notion of Big Data, the three V’s

Notes on the Technical Side

We will use a lot of tools during this lecture

You might also like