0% found this document useful (0 votes)
106 views30 pages

Data Engineering Fundamentals Explained

This document provides an introduction to foundations of data engineering. It discusses the rise of big data due to increasing data volumes and variety. It describes challenges of big data like volume, velocity and variety. It compares approaches like supercomputing, cluster computing and cloud computing. It also discusses economics of cloud computing and issues around privacy and security in the cloud.

Uploaded by

Vard Farrell
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views30 pages

Data Engineering Fundamentals Explained

This document provides an introduction to foundations of data engineering. It discusses the rise of big data due to increasing data volumes and variety. It describes challenges of big data like volume, velocity and variety. It compares approaches like supercomputing, cluster computing and cloud computing. It also discusses economics of cloud computing and issues around privacy and security in the cloud.

Uploaded by

Vard Farrell
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Foundations of Data Engineering

Thomas Neumann

1 / 27
Introduction

About this Lecture

The goal of this lecture is teaching the standard tools and techniques for
large-scale data processing.

Related keywords include:


• Big Data
• cloud computing
• scalable data-processing
• ...
We start with an overview, and then dive into individual topics.

2 / 27
Introduction

Goals and Scope

Note that this lecture emphasizes practical usage (after the introduction):
• we cover many different approaches and techniques
• but all of them will be used in practical manner, both in exercises and
in the lecture

We cover both concepts and usage:


• what software layers are used to handle Big Data?
• what are the principles behind this software?
• which kind of software would one use for which data problem?
• how do I use the software for a concrete problem?

3 / 27
Introduction

Some Pointers to Literature

They are not required for the course, but might be useful for reference
• The Datacenter as a Computer: An Introduction to the Design of
Warehouse-Scale Machines
• Hadoop: The Definitive Guide
• Big Data Processing with Apache Spark
• Big Data Infrastructure course by Peter Boncz

4 / 27
Introduction

The Age of Big Data

5 / 27
Introduction

The Age of Big Data

• 1,527,877 GB/m = 1500 TB/m = 1000 drives/m = 20m stack/m


5 / 27
Introduction

The Age of Big Data

• 1,527,877 GB/m = 1500 TB/m = 1000 drives/m = 20m stack/m


• 4 zetabytes = 3 billion drives 5 / 27
Introduction

“Big Data”

6 / 27
Introduction

The Data Economy

7 / 27
Introduction

The Data Economy

7 / 27
Introduction

Data Disrupting Science

Scientific paradigms:
• Observing
• Modeling
• Simulating
• Collecting and Analyzing Data

8 / 27
Introduction

Big Data

Big Data is a relative term


• if things are breaking, you have Big Data
I Big Data is not always Petabytes in size
I Big Data for Informatics is not the same as for Google
• Big Data is often hard to understand
I a model explaining it might be as complicated as the data itself
I this has implications for Science

• the game may be the same, but the rules are completely different
I what used to work needs to be reinvented in a different context

9 / 27
Introduction

Big Data Challenges (1/3)

• Volume
I data larger than a single machine (CPU,RAM,disk)
I infrastructures and techniques that scale by using more machines
I Google led the way in mastering “cluster data processing”

• Velocity
• Variety

10 / 27
Introduction

Supercomputers?

• take the top two supercomputers in the world today


I Tiahne-2 (Guangzhou, China)

I cost: US$390 million


I Titan (Oak Ridge National Laboratory, US)
I cost: US$97 million
• assume an expected lifetime of five years and compute cost per hour
I Tiahne-2: US$8,220
I Titan: US$2,214

• this is just for the machine showing up at the door


I not factored in operational costs (e.g., running, maintenance, power,

etc.)

11 / 27
Introduction

Let’s rent a supercomputer for an hour!

Amazon Web Services charge US$1.60 per hour for a large instance
• an 880 large instance cluster would cost US$1,408
• data costs US$0.15 per GB to upload
I assume we want to upload 1TB
I this would cost US$153

• the resulting setup would be #146 in the world’s top-500 machines


• total cost: US$1,561 per hour
• search for: LINPACK 880 server

12 / 27
Introduction

Supercomputing vs. Cluster Computing

• Supercomputing
I focus on performance (biggest, fastest).. At any cost!
I oriented towards the [secret] government sector / scientific computing
I programming effort seems less relevant
I Fortran + MPI: months do develop and debug programs
I GPU, i.e. computing with graphics cards

I FPGA, i.e. casting computation in hardware circuits


I assumes high-quality stable hardware

• Cluster Computing
I use a network of many computers to create a ‘supercomputer’

I oriented towards business applications


I use cheap servers (or even desktops), unreliable hardware
I software must make the unreliable parts reliable
I focus on economics (bang for the buck)
I programming effort counts, a lot! No time to lose on debugging..

13 / 27
Introduction

Cloud Computing vs Cluster Computing

• Cluster Computing
I Solving large tasks with more than one machine

I parallel database systems (e.g. Teradata, Vertica)


I NoSQL systems
I Hadoop / MapReduce
• Cloud Computing

14 / 27
Introduction

Cloud Computing vs Cluster Computing

• Cluster Computing
• Cloud Computing
I machines operated by a third party in large data centers

I sysadmin, electricity, backup, maintenance externalized


I rent access by the hour
I renting machines (Linux boxes): Infrastructure as a Service
I renting systems (Redshift SQL): Platform-as-a-service
I renting an software solution (Salesforce): Software-as-a-service
• independent concepts, but they are often combined!

15 / 27
Introduction

Economics of Cloud Computing

• a major argument for Cloud Computing is pricing:


I We could own our machines

I . . . and pay for electricity, cooling, operators


I . . . and allocate enough capacity to deal with peak demand
I since machines rarely operate at more than 30% capacity, we are paying
for wasted resources
• pay-as-you-go rental model
I rent machine instances by the hour

I pay for storage by space/month


I pay for bandwidth by space/hour

• no other costs
• this makes computing a commodity
I just like other commodity services (sewage, electricity etc.)

• some caveats though, we look at them later

16 / 27
Introduction

Cloud Computing: Provisioning

We can quickly scale resources as Target (US retailer) uses Amazon Web
demand dictates Services (AWS) to host [Link]
• high demand: more instances • during massive spikes (November 28
• low demand: fewer instances 2009 –”Black Friday”) [Link] is
unavailable
Elastic provisioning is crucial
Remember your panic when Facebook
was down?
demand

underprovisioning

provisioning

overprovisioning time

17 / 27
Introduction

Cloud Computing: some rough edges

• some provider hosts our data


I but we can only access it using proprietary (non-standard) APIs
I lock-in makes customers vulnerable to price increases and dependent

upon the provider


I local laws (e.g. privacy) might prohibit externalizing data processing

• providers may control our data in unexpected ways:


I July 2009: Amazon remotely removed books from Kindles

I Twitter prevents exporting tweets more than 3200 posts back


I Facebook locks user-data in
I paying customers forced off Picasa towards Google Plus

• anti-terror laws mean that providers have to grant access to


governments
I this privilege can be overused

18 / 27
Introduction

Privacy and Security

• people will not use Cloud Computing if trust is eroded


I who can access it?

I governments? Other people?


I Snowden is the Chernobyl of Big Data
I privacy guarantees needs to be clearly stated and kept-to
• privacy breaches
I numerous examples of Web mail accounts hacked
I many many cases of (UK) governmental data loss
I TJX Companies Inc. (2007): 45 million credit and debit card numbers

stolen
I every day there seems to be another instance of private data being

leaked to the public

19 / 27
Introduction

High performance and low latency

• how quickly data moves around the network


I total system latency is a function of memory, CPU, disk and network

I the CPU speed is often only a minor aspect

• examples
I Algorithmic Trading (put the data centre near the exchange); whoever

can execute a trade the fastest wins


I simulations of physical systems
I search results

I Google 2006: increasing page load time by 0.5 seconds produces a 20%
drop in traffic
I Amazon 2007: for every 100ms increase in load time, sales decrease by
1%
I Google’s web search rewards pages that load quickly

20 / 27
Introduction

Big Data Challenges (2/3)

• Volume
• Velocity
I endless stream of new events
I no time for heavy indexing (new data arrives continuously)
I led to development of data stream technologies

• Variety

21 / 27
Introduction

Big Streaming Data

• storing it is not really a problem: disk space is cheap


• efficiently accessing it and deriving results can be hard
• visualising it can be next to impossible
• repeated observations
I what makes Big Data big are repeated observations
I mobile phones report their locations every 15 seconds
I people post on Twitter > 100 million posts a day
I the Web changes every day
I potentially we need unbounded resources

I repeated observations motivates streaming algorithms

22 / 27
Introduction

Big Data Challenges (3/3)

• Volume
• Velocity
• Variety
I dirty, incomplete, inconclusive data (e.g. text in tweets)
I semantic complications:

I AI techniques needed, not just database queries


I Data mining, Data cleaning, text analysis
I techniques from other DEA lectures should be used in Big Data
I technical complications:
I skewed value distributions and “Power Laws”
I complex graph structures, expensive random access
I complicates cluster data processing (difficult to partition equally)
I localizing data by attaching pieces where you need them makes Big Data
even bigger

23 / 27
Introduction

Power laws

• Big Data typically obeys a power law


• modelling the head is easy, but may not be representative of the full
population
I dealing with the full population might imply Big Data (e.g., selling all
books, not just block busters)
• processing Big Data might reveal power-laws
I most items take a small amount of time to process
I a few items take a lot of time to process

• understanding the nature of data is key


24 / 27
Introduction

Skewed Data

• distributed computation is a natural way to tackle Big Data


I MapReduce encourages sequential, disk-based, localised processing of

data
I MapReduce operates over a cluster of machines

• one consequence of power laws is uneven allocation of data to nodes


I the head might go to one or two nodes
I the tail would spread over all other nodes

I all workers on the tail would finish quickly.


I the head workers would be a lot slower

• power laws can turn parallel algorithms into sequential algorithms

25 / 27
Introduction

Summary

Introduced the notion of Big Data, the three V’s


Explained Super/Cluster/Cloud computing

We will come back to that in the lecture, but we will start simple
• given a complex data set, what should you do to analyze it?
• start with simple approaches, become more and more complex
• we finish with cloud-scale computing, but not always appropriate
• Big Data is not the same for everybody

26 / 27
Introduction

Notes on the Technical Side

We will use a lot of tools during this lecture


• we concentrate on free and/or open source tools
• in general available for all major platforms
• we strongly suggest to use a Linux system, though
• ideally a recent Ubuntu/Debian system
• other systems should work, too, but you are on your own
• using a Virtual Machine is ok, might be easier than a native Linux
system

27 / 27

You might also like