Foundations of Data Engineering
Thomas Neumann
1 / 27
Introduction
About this Lecture
The goal of this lecture is teaching the standard tools and techniques for
large-scale data processing.
Related keywords include:
• Big Data
• cloud computing
• scalable data-processing
• ...
We start with an overview, and then dive into individual topics.
2 / 27
Introduction
Goals and Scope
Note that this lecture emphasizes practical usage (after the introduction):
• we cover many different approaches and techniques
• but all of them will be used in practical manner, both in exercises and
in the lecture
We cover both concepts and usage:
• what software layers are used to handle Big Data?
• what are the principles behind this software?
• which kind of software would one use for which data problem?
• how do I use the software for a concrete problem?
3 / 27
Introduction
Some Pointers to Literature
They are not required for the course, but might be useful for reference
• The Datacenter as a Computer: An Introduction to the Design of
Warehouse-Scale Machines
• Hadoop: The Definitive Guide
• Big Data Processing with Apache Spark
• Big Data Infrastructure course by Peter Boncz
4 / 27
Introduction
The Age of Big Data
5 / 27
Introduction
The Age of Big Data
• 1,527,877 GB/m = 1500 TB/m = 1000 drives/m = 20m stack/m
5 / 27
Introduction
The Age of Big Data
• 1,527,877 GB/m = 1500 TB/m = 1000 drives/m = 20m stack/m
• 4 zetabytes = 3 billion drives 5 / 27
Introduction
“Big Data”
6 / 27
Introduction
The Data Economy
7 / 27
Introduction
The Data Economy
7 / 27
Introduction
Data Disrupting Science
Scientific paradigms:
• Observing
• Modeling
• Simulating
• Collecting and Analyzing Data
8 / 27
Introduction
Big Data
Big Data is a relative term
• if things are breaking, you have Big Data
I Big Data is not always Petabytes in size
I Big Data for Informatics is not the same as for Google
• Big Data is often hard to understand
I a model explaining it might be as complicated as the data itself
I this has implications for Science
• the game may be the same, but the rules are completely different
I what used to work needs to be reinvented in a different context
9 / 27
Introduction
Big Data Challenges (1/3)
• Volume
I data larger than a single machine (CPU,RAM,disk)
I infrastructures and techniques that scale by using more machines
I Google led the way in mastering “cluster data processing”
• Velocity
• Variety
10 / 27
Introduction
Supercomputers?
• take the top two supercomputers in the world today
I Tiahne-2 (Guangzhou, China)
I cost: US$390 million
I Titan (Oak Ridge National Laboratory, US)
I cost: US$97 million
• assume an expected lifetime of five years and compute cost per hour
I Tiahne-2: US$8,220
I Titan: US$2,214
• this is just for the machine showing up at the door
I not factored in operational costs (e.g., running, maintenance, power,
etc.)
11 / 27
Introduction
Let’s rent a supercomputer for an hour!
Amazon Web Services charge US$1.60 per hour for a large instance
• an 880 large instance cluster would cost US$1,408
• data costs US$0.15 per GB to upload
I assume we want to upload 1TB
I this would cost US$153
• the resulting setup would be #146 in the world’s top-500 machines
• total cost: US$1,561 per hour
• search for: LINPACK 880 server
12 / 27
Introduction
Supercomputing vs. Cluster Computing
• Supercomputing
I focus on performance (biggest, fastest).. At any cost!
I oriented towards the [secret] government sector / scientific computing
I programming effort seems less relevant
I Fortran + MPI: months do develop and debug programs
I GPU, i.e. computing with graphics cards
I FPGA, i.e. casting computation in hardware circuits
I assumes high-quality stable hardware
• Cluster Computing
I use a network of many computers to create a ‘supercomputer’
I oriented towards business applications
I use cheap servers (or even desktops), unreliable hardware
I software must make the unreliable parts reliable
I focus on economics (bang for the buck)
I programming effort counts, a lot! No time to lose on debugging..
13 / 27
Introduction
Cloud Computing vs Cluster Computing
• Cluster Computing
I Solving large tasks with more than one machine
I parallel database systems (e.g. Teradata, Vertica)
I NoSQL systems
I Hadoop / MapReduce
• Cloud Computing
14 / 27
Introduction
Cloud Computing vs Cluster Computing
• Cluster Computing
• Cloud Computing
I machines operated by a third party in large data centers
I sysadmin, electricity, backup, maintenance externalized
I rent access by the hour
I renting machines (Linux boxes): Infrastructure as a Service
I renting systems (Redshift SQL): Platform-as-a-service
I renting an software solution (Salesforce): Software-as-a-service
• independent concepts, but they are often combined!
15 / 27
Introduction
Economics of Cloud Computing
• a major argument for Cloud Computing is pricing:
I We could own our machines
I . . . and pay for electricity, cooling, operators
I . . . and allocate enough capacity to deal with peak demand
I since machines rarely operate at more than 30% capacity, we are paying
for wasted resources
• pay-as-you-go rental model
I rent machine instances by the hour
I pay for storage by space/month
I pay for bandwidth by space/hour
• no other costs
• this makes computing a commodity
I just like other commodity services (sewage, electricity etc.)
• some caveats though, we look at them later
16 / 27
Introduction
Cloud Computing: Provisioning
We can quickly scale resources as Target (US retailer) uses Amazon Web
demand dictates Services (AWS) to host [Link]
• high demand: more instances • during massive spikes (November 28
• low demand: fewer instances 2009 –”Black Friday”) [Link] is
unavailable
Elastic provisioning is crucial
Remember your panic when Facebook
was down?
demand
underprovisioning
provisioning
overprovisioning time
17 / 27
Introduction
Cloud Computing: some rough edges
• some provider hosts our data
I but we can only access it using proprietary (non-standard) APIs
I lock-in makes customers vulnerable to price increases and dependent
upon the provider
I local laws (e.g. privacy) might prohibit externalizing data processing
• providers may control our data in unexpected ways:
I July 2009: Amazon remotely removed books from Kindles
I Twitter prevents exporting tweets more than 3200 posts back
I Facebook locks user-data in
I paying customers forced off Picasa towards Google Plus
• anti-terror laws mean that providers have to grant access to
governments
I this privilege can be overused
18 / 27
Introduction
Privacy and Security
• people will not use Cloud Computing if trust is eroded
I who can access it?
I governments? Other people?
I Snowden is the Chernobyl of Big Data
I privacy guarantees needs to be clearly stated and kept-to
• privacy breaches
I numerous examples of Web mail accounts hacked
I many many cases of (UK) governmental data loss
I TJX Companies Inc. (2007): 45 million credit and debit card numbers
stolen
I every day there seems to be another instance of private data being
leaked to the public
19 / 27
Introduction
High performance and low latency
• how quickly data moves around the network
I total system latency is a function of memory, CPU, disk and network
I the CPU speed is often only a minor aspect
• examples
I Algorithmic Trading (put the data centre near the exchange); whoever
can execute a trade the fastest wins
I simulations of physical systems
I search results
I Google 2006: increasing page load time by 0.5 seconds produces a 20%
drop in traffic
I Amazon 2007: for every 100ms increase in load time, sales decrease by
1%
I Google’s web search rewards pages that load quickly
20 / 27
Introduction
Big Data Challenges (2/3)
• Volume
• Velocity
I endless stream of new events
I no time for heavy indexing (new data arrives continuously)
I led to development of data stream technologies
• Variety
21 / 27
Introduction
Big Streaming Data
• storing it is not really a problem: disk space is cheap
• efficiently accessing it and deriving results can be hard
• visualising it can be next to impossible
• repeated observations
I what makes Big Data big are repeated observations
I mobile phones report their locations every 15 seconds
I people post on Twitter > 100 million posts a day
I the Web changes every day
I potentially we need unbounded resources
I repeated observations motivates streaming algorithms
22 / 27
Introduction
Big Data Challenges (3/3)
• Volume
• Velocity
• Variety
I dirty, incomplete, inconclusive data (e.g. text in tweets)
I semantic complications:
I AI techniques needed, not just database queries
I Data mining, Data cleaning, text analysis
I techniques from other DEA lectures should be used in Big Data
I technical complications:
I skewed value distributions and “Power Laws”
I complex graph structures, expensive random access
I complicates cluster data processing (difficult to partition equally)
I localizing data by attaching pieces where you need them makes Big Data
even bigger
23 / 27
Introduction
Power laws
• Big Data typically obeys a power law
• modelling the head is easy, but may not be representative of the full
population
I dealing with the full population might imply Big Data (e.g., selling all
books, not just block busters)
• processing Big Data might reveal power-laws
I most items take a small amount of time to process
I a few items take a lot of time to process
• understanding the nature of data is key
24 / 27
Introduction
Skewed Data
• distributed computation is a natural way to tackle Big Data
I MapReduce encourages sequential, disk-based, localised processing of
data
I MapReduce operates over a cluster of machines
• one consequence of power laws is uneven allocation of data to nodes
I the head might go to one or two nodes
I the tail would spread over all other nodes
I all workers on the tail would finish quickly.
I the head workers would be a lot slower
• power laws can turn parallel algorithms into sequential algorithms
25 / 27
Introduction
Summary
Introduced the notion of Big Data, the three V’s
Explained Super/Cluster/Cloud computing
We will come back to that in the lecture, but we will start simple
• given a complex data set, what should you do to analyze it?
• start with simple approaches, become more and more complex
• we finish with cloud-scale computing, but not always appropriate
• Big Data is not the same for everybody
26 / 27
Introduction
Notes on the Technical Side
We will use a lot of tools during this lecture
• we concentrate on free and/or open source tools
• in general available for all major platforms
• we strongly suggest to use a Linux system, though
• ideally a recent Ubuntu/Debian system
• other systems should work, too, but you are on your own
• using a Virtual Machine is ok, might be easier than a native Linux
system
27 / 27