Big Data Platform Elements - Part 1
CIS 415 Lecture 3
Hina Arora
Announcements
We have a Grader!
o Anirudh Dhawan ([email protected])
o Office Hours: Thur 10am-12pm; BA Suite 318
Show of hands did you complete last weeks required
readings?
o Contents of Lecture-1 Deck and any supplemental notes you took in class
o Review: Vocabulary Section in Lecture-1 Deck
o Review: List of common data applications here
o Watch: The Beauty of Data Visualization
Big Data Platform Elements
Virtualization
Map Reduce
Big
Data
Platform
s
Parallel
Programming
Cloud
Computing
What will we cover today?
Virtualization
Cloud Computing
Virtualization
What is Virtualization?
Virtualization means that Applications can use a resource without
any concern for where it resides, what the technical interface is,
how it has been implemented, which platform it uses, and how
much of it is available
~Rick F. Van der Lans in Data Virtualization for Business Intelligence Systems
Well look at a few different types of virtualization:
o Server Virtualization can be HW-level or OS-level Virtualization
o Storage Virtualization
o Network Virtualization
o Desktop Virtualization
o Application Virtualization
Server Virtualization: HW-level virtualization
Ability to run multiple Virtual Machines (VMs or guests) on a single Physical
Machine (host).
Each Virtual Machine emulates the underlying physical hardware and has an
Operating System (OS).
Guest VMs are mostly completely isolated from each other.
Each guest VM can run a different OS.
Hypervisors (or Virtual Machine Monitors or VMMs) are used to App
create and
run
App
Bins/Li
Bins/Lib
VMs. There are two types of hypervisors:
VM
bs
s
Guest
Guest
OS
OS
Hypervisor Type-1
o Type-1, Native or Bare-metal Hypervisors:
Server
Run directly on the host's hardware.
Example: Hyper-V Hypervisor.
o Type-2 or Hosted Hypervisors:
Run on the hosts OS.
VM
App
App
Bins/Li
Bins/Lib
bs
s
Guest
Guest
OS
OS
Hypervisor Type-2
Host OS
Server
Example: VMware Player, VirtualBox
erence: https://en.wikipedia.org/wiki/Hypervisor
Server Virtualization provides
improved utilization, and scalability
Server Virtualization: OS-level virtualization
Ability to run multiple isolated Containers (user-space instances or guests)
on a single Physical Machine (host).
Containers do not emulate the underlying HW and dont have their own OS
(they share the host OS). This lighter footprint allows hosts to support a
higher density of guest Containers (as against guest VMs). But on the flip
side raises Security concerns.
Containers can also share binaries and libraries with other Containers.
Example: Docker
Container
App
Bins/Lib
s
App
App
Bins/Libs
Docke
r
Each Container typically runs a single Application.
Host OS
Server
ence: https://en.wikipedia.org/wiki/Operating-system-level_virtualization
Review: Storage Definitions
Block
o A sequence of bytes.
o Storage systems typically provide access to blocks.
o The OS typically abstracts other logical views like files and records.
Striping
o Sequential blocks of data are stored on different physical storage devices in (typically) round-robin fashion.
o Example: Disk1 <A, C, E>; Disk2 <B, D, F>
o Striping is useful when requests for data are faster than a single storage device can deliver. Striping data across multiple storage devices
allows for concurrent access to data thereby improving performance.
Mirroring
o Replication of data onto separate disks in real time.
o Example: Disk1 <A, B, C>; Disk2 <A, B, C>
o Improves data redundancy and reliability.
Parity
o When data on a crashed disk can be reconstructed using data on other disks (using the XOR operation)
o Example: Disk1 <A:11010011>; Disk2 <B:10011001>; Disk3 <PAB: 01001010>
Essentially, PAB = A XOR B, so is any one disk crashes, you can reconstruct using XOR operation between other two
o Improves data redundancy
File System:
o Controls how data is managed, stored and retrieved.
o Without a file system, we would just have a large blob of data with no way to identify different connected pieces of information.
o File systems are organized around groups of data called files, and groups of files called directories or folders.
o Distributed files systems are files systems that are spread across multiple servers.
Reference: Wikipedia
Storage Virtualization
Data is abstracted into what appears to be a single storage unit, while the physical
storage actually spans multiple heterogeneous devices and often locations
Storage Virtualization provides location independence, improved utilization,
performance, reliability and availability
Example: RAID (redundant array of independent/inexpensive disks)
Popular
RAID
Types
Striping
(provides
excellent
performance)
Mirroring
(provides
excellent
redundancy)
Parity
(provides
good
redundancy)
Minimum
Number
of Disks
Example
(Disk Blocks)
Comments
RAID 0
Yes
No
No
Disk 1 -- A, C, E
Disk 2 -- B, D, F
Excellent Performance.
No Redundancy.
Do not use for critical
applications.
RAID 1
No
Yes
No
Disk 1 -- A, B, C
Disk 2 -- A, B, C
Good Performance.
Excellent Redundancy.
RAID 5
Yes
No
Yes
(Distributed
Parity)
Disk 1 A, C,
PEF
Disk 2 B, PCD,
E
Disk 3 PAB, D,
F
Good Performance.
Good Redundancy.
Most cost effective.
Fast Reads; Slow Writes.
RAID 10
Yes
Yes
No
Disk
Disk
Disk
Disk
Excellent Performance.
Excellent Redundancy.
Great for mission critical
applications.
ference: https://en.wikipedia.org/wiki/RAID
1
2
3
4
-----
A,
A,
B,
B,
C, E
C, E
D, F
D, F
Review: Network Definitions
Local Area Network (LAN):
o
Wide Area Network (WAN):
o
A computer network that spans large geographical areas
IP Address
o
Address of a device participating in a network
IPv4: 32 bits | IPv6: 128 bits
Example: 11000000.10101000.00000101.10000010 (192.168.5.130)
Higher order bits determine network (indicated by subnet mask), and lower order bits determine host (device)
Subnetting:
o
Dividing a network into smaller parts
This affects the total number of hosts that can be addressed
Switch:
o
A computer network with interconnected devices within a limited geographical area such as a house or building.
Connects devices together on a computer network
Router
o
Carry traffic from one network/subnet to the other
Routers maintain routing tables to determine whether traffic is meant for this LAN, a connected LAN or a different
network.
Example: the home router connects home computers to the internet (these are similar networks since they both share
TCP/IP protocol)
Reference: Wikipedia
Image Source: http://netprivateer.com/lanwan.h
Network Virtualization
Creation of logical, virtual networks that are decoupled from the (limitations of) underlying
physical hardware.
Example: VLAN, VPN
o
Virtual Local Area Network (VLAN)
Allows for grouping of hosts within a virtual LAN regardless of geographical
location
mage Source: link
Provides scalability, flexibility, simplified administration, and security
Virtual Private Network (VPN)
Securely extends a private network over a public network such as the internet
Users can remotely communicate with the private network as though they were
directly connected to it with the same functionality, security and administrative
policies
Provides flexibility, simplified administration, and security
Image Source: https://en.wikipedia.org/wiki/Virtual_private_ne
(Remote) Desktop Virtualization
Enables access to applications on a remote OS using a virtual desktop.
The remote OS carries the application and data, and only the display, keyboard, and mouse
information are communicated with the local client device.
Users (on the local client devices) must establish a session and be connected with the
remote server to access the application.
Makes installation, upgrades and management of applications easier for IT.
Two kinds: RDS, VDI
Remote Desktop Services (RDS) aka Terminal Services
o Provides remote desktop to multiple users on a Host OS
o Provides users session-based isolation (session virtualization) - users share Host OS
o Users have no admin privileges on the host OS
o Can support higher user density
Virtual Desktop Infrastructure (VDI)
o Provides remote desktop to multiple users on Guest OSs
o Provides users VM-based isolation - each user gets a dedicated Guest OS
o Users have admin privileges on the Guest OS
o Support lower users density
Application Virtualization
Application Virtualization separates the Application from the OS, so Applications can
be more easily deployed and delivered.
The application is packaged and streamed from the server down the network to the
client and, instead of being installed on the client device, is executed on the local
device in a virtual bubble that is completely isolated from the client OS.
Applications are streamed intelligently.
o
Only required parts are streamed as and when they are used.
Once the application has been streamed, it is cached on the client device so it doesnt have
to be streamed every time a user uses it on the client. This also means the application can
be used even when the client is not connected to the server.
When an application upgrade is available, the server copy is upgraded, and the upgrades are
streamed down to the clients the next time the application is used on the client.
Makes installation, upgrades and management of applications easier for IT.
Examples: VMware ThinApp, Citrix XenApp and Microsoft App-V
Reference: http://blogs.msdn.com/b/ianm/archive/2010/06/11/microsoft-virtual-desktop-101-making-sense-of-vdi-rds-app-v-med-v-and-
Cloud Computing
Have you used Applications Hosted on the Cloud?
What are some characteristics these applications have in common*?
You typically sign up for service (free with ads, free trial, or subscription)
You connect to the internet for access
You dont need to install application software, and version upgrades are
pushed seamlessly
You expect reliable, on-demand, self-service of the application
You expect ability to instantaneously upgrade (eg more storage, no ads,
etc)
You rely on the service provider for infrastructure (eg: you dont set up mail
server)
You rely on the service provider for security and privacy
You rely on the service provider for backup and recovery
*Note: a lot of these services come with clients apps we are not considering
that scenario here.
What is Cloud Computing?
Cloud computing is a model for enabling convenient, on-demand network
access to a shared pool of configurable computing resources (e.g., networks,
servers, storage, applications, and services) that can be rapidly provisioned
and released with minimal management effort or service provider interaction.
Key enabling technologies include: (1) fast wide-area networks, (2) powerful,
inexpensive server computers, and (3) high-performance virtualization for
commodity hardware.
ource: http://www.nist.gov/itl/cloud/
http://
www.intel.com/content/www/us/en/cloud-computing/cloud-101-vid
eo.html
Deployment Models
There are 3 basic deployment models in cloud computing:
Private Cloud
o
Two kinds of private clouds:
On-Prem Private Cloud: On-Prem Data Center + Network Virtualization + Cloud Orchestration Software
Externally Hosted Private Cloud (also called Virtual Private Cloud): Logically isolated, user-defined, and usercontrolled portion of a 3rd party hosted cloud (like AWS or Microsoft).
Provides high degree of Control
Good for highly-sensitive data and applications
Public Cloud
o
Third-Party Provides Cloud Services (3 different service models - IaaS, PaaS, or SaaS)
Typically pay-as-you-go model (you pay for what you use)
Service Provider held to agreed upon availability, reliability, privacy and security standards
Provides high degree of Scalability
Example: Amazon AWS, Microsoft Azure, Google Cloud
Hybrid Cloud
o
Combination of Private and Public Cloud
Allows you to pick desired level of Control vs Scalability
Service Models
There are 4 basic service models in cloud computing, based on what
parts of the stack the User controls vs what the Cloud Provider
manages.
Private: User controls everything from the networking
to the applications. Example: users on-premise
datacenter.
IaaS: User controls the application down to the
underlying OS, and the Cloud Provider manages the
virtualization layer and the hardware. Example: getting
a virtual server in the cloud.
PaaS: User controls application and data, and the Cloud
Provider provisions the underlying supporting
infrastructure, typically including operating system,
programming-language execution environment,
database, and web servers. This allows developers to
focus on application development instead of worrying
about underlying hardware and software layers.
SaaS: User gains access to application software and
databases. Cloud providers install and operate
application software, and manage the infrastructure and
platforms that run the applications. Example: O365 in
* Note: Managed by Microsoft is just an example its essentially cloud provider of your choice
the cloud.
erence: https://en.wikipedia.org/wiki/Cloud_computing
Image Source: http://cloudcomputing.sys-con.com/node/2932
Key Characteristics
On-demand self-service:
A consumer can provision computing capabilities, as needed automatically without requiring human interaction
with each service provider.
Device and location independence:
Users can access service using a web browser regardless of location or device used (e.g., PC, mobile phone).
Resource pooling:
Computing resources are pooled to serve multiple consumers, with different physical and virtual resources
dynamically assigned and reassigned according to consumer demand.
Scalability and elasticity:
Dynamic on-demand provisioning of resources on a fine-grained, self-service basis in near real-time without
users having to engineer for peak loads.
Measured service:
Resource usage can be monitored, controlled, and reported, providing transparency for both the provider and
consumer of the utilized service.
Reference: http://www.nist.gov/itl/cloud/ and https://
Advantages and Risks
Advantages
o
Scalability and elasticity by design (dynamic on-demand provisioning of resources)
Convenience by design (device and location independence)
Continuous Availability by design (on-demand self-service)
Improved Reliability due to use of multiple redundant sites
Faster Deployment since infrastructure set up is quick, and software integration is easier
Cost Reduction due to savings on sunk cost of infrastructure, licenses, and maintenance
Risks
o
Limited Control over infrastructure, software, and data
Security and Privacy of data is at the mercy of the Service Provider
Dependency on the Provider can lead to vendor lock-in and migration challenges
Downtime of service can occur due to Service Provider outage or network access issues
Reference: https://en.wikipedia.org/wiki/Cloud_computing
What did we learn today?
Four key elements make up big data platforms:
o Virtualization, Cloud Computing, Parallel Programming and Map Reduce.
Virtualization means that Applications can use a resource without any concern
for where it resides, what the technical interface is, how it has been
implemented, which platform it uses, and how much of it is available.
o Virtualization can occur at different levels of the stack: Server, Storage, Network, Desktop
and Application.
Cloud computing is a model for enabling convenient, on-demand network
access to a shared pool of configurable computing resources (e.g., networks,
servers, storage, applications, and services) that can be rapidly provisioned
and released with minimal management effort or service provider interaction.
o Three Deployment Models: Private, Public, Hybrid.
o Four Service Models: Private, IaaS, PaaS, SaaS.
o There are Advantages and Risks involved in Cloud Computing that one must be aware.
Required Readings for this Lecture
Contents of this Deck
o Note: Anything Ive linked to as Source, Reference, or Optional Reading in the deck is not required reading.
Supplemental notes you take during class
Homework - spend a 5-10 minutes on each of these Sites: Amazon AWS, Microsoft Azure,
Google Cloud
o Do you now see a number of familiar terms on these sites?
o What deployment models do they cover?
o What service models do they cover?
o Note how they all have very similar competing offers (including free trials to improve adoption).