DS5460: Big Data Scaling
Week 1: Course Introduction and Cloud Computing
Instructor: Dana Zhang
What will the course cover?
• We will learn to analyze/process different types of data:
- Data is high dimensional
- Data is labeled
- Data is large in volume
• We will learn to use different models of computation:
- MapReduce
- Large-scale machine learning algorithms (PySpark)
Tentative Course Schedule
Large-scale Computing
• Large-scale computing for data mining
- Problems of commodity hardware
• Challenges:
- How do we distribute computation?
- How can we make it easy to write distributed programs?
An idea and a Solution
• Issue:
- Copying data over a network takes time
• Idea:
- Bring computation to data
- Store files multiple times for reliability
• Storage Infrastructure – File system
- Google: GFS, Hadoop: HDFS
• Programming model
- MapReduce
- Spark
Course Logistics
Course Staff
Instructor: Prof. Peng “Dana” Zhang
Email: [Link]@[Link]
Office: DSI A2110
Office Hours: MW 1:30pm – 2:30pm
Also available by appointment
Teaching Assistants:
Siyuan “Jasmine” Guo ([Link]@[Link]) – head TA
Siyu Yang ([Link]@[Link])
Sunny Bhatt ([Link]@[Link])
Course Tools
• We will post course content and announcements to Brightspace
(hence check it regularly! At least every 24 hours)
• We will use Piazza for Q&A – you will receive an email to enroll in
Piazza
• Gradescope will be used for all assignment submission and sharing
graded feedback
Course Material
• Brightspace
- Course syllabus
- Lecture slides, in-class demo notebooks
- Homework documents and related files
• Optional textbook:
- Mining of Massive Datasets by A. Rajaraman, J. Ullman, and J. Leskovec
- by Cambridge Uni. Press and available for free at [Link]
Assessments
• Homework Assignments 35%
- Will be announced in class
- Submit via Gradescope
- The use of GenAI is allowed, but you must write up your own solutions in your
own words and indicate how GenAI is are used if applicable.
• In-class participation 15%
- There will be living coding exercises and/or quizzes in most weeks (usually
Wednesdays). The submission deadline may vary
- some will be submitted within 3 hours after class ends
- some will be due at the end of the class day
Assessments
• Midterm 25%:
- In class exam
- The exact date will be confirmed two weeks in advance
• Final Project 25%:
- Final project will be released after the midterm.
- Final projects can be done in groups of 1 - 3 people. We encourage you to form a
group of at least 2 members as the project can be quite lengthy. We will discuss
project details after the midterm.
Late Policy
• Each student has 5 penalty-free late (calendar) days that can be
used for any assignments except for the midterm and final project.
• These late days can be used at any time without needing to provide
a reason.
• Assignments turned in after the due date will use up late days in 24-
hour increments. For example:
• If an assignment is up to 24 hours late, it uses 1 full late day.
• If an assignment is 24-48 hours late, it uses 2 full late days.
Late Policy
• After a student has used all 5 late days, any additional late submission
will receive a 20% deduction from the grade per calendar day late.
For example,
• An assignment turned in 2 days late after all late days are used would receive
a 40% grade deduction.
Extra Credit
• Extra credit (proportional to your contribution during class)
- Students may receive bonus points by answering questions in class or
engaging in class discussions.
- The maximum amount of extra credit can be applied to a student’s
overall grade will be 5%.
- The extra credit earned will be sent to you in week 13 by the head TA
Intro to Cloud Computing
What is a Cloud?
Distributed system with Huge Storage and plenty of Compute cycles
• Servers (compute nodes in racks)
• Switches connecting racks
• Network
• Storage nodes
• Software
A rudimentary example of a cloud
Task: Big data (TB) processing
Desired spec:
128GB RAM
10TB Storage
…
…
16GB/500GB 16GB/500GB 16GB/500GB 16GB/500GB 16GB/500GB
16GB RAM/500GB Storage
Network
Data Centers
[Link] [Link]
/14/facebook-data-center-new-mexico-utah googles-data-centers-pictures/5/
[Link]
[Link]
What is Cloud Computing?
Cloud computing:
• Internet-based computing in which large groups of remote servers are
networked so as to allow sharing of data-processing tasks, centralized
data storage, and online access to computer services or resources.
• Any computer related task that is done entirely on the Internet.
• Allows users to deal with the software without having the hardware.
• Everything is done remotely, nothing is saved locally.
History
• 50's & 60's: theorized that the world would have
cloud computing
“computation may someday be organized as a
public utility” – John McCarthy, 1960
• 90's: start of VPNs and efficient infrastructure
• 00's: Amazon builds efficient servers/AWS
Commercial Cloud Providers
• Private Cloud Providers (customized for company)
• Public Cloud (accessible to anyone):
- AWS: Amazon Web Services
• S3 (Simple Storage Service) – store data sets pay per GB/month
• EC2 (Elastic compute cloud) – run apps, pay per CPU hour
- Microsoft Azure (Azure storage, Virtual Machines)
- Google Cloud (Dataproc, Google App Engine/Compute Engine,
Google Cloud Storage, etc)
- Other (Rightscale, Salesforce, Oracle, Cloudera, VMWare etc.)
Virtualization
• Virtualization is the fundamental technology that powers cloud computing.
Cloud Services
Commercial Clouds provide:
• Infrastructure as a Service (IaaS)
• Basic, service users manage most
components
• Platform as a Service (PaaS)
• Users are given software and hardware
automatically but can develop their own
apps
• Software as a Service (SaaS)
• All software and hardware are
encapsulated
• User only knows their own access point
Cloud Challenges
• Equipment Failures
• With so many machines, steady rate of failures is expected and constant
maintenance is required
• Scalability
• Cloud needs to be able to add more servers
• Asynchronous processing
• Clocks of different servers cannot all be synchronized to each other
• Concurrency
• Many machines may try to access the same data
Scale Out, Not Scale Up
• Scale up = grow your cluster by replacing with more powerful
machines
• Not cost-effective
• Need to replace machines with latest model, often
• Scale out = grow your cluster by adding more off the shelf machines
• Cheaper
• Faster machines gradually replace older machines
Scale Out, Not Scale Up
[Link]
Storage
• Require stable storage
• Solution = Distributed File System
• Google GFS
• Hadoop HDFS
• They can manage huge files (TB)
• Multiple copies of each file improve availability
Big Data in Cloud
• Big Data = collection of data so large, it is impossible to process in one
computer
• Requires a number of application solutions designed to deal with this
problem
• Bring computation to data!
• Store files multiple times for reliability!
Distributed File System
• Primary node (Name Node in Hadoop’s HDFS)
• Stores metadata about where files are stored
• Might be replicated
• Chunk servers
• File is split into contiguous chunks
• Typically each chunk is 16-64MB
• Each chunk replicated (usually 2x or 3x)
• Try to keep replicas in different racks
• Client library for file access
• Talks to master to find chunk servers
• Connects directly to chunk servers to access data
Distributed File System
Cloud Solutions: MapReduce
• MapReduce: Early Distributed Computing
Programming Model
Cloud Solutions: MapReduce
• MapReduce is a style of programming designed for:
1. Easy parallel programming
2. Invisible management of hardware and software failures
3. Easy management of very-large-scale data
• It has several implementations, including Hadoop, Spark (used in this
course), Flink, and the original Google implementation just called
“MapReduce”
Cloud Solutions: MapReduce
• A Classic MapReduce example: word count application • Each record is
processed
independently
• Records can be
processed in
parallel
• Input set can be
split into many
map tasks
[Link]
Before the next class
• Homework 0: Ungraded but required before the next class
• Visit [Link] and sign up for a Google Cloud trial
account
• The trial will give you $300 free credit for 3 months
• You will need to sign up with a personal credit card, but it will not charge you
unless you ’activate’ the full account
The practical skills
• We will be using Google Cloud for most in-class exercises and
homework assignments, so lots of hands-on practice for you!
• You will therefore need to create an account first
Microsoft Azure
• We will also be switching to MS Azure later in the semester after your
trial with GCP ends
• One more highly sought-after technology under your belt