0% found this document useful (0 votes)
26 views37 pages

Big Data Scaling: Cloud Computing Course

The DS5460 course focuses on big data scaling, covering data analysis, large-scale computing, and cloud computing concepts. Students will learn about MapReduce, machine learning algorithms, and distributed file systems while engaging in hands-on exercises using Google Cloud and Microsoft Azure. Assessments include homework, participation, a midterm, and a final project, with a late policy allowing for five penalty-free late days.

Uploaded by

keqingliush
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views37 pages

Big Data Scaling: Cloud Computing Course

The DS5460 course focuses on big data scaling, covering data analysis, large-scale computing, and cloud computing concepts. Students will learn about MapReduce, machine learning algorithms, and distributed file systems while engaging in hands-on exercises using Google Cloud and Microsoft Azure. Assessments include homework, participation, a midterm, and a final project, with a late policy allowing for five penalty-free late days.

Uploaded by

keqingliush
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DS5460: Big Data Scaling

Week 1: Course Introduction and Cloud Computing

Instructor: Dana Zhang


What will the course cover?
• We will learn to analyze/process different types of data:
- Data is high dimensional
- Data is labeled
- Data is large in volume

• We will learn to use different models of computation:


- MapReduce
- Large-scale machine learning algorithms (PySpark)
Tentative Course Schedule
Large-scale Computing
• Large-scale computing for data mining
- Problems of commodity hardware

• Challenges:
- How do we distribute computation?
- How can we make it easy to write distributed programs?
An idea and a Solution
• Issue:
- Copying data over a network takes time
• Idea:
- Bring computation to data
- Store files multiple times for reliability
• Storage Infrastructure – File system
- Google: GFS, Hadoop: HDFS
• Programming model
- MapReduce
- Spark
Course Logistics
Course Staff
Instructor: Prof. Peng “Dana” Zhang
Email: [Link]@[Link]
Office: DSI A2110
Office Hours: MW 1:30pm – 2:30pm
Also available by appointment

Teaching Assistants:
Siyuan “Jasmine” Guo ([Link]@[Link]) – head TA
Siyu Yang ([Link]@[Link])
Sunny Bhatt ([Link]@[Link])
Course Tools
• We will post course content and announcements to Brightspace
(hence check it regularly! At least every 24 hours)

• We will use Piazza for Q&A – you will receive an email to enroll in
Piazza

• Gradescope will be used for all assignment submission and sharing


graded feedback
Course Material
• Brightspace
- Course syllabus
- Lecture slides, in-class demo notebooks
- Homework documents and related files

• Optional textbook:
- Mining of Massive Datasets by A. Rajaraman, J. Ullman, and J. Leskovec
- by Cambridge Uni. Press and available for free at [Link]
Assessments
• Homework Assignments 35%
- Will be announced in class
- Submit via Gradescope
- The use of GenAI is allowed, but you must write up your own solutions in your
own words and indicate how GenAI is are used if applicable.

• In-class participation 15%


- There will be living coding exercises and/or quizzes in most weeks (usually
Wednesdays). The submission deadline may vary
- some will be submitted within 3 hours after class ends
- some will be due at the end of the class day
Assessments
• Midterm 25%:
- In class exam
- The exact date will be confirmed two weeks in advance

• Final Project 25%:


- Final project will be released after the midterm.
- Final projects can be done in groups of 1 - 3 people. We encourage you to form a
group of at least 2 members as the project can be quite lengthy. We will discuss
project details after the midterm.
Late Policy
• Each student has 5 penalty-free late (calendar) days that can be
used for any assignments except for the midterm and final project.
• These late days can be used at any time without needing to provide
a reason.
• Assignments turned in after the due date will use up late days in 24-
hour increments. For example:
• If an assignment is up to 24 hours late, it uses 1 full late day.
• If an assignment is 24-48 hours late, it uses 2 full late days.
Late Policy
• After a student has used all 5 late days, any additional late submission
will receive a 20% deduction from the grade per calendar day late.
For example,
• An assignment turned in 2 days late after all late days are used would receive
a 40% grade deduction.
Extra Credit
• Extra credit (proportional to your contribution during class)
- Students may receive bonus points by answering questions in class or
engaging in class discussions.
- The maximum amount of extra credit can be applied to a student’s
overall grade will be 5%.
- The extra credit earned will be sent to you in week 13 by the head TA
Intro to Cloud Computing
What is a Cloud?
Distributed system with Huge Storage and plenty of Compute cycles

• Servers (compute nodes in racks)


• Switches connecting racks
• Network
• Storage nodes
• Software
A rudimentary example of a cloud
Task: Big data (TB) processing

Desired spec:
128GB RAM
10TB Storage


16GB/500GB 16GB/500GB 16GB/500GB 16GB/500GB 16GB/500GB

16GB RAM/500GB Storage

Network
Data Centers

[Link] [Link]
/14/facebook-data-center-new-mexico-utah googles-data-centers-pictures/5/
[Link]
[Link]
What is Cloud Computing?
Cloud computing:
• Internet-based computing in which large groups of remote servers are
networked so as to allow sharing of data-processing tasks, centralized
data storage, and online access to computer services or resources.
• Any computer related task that is done entirely on the Internet.
• Allows users to deal with the software without having the hardware.
• Everything is done remotely, nothing is saved locally.
History

• 50's & 60's: theorized that the world would have


cloud computing
“computation may someday be organized as a
public utility” – John McCarthy, 1960

• 90's: start of VPNs and efficient infrastructure

• 00's: Amazon builds efficient servers/AWS


Commercial Cloud Providers
• Private Cloud Providers (customized for company)
• Public Cloud (accessible to anyone):
- AWS: Amazon Web Services
• S3 (Simple Storage Service) – store data sets pay per GB/month
• EC2 (Elastic compute cloud) – run apps, pay per CPU hour
- Microsoft Azure (Azure storage, Virtual Machines)
- Google Cloud (Dataproc, Google App Engine/Compute Engine,
Google Cloud Storage, etc)
- Other (Rightscale, Salesforce, Oracle, Cloudera, VMWare etc.)
Virtualization
• Virtualization is the fundamental technology that powers cloud computing.
Cloud Services
Commercial Clouds provide:
• Infrastructure as a Service (IaaS)
• Basic, service users manage most
components
• Platform as a Service (PaaS)
• Users are given software and hardware
automatically but can develop their own
apps
• Software as a Service (SaaS)
• All software and hardware are
encapsulated
• User only knows their own access point
Cloud Challenges
• Equipment Failures
• With so many machines, steady rate of failures is expected and constant
maintenance is required
• Scalability
• Cloud needs to be able to add more servers
• Asynchronous processing
• Clocks of different servers cannot all be synchronized to each other
• Concurrency
• Many machines may try to access the same data
Scale Out, Not Scale Up
• Scale up = grow your cluster by replacing with more powerful
machines
• Not cost-effective
• Need to replace machines with latest model, often

• Scale out = grow your cluster by adding more off the shelf machines
• Cheaper
• Faster machines gradually replace older machines
Scale Out, Not Scale Up

[Link]
Storage
• Require stable storage
• Solution = Distributed File System
• Google GFS
• Hadoop HDFS
• They can manage huge files (TB)
• Multiple copies of each file improve availability
Big Data in Cloud
• Big Data = collection of data so large, it is impossible to process in one
computer
• Requires a number of application solutions designed to deal with this
problem

• Bring computation to data!


• Store files multiple times for reliability!
Distributed File System
• Primary node (Name Node in Hadoop’s HDFS)
• Stores metadata about where files are stored
• Might be replicated

• Chunk servers
• File is split into contiguous chunks
• Typically each chunk is 16-64MB
• Each chunk replicated (usually 2x or 3x)
• Try to keep replicas in different racks

• Client library for file access


• Talks to master to find chunk servers
• Connects directly to chunk servers to access data
Distributed File System
Cloud Solutions: MapReduce
• MapReduce: Early Distributed Computing
Programming Model
Cloud Solutions: MapReduce
• MapReduce is a style of programming designed for:
1. Easy parallel programming
2. Invisible management of hardware and software failures
3. Easy management of very-large-scale data

• It has several implementations, including Hadoop, Spark (used in this


course), Flink, and the original Google implementation just called
“MapReduce”
Cloud Solutions: MapReduce
• A Classic MapReduce example: word count application • Each record is
processed
independently
• Records can be
processed in
parallel
• Input set can be
split into many
map tasks

[Link]
Before the next class
• Homework 0: Ungraded but required before the next class

• Visit [Link] and sign up for a Google Cloud trial


account
• The trial will give you $300 free credit for 3 months
• You will need to sign up with a personal credit card, but it will not charge you
unless you ’activate’ the full account
The practical skills
• We will be using Google Cloud for most in-class exercises and
homework assignments, so lots of hands-on practice for you!

• You will therefore need to create an account first


Microsoft Azure
• We will also be switching to MS Azure later in the semester after your
trial with GCP ends

• One more highly sought-after technology under your belt

You might also like