Big Data Analytics (BDA)
Name of the Faculty:
Shankar Venkatagiri
Affiliation:
(Institution / Organisation & Designation)
Associate Professor, IIM Bangalore
Teaching Area:
(such as Finance & Accounting; Marketing; Production
& Operations Management; Strategy)
Information Systems
This course may be offered to:
(PGP, FPM, PGPEM, PGPPM, EPGP)
http://www.iimb.ernet.in/programmes
PGP open to all programs
Credits (No. of hours):
(3 credits=30 classroom hours; 1.5 credits-15
classroom hours; session=90 minutes)
3 credits
Term / Quarter:
(Starting April /June /September/December)
Term 4 of PGP
Course Type:
(Regular: staggered across the term;
Workshop1: 3-5 continuous days)
Regular
There shall be 2 sessions by guest speakers
Additional information required from the industry
Are there any financial implications Travel and board for guest speakers, if
to this course? applicable.
1
Workshop course: Please provide reasons as to why the course is being offered in workshop mode and why it cannot be offered
as a regular course (that is spread over 10 weeks). As an institution, IIMB prefers courses offered in the regular mode, since it
results in better learning experience for the students and avoids overlapping of courses.
Big Data Analytics
Course Summary
The above image from Google Trends shows how “big data” is a phenomenon that is less than a decade old as
of this writing, and whose usage appears to have already peaked. This course looks beyond the hype of big
data2, into the opportunities and challenges for businesses in a world driven by information that must be
extracted from large, heterogeneous data sources.
There are multiple dimensions to big data, the most obvious of them being volume (as in big!). Starting with
the 1970s, data was organised into tabular entities, with linking relationships. Commercial database systems
such as Oracle’s RDBMS and open source MySQL implemented this E-R model of “structured” data. During
the 1980s, data warehouses helped store and analyse large amounts of data, so that managers could make
informed business decisions.
The advent of the Internet during the 1990s was a big bang event in the world of data. Notably, web pages did
not follow a set format. The first decade of the 2000s witnessed the rise of social networks such as Facebook
and Twitter, whose messages did not adhere to a fixed structure. Consequently, special analytical techniques
had to be devised accommodate this variety of “unstructured” data, consisting of textual content, audio,
imagery, video and so on.
In the present decade, watches, phones, and a host of other sensors have come to dominate the realm of data
generation. With storage on the cloud becoming cheap and communication scaling up to 5G speeds, the age of
IoT has finally dawned upon us. Computational platforms will have to deal with real-time data that arrive at
tremendous velocity. Handling this aspect of big data is key to the success of companies such as Uber and
Tesla.
The ubiquity of mobile devices like phones has its disadvantages. On the one hand, information about anything
is literally at one’s fingertips. On the other hand, recent advances in AI have made it effortless to create content
that appears credible and can be passed off as real news. The 2016 US election as well as the Brexit vote were
influenced by targeted campaigns on Facebook, whose veracity was in question.
This course, titled Big Data Analytics, has been specifically designed to cover these dimensions.
Pre-requisites, inclusion/exclusion criteria (if any)
2
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International
Journal of Information Management, 35(2), 137-144.
Big Data Analytics
For students who have not programmed earlier, there is some coding that they may struggle with, compared to
others who have prior experience. Even though we shall dedicate two sessions to cover Python basics,
students are expected to further strengthen their proficiency in the language via a suitable MOOC.
Consulting with websites such as StackOverflow will be considered de rigueur. One can always pick up
Python using the tutorial available on Datacamp. Students will be able to work with software packages (e.g.
Hadoop, Spark) independently on their laptops.
Learning Objectives
The course is designed with the following specific objectives and learning outcomes:
a. Explore the 4 traditional Vs of big data – volume, velocity, variety and veracity
b. Execute computations on distributed platforms that facilitate big data analytics
c. Examine applications of big data to finance, transportation and healthcare
Pedagogy
The course employs a mix of lecture and hands-on exploration in class. The book by Sandy Ryza et al, titled
Advanced Analytics with Spark 2nd Edition O’Reilly (2017), will play a central role in the course.
Students are required to bring their laptops to every session. With each technology platform, we shall cover the
basic syntax and usage, and then try out an application. To minimise incompatibilities among operating
systems, we shall make extensive use of a Docker image.
For larger scale examples, students will learn how to avail of cloud offerings from Amazon or Microsoft.
Guest speakers from these companies shall address the students on the features of their platforms.
A term project (max. group size of 4) will help bring all the concepts together.
Course Evaluation & Grading Pattern
Midterm 30%
Final 30%
Quizzes (2) -----
Project 30%
Class Participation 10%
Constructing exams is an arduous task. There shall be no make-ups for any component.
Please note that the project shall involve multiple submissions and a face-to-face viva. Any act of plagiarism –
either in code or in written form – shall be subject to strict disciplinary action as laid down by the norms of
IIMB. You know the rules.
Session-wise plan
Session Topic
1&2 An Overview of Big Data
Big Data Analytics
The popular book Big Data by Viktor Mayer-Schonberger & Kenneth Cukier
guides our discussion of major shifts in thinking about big data, facilitated by
access to greater computational power and cheap storage.
Video: Big Data: A revolution that will transform how we live
http://www.youtube.com/watch?v=bYS_4CWu3y8
3&4 Topic: Introduction to Python
This is a brief introduction to the Python language using the browser-friendly
Jupyter environment. We will adopt a containerised approach (Docker),
which supports a unified interface across multiple operating systems. The
sessions go over standard libraries such as Numpy and Pandas. For
visualisation, we will cover Matplotlib, Seaborn and Plotly.
Preparation:
1. Altintas, Ilkay, and Porter, Leo. Python for Data Science – a free
MOOC offering https://www.edx.org/course/python-for-data-science-
1
2. There will be a short quiz next week to test your grasp of the language
5 VOLUME: Google Cloud and BigQuery
Businesses that support users at scale are moving their applications to the
cloud. In this session, we go over structured data representation and querying
tables using SQL. Next, we discuss Google’s cloud offerings, and execute
SQL-style queries on large, structured datasets with BigQuery.
Students must have Google Cloud accounts, free for a period of 2 months.
Video: Google BigQuery introduction by Jordan Tigani
https://www.youtube.com/watch?v=kKBnFsNWwYM
6 Distributed Computing
The paradigm of large-scale computing has evolved from parallel processing
on multiple processors to virtualisation on a single server to cloud computing.
In this session, we explore computational paradigms enshrined in
MapReduce.
Reading:
Dean, J. and Ghemawat, S. MapReduce: simplified data processing on
large clusters. Communications of the ACM 51, no. 1 (2008): 107-113.
7&8 The Hadoop Ecosystem
Big Data Analytics
The Hadoop platform enables batch processes to be executed in parallel on a
cluster of machines. We dedicate this session to understanding HDFS and
using Hive to make SQL-like calls on large datasets.
As preparation, students must set aside two hours to go through the following
video series. A small quiz shall be held to assess the understanding.
Video Series: Introduction to Hadoop and MapReduce
https://www.youtube.com/watch?v=44K_bzTL_SM
9 Spark Overview with Scala
While Hadoop is excellent at parallelising data storage, the tasks are executed
in batch mode; a nimbler approach would be to execute the tasks
sequentially. Apache Spark is a powerful computational platform that works
with data that is loaded into memory.
We go over the architecture of Spark and provide an overview of the
platform in Scala. We then perform a data de-duplication task for a German
hospital, with data stored on the local drive as well as on HDFS.
Reading: Chapters 1 & 2 of Ryza et al.
10 Business Applications
Processing geospatial data efficiently lies at the heart of transportation and
drives the business models of companies like Ola/Uber. We use our
knowledge of Spark to analyse taxi trip data. In our next illustration, we learn
how to estimate financial risk using Monte Carlo Simulation.
Reading: Chapters 8 & 9 of Ryza et al.
11 & 12 Spark Overview with Python
In the first session, we provide an overview of PySpark, and learn how to
work with different types of data via aggregations and joins. We access data
stored on various sources (CSV/JSON/Parquet) and observe differences in
efficiency.
The second session is dedicated to the versatile SparkSQL mechanism.
Reading: SQL at scale with SparkSQL and DataFrames – D.Sarkar (2018)
https://towardsdatascience.com/sql-at-scale-with-apache-spark-sql-
and-dataframes-concepts-architecture-and-examples-c567853a702f
Big Data Analytics
13 & 14 Topic: Big Data Platforms on the Cloud (Guest Speakers)
While it is possible to purpose a computer for big data processing, real world
applications demand scale. In separate sessions, representatives from Amazon
and Microsoft will introduce students to features of their cloud platforms and
demonstrate how big data tasks are carried out in a professional setting.
Insights gained from these sessions will help students to formulate project
ideas and execute them on a large-scale platform.
15 & 16 Analytics with Big Data
Spark’s ML and MLLib libraries support a vast collection of supervised and
unsupervised algorithms at scale.
In the first session, we show how to apply decision trees to analyse the
responses of a customer satisfaction survey. In our next illustration, we apply
Alternating Least Squares (ALS) to the Autoscrobbler dataset and make
music recommendations.
In the second session, we consider an example in cybersecurity and learn how
to detect network intrusions with k-means clustering. We then carry out a
Latent Semantic Analysis (LSA) on Wikipedia in an application of natural
language processing.
Readings:
1. SFO Survey Machine Learning Example
https://docs.databricks.com/_static/notebooks/decision-trees-sfo-
airport-survey-example.html
2. Chapters 3, 4, 5 & 6 of Ryza et al.
17 VELOCITY: Streaming Applications
Having addressed the volume dimension of big data, we next turn to velocity.
Spark strongly supports stream processing, which is the act of continuously
ingesting real-time data to compute /update a result. This is a key requirement
for tasks like generating notifications, processing credit card activity, and
more recently, IoT sensor data.
In this session, we demonstrate the streaming features of Spark with Kafka, a
distributed streaming platform.
Reading: IoT: Real-time Data Processing and Analytics using Spark / Kafka
https://github.com/yugokato/Spark-and-Kafka_IoT-Data-Processing-
and-Analytics
Big Data Analytics
18 VARIETY: Unstructured Data
Having addressed the volume and velocity dimensions of big data, we turn
our attention to variety. A large proportion of real-world big data is
unstructured. One classic example of this is network data, which arises from
diverse domains such as social media, epidemiology, and so on.
In this session, we use Gephi package to load and explore networked datasets
and develop an analytic appreciate for graphs.
19 Social Networks
The first part of this century has witnessed the emergence of social networks.
We examine the role of selection and socialisation in community formation.
We then provide an overview of the PageRank algorithm.
Reading:
SP Borgatti, A Mehra, DJ Brass, G Labianca. Network analysis in the
social sciences. Science 323 (5916), 892-895
20 Distributed Graphs
Spark’s GraphX is a library to handle network datasets. We cover basic
classes and operations that help us build graphs. Next, we learn how to obtain
quantities such as triangle count, shortest paths, connected components, etc.
Finally, we apply our knowledge of PageRank to identify hubs in a large
bikeshare dataset.
Reading: Chapters 7 of Ryza et al.
21 Neural Networks
This session provides an introduction to artificial neural networks, and shows
how Google’s TensorFlow platform can facilitate the distributed processing of
large-scale imagery using deep convolutional nets. This capability powers the
business models of automobile companies like Tesla, and healthcare offerings
from outfits like IBM.
22 VERACITY: Targeted Marketing
Around the turn of this century, the field of mass communications took a
sharp turn: user generated content was at the heart of this revolution. The
mechanism of targeted advertising by the likes of Google was vastly
improved upon by Facebook, with its incisive insights into the intimate details
Big Data Analytics
of an individual.
However, the user’s privacy was sacrificed along the way, and one’s personal
data became a marketable commodity. An outfit by the name of Cambridge
Analytica capitalised on Facebook’s knowledge of its users, and played a
destructive role in the 2016 US elections as well as the Brexit campaign.
This final session is dedicated to discussing the fourth dimension of veracity.
Reading: Grasseger, H. & Krogerus, M. The Data that turned the World
Upside- down. Motherboard (Jan 2017)
Bookshelf
1. Easley, David., & Kleinberg, Jon. Networks, Crowds and Markets. Cambridge (2010)
2. Mayer-Schönberger, V., & Cukier, K. Big data: A revolution that will transform how we
live, work, and think. Houghton Mifflin Harcourt (2013)
3. Ryza, Sandy et al. Advanced Analytics with Spark 2nd Edition. O’Reilly (2017)
Big Data Analytics