0% found this document useful (0 votes)
109 views4 pages

Big Data Engineering Curriculum

This document outlines the curriculum for a Post Graduate Program in Big Data Engineering. It includes introductory content on big data concepts and industry applications. Core topics covered include data structures like hash tables, bloom filters, and k-d trees. The curriculum also focuses on algorithm design, including techniques like divide-and-conquer and distributed algorithms using MapReduce. Exercises are included to apply concepts in designing and implementing algorithms and data structures.

Uploaded by

venkivlr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views4 pages

Big Data Engineering Curriculum

This document outlines the curriculum for a Post Graduate Program in Big Data Engineering. It includes introductory content on big data concepts and industry applications. Core topics covered include data structures like hash tables, bloom filters, and k-d trees. The curriculum also focuses on algorithm design, including techniques like divide-and-conquer and distributed algorithms using MapReduce. Exercises are included to apply concepts in designing and implementing algorithms and data structures.

Uploaded by

venkivlr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

IN COLLABORATION

WITH

POST GRADUATE PROGRAM IN


BIG DATA ENGINEERING
PROGRAM CURRICULUM

For Prep Sessions + Batch Start Dates:


Please refer to [Link]

Note: This curriculum is subject to change based on inputs from BITS Pilani and Industry

COURSE MODULE NAME SESSION SESSION NAME

MOTIVATION AND UNDERSTAND WHAT IS BIG DATA?

SOURCES OF BIG DATA


PREPARATORY CONTENT

CONCEPTS IN BIG DATA BIG DATA VS NORMAL DATA

CHARACTERISTICS OF BIG DATA - VOLUME, VARIETY, VELOCITY


INTRODUCTION TO BIG DATA
DATA MODELS - STRUCTURED, SEMI-STRUCTURED AND UNSTRUCTURED DATA

CASE STUDY 1: CHURN PREDICTION (MOBILE COMPANIES OR CREDIT CARD COMPANIES LIKE AMEX ETC.)
INDUSTRY APPLICATIONS OF
BIG DATA CASE STUDY 2: PRODUCT RECOMMENDATIONS ON A RETAIL WEBSITE (E-BAY/SNAPDEAL ETC.)

CASE STUDY 3: GETTING RELEVANT SEARCH RESULTS USING GOOGLE SEARCH

DICTIONARY DATA STRUCTURE: LOOKUP TIME- ARRAY, SORTED ARRAY; PREPROCESSING TIME &
AMORTIZED COST

PREPROCESSING TIME - SORTING TIME, SORTING WITH A KNOWN RANGE

HASHTABLE: MOTIVATION - EXPECTED LOOKUP TIME; COLLISIONS AND DESIGN OF SEPARATELY


CHAINED HASHTABLES

HASHTABLE - LOAD FACTOR, SIZING, AND RE-HASHING

DATA STRUCTURES (LINEAR) POLYMORPHISM - GENERICS / TEMPLATES; HASHTABLE FOR ANY TYPE OF VALUES

USING HASHTABLES: COLLECTIONS LIBRARY IN JAVA; HASHTABLE API

KEY-VALUE PAIRS AND HASHMAPS; USING HASHMAPS: HASHMAP API

EXERCISE: USE HASHMAPS AND HASHTABLES IN A SMALL APPLICATION (E.G. CONTACTS LIST IN
A MOBILE PHONE)

DATA STRUCTURE DESIGN -


BLOOM FILTERS: MOTIVATION AND USE CASES
REVIEW & ADVANCED TOPICS
BLOOM FILTERS: DESIGN

BLOOM FILTERS: TIME VS. FALSE-POSITIVE TRADEOFF

POLYMORHPHISM - INHERITANCE - EXAMPLE

POLYMORPHISM - GENERICS / TEMPLATES; HASHTABLE FOR ANY TYPE OF VALUES


DATA STRUCTURES & ALGORITHMS FOR BIG DATA

BLOOM FILTERS: IMPLEMENTATION IN JAVA - USING INHERITANCE

EXERCISE: IMPLEMENT A BLOOM FILTER IN JAVA AND USE IT TO PROCESS LARGE DATA SET ON
DISK (I.E. FILES) AND MEASURE FALSE POSITIVES

KD TREES - USAGE AND USE CASES;

KD TREES - DESIGN

KD TREES - IMPLEMENTATION

EXERCISE: IMPLEMENTATION OF KD TREES IN JAVA

TOP-DOWN DESIGN - REVIEW: CHARACTERISTICS & PRAGMATICS

ALGORITHM DESIGN - REVIEW DIVIDE-AND-CONQUER DESIGN - REVIEW: CHARACTERISTICS AND PRAGMATICS


OF BASICS
EXERCISE: DESIGN AN ALGORITHM USING DIVIDE-AND-CONQUER AND IDENTIFY / ANALYZE IMPLICATIONS

ABSTRACT MACHINE MODEL AND DESIGN APPROACH

DIVIDE-AND-CONQUER DESIGN FOR DISTRIBUTED EXECUTION - EXAMPLE

DISTRIBUTED ALGORITHMS - DESIGN PERFORMANCE MODEL FOR DISTRIBUTED ALGORITHMS - SPEEDUP; COMMUNICATION COST
& PERFORMANCE
EXERCISE: ADAPT A DIVIDE-AND-CONQUER DESIGN FOR DISTRIBUTED EXECUTION; ESTIMATE THE
SPEEDUP AND THE COMMUNICATION COST

SPMD PROGRAMMING - MAP: USE CASE AND EXAMPLES


DISTRIBUTED ALGORITHMS
PERFORMANCE ANALYSIS AND ISSUES IN USING MAP

EXERCISE: DESIGN AN ALGORITM USING MAP (*2)

SPMD PROGRAMMING - TREE PARALLELISM - REDUCE: USE CASE AND EXAMPLES

ALGORITHM DESIGN USING PERFORMANCE ANALYSIS AND ISSUES IN USING REDUCE


MAP-REDUCE
EXERCISE: DESIGN AN ALGORITHM USING REDUCE (*2)

MAP-REDUCE PROGRAMMING: COMPOSING MAP AND REDUCE - EXAMPLES

EXERCISE: DESIGN AN ALGORITHM USING MAP-REDUCE

MAP-REDUCE PROGRAMMING: ITERATIVE MAP-REDUCE

EXERCISE: DESIGN AN ITERATIVE ALGORITHM USING MAP-REDUCE


IN COLLABORATION
WITH

POST GRADUATE PROGRAM IN


BIG DATA ENGINEERING
PROGRAM CURRICULUM

COURSE MODULE NAME SESSION SESSION NAME

VIRTUALIZATION TECHNOLOGY
WHY IS IT IMPORTANT?
HOW IS IT USED?

VIRTUAL MACHINES HOW TO SET UP A VIRTUAL MACHINE ON LOCAL MACHINE


TOOLS REQUIRED
SET UP A CLOUDERA/HORTONWORKS VIRTUAL MACHINE WITH THE REQUIRED ENVIRONMENT
VIRTUALIZATION TECHNOLOGY
& INFRASTRUCTURE WHAT IS AMAZON EC2?
HOW TO ACCESS EC2 FROM AWS PORTAL

AMAZON EC2
HOW TO CREATE AN EC2 VIRTUAL MACHINE HOSTED ON AWS AND ACCESS IT
SETUP CLOUDERA / HORTONWORKS VIRTUAL MACHINE ON AWS.

CHARACTERISTICS OF DISTRIBUTED SYSTEMS: LOCAL VS. REMOTE, FAILURE AND RELIABILITY,


SCALABILITY
DISTRIBUTED COMPUTING
ENVIRONMENT FOR BIG DATA
CLUSTERS AS DISTRIBUTED SYSTEMS - ARCHTIECTURE, CHARACTERISTICS, AND VARIANTS

WHAT IS HADOOP? HADOOP CLUSTERS - FEATURES

STRUCTURED / UNSTRUCTURED DATA - EXAMPLES, USE CASES

PERSISTENT STORE / IN MEMORY ACCESS / STREAMING ACCESS / REAL-TIME PROCESSING -


DIFFERENCES IN ACCESS AND PROCESSING

HDFS COMPONENTS AND ARCHITECTURE - BLOCKS AND NODES

HDFS COMMANDS AND COMMAND LINE INTERFACE

HDFS JAVA API & USAGE

HDFS - BASIC FILE INPUT/OUTPUT

DATA AND STORAGE STORAGE & LOAD BALANCING

A) PERFORM OPERATIONS ON FILE SYSTEM USING HDFS COMMANDS


(B) PERFORM A SIMPLE TRANSFORMATION READING FROM A LARGE FILE WRITING ONTO
MULTIPLE SMALL FILES; AND VICE VERSA USING HDFS API

SETTING UP KEY-VALUE PAIRS, IDENTIFYING MAP TASKS AND REDUCE TASKS, CONNECTING MAP TASKS

DISTRIBUTED PROCESSING TO REDUCE TASKS

WRITING A PROGRAM USING MULTIPLE MAPPERS AND REDUCERS - EXAMPLES

PERFORMANCE AND SCALABILITY; DECIDING NUMBER OF MAPPERS AND REDUCERS-


SCHEDULING AND TUNING
PLATFORMS FOR BIG DATA

SORTING AND JOINS IN THE MAP-REDUCE MODEL

DISTRIBUTED PROCESSING OF DATA WRITE A MR PROGRAM OF MEDIUM COMPLEXITY (REQUIRES USE OF BUILT-IN MR FEATURES,
REQUIRES SCHEDULING AND/OR LOAD BALANCING)

PIG: FEATURES AND TYPICAL USAGE - EXPRESSIONS, STATEMENTS, TYPES, AND SCHEMAS

PROGRAMMING WITH PIG - EXAMPLES

ADVANCED FEATURES OF PIG

WRITE A PIG PROGRAM FOR A SIMPLE TASK

(A) RETRIEVE AND STORE DATA FROM HDFS AND HBASE (BOTH OPTIONS);
(B) PROCESS DATA USING MAP-REDUCE ON HADOOP AS WELL AS USING PIG;
(C) MEASURE YOUR PERFORMANCE AND SCALABILITY

MOTIVATION - IN-MEMORY PROCESSING; USAGE COMPARISON WITH HADOOP

ARCHITECTURE OF IN-MEMORY PROCESSING WITH SPARK

JAVA PROGRAMMING ON SPARK - INTRODUCTION AND EXAMPLE

PROGRAMMING ON SPARK: RDDS


IN-MEMORY DISTRIBUTED
PROCESSING
PROGRAMMING ON SPARK: DATAFRAMES AND DATASETS

PROGRAMMING ON SPARK - EXAMPLE

WRITE A PROGRAM ON SPARK FOR A BIG DATA PROCESSING TASK TO DEMONSTRATE SCALABILITY
AND HIGH LEVEL PROGRAMMING FEATURES OF SPARK; COMPARISON WITH HADOOP MR -
PERFORMANCE, ABSTRACTION/PROGRAMMABILITY

INTRODUCTION OBJECT STORE, SQL AND NOSQL DATABASES ON THE CLOUD

WHAT IS OBJECT STORE


AMAZON S3
SETTING UP S3 AND UNDERSTAND RELATED TERMINOLOGIES

WHAT IS DYNAMO DB?


CONCEPTS OF TABLES, ELEMENTS AND ATTRIBUTES
KEY-VALUE PAIR
DATA STORE ON THE CLOUD

OPERATIONS
ADD, RETRIEVE, MODIFY, DELETE DATA
DYNAMODB

SETUP A S3 STORAGE ON AWS AND STORE/RETRIEVE ACCESS.


SETUP A SIMPLE DB AND PERFORM ADD, RETRIEVE, MODIFY AND DELETE OPERATIONS ON DATA

EXERCISE: DESIGN AN ITERATIVE ALGORITHM USING MAP-REDUCE


IN COLLABORATION
WITH

POST GRADUATE PROGRAM IN


BIG DATA ENGINEERING
PROGRAM CURRICULUM

COURSE MODULE NAME SESSION SESSION NAME

CONCEPT OF ETL
ETL
ETL VS ELT

FACTS AND DIMENSION TABLES

EXERCISE: FORM FACTS AND DIMENSION TABLES FOR GIVEN SCENARIO


DATA WAREHOUSING RELATIONAL VS. MULTI DIMENSIONAL DATA REPRESENTATION
AND ETL DATA WAREHOUSING
FUNDAMENTALS
EXERCISE: IDENTIFY WHICH REPRESENTATION TO USE FOR GIVEN SCENARIO

REPORTS, DASHBOARD AND SCORE CARD

ETL - RELEVANCE IN THE BIG DATA SCENARIO - DATA LAKES

EXERCISE: DETERMINE WHETHER TO DESIGN A SOLUTION WITH DATA LAKE OR DATA WAREHOUSING

CONCEPTS OF DATA INGESTION - WHAT IS DATA INGESTION?


INTRODUCTION TO DATA INGESTION
SOURCES OF STRUCTURED/UNSTRUCTURED DATA/REAL TIME/STREAMING DATA TARGET?

MOTIVATION AND USAGE OF SQOOP - IMPORT DATA TO HADOOP; DIFFERENT DATA / FILE FORMATS

SQOOP AND MAPREDUCE - THE IMPORT PROCESS

SQOOP - PERFORMANCE: IMPORTING LARGE OBJECTS

SQOOP: DATA INGESTION IN HADOOP


EXERCISE : PERFORM DATA INGESTION USING SQOOP; [SHOULD INVOLVE INTERACTION WITH
HADOOP MR; HIVE; HDFS]

ADDITIONAL READING: DATA EXPORT OPERATION USING SQOOP

ADDITIONAL EXERCISE: PERFORM EXPORT DATA OPERATION TO RDBMS SOURCE WITH SQOOP
DATA INGESTION FOR
STRUCTURED / UNSTRUCTURED EVENTS AND FLOWS: MOTIVATION FOR FLUME
DATA
USING FLUME - INGESTION OF EVENTS / LOG DATA; INGESTION OF STREAMING DATA

FLUME - FLOWS (MULTI-HOP, CONSOLIDATION, REPLICATION, MULTIPLEXING) AND


FLUME: INGESTION OF EVENTS DATA CONFIGURATION (MULTI-AGENT, FAN-OUT)

FLUME - (SELECT) SOURCES AND CONFIGURATION

COMPLEX EVENT PROCESSING - LOG PROCESSING USING FLUME

COMPLEX EVENT PROCESSING - CASE STUDY


PROCESSING BIG DATA : ETL & BATCH PROCESSING

EXERCISE: ANALYZE A (PARTIAL) CASE STUDY - IDENTIFY ISSUES AND SUGGEST SOLUTIONS WITH
DESIGN/CONFIGURATION

ASSIGNMENT 3 - PART I: PERFORM DATA INGESTION USING FLUME FOR A COMPLEX EVENT
PROCESSING REQUIREMENT; [SHOULD INVOLVE CONFIGURATION WITH MULTIPLE SOURCES,
COMPLEX FLOWS]

INTRODUCTION TO HIVE - INTERFACES, METASTORE

HIVE VS. RELATIONAL DATABASE SYSTEMS - SCHEMA

HIVE: FILE FORMATS


HIVE
QUERYING IN HIVE (HIVEQL) - TYPES AND OPERATORS/FUNCTIONS; TABLES, PARTITIONS, AND
STORAGE FORMATS

COMPLEX QUERIES: E.G. JOINS.- MAP JOINS

EXERCISES ON HIVE COMMANDS, HIVEQL; COMPARISON WITH RELATIONAL QUERIES


DATA TRANSFORMATION AND
BATCH PROCESSING NEED FOR QUERY OPTIMIZATION; SOME OPTIMIZATION TECHNIQUES (2 OR 3 - ORC FILE, CBO,
VECTORIZATION OR BUCKETING)
QUERY OPTIMIZATION
EXERCISE: COMPARISON OF QUERIES WITH/WITHOUT OPTIMIZATION

BATCH PROCESSING WITH HIVE


BATCH PROCESSING
ASSIGNMENT EXERCISES ON BATCH PROCESSING OF DATA INGESTED AND PRE-PROCESSED /
PREPARED IN PART I

NEED AND USE OF HBASE - SCHEMAS AND QUERIES

HBASE - JAVA API

NOSQL DATABASE HBASE


COMPARSION WITH TRADITIONAL RELATIONAL DATABASE SYSTEMS

A) PERFORM SIMPLE QUERIES ON AN UNSTRUCTURED DATABASE USING HBASE API


B) IDENTIFY ISSUES IN USING A NOSQL DATABASE SUCH AS HBASE C) COMPARE AND CONTRAST
USE OF HBASE WITH A TRADITIONAL RDBMS SYSTEM

HIVE V/S HBASE COMPARING HIVE AND HBASE - USE CASES - HIVE/HBASE. WHEN TO/NOT-TO USE HIVE / HBASE

WORKFLOWS; DAGS FOR MODELING / DEPICTING WORKFLOWS; WORKFLOWS ON HADOOP;


MOTIVATION FOR A WORKFLOW ENGINE

OOZIE WORKFLOW SPECIFICATION: WORKFLOW NODES - CONTROL FLOW NODES AND ACTION NODES
WORKFLOW MANAGEMENT
OOZIE - WORKFLOW ENGINE OOZIE SPECIFICATION:CONTROL FLOW NODES: SIMPLE (START, KILL, END) AND COMPLEX (DECISION,
FOR HADOOP
FORK-JOIN) NODES

OOZIE SPECIFICATION: ACTIONS: MAP-REDUCE ACTION, PIG ACTION, HDFS ACTION, JAVA ACTION

EXERCISES USING OOZIE - SPECIFY AND RUN A COMPLEX WORKFLOW ON HADOOP [INVOLVING
MAP-REDUCE, HDFS, AND JAVA ACTIONS WITH MULTIPLE CONTROL-FLOWS INVOLVED]
IN COLLABORATION
WITH

POST GRADUATE PROGRAM IN


BIG DATA ENGINEERING
PROGRAM CURRICULUM

COURSE MODULE NAME SESSION SESSION NAME

WHAT ARE STREAMING DATA? WHERE DO THESE DATA COME FROM? [OPERATIONAL MONITORING,
WEB, ECOMMERCE, SOCIAL MEDIA, ETC ]

DATA COLLECTION, DATA PROCESSING, STORAGE, DELIVERY

INTRODUCTION TO STREAMING STREAMING DATA, PROECSSING & AVAILABILITY, LATENCY, SCALABILITY


DATA SOCIAL MEDIA DATA
TWITTER SENTIMENT ANALYSIS

EXERCISES ON FUNDAMENTALS ON REAL TIME STREAM PROCESSING

USING APACHE FLUME AS A DATA SOURCE


REAL TIME/STREAM DATA PROCESSING

KAFKA SINGLE-BROKER CLUSTER SETUP

CREATE KAFKA TOPIC, PRODUCER AND CONSUMER

KAFKA HANDLING REAL-TIME DATA FEEDS APACHE KAFKA CONNECT

KAFKA STREAMS

INTEGRATION OF KAFKA WITH STORM, SPARK, ELASTICSEARCH

COORDINATION, PARTITION AND MERGE OF DATA FROM SOURCES, TRANSACTIONS HANDLING


ELEMENTS OF A STREAM
STORM PROCESSING SYSTEM & ZOOKEEPER, NIMBUS, SUPERVISORS
STORM CLUSTER
STREAM PROCESSING WITH STORM

TRIDENT STREAMS – FILTERS AND FUNCTIONS – GROUPING OPERATIONS – AGGREGATION

TRIDENT (STORM DSL) TRIDENT (STORM DSL) - OVERVIEW USING TRIDENT TO PROCESS DATA STREAMS
& CASESTUDY

REAL TIME PROCESSING OF ECOMMERCE DATA

MOTIVATION FOR SPARK STREAMING ADD-ON ON SPARK, STREAMING PROCESS

INTRODUCTION & UNDERSTANDING STREAMING CONTEXT, STRUCTURE OF STREAMING APPLICATION, STARTING/ CHECK-POINTING/
SPARK STREAM API STOPPING STREAMS

EXERCISES ON USING SPARK STREAM API

STREAMING ON SPARK INTRODUCTION TO DSTREAM ABSTRACTION, CREATION AND MANIPULATION OF DSTREAM

UNDERSTANDING & PROCESSING OF TRANSFORMATIONS , SIMPLE AGGREGATIONS, AGGREGATIONS ON KEY-VALUE PAIRS], WINDOWING
DSTREAM OPERATIONS, OPERATIONS ON WINDOWS

DSTREAM OPERATIONS

TRENDING TOPICS IN TWITTER


CASE STUDY & EXAMPLE
REAL TIME PROCESSING OF ECOMMERCE DATA (EXTENDING PREV. CASE STUDY)

ANALYTICS PROBLEMS FORM ECOMMERCE / SOCIAL MEDIA / WEB / FINANCE

REGRESSION, CLASSIFICATION, CLUSTERING

IDENTIFYING LEARNING TASKS FOR GIVEN CASES

FEATURES, SUPPORTED LEARNING TASKS

API, TYPES – (VECTOR, LABELLED POINTS, RATING) , ALGORITHMS


INTRODUCTION & OVERVIEW
OF MLLIB IMPORTING DATA SETS IN MLLIB AND PERFORMING OPERATIONS ON FEATURES

CREATING CHARTS FOR A GIVEN DATA SET

SAMPLE TASKS INVOLVING REGRESSION

OUTLINE OF MATHEMATICAL FORMULATION


REGRESSION
USING TRAIN/TRAINREGRESSOR/PREDICT METHODS

SAVING AND LOADING MODELS


BIG DATA ANALYTICS

SAMPLE TASKS INVOLVING CLASSIFICATION

LINE / CURVE /PLANE SEPARATING CLASSES , REPRESENTING CLASSES USING A TREE, PROBABILISTIC

INTRODUCTION TO BIG DATA REPRESENTATION OF CLASSES


CLASSIFICATION & FORM OF
ANALYTIC CLASSIFIERMODELS
CLASSIFIER PERFORMANCE ,DISCUSSIONS

EXERCISES ON CONCEPTS COVERS IN FORMS OF CLASSIFIER MODELS

USING TRAIN / TRAINCLASSIFIER / PREDICT METHODS

EXERCISES ON FITTING CLASSIFICATION MODEL, VISUALIZING AND INTERPRETING RESULTS

SAMPLE TASKS INVOLVING CLUSTERING

FORMULATION OF TASK: NOTION OF SIMILARITY

SAMPLE TASKS INVOLVING CLASSIFICATION

CLUSTERING UNSUPERVISED GROUPING, K-MEANS CLUSTERING, DECIDING ON K

USING TRAIN / KMEANSMODEL/ PREDICT METHODS

MLLIB WORKING EXAMPLE (WITH QLIKVIEW FOR VISUALIZATION)

ON CLUSTERING, VISUALIZING AND INTERPRETING RESULTS

CREDIT CARD FRAUD DETECTION

CUSTOMER SEGMENTATION FOR TARGETED MARKETING

You might also like