Big Data Engineering Curriculum
Big Data Engineering Curriculum
WITH
Note: This curriculum is subject to change based on inputs from BITS Pilani and Industry
CASE STUDY 1: CHURN PREDICTION (MOBILE COMPANIES OR CREDIT CARD COMPANIES LIKE AMEX ETC.)
INDUSTRY APPLICATIONS OF
BIG DATA CASE STUDY 2: PRODUCT RECOMMENDATIONS ON A RETAIL WEBSITE (E-BAY/SNAPDEAL ETC.)
DICTIONARY DATA STRUCTURE: LOOKUP TIME- ARRAY, SORTED ARRAY; PREPROCESSING TIME &
AMORTIZED COST
DATA STRUCTURES (LINEAR) POLYMORPHISM - GENERICS / TEMPLATES; HASHTABLE FOR ANY TYPE OF VALUES
EXERCISE: USE HASHMAPS AND HASHTABLES IN A SMALL APPLICATION (E.G. CONTACTS LIST IN
A MOBILE PHONE)
EXERCISE: IMPLEMENT A BLOOM FILTER IN JAVA AND USE IT TO PROCESS LARGE DATA SET ON
DISK (I.E. FILES) AND MEASURE FALSE POSITIVES
KD TREES - DESIGN
KD TREES - IMPLEMENTATION
DISTRIBUTED ALGORITHMS - DESIGN PERFORMANCE MODEL FOR DISTRIBUTED ALGORITHMS - SPEEDUP; COMMUNICATION COST
& PERFORMANCE
EXERCISE: ADAPT A DIVIDE-AND-CONQUER DESIGN FOR DISTRIBUTED EXECUTION; ESTIMATE THE
SPEEDUP AND THE COMMUNICATION COST
VIRTUALIZATION TECHNOLOGY
WHY IS IT IMPORTANT?
HOW IS IT USED?
AMAZON EC2
HOW TO CREATE AN EC2 VIRTUAL MACHINE HOSTED ON AWS AND ACCESS IT
SETUP CLOUDERA / HORTONWORKS VIRTUAL MACHINE ON AWS.
SETTING UP KEY-VALUE PAIRS, IDENTIFYING MAP TASKS AND REDUCE TASKS, CONNECTING MAP TASKS
DISTRIBUTED PROCESSING OF DATA WRITE A MR PROGRAM OF MEDIUM COMPLEXITY (REQUIRES USE OF BUILT-IN MR FEATURES,
REQUIRES SCHEDULING AND/OR LOAD BALANCING)
PIG: FEATURES AND TYPICAL USAGE - EXPRESSIONS, STATEMENTS, TYPES, AND SCHEMAS
(A) RETRIEVE AND STORE DATA FROM HDFS AND HBASE (BOTH OPTIONS);
(B) PROCESS DATA USING MAP-REDUCE ON HADOOP AS WELL AS USING PIG;
(C) MEASURE YOUR PERFORMANCE AND SCALABILITY
WRITE A PROGRAM ON SPARK FOR A BIG DATA PROCESSING TASK TO DEMONSTRATE SCALABILITY
AND HIGH LEVEL PROGRAMMING FEATURES OF SPARK; COMPARISON WITH HADOOP MR -
PERFORMANCE, ABSTRACTION/PROGRAMMABILITY
OPERATIONS
ADD, RETRIEVE, MODIFY, DELETE DATA
DYNAMODB
CONCEPT OF ETL
ETL
ETL VS ELT
EXERCISE: DETERMINE WHETHER TO DESIGN A SOLUTION WITH DATA LAKE OR DATA WAREHOUSING
MOTIVATION AND USAGE OF SQOOP - IMPORT DATA TO HADOOP; DIFFERENT DATA / FILE FORMATS
ADDITIONAL EXERCISE: PERFORM EXPORT DATA OPERATION TO RDBMS SOURCE WITH SQOOP
DATA INGESTION FOR
STRUCTURED / UNSTRUCTURED EVENTS AND FLOWS: MOTIVATION FOR FLUME
DATA
USING FLUME - INGESTION OF EVENTS / LOG DATA; INGESTION OF STREAMING DATA
EXERCISE: ANALYZE A (PARTIAL) CASE STUDY - IDENTIFY ISSUES AND SUGGEST SOLUTIONS WITH
DESIGN/CONFIGURATION
ASSIGNMENT 3 - PART I: PERFORM DATA INGESTION USING FLUME FOR A COMPLEX EVENT
PROCESSING REQUIREMENT; [SHOULD INVOLVE CONFIGURATION WITH MULTIPLE SOURCES,
COMPLEX FLOWS]
HIVE V/S HBASE COMPARING HIVE AND HBASE - USE CASES - HIVE/HBASE. WHEN TO/NOT-TO USE HIVE / HBASE
OOZIE WORKFLOW SPECIFICATION: WORKFLOW NODES - CONTROL FLOW NODES AND ACTION NODES
WORKFLOW MANAGEMENT
OOZIE - WORKFLOW ENGINE OOZIE SPECIFICATION:CONTROL FLOW NODES: SIMPLE (START, KILL, END) AND COMPLEX (DECISION,
FOR HADOOP
FORK-JOIN) NODES
OOZIE SPECIFICATION: ACTIONS: MAP-REDUCE ACTION, PIG ACTION, HDFS ACTION, JAVA ACTION
EXERCISES USING OOZIE - SPECIFY AND RUN A COMPLEX WORKFLOW ON HADOOP [INVOLVING
MAP-REDUCE, HDFS, AND JAVA ACTIONS WITH MULTIPLE CONTROL-FLOWS INVOLVED]
IN COLLABORATION
WITH
WHAT ARE STREAMING DATA? WHERE DO THESE DATA COME FROM? [OPERATIONAL MONITORING,
WEB, ECOMMERCE, SOCIAL MEDIA, ETC ]
KAFKA STREAMS
TRIDENT (STORM DSL) TRIDENT (STORM DSL) - OVERVIEW USING TRIDENT TO PROCESS DATA STREAMS
& CASESTUDY
INTRODUCTION & UNDERSTANDING STREAMING CONTEXT, STRUCTURE OF STREAMING APPLICATION, STARTING/ CHECK-POINTING/
SPARK STREAM API STOPPING STREAMS
UNDERSTANDING & PROCESSING OF TRANSFORMATIONS , SIMPLE AGGREGATIONS, AGGREGATIONS ON KEY-VALUE PAIRS], WINDOWING
DSTREAM OPERATIONS, OPERATIONS ON WINDOWS
DSTREAM OPERATIONS
LINE / CURVE /PLANE SEPARATING CLASSES , REPRESENTING CLASSES USING A TREE, PROBABILISTIC