0% found this document useful (0 votes)
14 views

EliteDataEngineeringProgramCurriculum (1)

The Elite Data Engineering Program, led by Sumit Mittal, offers a comprehensive curriculum spanning 18 weeks, focusing on big data concepts, distributed storage, and processing using Apache Spark. Key topics include data pipelines, Spark SQL, performance tuning, and real-time data processing, along with practical applications and project implementation. The program also covers Git, CI/CD practices, and data modeling, culminating in hands-on experience with structured streaming and Apache Hive.

Uploaded by

connect.shivam29
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

EliteDataEngineeringProgramCurriculum (1)

The Elite Data Engineering Program, led by Sumit Mittal, offers a comprehensive curriculum spanning 18 weeks, focusing on big data concepts, distributed storage, and processing using Apache Spark. Key topics include data pipelines, Spark SQL, performance tuning, and real-time data processing, along with practical applications and project implementation. The program also covers Git, CI/CD practices, and data modeling, culminating in hands-on experience with structured streaming and Apache Hive.

Uploaded by

connect.shivam29
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Elite Data Engineering Program

YOUR PATH TO DATA ENGINEERING SUCCESS

Newly
Launched
Elite
Program

By
Sumit Mittal
CURRICULUM
WEEK 1 : BIG DATA - THE BIG PICTURE

> INTRODUCTION TO BIG DATA


> COMPARISON BETWEEN MONOLITHIC AND DISTRIBUTED SYSTEMS.
> HADOOP: EVOLUTION, OVERVIEW AND CORE COMPONENTS
> CHALLENGES WITH HADOOP
> COMPARISON BETWEEN ON-PREMISE AND CLOUD
> ADVANTAGES OF CLOUD
> TYPES OF CLOUD Elite Data
> INTRODUCTION TO APACHE SPARK Engineering
Program
> DATABASE VS DATA WAREHOUSE VS DATA LAKE
By
Sumit Sir
> INTRODUCTION TO DATABASE
> INTRODUCTION TO DATA WAREHOUSE
> DATA ENGINEERING FLOW
> DATA PIPELINE ON HADOOP
> DATA PIPELINE WORKFLOW VISUALIZATION FOR ON-PREMISE
> DATA PIPELINE WORKFLOW VISUALIZATION FOR CLOUD
> CATEGORIES OF COMPUTATION
> SERVERLESS COMPUTING
> SERVERFUL COMPUTING
> HDFS ARCHITECTURE
> ROLE OF DATA ENGINEERS
> TRADITIONAL WAYS OF PROCESSING DATA AND ITS CHALLENGES Elite Data
Engineering
Program

By
Sumit Sir
WEEK 2 : DISTRIBUTED STORAGE
FUNDAMENTALS
> HDFS OVERVIEW
- READING FILE FROM HDFS
- BLOCK SIZE IN HDFS
- NAME NODE FEDERATION
- RACK MECHANISM
- FAULT TOLERANCE
> INTRODUCTION TO PRACTICE LAB
> LINUX COMMANDS
- ABSOLUTE VS RELATIVE PATH Elite Data
- NAVIGATING THE FILE SYSTEM Engineering
Program
- VIEWING THE FILE CONTENT
By
- WORKING WITH FILES, SEARCHING AND FILTERING Sumit Sir
> HDFS COMMANDS
- LIST THE CONTENT OF FILE WITH DIFFERENT PARAMETER
- CREATE FOLDERS AND FILES IN HDFS
- MOVE DATA FROM LOCAL TO HDFS
- MOVE DATA FROM HDFS TO LOCAL
- MOVE DATA FROM ONE HDFS LOCATION TO ANOTHER
> HDFS VS CLOUD DATA LAKE
> DISTRIBUTED PROCESSING
- INTRODUCTION TO MAP REDUCE
- PRINCIPLE OF DATA LOCALITY

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 3 : DISTRIBUTED PROCESSING
FUNDAMENTALS - APACHE SPARK
> DISTRIBUTED PROCESSING
- MAP REDUCE
- CHANGING THE NUMBER OF REDUCERS
- WORKFLOW OF A MAPREDUCE JOB
- USE CASE : FINDING MAXIMUM TEMPERATURE USING MAPREDUCE
- WHAT IS SHUFFLE?
- WHAT IS SORT?
- WHAT IS PARTITION?
Elite Data
- ADVANTAGE OF LOCAL AGGREGATION Engineering
Program
- USE CASE : CLASSICAL INDUSTRY USE CASE OF MAPREDUCE
By
- PRACTICAL : HOW TO RUN MAPREDUCE PROGRAM Sumit Sir
> APACHE SPARK:
- CHALLENGES OF MAPREDUCE
- BRIEF INTRODUCTION OF APACHE SPARK
- UNDERSTANDING OF SPARK EXECUTION PLAN
- VISUALIZATION OF RDD
- WHAT IS DAG?
- ADVANTAGE OF SPARK BEING LAZY
- REAL TIME EXAMPLE : EXECUTING WORD COUNT PROGRAM ON
PYSPARK

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 4 : APACHE SPARK CORE API
> PYTHON BASICS
- NORMAL VS LAMBDA FUNCTION
- PYTHON MAP FUNCTION VS SPARK MAP TRANSFORMATION
- HIGHER ORDER FUNCTION
> PYSPARK USE CASE
> UNDERSTANDING MAP, REDUCE, REDUCEBYKEY, FILTER, SORTBY,
DISTINCT, TAKE, COLLECT
> REAL TIME EXAMPLE : FINDING FREQUENCY OF EACH WORD IN FILE
USING PYSPARK
> PARALLELIZE AND ITS USE
Elite Data
> COUNTBYVALUE Engineering
Program
> UNDERSTANDING PARTITIONS IN A RDD
By
> CATEGORIES OF SPARK TRANSFORMATION(WIDE & NARROW) Sumit Sir
> VISUALIZATION OF SPARK JOBS ON HISTORY SERVER
- STAGE, TASK AND JOBS
- RELATION BETWEEN JOBS AND ACTIONS.
- RELATION BETWEEN STAGES AND WIDE TRANSFORMATION.
- RELATION BETWEEN TASK AND PARTITIONS.
> REDUCE VS REDUCEBYKEY
> REDUCEBYKEY VS GROUPBYKEY
> JOINS IN SPARK
> BROADCAST JOIN AND ITS WORKING
> REPARTITION, COALESCE, AND THEIR APPLICATIONS
> UNDERSTANDING OF CACHE

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 5 : SPARK HIGHER LEVEL APIS -
DATAFRAMES & SPARK SQL
> INTRODUCTION TO HIGHER LEVEL API’S IN APACHE SPARK
- DATAFRAMES
- SPARK SQL
> WHY ARE HIGHER LEVEL API’S MORE PERFORMANT
> WORKING OF DATAFRAMES
> CREATION OF SPARK SQL TABLE FROM DATAFRAME AND VICE VERSA
> CREATION OF SPARK TABLE
> TYPES OF TABLE
Elite Data
- MANAGED TABLE Engineering
Program
- EXTERNAL TABLE
By
- MANAGED TABLE VS EXTERNAL TABLE Sumit Sir
> USE CASE OF DATAFRAMES & SPARKSQL
> SPARK OPTIMIZATION
- APPLICATION CODE LEVEL OPTIMIZATION
- CLUSTER LEVEL OPTIMIZATION
- SPARK EXECUTORS
- THIN EXECUTORS
- FAT EXECUTORS
> RIGHT STRATEGY FOR CREATING CONTAINERS

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 6 : SPARK DATAFRAME
TRANSFORMATIONS
> CHALLENGES OF SCHEMA INFERENCE
> INTRODUCTION TO SCHEMA ENFORCEMENT
- SAMPLING RATIO
- WAYS TO ENFORCE SCHEMA(SCHEMA DDL & STRUCT TYPE)
> DIFFERENT WAYS OF HANDLING DATE FORMATS
> DATAFRAME READ MODES
> DIFFERENT WAYS OF CREATING A DATAFRAME
> CONVERSION OF RDD TO DATAFRAME AND ITS DIFFERENT
APPROACHES
> HANDLING NESTED SCHEMA Elite Data
Engineering
> DATAFRAME TRANSFORMATION(SELECT VS SELECTEXPR) Program
> REMOVAL OF DUPLICATES FROM DATAFRAME By
Sumit Sir
> SPARK SESSION
> DEPLOYMENT MODES(CLIENT MODE VS CLUSTER MODE)
WEEK 7 : APACHE SPARK - CACHING
> ACCESSING SPARK UI AND RESOURCE MANAGER
> UNDERSTANDING THE SPARK UI
> UNDERSTANDING CACHE AND PERSIST
> IMPORTANCE OF CACHE
> PRACTICAL APPLICATIONS OF CACHING
> SERIALIZED VS DESERIALIZED
> THE MECHANISM OF CACHING
> UNDERSTANDING SPARK CODE EXECUTION(PARSED, ANALYZED,
OPTIMIZED LOGICAL PLAN)
> IN-MEMORY TABLE CACHE Elite Data
Engineering
> NODE_LOCAL VS PROCESS_LOCAL Program
> SIGNIFICANCE OF DYNAMIC ALLOCATION By
Sumit Sir
> CACHING SPARK TABLE
> SPARK CATALOG, MANAGED & EXTERNAL TABLE
> CACHE PERFORMANCE
> UN-PERSIST AND ITS USE
> IMPORTANCE OF STORING CACHED DATA IN OTHER DATAFRAME
> PREDICATE PUSH DOWN
> DIFFERENT WAYS OF CACHING
> TYPES OF FILE FORMATS
> INTRODUCTION TO PERSIST
> VARIOUS STORAGE LEVELS PROVIDED BY THE PERSIST

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 8 : SPARK ARCHITECTURE & AGGREGATE
FUNCTIONS
> YARN ARCHITECTURE
- RESOURCE MANAGER
- APPLICATION MASTER
- NODE MANAGER
- CONTAINER
- UBER MODE
- YARN RESOURCE MANAGER UI
> WORKING OF SPARK ON YARN ARCHITECTURE
Elite Data
> SPARK ARCHITECTURE Engineering
Program
> SPARK JOB IN CLIENT MODE
By
> SPARK JOB IN CLUSTER MODE Sumit Sir
> WAYS OF ACCESSING COLUMNS IN PYSPARK
- COLUMN STRING
- COLUMN OBJECT
- COLUMN EXPRESSION
> OVERVIEW OF AGGREGATE FUNCTION
> SIMPLE AGGREGATE
> GROUPING AGGREGATE
> WINDOWING AGGREGATE
> WINDOWING FUNCTIONS
- RANK
- DENSE_RANK
- ROW_NUMBER
Elite Data
- LEAD Engineering
Program
- LAG
By
> ANALYZING A LOG FILE? Sumit Sir
> PIVOT TABLE AND ITS CREATION
WEEK 9 : APACHE SPARK INTERNALS &
DATAFRAME PARTITIONS
> READING DATAFRAMES
> DATAFRAME READ MODES
- PERMISSIVE
- DROPMALFORMED
- FAILFAST
> DATAFRAME WRITE MODES
- OVERWRITE
- IGNORE
- APPEND Elite Data
Engineering
- ERRORIFEXISTS Program
> PARTITIONBY CLAUSE By
> UNDERSTANDING OF BUCKETING & ITS PERFORMANCE GAINS Sumit Sir
> ACCESSING SPARK UI IN DATABRICKS COMMUNITY EDITION
> SPARK INTERNALS
> DISABLING DYNAMIC EXECUTOR ALLOCATION
> SPARK-SUBMIT AT A HIGH-LEVEL
> INITIAL NUMBER OF PARTITIONS IN A DATAFRAME
> CALCULATING THE INITIAL NUMBER OF PARTITIONS FOR A SINGLE
NON-SPLITABLE FILE
> CALCULATING THE INITIAL NUMBER OF PARTITIONS FOR MULTIPLE
FILES

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 10 : SPARK OPTIMIZATIONS &
PERFORMANCE TUNING - 1
> INTERNALS OF GROUPBY
> NORMAL VS BROADCAST JOIN
> DIFFERENT TYPES OF JOINS
- INNER JOIN
- LEFT OUTER JOIN
- RIGHT OUTER JOIN
- FULL OUTER JOIN
- LEFT SEMI JOIN
Elite Data
- LEFT ANTI JOIN Engineering
Program
> PARTITION SKEW
By
Sumit Sir
> 3 USE CASES : OPTIMIZATIONS
> SIGNIFICANCE OF AQE(ADAPTIVE QUERY EXECUTION)
> JOIN STRATEGIES IN APACHE SPARK
> BROADCAST HASH JOIN
> SORT MERGE JOIN
> SHUFFLE HASH JOIN
> OPTIMIZING JOIN OF 2 LARGE TABLES - BUCKETING

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 11 : SPARK OPTIMIZATIONS &
PERFORMANCE TUNING - 2
> MEMORY MANAGEMENT IN APACHE SPARK
> SORT AGGREGATE VS HASH AGGREGATE>
> VARIOUS PLANS IN APACHE SPARK
- PARSED LOGICAL PLAN
- ANALYZED LOGICAL PLAN
- OPTIMIZED LOGICAL PLAN
- PHYSICAL PLAN
> CATALYST OPTIMIZER
Elite Data
> INTRODUCTION TO FILE FORMATS & COMPRESSION TECHNIQUE Engineering
Program
- ROW BASED FILE FORMATS
By
- COLUMN BASED FILE FORMATS Sumit Sir
> SPECIALIZED FILE FORMATS
- AVRO
- ORC
- PARQUET
> SCHEMA EVOLUTION
> COMPRESSION TECHNIQUES
- SNAPPY
- LZO
- GZIP
- BZIP

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 12 - 13 : PYSPARK PROJECT
IMPLEMENTATION AND BEST PRACTICES
> KEY ELEMENTS OF A BIG DATA PROJECT
> EXAMPLE PROBLEM STATEMENTS
> AGILE METHODOLOGY
> PYSPARK PROJECT
- FINANCE DOMAIN
- ARCHITECTURAL SOLUTION
- UNDERSTANDING THE DATASETS
- DATA CLEANING
> UNDERSTANDING THE PROJECT IMPLEMENTATION LOGIC Elite Data
Engineering
> PERMANENT TABLE CREATION ON CLEANED DATA Program

By
Sumit Sir
> ACCESS PATTERNS
> IDENTIFYING THE BAD DATA
> SEGREGATING THE IDENTIFIED BAD DATA FROM THE NORMAL DATA
> PROCESSING AND STORING THE FINAL RESULTS
> PROJECT STRUCTURING & EXECUTION
- VIRTUAL ENVIRONMENT SETUP
> UNIT TESTING
- IDENTIFYING AND WRITING UNIT TEST CASES
- FIXTURE
- TEARDOWN | YIELD
- FIXTURE TO CHECK IF THE CALCULATED RESULTS MATCH EXPECTED
RESULTS
Elite Data
- MARKERS Engineering
Program
- PARAMETERIZED GENERIC TEST CASES
By
> IMPLEMENTING LOGGING LEVEL IN APACHE SPARK Sumit Sir
WEEK 14 : GIT | GITHUB | CICD
> OVERVIEW OF GIT & GITHUB
> SETUP
- GITHUB ACCOUNT CREATION
- GIT INSTALLATION
- VS CODE IDE INSTALLATION
> IMPORTANT GIT COMMANDS
> SCENARIO 1 : PROJECT CREATION THROUGH GITHUB (REMOTE)
> SCENARIO 2 - PROJECT CREATION THROUGH GIT (LOCAL)
> BRANCHES IN GIT
> REVERTING BACK TO THE PREVIOUS CODE BASE
Elite Data
> SCENARIO 3 : WORKING ON EXISTING PROJECT(FORK COMMAND) Engineering
Program
> GIT STASH COMMAND
By
> HANDLING MERGE CONFLICTS Sumit Sir
> CONTINUOUS INTEGRATION & CONTINUOUS DEPLOYMENT - CICD
- BRANCHING STRATEGY & STAGES OF CICD
- SETUP : DEPLOYING AND CONFIGURING JENKINS SERVER
- BRANCHING STRUCTURE
- JENKINS CONFIGURATIONS
- CREATING SAMPLE JENKINS PIPELINE
- BUILD | TEST | PACKAGE & DEPLOY : JENKINS PIPELINE FOR PROJECT

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 15 : APACHE HIVE
> INTRODUCTION TO APACHE HIVE & PRACTICALS
> APACHE HIVE TABLES
- MANAGED
- EXTERNAL
> HIVE OPTIMIZATIONS
- PARTITIONING
- BUCKETING
- JOIN OPTIMIZATIONS
> HIVE TRANSACTIONAL TABLES
> ACID PROPERTIES
Elite Data
> SPARK-HIVE INTEGRATION Engineering
Program
> HIVE MSCK REPAIR
By
Sumit Sir
WEEK 16 : DATA MODELING AND SYSTEM
DESIGN
> WHAT IS DATA MODELING | NORMAL FORMS
> NORMALIZATION - OLTP SYSTEMS
> MODELING DATAWAREHOUSE (DWH)
- FACT TABLE
- DIMENSION TABLE
> SURROGATE KEY
> STEPS INVOLVED IN DIMENSIONAL MODELING
> OPTIMIZING THE DATA MODELING PROCESS
- CHOOSING THE RIGHT GRAIN Elite Data
Engineering
> DIMENSION TABLE VS ONE BIG TABLE (OBT) Program
> SLOWLY CHANGING DIMENSIONS (SCD TYPES) By
> SCD TYPE-2 IMPLEMENTATION Sumit Sir
WEEK 17 - 18 : APACHE SPARK STRUCTURED
STREAMING
> KIND OF PROCESSING
> WHAT IS REAL-TIME PROCESSING
> BATCH PROCESSING VS REAL-TIME STREAMING
> SPARK STREAMING DATA
> STRUCTURED STREAMING IN-DEPTH
> BENEFITS OF SPARK STRUCTURED STREAMING
> TYPES OF DATA SOURCES
> STREAMING JOINS
Elite Data
> STREAMING DATAFRAME Engineering
Program
> SPARK DESCRITIZED STREAM : DSTREAM
By
> IS SPARK A REAL-TIME STREAMING ENGINE Sumit Sir
> STREAM PROCESSING IN SPARK
> TRANSFORMED DSTREAM
> UNDERSTANDING PRODUCER AND CONSUMER
> PRACTICAL ON REAL-TIME PROCESSING
> STREAM TRANSFORMATION
> STATELESS TRANSFORMATIONS
> STATEFUL TRANSFORMATIONS
> WINDOW TRANSFORMATIONS
> UPDATESTATEBYKEY, REDUCEBYKEYANDWINDOW,
REDUCEBYWINDOW, COUNTBYWINDOW
> TYPES OF WINDOWS - TUMBLING TIME WINDOW & SLIDING TIME
WINDOW
> WINDOW OPERATIONS | BATCH INTERVAL Elite Data
Engineering
> WINDOW SIZE | SLIDING INTERVAL Program

By
Sumit Sir
> WHAT IS STRUCTURED STREAMING
> REQUIREMENT OF STRUCTURED STREAMING
> LIMITATIONS OF SPARK STREAMING
> BENEFITS OF SPARK STRUCTURED STREAMING
> DYNAMICALLY SETTING THE SHUFFLE PARTITIONS
> DATASTREAM WRITER OUTPUT MODES
> DATASTREAM OUTPUT MODES - APPEND, UPDATE & COMPLETE
> SPARK STREAMING GRACEFUL SHUTDOWN
> HOW DOES SPARK STREAMING CODE EXECUTE INTERNALLY
> TYPES OF TRIGGERS - UNSPECIFIED, TIME-INTERVAL, ONE-TIME &
CONTINUOUS
> TYPES OF DATA SOURCES - SOCKET, RATE, FILE & KAFKA SOURCE
> TYPES OF SPARK STREAMING OUTPUT DATA OPTIONS Elite Data
Engineering
> TYPES OF AGGREGATIONS Program

By
Sumit Sir
WEEK 19 : KAFKA
> INTRODUCTION TO KAFKA - STREAMING PLATFORM
> KAFKA ARCHITECTURE
> KAFKA KEY CONCEPTS
> CLUSTER | NODES | BROKERS | TOPICS
> CONSUMER | PRODUCER | LOGS | PARTITIONS
> INSTALLING MULTI-NODE KAFKA CLUSTER
> WRITING KAFKA PRODUCER AND CONSUMER
> SCALING UP THE KAFKA CLUSTER
> CONCEPT OF PARTITION GROUPS
> LEADER AND FOLLOWER PARTITION Elite Data
Engineering
Program

By
Sumit Sir
> COMMAND LINE PRODUCER AND CONSUMER
> REPLICATION CONCEPT FOR FAULT TOLERANCE
> HOW DATA IS STORED IN BROKERS
> LOG SEGMENTS, MESSAGE OFFSETS, MESSAGE INDEX
> WRITING KAFKA PRODUCER, CONSUMER
> INTEGRATING KAFKA WITH SPARK STRUCTURED STREAMING
> BUILDING STREAMING PIPELINE (STRUCTURED STREAMING WITH
KAFKA)
> END-TO-END REAL-TIME STREAMING USE CASE USING KAFKA

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 20 - 21 : IMPORTANT AWS CLOUD
SERVICES
> AWS EMR (ELASTIC MAPREDUCE)
> LAUNCH EMR CLUSTER USING ADVANCED OPTIONS
> KINDS OF NODES IN CLUSTER
> HOW TO CREATE A VM
> TYPES OF EC2 INSTANCES
> RUNNING SPARK CODE ON EMR
> HOW TO TRACK YOUR JOB
> AWS S3
> AWS STORAGE Elite Data
Engineering
> COPY FILE FROM S3 TO LOCAL Program
> AWS COMMAND LINE INTERFACE By
Sumit Sir
> AWS ATHENA
> WHEN DO WE REQUIRE ATHENA
> WHAT PROBLEM ATHENA SOLVE
> ATHENA PRICING
> ATHENA PRACTICAL DEMONSTRATION
> HOW TO MINIMIZE DATA SCANNING IN ATHENA
> AWS GLUE - DATA CATALOG | CRAWLERS
> INFERING SCHEMA AUTOMATICALLY USING AWS GLUE
> CONNECTING TO DATA STORE
> USING CRAWLERS FOR CATALOG TABLES
> OVERVIEW OF WORKING WITH GLUE JOBS
> ADDING NEW JOBS IN GLUE Elite Data
Engineering
> TRIGGERING JOBS AND SCHEDULING Program

By
Sumit Sir
> AWS REDSHIFT
> BENEFITS AND USE CASES OF REDSHIFT
> REDSHIFT ARCHITECTURE
> TYPES OF NODES
> REDSHIFT SPECTRUM
> REDSHIFT FAULT TOLERANCE
> REDSHIFT SORT KEYS
> REDSHIFT DISTRIBUTION STYLES
> LAMBDA FUNCTIONS
> END-TO-END REAL-TIME USE CASE USING AWS CLOUD SERVICES

Elite Data
Engineering
Program

By
Sumit Sir
GET INTERVIEW READY
> RESUME | LINKEDIN | NAUKRI PROFILE BUILDING
AND OPTIMIZATION
> SAMPLE INTERVIEW QUESTIONS

Elite Data
Engineering
Program

By
Sumit Sir
Contact
[email protected]
https://trendytech.in/

You might also like