100% found this document useful (1 vote)
221 views5 pages

Data Engineering Roadmap Overview

The document outlines the key concepts and technologies needed in a data engineering roadmap, including programming languages like Python, data analysis tools like Pandas, distributed computing frameworks like Apache Hadoop and Spark, SQL, NoSQL databases, data warehousing, cloud services, and more. It provides a comprehensive overview of the data engineering stack and skills.

Uploaded by

Prem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
221 views5 pages

Data Engineering Roadmap Overview

The document outlines the key concepts and technologies needed in a data engineering roadmap, including programming languages like Python, data analysis tools like Pandas, distributed computing frameworks like Apache Hadoop and Spark, SQL, NoSQL databases, data warehousing, cloud services, and more. It provides a comprehensive overview of the data engineering stack and skills.

Uploaded by

Prem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Engineering Roadmap

• Programming Languages
• Python (Recommended)

• Java

• Scala

• Data Exploration Libraries (Python Only)


• Pandas

• NumPy

• Matplotlib

• Operating Systems & Scripting


• Linux/Unix Commands

• Shell Scripting

• Cron Jobs

• Data Structures & Algorithms (Easy-Medium Level Only)


• Arrays

• Strings

• Linked List

• Stack

• Queue

• Tree (Basics)

• Graph (Basics)

• Dynamic Programming

• Searching

• Sorting

• Database Management Systems


• Understand RDBMS and its use cases

• Schema Types
• ER Diagram

• ACID Properties

• Transactions

• Concurrency Control

• Deadlock

• Indexing

• Hashing

• Normalization Forms

• Views

• Stored Procedures

• SQL
• Basics Of DDL, DML, DCL

• All Types Of Joins

• Subqueries

• Group By

• Case-When Statement

• Common Table Expression (With Clause)

• Window Functions

• Pivoting

• BigData Terminologies
• What is BigData?

• 5 V’s of BigData

• Distributed Computation

• Distributed Storage

• Vertical vs Horizontal Scaling

• Commodity Hardware

• Clusters

• File formats
a. CSV

b. JSON

c. AVRO
d. Parquet

e. ORC

• Type of Data
a. Structured

b. Unstructured

c. Semi-structured

• Data Warehousing
• OLAP vs OLTP

• Dimension Tables

• Fact Tables

• Star Schema

• Snowflake Schema

• Warehouse Designing Questions

• Slowly Changing Dimensions (SCD)

• BigData Frameworks
• Apache Hadoop (Architecture Understanding Most Important)
a. HDFS

b. Map-Reduce (Coding part not needed)

c. Yarn

• Apache Hive
• How to load data in different file formats

• Internal Tables

• External Tables

• Querying table data stored in HDFS

• Partitioning

• Bucketing

• Map-Side Join

• Sorted-Merge Join

• UDFs in Hive

• SerDe in Hive

• Apache Spark (Most Important)


• Spark Core

• Spark SQL

• Spark Streaming

• Apache Flink ( Real-Time Data Processing )

• Apache SQOOP

• Apache NIFI

• Apache FLUME

• Schedulers/Workflow Managers
• Apache Airflow

• Apache NIFI

• Azkaban

• NoSQL Databases
• HBase

• Cassandra

• ElasticSearch

• MongoDB

• Messaging Queue
• Apache Kafka

• Dash Boarding Tools


• Tableau

• PowerBI

• Grafana

• Kibana

• BigData Services in Cloud (AWS)


• On-demand Machines
• AWS EC2

• Access Management
• AWS IAM

• For Storing and Accessing Credentials


• AWS Secret Manager

• Distributed File Storage


• AWS S3

• Transactional Database Services


• AWS RDS

• AWS Athena

• Data Warehousing Service


• AWS Redshift

• NoSQL Database Services


• AWS Dynamo

• Serverless
• AWS Lambda

• ETL Services
• AWS Glue

• Scheduler
• AWS CloudWatch

• Distributed Data Computation


• AWS EMR

• Messaging Queue
• AWS SNS

• AWS SQS

• Real-Time Data Processing


• AWS Kinesis

You might also like