0% found this document useful (0 votes)

4 views

notes - Copy (2)

The document provides an overview of various data processing concepts and technologies, including Delta format, Spark SQL, Hadoop architecture, and optimization techniques in Hive and Spark. It also includes personal information about an individual named Swastika, detailing her educational background, work experience, strengths, weaknesses, and career aspirations. Additionally, it discusses her leadership experience in onboarding new team members in her current role as a Big Data Engineer.

Uploaded by

RahulAnand

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

notes - Copy (2)

Uploaded by

RahulAnand

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 4

Delta is a data format based on parquet.

It can be hosted on any of the cloud

platform. It provide form of stripes and each stripe contains Index, row data and
footer.
rank - unlike row_number this will give same rank to columns with same values in
order by columns. this yes ACID transaction to spark. Some of it's features are -
Caching - It creates local copies of data on worker nodes. It helps in avoiding
remote reads during execution.
Time travel - delta table keeps it's history of all the changes that were made to
the table in the past. This history is a transaction log which can also be queried
upon for a specific timestamp. Data can be restored to a previous version using
timestamp

ORC format stores data in thtition if provided

lead(1) - it will print value of the next row and last row is null.

aggregate functions -
. avg, sum, min, max on partition by. Order by is not needed here

In bucketing, bucket columns determine data partitioning and prevent data shuffle.
Based on the value of one or more data columns, data is allocated to predefined
number of bucekts.

Spark window functions are used to calculate results such as rank, row number, etc.
Spark SQL supports 3 kinds of window functions.
ranking functions -
row_number partition by order by - this will give row numbers in sequence for that
partition and in that order
default apply() method which handles object construction. A scala case class also
has all vals, which means they are immutable. syntax - case class <classname>
(<regular parameters>). To create a Scala Object of a case class, we don’t use the
keyword ‘new’. This is because its default apply() method handles the creation of
objects.

Rack Awareness in Hadoop is the concept to choose a nearby data node (closest to
the client which has raised the Read/Write request), thereby reducing the network
traffic.
An edge node is a computer that acts as an end user portal for communication with
other nodes in cluster computing. Edge nodes are also sometimes called gateway
nodes or edge communication nodes. In a Hadoop cluster, three types of nodes exist:
master, worker and edge nodes. Master nodes are responsible for storing data in
HDFS and overseeing key operations, such as running parallel computations on the
data using MapReduce. The worker nodes comprise most of the virtual machines in a
Hadoop cluster, and perform the job of storing the data and running computations.
Each worker node runs the DataNode and TaskTracker services, which are used to
receive the instructions from the master nodes.
a speculative execution means that Hadoop in overall doesn't try to fix slow tasks
as it is hard to detect the reason (misconfiguration, hardware issues, etc),
instead, it just launches another parallel/backup task for each task that is
performing slower than the expected, on faster nodes.

we can create a temporary table/view using createorreplacetempview() and then use

this view to create a table either in hive or athena.

groupByKey receives key value pairs and groups the records over each key.
reduceByKey has also same functionality.
While both reducebykey and groupbykey will produce the same answer, the reduceByKey
example works much better on a large dataset. That's because Spark knows it can
combine output with a common key on each partition before shuffling the data.
On the other hand, when calling groupByKey - all the key-value pairs are shuffled
around. This is a lot of unnessary data to being transferred over the network.
ReduceByKey uses a combiner to combine all the data while groupByKey doesnt.
eg - val testrdd = seq(("A", 1), ("B", 1), ("A", 1), ("C", 1))
reducebyKey o/p -> ("A", 2), ("B", 1), ("C", 1)
groupByKey o/p -> ("A", 1, 1), ("B", 1), ("C", 1)

optimization techniques used in HIVE -

Partitioning, Bucketing, Using TEZ execution, using suitable file format, avoid
calculated fields in join and where clause
while bucketing and partitioning are used for seggregating and storing data for
query optimization. Their difference includes - partitioning is based on a
paerticular column while bucketing organizes data in by a range of values

to load data in table from csv file and schema is not known
val pagecount = sqlContext.read.format("csv")
.option("delimiter"," ").option("quote","")
.option("header", "true")
.option("inferSchema", "true")
.load("dbfs:/databricks-datasets/wikipedia-datasets

optimization in spark -
- serialization - spark uses java serialiser by default. we can set it to Kryo
which gives 10x better performance than java one.
- API selection: Spark introduced three types of API to work upon – RDD, DataFrame,
DataSet. RDD is used for low level operation with less optimization. DataFrame is
best choice in most cases due to its catalyst optimizer and low garbage collection
(GC) overhead. Dataset is highly type safe and use encoders. It uses Tungsten for
serialization in binary format.
Broadcasting, persisting, file format selection, minimal use of ByKey operations,
repartition and coalesce to handle parallelism.

Resilient because RDDs are immutable(can't be modified once created) and fault
tolerant, Distributed because it is distributed across cluster and Dataset because
it holds data.
DAG (direct acyclic graph) is the representation of the way Spark will execute your
program - each vertex on that graph is a separate operation and edges represent
dependencies of each operation. Your program (thus DAG that represents it) may
operate on multiple entities (RDDs, Dataframes, etc). RDD Lineage is just a portion
of a DAG (one or more operations) that lead to the creation of that particular RDD.

difference between data frame and dataset

dataframes are SparkSQL structure and are similar to a relational database, that
is, row and column like structure. Spark Datasets is an extension of Dataframes
API. It is fast as well as provides a type-safe interface. Type safety means that
the compiler will validate the data types of all the columns in the dataset while
compilation only and will throw an error if there is any mismatch in the data
types.

how is a file processed in map reduce?

MapReduce facilitates concurrent processing by splitting petabytes of data into
smaller chunks, and processing them in parallel on Hadoop commodity servers. In the
end, it aggregates all the data from multiple servers to return a consolidated
output back to the application.

Spark architecture -
In your master node, you have the driver program, which drives your application.
The code you are writing behaves as a driver program or if you are using the
interactive shell, the shell acts as the driver program. Inside the driver program,
the first thing you do is, you create a Spark Context. Assume that the Spark
context is a gateway to all the Spark functionalities. It is similar to your
database connection. Any command you execute in your database goes through the
database connection. Likewise, anything you do on Spark goes through Spark context.

Now, this Spark context works with the cluster manager to manage various jobs. The
driver program & Spark context takes care of the job execution within the cluster.
A job is split into multiple tasks which are distributed over the worker node.
Anytime an RDD is created in Spark context, it can be distributed across various
nodes and can be cached there.

Worker nodes are the slave nodes whose job is to basically execute the tasks. These
tasks are then executed on the partitioned RDDs in the worker node and hence
returns back the result to the Spark Context.

Spark Context takes the job, breaks the job in tasks and distribute them to the
worker nodes. These tasks work on the partitioned RDD, perform operations, collect
the results and return to the main Spark Context.

optimization techniques -
-> make sure all the dimension tables and small sized tables are broadcasted while
reading, so that when these tables are used in join statements then there will be
lesser data transmission when the application is running.
-> in case of sql or hive queries try to avoid sub query/select statements
especially in join statements. Instead we can replace those sub queries with
dataframes. Also, select only the required columns in the select statement. Based
on requirements, filter out the non-mandatory data while reading large tables.
-> make sure hive tables are partitioned or bucketed based on the requirements and
use case of the tables.
-> Persist or cache intermediate dataframes as per requirement.
-> you can use kryo serialiser instead of the default one
-> if there is not much customaization required to the data then we can stick to
using dataframes instead of rdd.
-> Monitor resource manager and application master of the spark application while
it is running and identify bottle necks. Based on which we can identify which
dataframes are taking longer to execute and use repartition accordingly.
-> shuffle partition and parallelism
-> In case job is failing at any point due to stage failure, we can experiment with
adding checkpoint directory as well.

Why should we hire you -

I believe, I have worked tirelessly for all the relevant skills that I have
acquired over the past few years. I have all/most of the skills that is needed for
the position I am interviewing for which makes me certain that I will be a good fit
and have an imediate impact once I start working for the company. On top of
everything, I am also really excited to learn more and enhance my skill set while
working with a team of talented individuals.

Introduce yourself -
Hi I am Swastika! I graduated from SRM University, Chennai in the year 2020. I did
my Bachelor's of Technology in the field of Computer Science. After completing my
graduation, I started working as a Big Data Engineer for this organization called
Infoepts Technologies and I have been working here henceforth. For my role as Big
Data Engineer, I have worked extensively with technologies like Spark, Spark SQL,
Scala and Hive hosted over AWS cloud platform. I also have theoretical undrstanding
of Hadoop. While working here, I have worked on enhancing, improving and delivering
various ETL pipelines. I am really looking formward to be a part of an esteemed
organization where I can enhance my knowledge and learn new technologies in the
field of Big Data. Apart from my professional aspirations, in my free time I like
reading books, cooking, playing various sports and video games.

Strengths and weakness -

Strengths: I am very reliable, I make sure that the tasks that are assigned to me
get done within the deadline period. I am sincere about my work, I don't take my
work lightly. I am respectful towrads my leads and colleagues. I am a fast learner
Weakness: I can get overwhelmed or worked up sometimes when a lot of tasks are
assigned to me. I can also get lazy when I see that deadline is not close and I
know I can finish the task sooner. I also get impatient sometimes which makes me
miss minor details, in that moment.

Next 5 years -
In the next 5 years, I see myself in a leadership role with more responsibility,
where I am not only guiding a group of driven individuals but they also look up to
me because of my skills. Moreover, I want to attain more skills in this field of
Big data and data Science. In order to achieve these goals, I will use all the
possible opportunitites I get to enhance my knowledge and learn more. And I believe
joining TCS could get me the right kind of exposure and help me achieve this goal.

Why are you leaving your current job - I have been wanting to switch to a bigger
organization where I can work with a team of skilled individuals and leads support
their associates. . Moreover, I am looking for a more challenging and more
responsible role, I want to work for a higher position and higer pay. Lastly, I
have been wanting to relocate to Bangalore due to personal family reasons.

Tell me something that is not on your resume - I am sincere about my work and I
don't take things related to my work lightly. I have also been praised by my
seniors on multiple occassions for doing a good job. I am also looking forward to
joining TCS as I want to increase my knowledge in the field of Big data and get
opportunity to work on technologies that are new to me in this field. Apart from my
professional story, personally I was very good at different kinds of sports during
my school days.

Tell me about an instance where you demonstrated leadership skills - The company I
am working for has expanded drastically last year, in terms of hiring more
associates. So eventually a lot of new members have been introduced to the Project
I am working for.so, from the past few months I have been given the responsibility
to get the new joinees acquainted with the project structure that we follow in out
team and also help them with their tasks. As a result from the past few months on
top of my tasks, I have been guiding all the new joinees with their problems and
technical and business understanding. I am also making sure that their tasks are
also getting delivered within the stipulated timeframe.

Google Hacking Database
83% (18)
Google Hacking Database
91 pages
Dangerous Google - Searching For Secrets PDF
88% (26)
Dangerous Google - Searching For Secrets PDF
12 pages
Download ebooks file The Volatility Edge in Options Trading New Technical Strategies for Investing in Unstable Markets 1st Edition Jeff Augen all chapters
No ratings yet
Download ebooks file The Volatility Edge in Options Trading New Technical Strategies for Investing in Unstable Markets 1st Edition Jeff Augen all chapters
55 pages
Dangerous Google Searching For Secrets
No ratings yet
Dangerous Google Searching For Secrets
12 pages
Logistics Management System Complete Documentation
33% (9)
Logistics Management System Complete Documentation
99 pages
Google Hacking Database
No ratings yet
Google Hacking Database
91 pages
David Amos, Dan Bader, Joanna Jablonski, Fletcher Heisler Python
100% (15)
David Amos, Dan Bader, Joanna Jablonski, Fletcher Heisler Python
643 pages
Understanding Database Types - by Alex Xu
No ratings yet
Understanding Database Types - by Alex Xu
13 pages
Policy Document Ucc Redemption Understanding The Process Further
80% (20)
Policy Document Ucc Redemption Understanding The Process Further
37 pages
How To Use Google Hack
100% (1)
How To Use Google Hack
4 pages
Hackers Black Book (2011-Edition)
No ratings yet
Hackers Black Book (2011-Edition)
6 pages
UCC-1 Financing Statement
87% (39)
UCC-1 Financing Statement
94 pages
PayPal Hacks
100% (1)
PayPal Hacks
6 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Big Data Engineering - PySpark
100% (1)
Big Data Engineering - PySpark
120 pages
Dark Web Market Price Index Hacking Tools July 2018 Top10VPN2
91% (11)
Dark Web Market Price Index Hacking Tools July 2018 Top10VPN2
7 pages
Kali Linux Tools Descriptions
100% (2)
Kali Linux Tools Descriptions
26 pages
Allison, Berkowitz - 2008 - SQL For Microsoft Access PDF
100% (1)
Allison, Berkowitz - 2008 - SQL For Microsoft Access PDF
393 pages
canadianResumeTemplate 1
No ratings yet
canadianResumeTemplate 1
2 pages
Hackers Favorite Search Queries 4
100% (1)
Hackers Favorite Search Queries 4
6 pages
SQL Server DB Refresh Article-DBAChamps
No ratings yet
SQL Server DB Refresh Article-DBAChamps
4 pages
notes (2) - Copy
No ratings yet
notes (2) - Copy
4 pages
notes - Copy
No ratings yet
notes - Copy
5 pages
notes (3) - Copy
No ratings yet
notes (3) - Copy
3 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
SPARK
No ratings yet
SPARK
35 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
How To Create A Pipeline Capable of Processing 2.5 Billion Records/day
No ratings yet
How To Create A Pipeline Capable of Processing 2.5 Billion Records/day
65 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
Spark
No ratings yet
Spark
96 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Spark Keywords 1675605055
No ratings yet
Spark Keywords 1675605055
6 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
Spark
No ratings yet
Spark
11 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Chapter 3 spark
No ratings yet
Chapter 3 spark
6 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
MyinterviewQs (1)
No ratings yet
MyinterviewQs (1)
9 pages
BDA-Lec9
No ratings yet
BDA-Lec9
25 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Pyspark
No ratings yet
Pyspark
31 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
Bài Giảng Spark Streaming
No ratings yet
Bài Giảng Spark Streaming
75 pages
REcall Topics
No ratings yet
REcall Topics
2 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Spark End To End QUESTIONS
No ratings yet
Spark End To End QUESTIONS
10 pages
Spark Material
No ratings yet
Spark Material
6 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Spark Introduction
No ratings yet
Spark Introduction
26 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
61 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Spark Interview Questions and Answers
100% (2)
Spark Interview Questions and Answers
31 pages
BDA-Lec7
No ratings yet
BDA-Lec7
32 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Py Spark
No ratings yet
Py Spark
9 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Spark And Scala Week 1
No ratings yet
Spark And Scala Week 1
16 pages
BDA1
No ratings yet
BDA1
17 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
SQL Crash Course
No ratings yet
SQL Crash Course
17 pages
Useful Google Hacks
100% (4)
Useful Google Hacks
7 pages
Google Hacking Database PDF
0% (1)
Google Hacking Database PDF
100 pages
Excel Cheat Sheet: Travis Cuzick
100% (1)
Excel Cheat Sheet: Travis Cuzick
15 pages
Introduction To Database Systems
No ratings yet
Introduction To Database Systems
42 pages
Full download Network Security and Cryptography Sarhan M. Musa pdf docx
No ratings yet
Full download Network Security and Cryptography Sarhan M. Musa pdf docx
40 pages
Microsoft Access For Beginners PDF
100% (2)
Microsoft Access For Beginners PDF
196 pages
TITLE 28 United States Code Sec. 3002
91% (11)
TITLE 28 United States Code Sec. 3002
77 pages
Master Cyber Digital Forensics
50% (2)
Master Cyber Digital Forensics
114 pages
Mythic Magazine #015
100% (3)
Mythic Magazine #015
34 pages
SFDSFD401 - Basics and Fundamentals of Database
No ratings yet
SFDSFD401 - Basics and Fundamentals of Database
77 pages
Record Keeping and Documentation
100% (4)
Record Keeping and Documentation
18 pages
JCL Reference
No ratings yet
JCL Reference
722 pages
Unit 2 Global Information: Cambridge Technicals Level 3 Cambridge Technicals Level 3
No ratings yet
Unit 2 Global Information: Cambridge Technicals Level 3 Cambridge Technicals Level 3
18 pages
Internet and World Wide Web
0% (1)
Internet and World Wide Web
13 pages
How to create Document Sequencing in bulk through spreadsheet in Oracle Fusion
No ratings yet
How to create Document Sequencing in bulk through spreadsheet in Oracle Fusion
15 pages
Be Computer Engineering Semester 5 2023 November Database Management Systems Dms Pattern 2019
No ratings yet
Be Computer Engineering Semester 5 2023 November Database Management Systems Dms Pattern 2019
2 pages
How To Configure VCA of The NVR
No ratings yet
How To Configure VCA of The NVR
16 pages
Structured Analysis and Design Technique
No ratings yet
Structured Analysis and Design Technique
7 pages
Introduction To Software Construction
No ratings yet
Introduction To Software Construction
28 pages
Az 300
No ratings yet
Az 300
7 pages
Data Breaches in India A Growing Concern
No ratings yet
Data Breaches in India A Growing Concern
8 pages
Lecture Two Remote Procedure Call
No ratings yet
Lecture Two Remote Procedure Call
6 pages
Default Text in Sales Order Header
No ratings yet
Default Text in Sales Order Header
9 pages
uniFLOW Ports
50% (2)
uniFLOW Ports
3 pages
Oneapi Rendering Toolkit Get Started Guide Windows 2023.2 766442 781968
No ratings yet
Oneapi Rendering Toolkit Get Started Guide Windows 2023.2 766442 781968
26 pages
DX NetOps Product Brief
No ratings yet
DX NetOps Product Brief
4 pages
RFUMSV00 (TH) Wrong Non-Deductible Amount With Deferred Tax
No ratings yet
RFUMSV00 (TH) Wrong Non-Deductible Amount With Deferred Tax
2 pages
Itts Final Requirement Vince Ugalde
No ratings yet
Itts Final Requirement Vince Ugalde
60 pages
Catalogo Sorhea_fsc_shell_Tunisie -en
No ratings yet
Catalogo Sorhea_fsc_shell_Tunisie -en
2 pages
STG Agent Guide Aix
No ratings yet
STG Agent Guide Aix
146 pages
SAP GRC Integration Questions
No ratings yet
SAP GRC Integration Questions
3 pages
Salesforce CV
No ratings yet
Salesforce CV
4 pages
Build Your Own IoT Gateway With Python
No ratings yet
Build Your Own IoT Gateway With Python
98 pages
CompTIA Security Sy0-701 Exam Objectives (6 0)
No ratings yet
CompTIA Security Sy0-701 Exam Objectives (6 0)
21 pages
Rebuilding Reliable Data Pipelines Through Modern Tools PDF
100% (1)
Rebuilding Reliable Data Pipelines Through Modern Tools PDF
99 pages
Vels university ethical hacking
No ratings yet
Vels university ethical hacking
2 pages
Vproxy Guide
No ratings yet
Vproxy Guide
368 pages
Reflection 2 - Group 8. MC022
No ratings yet
Reflection 2 - Group 8. MC022
3 pages
Performance: Automated Operations: Ron Coleman
No ratings yet
Performance: Automated Operations: Ron Coleman
11 pages
Collecting Social Media Data
No ratings yet
Collecting Social Media Data
11 pages

notes - Copy (2)

Uploaded by

notes - Copy (2)

Uploaded by

Delta is a data format based on parquet.

It can be hosted on any of the cloud

ORC format stores data in thtition if provided

we can create a temporary table/view using createorreplacetempview() and then use

optimization techniques used in HIVE -

difference between data frame and dataset

how is a file processed in map reduce?

Why should we hire you -

Strengths and weakness -

You might also like