0% found this document useful (0 votes)
4 views

notes - Copy (2)

The document provides an overview of various data processing concepts and technologies, including Delta format, Spark SQL, Hadoop architecture, and optimization techniques in Hive and Spark. It also includes personal information about an individual named Swastika, detailing her educational background, work experience, strengths, weaknesses, and career aspirations. Additionally, it discusses her leadership experience in onboarding new team members in her current role as a Big Data Engineer.

Uploaded by

RahulAnand
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

notes - Copy (2)

The document provides an overview of various data processing concepts and technologies, including Delta format, Spark SQL, Hadoop architecture, and optimization techniques in Hive and Spark. It also includes personal information about an individual named Swastika, detailing her educational background, work experience, strengths, weaknesses, and career aspirations. Additionally, it discusses her leadership experience in onboarding new team members in her current role as a Big Data Engineer.

Uploaded by

RahulAnand
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 4

Delta is a data format based on parquet.

It can be hosted on any of the cloud


platform. It provide form of stripes and each stripe contains Index, row data and
footer.
rank - unlike row_number this will give same rank to columns with same values in
order by columns. this yes ACID transaction to spark. Some of it's features are -
Caching - It creates local copies of data on worker nodes. It helps in avoiding
remote reads during execution.
Time travel - delta table keeps it's history of all the changes that were made to
the table in the past. This history is a transaction log which can also be queried
upon for a specific timestamp. Data can be restored to a previous version using
timestamp

ORC format stores data in thtition if provided


lead(1) - it will print value of the next row and last row is null.

aggregate functions -
. avg, sum, min, max on partition by. Order by is not needed here

In bucketing, bucket columns determine data partitioning and prevent data shuffle.
Based on the value of one or more data columns, data is allocated to predefined
number of bucekts.

Spark window functions are used to calculate results such as rank, row number, etc.
Spark SQL supports 3 kinds of window functions.
ranking functions -
row_number partition by order by - this will give row numbers in sequence for that
partition and in that order
default apply() method which handles object construction. A scala case class also
has all vals, which means they are immutable. syntax - case class <classname>
(<regular parameters>). To create a Scala Object of a case class, we don’t use the
keyword ‘new’. This is because its default apply() method handles the creation of
objects.

Rack Awareness in Hadoop is the concept to choose a nearby data node (closest to
the client which has raised the Read/Write request), thereby reducing the network
traffic.
An edge node is a computer that acts as an end user portal for communication with
other nodes in cluster computing. Edge nodes are also sometimes called gateway
nodes or edge communication nodes. In a Hadoop cluster, three types of nodes exist:
master, worker and edge nodes. Master nodes are responsible for storing data in
HDFS and overseeing key operations, such as running parallel computations on the
data using MapReduce. The worker nodes comprise most of the virtual machines in a
Hadoop cluster, and perform the job of storing the data and running computations.
Each worker node runs the DataNode and TaskTracker services, which are used to
receive the instructions from the master nodes.
a speculative execution means that Hadoop in overall doesn't try to fix slow tasks
as it is hard to detect the reason (misconfiguration, hardware issues, etc),
instead, it just launches another parallel/backup task for each task that is
performing slower than the expected, on faster nodes.

we can create a temporary table/view using createorreplacetempview() and then use


this view to create a table either in hive or athena.

groupByKey receives key value pairs and groups the records over each key.
reduceByKey has also same functionality.
While both reducebykey and groupbykey will produce the same answer, the reduceByKey
example works much better on a large dataset. That's because Spark knows it can
combine output with a common key on each partition before shuffling the data.
On the other hand, when calling groupByKey - all the key-value pairs are shuffled
around. This is a lot of unnessary data to being transferred over the network.
ReduceByKey uses a combiner to combine all the data while groupByKey doesnt.
eg - val testrdd = seq(("A", 1), ("B", 1), ("A", 1), ("C", 1))
reducebyKey o/p -> ("A", 2), ("B", 1), ("C", 1)
groupByKey o/p -> ("A", 1, 1), ("B", 1), ("C", 1)

optimization techniques used in HIVE -


Partitioning, Bucketing, Using TEZ execution, using suitable file format, avoid
calculated fields in join and where clause
while bucketing and partitioning are used for seggregating and storing data for
query optimization. Their difference includes - partitioning is based on a
paerticular column while bucketing organizes data in by a range of values

to load data in table from csv file and schema is not known
val pagecount = sqlContext.read.format("csv")
.option("delimiter"," ").option("quote","")
.option("header", "true")
.option("inferSchema", "true")
.load("dbfs:/databricks-datasets/wikipedia-datasets

optimization in spark -
- serialization - spark uses java serialiser by default. we can set it to Kryo
which gives 10x better performance than java one.
- API selection: Spark introduced three types of API to work upon – RDD, DataFrame,
DataSet. RDD is used for low level operation with less optimization. DataFrame is
best choice in most cases due to its catalyst optimizer and low garbage collection
(GC) overhead. Dataset is highly type safe and use encoders. It uses Tungsten for
serialization in binary format.
Broadcasting, persisting, file format selection, minimal use of ByKey operations,
repartition and coalesce to handle parallelism.

Resilient because RDDs are immutable(can't be modified once created) and fault
tolerant, Distributed because it is distributed across cluster and Dataset because
it holds data.
DAG (direct acyclic graph) is the representation of the way Spark will execute your
program - each vertex on that graph is a separate operation and edges represent
dependencies of each operation. Your program (thus DAG that represents it) may
operate on multiple entities (RDDs, Dataframes, etc). RDD Lineage is just a portion
of a DAG (one or more operations) that lead to the creation of that particular RDD.

difference between data frame and dataset


dataframes are SparkSQL structure and are similar to a relational database, that
is, row and column like structure. Spark Datasets is an extension of Dataframes
API. It is fast as well as provides a type-safe interface. Type safety means that
the compiler will validate the data types of all the columns in the dataset while
compilation only and will throw an error if there is any mismatch in the data
types.

how is a file processed in map reduce?


MapReduce facilitates concurrent processing by splitting petabytes of data into
smaller chunks, and processing them in parallel on Hadoop commodity servers. In the
end, it aggregates all the data from multiple servers to return a consolidated
output back to the application.

Spark architecture -
In your master node, you have the driver program, which drives your application.
The code you are writing behaves as a driver program or if you are using the
interactive shell, the shell acts as the driver program. Inside the driver program,
the first thing you do is, you create a Spark Context. Assume that the Spark
context is a gateway to all the Spark functionalities. It is similar to your
database connection. Any command you execute in your database goes through the
database connection. Likewise, anything you do on Spark goes through Spark context.

Now, this Spark context works with the cluster manager to manage various jobs. The
driver program & Spark context takes care of the job execution within the cluster.
A job is split into multiple tasks which are distributed over the worker node.
Anytime an RDD is created in Spark context, it can be distributed across various
nodes and can be cached there.

Worker nodes are the slave nodes whose job is to basically execute the tasks. These
tasks are then executed on the partitioned RDDs in the worker node and hence
returns back the result to the Spark Context.

Spark Context takes the job, breaks the job in tasks and distribute them to the
worker nodes. These tasks work on the partitioned RDD, perform operations, collect
the results and return to the main Spark Context.

optimization techniques -
-> make sure all the dimension tables and small sized tables are broadcasted while
reading, so that when these tables are used in join statements then there will be
lesser data transmission when the application is running.
-> in case of sql or hive queries try to avoid sub query/select statements
especially in join statements. Instead we can replace those sub queries with
dataframes. Also, select only the required columns in the select statement. Based
on requirements, filter out the non-mandatory data while reading large tables.
-> make sure hive tables are partitioned or bucketed based on the requirements and
use case of the tables.
-> Persist or cache intermediate dataframes as per requirement.
-> you can use kryo serialiser instead of the default one
-> if there is not much customaization required to the data then we can stick to
using dataframes instead of rdd.
-> Monitor resource manager and application master of the spark application while
it is running and identify bottle necks. Based on which we can identify which
dataframes are taking longer to execute and use repartition accordingly.
-> shuffle partition and parallelism
-> In case job is failing at any point due to stage failure, we can experiment with
adding checkpoint directory as well.

Why should we hire you -


I believe, I have worked tirelessly for all the relevant skills that I have
acquired over the past few years. I have all/most of the skills that is needed for
the position I am interviewing for which makes me certain that I will be a good fit
and have an imediate impact once I start working for the company. On top of
everything, I am also really excited to learn more and enhance my skill set while
working with a team of talented individuals.

Introduce yourself -
Hi I am Swastika! I graduated from SRM University, Chennai in the year 2020. I did
my Bachelor's of Technology in the field of Computer Science. After completing my
graduation, I started working as a Big Data Engineer for this organization called
Infoepts Technologies and I have been working here henceforth. For my role as Big
Data Engineer, I have worked extensively with technologies like Spark, Spark SQL,
Scala and Hive hosted over AWS cloud platform. I also have theoretical undrstanding
of Hadoop. While working here, I have worked on enhancing, improving and delivering
various ETL pipelines. I am really looking formward to be a part of an esteemed
organization where I can enhance my knowledge and learn new technologies in the
field of Big Data. Apart from my professional aspirations, in my free time I like
reading books, cooking, playing various sports and video games.

Strengths and weakness -


Strengths: I am very reliable, I make sure that the tasks that are assigned to me
get done within the deadline period. I am sincere about my work, I don't take my
work lightly. I am respectful towrads my leads and colleagues. I am a fast learner
Weakness: I can get overwhelmed or worked up sometimes when a lot of tasks are
assigned to me. I can also get lazy when I see that deadline is not close and I
know I can finish the task sooner. I also get impatient sometimes which makes me
miss minor details, in that moment.

Next 5 years -
In the next 5 years, I see myself in a leadership role with more responsibility,
where I am not only guiding a group of driven individuals but they also look up to
me because of my skills. Moreover, I want to attain more skills in this field of
Big data and data Science. In order to achieve these goals, I will use all the
possible opportunitites I get to enhance my knowledge and learn more. And I believe
joining TCS could get me the right kind of exposure and help me achieve this goal.

Why are you leaving your current job - I have been wanting to switch to a bigger
organization where I can work with a team of skilled individuals and leads support
their associates. . Moreover, I am looking for a more challenging and more
responsible role, I want to work for a higher position and higer pay. Lastly, I
have been wanting to relocate to Bangalore due to personal family reasons.

Tell me something that is not on your resume - I am sincere about my work and I
don't take things related to my work lightly. I have also been praised by my
seniors on multiple occassions for doing a good job. I am also looking forward to
joining TCS as I want to increase my knowledge in the field of Big data and get
opportunity to work on technologies that are new to me in this field. Apart from my
professional story, personally I was very good at different kinds of sports during
my school days.

Tell me about an instance where you demonstrated leadership skills - The company I
am working for has expanded drastically last year, in terms of hiring more
associates. So eventually a lot of new members have been introduced to the Project
I am working for.so, from the past few months I have been given the responsibility
to get the new joinees acquainted with the project structure that we follow in out
team and also help them with their tasks. As a result from the past few months on
top of my tasks, I have been guiding all the new joinees with their problems and
technical and business understanding. I am also making sure that their tasks are
also getting delivered within the stipulated timeframe.

You might also like