notes - Copy (2)
notes - Copy (2)
aggregate functions -
. avg, sum, min, max on partition by. Order by is not needed here
In bucketing, bucket columns determine data partitioning and prevent data shuffle.
Based on the value of one or more data columns, data is allocated to predefined
number of bucekts.
Spark window functions are used to calculate results such as rank, row number, etc.
Spark SQL supports 3 kinds of window functions.
ranking functions -
row_number partition by order by - this will give row numbers in sequence for that
partition and in that order
default apply() method which handles object construction. A scala case class also
has all vals, which means they are immutable. syntax - case class <classname>
(<regular parameters>). To create a Scala Object of a case class, we don’t use the
keyword ‘new’. This is because its default apply() method handles the creation of
objects.
Rack Awareness in Hadoop is the concept to choose a nearby data node (closest to
the client which has raised the Read/Write request), thereby reducing the network
traffic.
An edge node is a computer that acts as an end user portal for communication with
other nodes in cluster computing. Edge nodes are also sometimes called gateway
nodes or edge communication nodes. In a Hadoop cluster, three types of nodes exist:
master, worker and edge nodes. Master nodes are responsible for storing data in
HDFS and overseeing key operations, such as running parallel computations on the
data using MapReduce. The worker nodes comprise most of the virtual machines in a
Hadoop cluster, and perform the job of storing the data and running computations.
Each worker node runs the DataNode and TaskTracker services, which are used to
receive the instructions from the master nodes.
a speculative execution means that Hadoop in overall doesn't try to fix slow tasks
as it is hard to detect the reason (misconfiguration, hardware issues, etc),
instead, it just launches another parallel/backup task for each task that is
performing slower than the expected, on faster nodes.
groupByKey receives key value pairs and groups the records over each key.
reduceByKey has also same functionality.
While both reducebykey and groupbykey will produce the same answer, the reduceByKey
example works much better on a large dataset. That's because Spark knows it can
combine output with a common key on each partition before shuffling the data.
On the other hand, when calling groupByKey - all the key-value pairs are shuffled
around. This is a lot of unnessary data to being transferred over the network.
ReduceByKey uses a combiner to combine all the data while groupByKey doesnt.
eg - val testrdd = seq(("A", 1), ("B", 1), ("A", 1), ("C", 1))
reducebyKey o/p -> ("A", 2), ("B", 1), ("C", 1)
groupByKey o/p -> ("A", 1, 1), ("B", 1), ("C", 1)
to load data in table from csv file and schema is not known
val pagecount = sqlContext.read.format("csv")
.option("delimiter"," ").option("quote","")
.option("header", "true")
.option("inferSchema", "true")
.load("dbfs:/databricks-datasets/wikipedia-datasets
optimization in spark -
- serialization - spark uses java serialiser by default. we can set it to Kryo
which gives 10x better performance than java one.
- API selection: Spark introduced three types of API to work upon – RDD, DataFrame,
DataSet. RDD is used for low level operation with less optimization. DataFrame is
best choice in most cases due to its catalyst optimizer and low garbage collection
(GC) overhead. Dataset is highly type safe and use encoders. It uses Tungsten for
serialization in binary format.
Broadcasting, persisting, file format selection, minimal use of ByKey operations,
repartition and coalesce to handle parallelism.
Resilient because RDDs are immutable(can't be modified once created) and fault
tolerant, Distributed because it is distributed across cluster and Dataset because
it holds data.
DAG (direct acyclic graph) is the representation of the way Spark will execute your
program - each vertex on that graph is a separate operation and edges represent
dependencies of each operation. Your program (thus DAG that represents it) may
operate on multiple entities (RDDs, Dataframes, etc). RDD Lineage is just a portion
of a DAG (one or more operations) that lead to the creation of that particular RDD.
Spark architecture -
In your master node, you have the driver program, which drives your application.
The code you are writing behaves as a driver program or if you are using the
interactive shell, the shell acts as the driver program. Inside the driver program,
the first thing you do is, you create a Spark Context. Assume that the Spark
context is a gateway to all the Spark functionalities. It is similar to your
database connection. Any command you execute in your database goes through the
database connection. Likewise, anything you do on Spark goes through Spark context.
Now, this Spark context works with the cluster manager to manage various jobs. The
driver program & Spark context takes care of the job execution within the cluster.
A job is split into multiple tasks which are distributed over the worker node.
Anytime an RDD is created in Spark context, it can be distributed across various
nodes and can be cached there.
Worker nodes are the slave nodes whose job is to basically execute the tasks. These
tasks are then executed on the partitioned RDDs in the worker node and hence
returns back the result to the Spark Context.
Spark Context takes the job, breaks the job in tasks and distribute them to the
worker nodes. These tasks work on the partitioned RDD, perform operations, collect
the results and return to the main Spark Context.
optimization techniques -
-> make sure all the dimension tables and small sized tables are broadcasted while
reading, so that when these tables are used in join statements then there will be
lesser data transmission when the application is running.
-> in case of sql or hive queries try to avoid sub query/select statements
especially in join statements. Instead we can replace those sub queries with
dataframes. Also, select only the required columns in the select statement. Based
on requirements, filter out the non-mandatory data while reading large tables.
-> make sure hive tables are partitioned or bucketed based on the requirements and
use case of the tables.
-> Persist or cache intermediate dataframes as per requirement.
-> you can use kryo serialiser instead of the default one
-> if there is not much customaization required to the data then we can stick to
using dataframes instead of rdd.
-> Monitor resource manager and application master of the spark application while
it is running and identify bottle necks. Based on which we can identify which
dataframes are taking longer to execute and use repartition accordingly.
-> shuffle partition and parallelism
-> In case job is failing at any point due to stage failure, we can experiment with
adding checkpoint directory as well.
Introduce yourself -
Hi I am Swastika! I graduated from SRM University, Chennai in the year 2020. I did
my Bachelor's of Technology in the field of Computer Science. After completing my
graduation, I started working as a Big Data Engineer for this organization called
Infoepts Technologies and I have been working here henceforth. For my role as Big
Data Engineer, I have worked extensively with technologies like Spark, Spark SQL,
Scala and Hive hosted over AWS cloud platform. I also have theoretical undrstanding
of Hadoop. While working here, I have worked on enhancing, improving and delivering
various ETL pipelines. I am really looking formward to be a part of an esteemed
organization where I can enhance my knowledge and learn new technologies in the
field of Big Data. Apart from my professional aspirations, in my free time I like
reading books, cooking, playing various sports and video games.
Next 5 years -
In the next 5 years, I see myself in a leadership role with more responsibility,
where I am not only guiding a group of driven individuals but they also look up to
me because of my skills. Moreover, I want to attain more skills in this field of
Big data and data Science. In order to achieve these goals, I will use all the
possible opportunitites I get to enhance my knowledge and learn more. And I believe
joining TCS could get me the right kind of exposure and help me achieve this goal.
Why are you leaving your current job - I have been wanting to switch to a bigger
organization where I can work with a team of skilled individuals and leads support
their associates. . Moreover, I am looking for a more challenging and more
responsible role, I want to work for a higher position and higer pay. Lastly, I
have been wanting to relocate to Bangalore due to personal family reasons.
Tell me something that is not on your resume - I am sincere about my work and I
don't take things related to my work lightly. I have also been praised by my
seniors on multiple occassions for doing a good job. I am also looking forward to
joining TCS as I want to increase my knowledge in the field of Big data and get
opportunity to work on technologies that are new to me in this field. Apart from my
professional story, personally I was very good at different kinds of sports during
my school days.
Tell me about an instance where you demonstrated leadership skills - The company I
am working for has expanded drastically last year, in terms of hiring more
associates. So eventually a lot of new members have been introduced to the Project
I am working for.so, from the past few months I have been given the responsibility
to get the new joinees acquainted with the project structure that we follow in out
team and also help them with their tasks. As a result from the past few months on
top of my tasks, I have been guiding all the new joinees with their problems and
technical and business understanding. I am also making sure that their tasks are
also getting delivered within the stipulated timeframe.