0% found this document useful (0 votes)
138 views

Bigdata Bits

1) A resilient distributed dataset (RDD) is a read-only collection of objects partitioned across machines that can be rebuilt if lost. 2) The join operation on RDDs of type (K,V) and (K,W) returns an RDD of type (K,(V,W)) pairs with all pairs of elements for each key. 3) Both statements about RDDs are true - you can control partitioning and choose to persist RDDs to disk.

Uploaded by

Shreyansh Diwan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views

Bigdata Bits

1) A resilient distributed dataset (RDD) is a read-only collection of objects partitioned across machines that can be rebuilt if lost. 2) The join operation on RDDs of type (K,V) and (K,W) returns an RDD of type (K,(V,W)) pairs with all pairs of elements for each key. 3) Both statements about RDDs are true - you can control partitioning and choose to persist RDDs to disk.

Uploaded by

Shreyansh Diwan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

1)In spark, a ______________________is a read-only collection of objects partitioned across a

set of machines that can be rebuilt if a partition is lost.


A) Spark Streaming
B)esilient Distributed Dataset (RDD)
C) FlatMap
D) Driver

2)Given the following definition about the join transformation in Apache Spark:

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

Where join operation is used for joining two datasets. When it is called on datasets of type (K,
V) and (K, W), it returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

Output the result of joinrdd, when the following code is run.

val rdd1 = sc.parallelize(Seq(("m",55),("m",56),("e",57),("e",58),("s",59),("s",54)))


val rdd2 = sc.parallelize(Seq(("m",60),("m",65),("s",61),("s",62),("h",63),("h",64)))
val joinrdd = rdd1.join(rdd2)
joinrdd.collect
1) Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)),
(s,(59,62)), (h,(63,64)), (s,(54,61)), (s,(54,62)))
2) Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)),
(s,(59,62)), (e,(57,58)), (s,(54,61)), (s,(54,62)))
3) Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)), (m,(56,65)), (s,(59,61)),
(s,(59,62)), (s,(54,61)), (s,(54,62)))
4)None of the mentioned

3)Consider the following statements in the context of Spark:

Statement 1: Spark also gives you control over how you can partition your Resilient Distributed
Datasets (RDDs)

Statement 2: Spark allows you to choose whether you want to persist Resilient Distributed
Dataset (RDD) onto disk or not.
A)Only statement 1 is true
B)Only statement 2 is true
C)Both statements are true
D)Both statements are false

4)______________ leverages Spark Core fast scheduling capability to perform streaming


analytics.
A) MLlib
B) Spark Streaming
C) GraphX
D) RDDs

5)____________________ is a distributed graph processing framework on top of Spark.


A) MLlib
B) Spark streaming
C) GraphX
D) All of the mentioned

6)Consider the following statements:

Statement 1: Scale out means grow your cluster capacity by replacing with more powerful
machines

Statement 2: Scale up means incrementally grow your cluster capacity by adding more COTS
machines (Components Off the Shelf)

A) Only statement 1 is true


B) Only statement 2 is true
C) Both statements are true
D) Both statements are false

7)Which of the following is not a NoSQL database?


A) HBase
B) SQL Server
C) Cassandra
D) None of the mentioned

8)Which of the following are the simplest NoSQL databases ?


A) Key-value
B) Wide-column
C) Document
D) All of the mentioned

9)Point out the incorrect statement in the context of Cassandra:


A) It is originally designed at Facebook
B) It is a centralized key-value store
C) It is designed to handle large amounts of data across many commodity servers, providing
high availability with no single point of failure.
D) It uses a ring-based DHT (Distributed Hash Table) but without finger tables or routing

You might also like