0% found this document useful (0 votes)
19 views

Apache Spark

Uploaded by

vedaxew561
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Apache Spark

Uploaded by

vedaxew561
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Module – 5 Apache Spark

1. Explain Spark Operations with an example?


Solution:
Apache Spark is an open-source, distributed computing framework that provides a fast and
general-purpose cluster computing system for big data processing. Spark operates on
distributed datasets and performs various operations to process and transform data. These
operations can be categorized into two types: transformations and actions.

TRANSFORMATIONS:
Transformations are operations that create a new dataset from an existing one. They are
executed lazily, which means that they are not computed immediately but instead build a
lineage of transformations that will be executed when an action is called. Here are some
common transformations:

map: Applies a function to each element of the dataset, producing a new dataset. For
example, if you have a dataset of numbers and you want to square each number:
input_data = [1, 2, 3, 4, 5]
rdd = SparkContext.parallelize(input_data)
squared_rdd = rdd.map(lambda x: x**2)

flatMap: Similar to map, but each input item can be mapped to zero or more output items.
For example, splitting lines of text into words:
lines = ["Hello world", "Spark is great", "Big Data"]
rdd = SparkContext.parallelize(lines)
words_rdd = rdd.flatMap(lambda line: line.split(" "))

distinct: Returns a new dataset with distinct elements.


input_data = [1, 2, 2, 3, 3, 4, 4, 5]
rdd = SparkContext.parallelize(input_data)
distinct_rdd = rdd.distinct()

ACTIONS:
Actions are operations that trigger the execution of transformations and return a value to
the driver program or write data to an external storage system. Actions are the operations
that actually perform computation. Here are some common actions:

collect: Retrieves all the elements of the dataset and returns them to the driver program.
Use this with caution as it can be memory-intensive for large datasets.
input_data = [1, 2, 3, 4, 5]
rdd = SparkContext.parallelize(input_data)
result = rdd.collect()

count: Returns the number of elements in the dataset.


input_data = [1, 2, 3, 4, 5]
rdd = SparkContext.parallelize(input_data)
count = rdd.count()

first: Returns the first element of the dataset.


input_data = [1, 2, 3, 4, 5]
rdd = SparkContext.parallelize(input_data)
first_element = rdd.first()

2. Perform the below RDD


a. How to read multiple text files into RDD

SOLUTION:

from pyspark import SparkContext

# Create a SparkContext

sc = SparkContext("local", "ReadTextFilesExample")

# Specify the directory containing the text files

input_directory = "dbfs:/FileStore/txtfile/*.txt"

# Use textFile to read the text files

text_files_rdd = sc.textFile(input_directory)

# Print the entire contents of the RDD using collect()

all_elements = text_files_rdd.collect()

# Print each element in the RDD

for element in all_elements:

print(element)

# stop the SparkContext

sc.stop()
b. Read CSV file into RDD
Solution:
from pyspark.sql import SparkSession

# Create a SparkSession
# spark =
SparkSession.builder.appName("ReadCSVIntoDataFrameExample").getOrCreate()

# Specify the path to the CSV file


csv_file_path = "dbfs:/FileStore/Address.csv"

# Read the CSV file into a DataFrame


df = spark.read.csv(csv_file_path, header=True, inferSchema=True)

# Show the DataFrame


df.show()

# Stop the SparkSession


# spark.stop()

c. Ways to create an RDD


Solution:
There are 3 ways to create a rdd

1. Parallelizing an Existing Collection:

2. Reading from External Storage:

3. Creating an RDD from Another RDD:

d. Create an empty RDD


Solutions:
We can create an empty rdd using parallelize method

from pyspark import SparkContext

# Create a SparkContext
sc = SparkContext("local", "EmptyRDDExample")

# Create an empty RDD


empty_rdd = sc.parallelize([])

# Perform operations on the empty RDD (for example, count the elements)
count = empty_rdd.count()
# Print the result
print(f"Number of elements in the empty RDD: {count}")

# Stop the SparkContext


sc.stop()

e. RDD Pair Functions

Solutions:

 reduceByKey(func)

Combines values for each key using a specified reduction function.

data = [("apple", 3), ("banana", 2), ("apple", 5), ("banana", 1)]

rdd = sc.parallelize(data)

reduced_rdd = rdd.reduceByKey(lambda x, y: x + y)

reduced_rdd.collect()

 sortByKey(ascending=True)

Sorts the RDD by keys

data = [("apple", 3), ("banana", 2), ("orange", 5), ("grape", 1)]

rdd = sc.parallelize(data)

sorted_rdd = rdd.sortByKey()

sorted_rdd.collect()

 mapValues(func)

Applies a function to each value in the RDD without changing the keys.

data = [("apple", 3), ("banana", 2), ("orange", 5)]

rdd = sc.parallelize(data)

mapped_rdd = rdd.mapValues(lambda x: x * 2)

mapped_rdd.collect()
f. Generate Data Frame from RDD

solution:

from pyspark.sql import SparkSession

# Create a SparkSession

spark = SparkSession.builder.appName("RDDToDataFrameExample").getOrCreate()

# Create an RDD

data = [("apple", 3), ("banana", 2), ("apple", 5), ("banana", 1)]

rdd = spark.sparkContext.parallelize(data)

# Define the schema for the DataFrame

schema = ["fruit", "quantity"]

# Convert the RDD to a DataFrame

df = rdd.toDF(schema)

# Show the DataFrame

df.show()

# Stop the SparkSession

spark.stop()

3. How does spark shuffle works?

Spark shuffle is a crucial operation in Apache Spark that redistributes data across partitions to
facilitate certain transformations and operations. It plays a vital role in enabling efficient data
processing and execution of complex workloads.

Understanding Spark Shuffle

Spark shuffle typically occurs when data needs to be grouped or aggregated based on certain
keys. It involves several steps:
1. Map Phase: The input data is divided into partitions, and each partition is assigned to a Spark
executor. Each executor applies a mapper function to each record, generating key-value
pairs.

2. Sort Phase: The key-value pairs are sorted within each partition using a sorting algorithm.
This ensures that all records with the same key are grouped together.

3. Hash Partitioning: The sorted key-value pairs are hashed and distributed to a specified
number of shuffle partitions. This ensures that records with the same key end up in the same
partition, regardless of their original partition.

4. Reduce Phase: The shuffled key-value pairs are aggregated or reduced within each shuffle
partition using a reducer function. This consolidates the data associated with each key.

5. Write to Disk: The aggregated data is written to disk, typically in temporary files. This is
necessary when the data doesn't fit in memory or when multiple stages of a Spark job
require the same shuffled data.

You might also like