Apache Spark
Apache Spark
TRANSFORMATIONS:
Transformations are operations that create a new dataset from an existing one. They are
executed lazily, which means that they are not computed immediately but instead build a
lineage of transformations that will be executed when an action is called. Here are some
common transformations:
map: Applies a function to each element of the dataset, producing a new dataset. For
example, if you have a dataset of numbers and you want to square each number:
input_data = [1, 2, 3, 4, 5]
rdd = SparkContext.parallelize(input_data)
squared_rdd = rdd.map(lambda x: x**2)
flatMap: Similar to map, but each input item can be mapped to zero or more output items.
For example, splitting lines of text into words:
lines = ["Hello world", "Spark is great", "Big Data"]
rdd = SparkContext.parallelize(lines)
words_rdd = rdd.flatMap(lambda line: line.split(" "))
ACTIONS:
Actions are operations that trigger the execution of transformations and return a value to
the driver program or write data to an external storage system. Actions are the operations
that actually perform computation. Here are some common actions:
collect: Retrieves all the elements of the dataset and returns them to the driver program.
Use this with caution as it can be memory-intensive for large datasets.
input_data = [1, 2, 3, 4, 5]
rdd = SparkContext.parallelize(input_data)
result = rdd.collect()
SOLUTION:
# Create a SparkContext
sc = SparkContext("local", "ReadTextFilesExample")
input_directory = "dbfs:/FileStore/txtfile/*.txt"
text_files_rdd = sc.textFile(input_directory)
all_elements = text_files_rdd.collect()
print(element)
sc.stop()
b. Read CSV file into RDD
Solution:
from pyspark.sql import SparkSession
# Create a SparkSession
# spark =
SparkSession.builder.appName("ReadCSVIntoDataFrameExample").getOrCreate()
# Create a SparkContext
sc = SparkContext("local", "EmptyRDDExample")
# Perform operations on the empty RDD (for example, count the elements)
count = empty_rdd.count()
# Print the result
print(f"Number of elements in the empty RDD: {count}")
Solutions:
reduceByKey(func)
rdd = sc.parallelize(data)
reduced_rdd = rdd.reduceByKey(lambda x, y: x + y)
reduced_rdd.collect()
sortByKey(ascending=True)
rdd = sc.parallelize(data)
sorted_rdd = rdd.sortByKey()
sorted_rdd.collect()
mapValues(func)
Applies a function to each value in the RDD without changing the keys.
rdd = sc.parallelize(data)
mapped_rdd = rdd.mapValues(lambda x: x * 2)
mapped_rdd.collect()
f. Generate Data Frame from RDD
solution:
# Create a SparkSession
spark = SparkSession.builder.appName("RDDToDataFrameExample").getOrCreate()
# Create an RDD
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(schema)
df.show()
spark.stop()
Spark shuffle is a crucial operation in Apache Spark that redistributes data across partitions to
facilitate certain transformations and operations. It plays a vital role in enabling efficient data
processing and execution of complex workloads.
Spark shuffle typically occurs when data needs to be grouped or aggregated based on certain
keys. It involves several steps:
1. Map Phase: The input data is divided into partitions, and each partition is assigned to a Spark
executor. Each executor applies a mapper function to each record, generating key-value
pairs.
2. Sort Phase: The key-value pairs are sorted within each partition using a sorting algorithm.
This ensures that all records with the same key are grouped together.
3. Hash Partitioning: The sorted key-value pairs are hashed and distributed to a specified
number of shuffle partitions. This ensures that records with the same key end up in the same
partition, regardless of their original partition.
4. Reduce Phase: The shuffled key-value pairs are aggregated or reduced within each shuffle
partition using a reducer function. This consolidates the data associated with each key.
5. Write to Disk: The aggregated data is written to disk, typically in temporary files. This is
necessary when the data doesn't fit in memory or when multiple stages of a Spark job
require the same shuffled data.