0% found this document useful (0 votes)
16 views5 pages

Task 1: This Notebook Illustrates The Use of "MAP-REDUCE" To Calculate Averages From The Data Contained in Nsedata - CSV

This document outlines a notebook that demonstrates the use of MAP-REDUCE with Apache Spark to calculate averages from a CSV dataset. It includes tasks for reviewing and commenting on the code, as well as writing additional code to compute average HIGH prices for specific months. The final output is saved as text files, with results sorted in ascending or descending order as required.

Uploaded by

aditya agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views5 pages

Task 1: This Notebook Illustrates The Use of "MAP-REDUCE" To Calculate Averages From The Data Contained in Nsedata - CSV

This document outlines a notebook that demonstrates the use of MAP-REDUCE with Apache Spark to calculate averages from a CSV dataset. It includes tasks for reviewing and commenting on the code, as well as writing additional code to compute average HIGH prices for specific months. The final output is saved as text files, with results sorted in ascending or descending order as required.

Uploaded by

aditya agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

E10-1

This Notebook illustrates the use of "MAP-REDUCE" to calculate averages from the
data contained in nsedata.csv.

Task 1
You are required to review the code (refer to the SPARK document where necessary), and add
comments / markup explaining the code in each cell. Also explain the role of each cell in the
overall context of the solution to the problem (ie. what is the cell trying to achieve in the overall
scheme of things). You may create additional code in each cell to generate any debug output
that you may need to complete this exercise.

Task 2
You are required to write code to solve the problem stated at the end this Notebook

Submission
Create and upload a PDF of this Notebook. BEFORE CONVERTING TO PDF and UPLOADING
ENSURE THAT YOU REMOVE / TRIM LENGTHY DEBUG OUTPUTS . Short debug outputs of up to
5 lines are acceptable.

#After making sure the spark is installed on system , we can


call'findspark' library to be able to locate it
import findspark
#After successfully locating the library , we use the following
function to be able to execute the spark libraries
findspark.init()

#overall , we may comment that we are finding the spark libraries to


use in the following program

#Now to use the pyspark , i.e, the python library which enables us to
work with Apache Spark
import pyspark
#This import statement provides access to the data types available in
PySpark's SQL module
from pyspark.sql.types import *

#overall, this cell lays foundation for the upcoming usage of the data
types and the python libraries

#Initialising a SparkContext in PySpark with application name as "E10"


sc = pyspark.SparkContext(appName="E10")

Setting default log level to "WARN".


To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
23/10/28 18:14:03 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable

#It creates an RDD using the mentioned CSV type text file
rdd1 = sc.textFile("/home/hduser/spark/nsedata.csv")

#We apply a filter transformation where there is the string "symbol"


in the element x, those are removed from the RDD
rdd1 = rdd1.filter(lambda x: "SYMBOL" not in x)

#Here , using the lambda function we have split each element 'x' into
substrings by virtue of the commas
rdd2 = rdd1.map(lambda x : x.split(","))

# Helper comment!: The goal is to find out the mean of the OPEN prices
and the mean of the CLOSE price in one batch of tasks ...

#For each element x of the RDD2 we assign floating type to the 3rd and
5th element in the process of making two new RDD's from the single one

rdd_open = rdd2.map(lambda x : (x[0]+"_open",float(x[2])))


rdd_close = rdd2.map(lambda x : (x[0]+"_close",float(x[5])))
#By x[0]+"open" we aim to achieve a string like "symbol_open" for each
of the elements , i.e., it gets concatenated with first element

#This piece of code basically unites both the above individual RDD's
made.
rdd_united = rdd_open.union(rdd_close)

#This transformation groups the RDD elements by key and applies the
provided lambda function to combine (reduce) the values associated
with the same key.
reducedByKey = rdd_united.reduceByKey(lambda x,y: x+y)

#Basically , working onto rdd_united we map it's each element to a new


key-value pair, where key is the symbol_close or symbol_open and value
is 1
temp1 = rdd_united.map(lambda x: (x[0],1)).countByKey()
#By using the count by key we count the values of each specific unique
key and sum it up
countOfEachSymbol = sc.parallelize(temp1.items())
#Finally the dictionary which is created in temp1 is transformed into
RDD named as countOfEachSymbol

#So, symbol_sum_count is an RDD that combines the information about


the sum of prices and the count of occurrences for each symbol
symbol_sum_count = reducedByKey.join(countOfEachSymbol)
#x[0] refers to the first value that is the symbol and then x[1][0] is
used to retrieve the sum and x[1][1] is used to retrieve the total
count
averages = symbol_sum_count.map(lambda x : (x[0], x[1][0]/x[1][1]))
#Basically , it is the finalising step for calculating the average

#By calling the following function we arrange the key value pairs in
ascending order since symbol names are the key
averagesSorted = averages.sortByKey()

#It saves the averages in ascending order of symbol name in form of a


text file in the folder named spark
averagesSorted.saveAsTextFile("/home/hduser/spark/averages")

sc.stop()

Review the output files generated in the above step and copy the first
15 lines of any one of the output files into the cell below for reference.
Write your comments on the generated output
('BILPOWER_open', 46.46917454858125)
('BIL_close', 186.45586095392073)
('BIL_open', 186.1023848019402)
('BIMETAL_close', 266.0775303643725)
('BIMETAL_open', 267.27899797570853)
('BINANICEM_close', 87.07222222222221)
('BINANICEM_open', 87.1978835978836)
('BINANIIND_close', 117.72643492320127)
('BINANIIND_open', 118.23771220695232)
('BINDALAGRO_close', 36.00545234248788)
('BINDALAGRO_open', 36.078957996768985)
('BIOCON_close', 360.4194017784965)
('BIOCON_open', 361.0068714632174)
('BIRLACORPN_close', 326.19757477768786)
('BIRLACORPN_open', 326.5177445432499)
#As we may observe that the output is arranged in ascending order of
alphabets as desired.
#While the header line was removed to not cause any interference .
#The type is also mentioned as open or close along with the averages

('BIRLACORPN_open', 326.5177445432499)
Task 2 - Problem Statement
Using the MAP-REDUCE strategy, write SPARK code that will create
the average of HIGH prices for all the traded companies, but only for
any 3 months of your choice. Create the appropriate (K,V) pairs so that
the averages are simultaneously calculated, as in the above example.
Create the output files such that the final data is sorted in descending
order of the company names.
import findspark
findspark.init()
import pyspark
from pyspark.sql.types import *
from pyspark.sql.functions import year,month

sc = pyspark.SparkContext(appName="E10")
ss = pyspark.sql.SparkSession(sc)

rdd1 = sc.textFile("/home/hduser/spark/nsedata.csv")

rdd1 = rdd1.filter(lambda x: "SYMBOL" not in x)

rdd2 = rdd1.map(lambda x: x.split(","))

rdd_high = rdd2.map(lambda x : (x[0],float(x[2])))


reducedByKey = rdd_high.reduceByKey(lambda x,y: x+y)

temp1 = rdd_high.map(lambda x: (x[0],1)).countByKey()


countOfEachSymbol = sc.parallelize(temp1.items())

symbol_sum_count = reducedByKey.join(countOfEachSymbol)

averages = symbol_sum_count.map(lambda x : (x[0], x[1][0]/x[1][1]))

sorted_results = averages.sortByKey(ascending=False)

sorted_results.saveAsTextFile("/home/hduser/spark/high_averages_2")

filtered_df = df.filter((year("TIMESTAMP") == 2011) &


(month("TIMESTAMP") >= start_month) & (month("TIMESTAMP") <=
end_month))
company_high = filtered_df.rdd.map(lambda x: (x["SYMBOL"], x["HIGH"]))
rdd_high = rdd2.map(lambda x : (x[0]+"_open",float(x[2])))
temp1 = rdd_high.map(lambda x: (x[0],1)).countByKey()
countOfEachSymbol = sc.parallelize(temp1.items())

average_high = sum_count.mapValues(lambda x: x[0] / x[1])

sorted_results = average_high.sortByKey(ascending=False)

23/10/28 18:37:54 WARN CSVHeaderChecker: Number of column in CSV


header is not equal to number of fields in the schema:
Header length: 14, schema size: 12
CSV file: file:///home/hduser/spark/nsedata.csv

sorted_results.saveAsTextFile("/home/hduser/spark/average_high_price")

ss.stop
sc.stop

<bound method SparkContext.stop of <SparkContext master=local[*]


appName=E10>>

You might also like