0% found this document useful (0 votes)

16 views5 pages

Task 1: This Notebook Illustrates The Use of "MAP-REDUCE" To Calculate Averages From The Data Contained in Nsedata - CSV

This document outlines a notebook that demonstrates the use of MAP-REDUCE with Apache Spark to calculate averages from a CSV dataset. It includes tasks for reviewing and commenting on the code, as well as writing additional code to compute average HIGH prices for specific months. The final output is saved as text files, with results sorted in ascending or descending order as required.

Uploaded by

aditya agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views5 pages

Task 1: This Notebook Illustrates The Use of "MAP-REDUCE" To Calculate Averages From The Data Contained in Nsedata - CSV

Uploaded by

aditya agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

E10-1

This Notebook illustrates the use of "MAP-REDUCE" to calculate averages from the
data contained in nsedata.csv.

Task 1
You are required to review the code (refer to the SPARK document where necessary), and add
comments / markup explaining the code in each cell. Also explain the role of each cell in the
overall context of the solution to the problem (ie. what is the cell trying to achieve in the overall
scheme of things). You may create additional code in each cell to generate any debug output
that you may need to complete this exercise.

Task 2
You are required to write code to solve the problem stated at the end this Notebook

Submission
Create and upload a PDF of this Notebook. BEFORE CONVERTING TO PDF and UPLOADING
ENSURE THAT YOU REMOVE / TRIM LENGTHY DEBUG OUTPUTS . Short debug outputs of up to
5 lines are acceptable.

#After making sure the spark is installed on system , we can

call'findspark' library to be able to locate it
import findspark
#After successfully locating the library , we use the following
function to be able to execute the spark libraries
findspark.init()

#overall , we may comment that we are finding the spark libraries to

use in the following program

#Now to use the pyspark , i.e, the python library which enables us to
work with Apache Spark
import pyspark
#This import statement provides access to the data types available in
PySpark's SQL module
from pyspark.sql.types import *

#overall, this cell lays foundation for the upcoming usage of the data
types and the python libraries

#Initialising a SparkContext in PySpark with application name as "E10"

sc = pyspark.SparkContext(appName="E10")

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
23/10/28 18:14:03 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable

#It creates an RDD using the mentioned CSV type text file
rdd1 = sc.textFile("/home/hduser/spark/nsedata.csv")

#We apply a filter transformation where there is the string "symbol"

in the element x, those are removed from the RDD
rdd1 = rdd1.filter(lambda x: "SYMBOL" not in x)

#Here , using the lambda function we have split each element 'x' into
substrings by virtue of the commas
rdd2 = rdd1.map(lambda x : x.split(","))

# Helper comment!: The goal is to find out the mean of the OPEN prices
and the mean of the CLOSE price in one batch of tasks ...

#For each element x of the RDD2 we assign floating type to the 3rd and
5th element in the process of making two new RDD's from the single one

rdd_open = rdd2.map(lambda x : (x[0]+"_open",float(x[2])))

rdd_close = rdd2.map(lambda x : (x[0]+"_close",float(x[5])))
#By x[0]+"open" we aim to achieve a string like "symbol_open" for each
of the elements , i.e., it gets concatenated with first element

#This piece of code basically unites both the above individual RDD's
made.
rdd_united = rdd_open.union(rdd_close)

#This transformation groups the RDD elements by key and applies the
provided lambda function to combine (reduce) the values associated
with the same key.
reducedByKey = rdd_united.reduceByKey(lambda x,y: x+y)

#Basically , working onto rdd_united we map it's each element to a new

key-value pair, where key is the symbol_close or symbol_open and value
is 1
temp1 = rdd_united.map(lambda x: (x[0],1)).countByKey()
#By using the count by key we count the values of each specific unique
key and sum it up
countOfEachSymbol = sc.parallelize(temp1.items())
#Finally the dictionary which is created in temp1 is transformed into
RDD named as countOfEachSymbol

#So, symbol_sum_count is an RDD that combines the information about

the sum of prices and the count of occurrences for each symbol
symbol_sum_count = reducedByKey.join(countOfEachSymbol)
#x[0] refers to the first value that is the symbol and then x[1][0] is
used to retrieve the sum and x[1][1] is used to retrieve the total
count
averages = symbol_sum_count.map(lambda x : (x[0], x[1][0]/x[1][1]))
#Basically , it is the finalising step for calculating the average

#By calling the following function we arrange the key value pairs in
ascending order since symbol names are the key
averagesSorted = averages.sortByKey()

#It saves the averages in ascending order of symbol name in form of a

text file in the folder named spark
averagesSorted.saveAsTextFile("/home/hduser/spark/averages")

sc.stop()

Review the output files generated in the above step and copy the first
15 lines of any one of the output files into the cell below for reference.
Write your comments on the generated output
('BILPOWER_open', 46.46917454858125)
('BIL_close', 186.45586095392073)
('BIL_open', 186.1023848019402)
('BIMETAL_close', 266.0775303643725)
('BIMETAL_open', 267.27899797570853)
('BINANICEM_close', 87.07222222222221)
('BINANICEM_open', 87.1978835978836)
('BINANIIND_close', 117.72643492320127)
('BINANIIND_open', 118.23771220695232)
('BINDALAGRO_close', 36.00545234248788)
('BINDALAGRO_open', 36.078957996768985)
('BIOCON_close', 360.4194017784965)
('BIOCON_open', 361.0068714632174)
('BIRLACORPN_close', 326.19757477768786)
('BIRLACORPN_open', 326.5177445432499)
#As we may observe that the output is arranged in ascending order of
alphabets as desired.
#While the header line was removed to not cause any interference .
#The type is also mentioned as open or close along with the averages

('BIRLACORPN_open', 326.5177445432499)
Task 2 - Problem Statement
Using the MAP-REDUCE strategy, write SPARK code that will create
the average of HIGH prices for all the traded companies, but only for
any 3 months of your choice. Create the appropriate (K,V) pairs so that
the averages are simultaneously calculated, as in the above example.
Create the output files such that the final data is sorted in descending
order of the company names.
import findspark
findspark.init()
import pyspark
from pyspark.sql.types import *
from pyspark.sql.functions import year,month

sc = pyspark.SparkContext(appName="E10")
ss = pyspark.sql.SparkSession(sc)

rdd1 = sc.textFile("/home/hduser/spark/nsedata.csv")

rdd1 = rdd1.filter(lambda x: "SYMBOL" not in x)

rdd2 = rdd1.map(lambda x: x.split(","))

rdd_high = rdd2.map(lambda x : (x[0],float(x[2])))

reducedByKey = rdd_high.reduceByKey(lambda x,y: x+y)

temp1 = rdd_high.map(lambda x: (x[0],1)).countByKey()

countOfEachSymbol = sc.parallelize(temp1.items())

symbol_sum_count = reducedByKey.join(countOfEachSymbol)

averages = symbol_sum_count.map(lambda x : (x[0], x[1][0]/x[1][1]))

sorted_results = averages.sortByKey(ascending=False)

sorted_results.saveAsTextFile("/home/hduser/spark/high_averages_2")

filtered_df = df.filter((year("TIMESTAMP") == 2011) &

(month("TIMESTAMP") >= start_month) & (month("TIMESTAMP") <=
end_month))
company_high = filtered_df.rdd.map(lambda x: (x["SYMBOL"], x["HIGH"]))
rdd_high = rdd2.map(lambda x : (x[0]+"_open",float(x[2])))
temp1 = rdd_high.map(lambda x: (x[0],1)).countByKey()
countOfEachSymbol = sc.parallelize(temp1.items())

average_high = sum_count.mapValues(lambda x: x[0] / x[1])

sorted_results = average_high.sortByKey(ascending=False)

23/10/28 18:37:54 WARN CSVHeaderChecker: Number of column in CSV

header is not equal to number of fields in the schema:
Header length: 14, schema size: 12
CSV file: file:///home/hduser/spark/nsedata.csv

sorted_results.saveAsTextFile("/home/hduser/spark/average_high_price")

ss.stop
sc.stop

<bound method SparkContext.stop of <SparkContext master=local[*]

appName=E10>>

PySpark RDD Guide for Data Scientists
No ratings yet
PySpark RDD Guide for Data Scientists
1 page
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
Journal
No ratings yet
Journal
47 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
Walmart Stock Data Analysis with Spark
0% (1)
Walmart Stock Data Analysis with Spark
17 pages
Indrani Cheat Sheet
No ratings yet
Indrani Cheat Sheet
2 pages
PySpark Notes
No ratings yet
PySpark Notes
4 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
Docse
No ratings yet
Docse
3 pages
PySpark Cheatsheet - Elaborate
No ratings yet
PySpark Cheatsheet - Elaborate
14 pages
PySpark RDD Basics Cheat Sheet
No ratings yet
PySpark RDD Basics Cheat Sheet
1 page
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Spark & Python Dataframe Functions
No ratings yet
Spark & Python Dataframe Functions
24 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Spark Commands
No ratings yet
Spark Commands
3 pages
2324 BigData Lab3
No ratings yet
2324 BigData Lab3
6 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Final Print Py Spark
0% (1)
Final Print Py Spark
133 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
PySpark RDD Transformations Guide
No ratings yet
PySpark RDD Transformations Guide
38 pages
Py Spark
No ratings yet
Py Spark
19 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Spark Class 2
No ratings yet
Spark Class 2
37 pages
PySpark Big Data Analytics Guide
No ratings yet
PySpark Big Data Analytics Guide
7 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
PySpark ELT Cheat Sheet Guide
No ratings yet
PySpark ELT Cheat Sheet Guide
8 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Spark Lab
No ratings yet
Spark Lab
6 pages
Lab Spark
No ratings yet
Lab Spark
3 pages
Spark DataFrames Project Exercise - Jupyter Notebook
No ratings yet
Spark DataFrames Project Exercise - Jupyter Notebook
7 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Unit 6 Pyspark - MLlib
No ratings yet
Unit 6 Pyspark - MLlib
6 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Page 02
No ratings yet
Page 02
2 pages
PySpark All Query
No ratings yet
PySpark All Query
22 pages
Apache Spark With Scala - Cheatsheet
No ratings yet
Apache Spark With Scala - Cheatsheet
7 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
PySpark Interview Questions Guide
100% (4)
PySpark Interview Questions Guide
126 pages
Probability Basics for Students
No ratings yet
Probability Basics for Students
23 pages
The Philippine Stock Exchange, Inc.: What Is PSE? History
No ratings yet
The Philippine Stock Exchange, Inc.: What Is PSE? History
7 pages
Dark Continent
No ratings yet
Dark Continent
10 pages
AS 3978 2003 NDT Visual Inspection
No ratings yet
AS 3978 2003 NDT Visual Inspection
29 pages
Automotive Technician Resume
No ratings yet
Automotive Technician Resume
2 pages
Pierre Bourdieu Sociology Is A Martial A
No ratings yet
Pierre Bourdieu Sociology Is A Martial A
4 pages
Exp 4
No ratings yet
Exp 4
4 pages
Dinesh Kumar CV
100% (1)
Dinesh Kumar CV
5 pages
Grade 7 Computer Systems Lesson Plan
No ratings yet
Grade 7 Computer Systems Lesson Plan
6 pages
Character Analysis of Tsunade Senju
No ratings yet
Character Analysis of Tsunade Senju
7 pages
Zimbabwe Electricity Supply Authority
100% (1)
Zimbabwe Electricity Supply Authority
11 pages
Noli Me Tangere
No ratings yet
Noli Me Tangere
3 pages
Presentación Diseño Por Desempeño - Es.en
No ratings yet
Presentación Diseño Por Desempeño - Es.en
65 pages
BSRIA-Commissioning Plan
No ratings yet
BSRIA-Commissioning Plan
8 pages
Engineering Critical Assessment
100% (1)
Engineering Critical Assessment
2 pages
Dissertation Library Science
100% (2)
Dissertation Library Science
6 pages
Oracle VM Manager
No ratings yet
Oracle VM Manager
19 pages
Eng Second Lang Shs 02 Al Be
No ratings yet
Eng Second Lang Shs 02 Al Be
148 pages
Excel EX
No ratings yet
Excel EX
10 pages
Module 7 Studio Lighting One Light Setup and Multiple Light Setup
No ratings yet
Module 7 Studio Lighting One Light Setup and Multiple Light Setup
75 pages
Account Determination MM en US
No ratings yet
Account Determination MM en US
79 pages
Datasheet MMBT2907AL
No ratings yet
Datasheet MMBT2907AL
8 pages
Group 4 Hallticket
No ratings yet
Group 4 Hallticket
4 pages
Invest in Riyadh Ministry of Investment 1692116000
No ratings yet
Invest in Riyadh Ministry of Investment 1692116000
39 pages
15EC205 - Signals and Systems Syllabus
No ratings yet
15EC205 - Signals and Systems Syllabus
2 pages
Oblique Projection
No ratings yet
Oblique Projection
56 pages
Disciplines and Ideas in The Applied Social Sciences
100% (4)
Disciplines and Ideas in The Applied Social Sciences
22 pages
Group Storyboard-Edid6508
No ratings yet
Group Storyboard-Edid6508
12 pages
Extended Abstract Template
No ratings yet
Extended Abstract Template
2 pages
CHLOROPLAST
No ratings yet
CHLOROPLAST
4 pages

Task 1: This Notebook Illustrates The Use of "MAP-REDUCE" To Calculate Averages From The Data Contained in Nsedata - CSV

Uploaded by

Task 1: This Notebook Illustrates The Use of "MAP-REDUCE" To Calculate Averages From The Data Contained in Nsedata - CSV

Uploaded by

E10-1

#After making sure the spark is installed on system , we can

#overall , we may comment that we are finding the spark libraries to

#Initialising a SparkContext in PySpark with application name as "E10"

Setting default log level to "WARN".

#We apply a filter transformation where there is the string "symbol"

rdd_open = rdd2.map(lambda x : (x[0]+"_open",float(x[2])))

#Basically , working onto rdd_united we map it's each element to a new

#So, symbol_sum_count is an RDD that combines the information about

#It saves the averages in ascending order of symbol name in form of a

rdd1 = rdd1.filter(lambda x: "SYMBOL" not in x)

rdd2 = rdd1.map(lambda x: x.split(","))

rdd_high = rdd2.map(lambda x : (x[0],float(x[2])))

temp1 = rdd_high.map(lambda x: (x[0],1)).countByKey()

averages = symbol_sum_count.map(lambda x : (x[0], x[1][0]/x[1][1]))

filtered_df = df.filter((year("TIMESTAMP") == 2011) &

average_high = sum_count.mapValues(lambda x: x[0] / x[1])

23/10/28 18:37:54 WARN CSVHeaderChecker: Number of column in CSV

<bound method SparkContext.stop of <SparkContext master=local[*]

You might also like