0% found this document useful (0 votes)
136 views19 pages

Spark ML for Finance Professionals

Uploaded by

dl0395736
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views19 pages

Spark ML for Finance Professionals

Uploaded by

dl0395736
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

MACHINE LEARNING WITH PYTHON FOR FINANCE PROFESSIONALS

EXERCISE: 1
We can use Python to count how many times a word is used within a sentence or document.
This can be useful for text analysis projects and other types of reporting.

PROGRAM:
sentence = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
# Let's handle the capital letters by making everything lowercase
[Link]()
# Remove the question mark
[Link]("?", "")
# We can use "method chaining" to do both in one go!
[Link]().replace("?", "")
# You can use the split function to extract the words from the sentence
[Link]()
# Let's put it all together and assign to a variable ready to use in the next step
words = [Link]().replace("?", "").split()
words
# Create a dictionary ready to hold the words (as keys) and their counts (as values)
word_counts = {}
# Loop through the words
for word in words:
# Is the word in the dictionary yet?
if word in word_counts:
# If it is, add +1 to the current count
word_counts[word]
else:
# If it isn't, put it in with a count of 1
word_counts[word] = 1
print(word_counts)
output:
{'how': 1, 'much': 1, 'wood': 2, 'would': 1, 'a': 2, 'woodchuck': 2,
'chuck': 2,
'if': 1, 'could': 1}
Exercise: 2
1. Adapt your `count_words()` function from the previous exercise so that it strips out
punctuation using the code tip provided above.
2. Try out your adapted function with a long-form piece of text of your choice, using triple
quotes (`"""`) to store this information in a single multi-line string variable.
3. Find the top 20 words in your chosen text using the `sort_dictionary()` function provided
above. (This returns a list, so you can use list slicing to reduce it down to just the top 20.)

Program:
def sort_dictionary(input_dictionary, reverse=True):
return sorted(input_dictionary.items(), key=lambda x: x[1], reverse=reverse)
def count_words(sentence):
words = (
[Link]()
.replace("?", "") # remove ?
.replace(".", "") # remove .
.replace(",", "") # remove ,
.split()
)
word_counts = {}
for word in words:
word_counts[word] = word_counts.get(word, 0) + 1
return word_counts
sentence="""
HELLO WORLD WELCOME TO MACHINE LEARNING LAB"""
sort_dictionary(count_words(sentence))
OUTPUT:
[('hello', 1),
('world', 1),
('welcome', 1),
('to', 1),
('machine', 1),
('learning', 1),
('lab', 1)]

Exercise: 3

So far we've been working with only 2019 data. Let's now read in the full Dream Destination
dataset containing 66k orders across a ten year period from 2010-2019.
> 1. Using `pd.read_excel()` read in the `"Order Database"` sheet from the file `"Hotel
Industry - Order and Finance [Link]"`.
> 2. As before, assign it to a variable called `orders`.

PROGRAM:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
# Read in the Dream Destination hotel data
orders = pd.read_excel("[Link]",sheet_name="Order Database")
[Link](3)
[Link]();
df = [Link](left=orders, right=finance, on='Booking ID', how='left')
len(df)
[Link]
[Link]()
df[['Total Booking Amount', 'Discount Amount', 'Net Sales']].head(3)
[Link](x='Total Booking Amount', y='Origin Country', estimator=sum, data=df);

output:
Exercise: 4
> Create a Seaborn barplot showing Total Booking Amount (x-axis) by Origin Country (y-
axis).

Program:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
[Link](x='Total Booking Amount', y='Origin Country', estimator=sum, data=df)
[Link](figsize=(15,10))
[Link](
x='Revenue',
y='State',
hue='Origin Country',
dodge=False,
orient='h',
data=top_states
);

Output:
APACHE SPARK FOR DATA ENGINEERING AND MACHINE LEARNING

Exercise: 1
Building and Training a Prediction Model using Linear Regression
a. Load a dataset(diamond dataset)
b. Identify the target column and data column
c. Build and Train a new linear regression model
d. Evaluate the model
e. Predict the Price of the diamond

AIM: To Building and Training a Prediction Model Using Linear Regression.


a. Loading a dataset(diamond dataset)
b. Identifying the target column and data column
c. Building and Training a new linear regression model
d. Evaluating the model
e. Predicting the Price of the diamond

DATASET:
Diamonds dataset. Available at [Link]
type=data&sort=runs&id=42225&status=active
PROGRAM:
import pandas as pd
from sklearn.linear_model import LinearRegression
data = "[Link]
SkillsNetwork/datasets/[Link]"
df = pd.read_csv(data)
target = df["price"]
features = df[["carat","depth"]]
lr = LinearRegression()
[Link](features,target)
print("Model Score:", [Link](features,target))
print("Predicted price of the diamond:",[Link]([[0.3, 60]]))
OUTPUT:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,normalize=False)

Model Score: 0.8506754571636563

Predicted price of the diamond: [244.95605225]


Exercise: 2
Build a Classifier Model using Logistic Regression
a. Load a Dataset
b. Identify the target column and data column
c. Build and Train a new classifier
d. Evaluate the model
e. Find out if a tumor is cancerous

AIM:Building a Classifier Model using Logistic Regression


a. Loading a Dataset
b. Identifying the target column and data column
c. Building and Training a new classifier
d. Evaluating the model
e. Finding out if a tumor is cancerous

Dataset: [Link]

PROGRAM:
import pandas as pd
from sklearn.linear_model import LinearRegression
data = "[Link]
SkillsNetwork/datasets/[Link]"
df = pd.read_csv(data)
target = df["diagnosis"]
features = df[['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean',
'compactness_mean', 'concavity_mean', 'symmetry_mean']]
classifier = LogisticRegression()
[Link](features,target)
[Link](features,target)
[Link]([[13.45,86.6,555.1,0.1022,0.08165,0.03974,0.1638]])
OUTPUT:
LogisticRegression()
0.8963093145869947
array(['Benign'], dtype=object)
Exercise: 3
.Connecting to Spark Cluster using SN Labs
Create a Spark Session
A. Load the dataset into a dataframe
B. Explore the data
C. Print the top 5 rows of the dataframe
D. Stop the spark session

Aim:Connecting to Spark Cluster using SN Labs (External resource)


A. Creating a Spark Session
B. Loading the dataset into a dataframe
C. Exploring the data
D. Print the top 5 rows of the dataframe
E. Stop the spark session

Dataset:
Download the data set from :"[Link]
[Link]/IBM-BD0231EN
SkillsNetwork/datasets/[Link]"
Program:
#import findspark
[Link]()
#import SparkSession
from [Link] import SparkSession
#Create SparkSession
spark = [Link]("Getting Started with Spark").getOrCreate()
#Download the data file
!wget [Link]
SkillsNetwork/datasets/[Link]
#Load diamond dataset into a dataframe named diamond_data
diamond_data = [Link]("[Link]", header=True, inferSchema=True)
#Print the schema of the dataframe
diamond_data.printSchema()
#Print the top 5 rows of the dataframe
diamond_data.head(5)
#Stop the spark session
[Link]()
OUTPUT:
root
|-- s: integer (nullable = true)
|-- carat: double (nullable = true)
|-- cut: string (nullable = true)
|-- color: string (nullable = true)
|-- clarity: string (nullable = true)
|-- depth: double (nullable = true)
|-- table: double (nullable = true)
|-- price: integer (nullable = true)
|-- x: double (nullable = true)
|-- y: double (nullable = true)
|-- z: double (nullable = true)

[Row(s=1, carat=0.23, cut='Ideal', color='E', clarity='SI2', depth=61.5, table=55.0, price=326,


x=3.95, y=3.98, z=2.43),
Row(s=2, carat=0.21, cut='Premium', color='E', clarity='SI1', depth=59.8, table=61.0,
price=326, x=3.89, y=3.84, z=2.31),
Row(s=3, carat=0.23, cut='Good', color='E', clarity='VS1', depth=56.9, table=65.0,
price=327, x=4.05, y=4.07, z=2.31),
Row(s=4, carat=0.29, cut='Premium', color='I', clarity='VS2', depth=62.4, table=58.0,
price=334, x=4.2, y=4.23, z=2.63),
Row(s=5, carat=0.31, cut='Good', color='J', clarity='SI2', depth=63.3, table=58.0, price=335,
x=4.34, y=4.35, z=2.75)]
Exercise: 4
. Regression using SparkML
A. Create a spark session
B. Load the data in a csv file into a dataframe
C. Identify the label column and the input columns
D. Split the data
E. Build and Train a Linear Regression Model
F. Evaluate the model
AIM:Regression using SparkML
A. Create a spark session
B. Load the data in a csv file into a dataframe
C. Identify the label column and the input columns
D. Split the data
E. Build and Train a Linear Regression Model
F. Evaluate the model

Dataset:Download the data set : [Link]


[Link]/IBM-BD0231EN-SkillsNetwork/datasets/[Link]
Program:
#install libraries
!pip install pyspark==3.1.2 -q
!pip install findspark -q
#Ignore Warnings
def warn(*args, **kwargs):
pass
import warnings
[Link] = warn
[Link]('ignore')
#Importing Required Libraries
import findspark
[Link]()
from [Link] import SparkSession
#import functions/Classes for sparkml
from [Link] import VectorAssembler
from [Link] import LinearRegression
# import functions/Classes for metrics
from [Link] import RegressionEvaluator
#Create a spark session with appname "Diamond Price Prediction"
spark = [Link]("Diamond Price Prediction").getOrCreate()
#Download the data set
!wget [Link]
SkillsNetwork/datasets/[Link]
#Load the dataset into a spark dataframe
diamond_data = [Link]("[Link]", header=True, inferSchema=True)
#Display sample data from dataset
diamond_data.show(5)
#Identify the label column and the input columns
#use the price column as label column
#use the columns carat,depth and table as features
#Assemble the columns columnscarat,depth and table into a single column named "features"
assembler = VectorAssembler(inputCols=["carat", "depth", "table"], outputCol="features")
diamond_transformed_data = [Link](diamond_data)
#Print the vectorized features and label columns
diamond_transformed_data.select("features","price").show()
#Split the dataset into training and testing sets in the ratio of 70:30.
(training_data, testing_data) = diamond_transformed_data.randomSplit([0.7, 0.3])
#Build a linear regression and train it
lr = LinearRegression(featuresCol="features", labelCol="price")
model = [Link](training_data)
#Predict the values using the test data
predictions = [Link](testing_data)
#Print the metrics :
#R squared
#mean absolute error
#root mean squared error
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction",
metricName="r2")
r2 = [Link](predictions)
print("R Squared =", r2)
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction",
metricName="mae")
mae = [Link](predictions)
print("MAE =", mae)
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction",
metricName="rmse")
rmse = [Link](predictions)
print("RMSE =", rmse)

Output:

+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| s|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
| 2| 0.21|Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
| 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31|
| 4| 0.29|Premium| I| VS2| 62.4| 58.0| 334| 4.2|4.23|2.63|
| 5| 0.31| Good| J| SI2| 63.3| 58.0| 335|4.34|4.35|2.75|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
+----------------+-----+
| features|price|
+----------------+-----+
|[0.23,61.5,55.0]| 326|
|[0.21,59.8,61.0]| 326|
|[0.23,56.9,65.0]| 327|
|[0.29,62.4,58.0]| 334|
|[0.31,63.3,58.0]| 335|
|[0.24,62.8,57.0]| 336|
|[0.24,62.3,57.0]| 336|
|[0.26,61.9,55.0]| 337|
|[0.22,65.1,61.0]| 337|
|[0.23,59.4,61.0]| 338|
| [0.3,64.0,55.0]| 339|
|[0.23,62.8,56.0]| 340|
|[0.22,60.4,61.0]| 342|
|[0.31,62.2,54.0]| 344|
| [0.2,60.2,62.0]| 345|
|[0.32,60.9,58.0]| 345|
| [0.3,62.0,54.0]| 348|
| [0.3,63.4,54.0]| 351|
| [0.3,63.8,56.0]| 351|
| [0.3,62.7,59.0]| 351|
+----------------+-----+

R Squared = 0.8521786458835734
MAE = 993.2267089991868
RMSE = 1513.0984941194174
Exercise: 5
. ETL using Apache Spark
A. Extract
B. Transform
C. Load

Aim: ETL using Apache Spark


A. Extract
B. Transform
C. Load

Dataset :[Link]
u21781a0533/files/labs/authoride/IBMSkillsNetwork%2BBD0231EN/labs/
student_transformed.csv/[Link]?
_xsrf=MnwxOjB8MTA6MTcxNTE2Nzg2MXw1Ol94c3JmfDEzMjpOVFkxTURWaU1qRTJ
PR001TkdNME9UaGhaakk0T0RJM1kyTTVPR0psT0RBNlpXSm1OR0kwT1RJeVlqWTJN
bU00TlRBeVpEY3dZbVJqWWpOaU1UUm1ZVGxqWlRjM09EazBPRFl6TVRjMk1EY3d
ObU0wTWpFeFlqYzNZek13TkdVMU5nPT18NjM4YmNlMzc3NWM4NjQ1ZjJjY2RhYTU
0NDExY2ZiNGMwMTdlZjdmMjJmOTRlZWQ3OTAwY2Y2ZmI2MDliMWY0Mw

Program:

#Installing Required Libraries


!pip install pyspark==3.1.2 -q
!pip install findspark -q

#Importing Required Libraries


#Ignore Warnings
def warn(*args, **kwargs):
pass
import warnings
[Link] = warn
[Link]('ignore')

# FindSpark simplifies the process of using Apache Spark with Python

import findspark
[Link]()

from [Link] import SparkSession

#Create SparkSession
spark = [Link]("Exercises - ETL using Spark").getOrCreate()

# Extract
#Load data from student_transformed.csv into a dataframe
# Load student dataset
df = [Link]("student_transformed.csv", header=True, inferSchema=True)
# display dataframe
[Link]()

#Transform
#Convert cm to meters
# Divide the column height_cm by 100 a new column height_cm
df = [Link]("height_meters", expr("height_cm / 100"))
# display dataframe
[Link]()

#Create a column named bmi


# compute bmi using the below formula
# BMI = weight/(height * height)
# weight must be in kgs
# height must be in meters
df = [Link]("bmi", expr("weight_kg/(height_meters*height_meters)"))
# display dataframe
[Link]()
#Drop the columns height_inches, weight_pounds
# drop the columns "height_inches","weight_pounds"
df = [Link]("height_cm","weight_kg","height_meters")
# display dataframe
[Link]()

from [Link] import col, round


df = [Link]("bmi_rounded", round(col("bmi")))
[Link]()
#Load
#Save the dataframe into a parquet file
#Write the data to a Parquet file, set the mode to overwrite
[Link]("overwrite").parquet("student_transformed.parquet")
#Stop Spark Session
[Link]()

Output:

+--------+---------+---------+
| student|height_cm|weight_kg|
+--------+---------+---------+
|student6| 157.48| 38.55532|
|student3| 175.26| 43.09124|
|student2| 149.86| 45.3592|
|student7| 165.1| 36.28736|
|student1| 162.56| 40.82328|
|student5| 152.4| 36.28736|
+--------+---------+---------+
+--------+---------+---------+------------------+
| student|height_cm|weight_kg| height_meters|
+--------+---------+---------+------------------+
|student6| 157.48| 38.55532| 1.5748|
|student3| 175.26| 43.09124| 1.7526|
|student2| 149.86| 45.3592|1.4986000000000002|
|student7| 165.1| 36.28736| 1.651|
|student1| 162.56| 40.82328| 1.6256|
|student5| 152.4| 36.28736| 1.524|
+--------+---------+---------+------------------+

+--------+---------+---------+------------------+------------------+
| student|height_cm|weight_kg| height_meters| bmi|
+--------+---------+---------+------------------+------------------+
|student6| 157.48| 38.55532| 1.5748|15.546531093062187|
|student3| 175.26| 43.09124| 1.7526|14.028892161964118|
|student2| 149.86| 45.3592|1.4986000000000002|20.197328530250278|
|student7| 165.1| 36.28736| 1.651|13.312549228648752|
|student1| 162.56| 40.82328| 1.6256|15.448293591899683|
|student5| 152.4| 36.28736| 1.524|15.623755691955827|
+--------+---------+---------+------------------+------------------+

+--------+------------------+
| student| bmi|
+--------+------------------+
|student6|15.546531093062187|
|student3|14.028892161964118|
|student2|20.197328530250278|
|student7|13.312549228648752|
|student1|15.448293591899683|
|student5|15.623755691955827|
+--------+------------------+

+--------+------------------+-----------+
| student| bmi|bmi_rounded|
+--------+------------------+-----------+
|student6|15.546531093062187| 16.0|
|student3|14.028892161964118| 14.0|
|student2|20.197328530250278| 20.0|
|student7|13.312549228648752| 13.0|
|student1|15.448293591899683| 15.0|
|student5|15.623755691955827| 16.0|
+--------+------------------+-----------+

Common questions

Powered by AI

Regression models, like linear regression, are crucial for predicting continuous outcomes such as diamond prices. They help in understanding the relationship between features (e.g., carat, depth) and the price, enabling accurate predictions. Key evaluation metrics such as R-squared measure the proportion of variance captured by the model, while MAE and RMSE assess prediction errors in absolute and squared terms, providing insights into model accuracy and reliability .

Parameter tuning in regression analysis, such as adjusting regularization strengths, significantly affects model predictions by preventing overfitting and improving generalization. Effective parameter tuning can enhance model performance by optimizing the bias-variance tradeoff, leading to more reliable and accurate predictions on unseen data .

Implementing linear regression in Spark ML involves challenges such as handling distributed data and optimizing computation across nodes. Solutions include utilizing functionalities like VectorAssembler for feature engineering and handling data partitioning efficiently. Evaluating model performance using Spark's built-in metrics helps address scalability and ensures the model's applicability to large datasets .

Python offers flexibility, extensive libraries, and customization for word frequency analysis, making it suitable for integrating into larger data workflows. However, it may lack the efficiency and specific functions of specialized software, which could hinder performance on very large datasets or require more technical expertise to implement effectively .

Reading and merging datasets using functions like `pd.read_excel` and `pd.merge` enable data scientists to integrate diverse data sources, providing a comprehensive view necessary for complex analytics. In finance, this allows for thorough analyses of financial transactions by combining order data with financial metrics, facilitating predictions, trend analysis, and strategic decision-making .

Seaborn provides sophisticated and aesthetically pleasing statistical graphics that make data understanding easier. Compared to basic plotting libraries like matplotlib, Seaborn offers higher-level interfaces for drawing attractive, informative statistical graphics, simplifying complex tasks such as creating violin plots or heatmaps. It integrates well with pandas data structures, enhancing exploratory data analysis workflows .

Logistic regression models distinguish between binary outcomes, such as a tumor being benign or malignant, based on features like radius, perimeter, and area. Evaluating its accuracy through performance metrics like accuracy score and confusion matrices provides insights into the model's effectiveness in correctly identifying medical conditions, guiding clinical decisions and improving diagnosis accuracy .

The `sort_dictionary()` function sorts words by their frequency, allowing for quick identification of the most common or significant words in a text. This is essential for text analysis as it helps prioritize words by importance and frequency, facilitating tasks such as keyword extraction, sentiment analysis, or identifying trends within textual data .

The ETL process using Apache Spark involves extracting data from sources like CSV files, transforming it through operations such as data cleaning or reformatting (e.g., converting cm to meters), and loading it into target storage formats like Parquet files. This process is crucial in big data workflows for ensuring data consistency, quality, and accessibility for analysis, enabling efficient handling and processing of large datasets .

Method chaining allows the use of multiple functions in a single, concise line of code, enhancing readability and efficiency. Instead of using multiple lines to transform a string, method chaining combines operations like lowercasing and punctuation removal into a streamlined process. This not only reduces code complexity but also minimizes chances for errors associated with repeating variables or operations, providing a more elegant solution than traditional loops .

You might also like