0% found this document useful (0 votes)

144 views

032 Linear Regression With Time Series Data

Uploaded by

Nguyễn Đăng

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

144 views

032 Linear Regression With Time Series Data

Uploaded by

Nguyễn Đăng

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

032-linear-regression-with-time-series-data

April 25, 2022

Linear Regression with Time Series Data

[1]: import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import pytz
from IPython.display import VimeoVideo
from pymongo import MongoClient
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

[2]: VimeoVideo("665412117", h="c39a50bd58", width=600)

[2]: <IPython.lib.display.VimeoVideo at 0x7f3a905397f0>

1 Prepare Data
1.1 Import
[3]: VimeoVideo("665412469", h="135f32c7da", width=600)

[3]: <IPython.lib.display.VimeoVideo at 0x7f3a90539cd0>

Task 3.2.1: Complete to the create a client to connect to the MongoDB server, assign the
"air-quality" database to db, and assign the "nairobi" connection to nairobi.
• Create a client object for a MongoDB instance.
• Access a database using PyMongo.
• Access a collection in a database using PyMongo.
[4]: client = MongoClient(host="localhost", port=27017)
db = client["air-quality"]
nairobi = db["nairobi"]

[5]: VimeoVideo("665412480", h="c20ed3e570", width=600)

[5]: <IPython.lib.display.VimeoVideo at 0x7f39b4ad8910>

1
Task 3.2.2: Complete the wrangle function below so that the results from the database query
are read into the DataFrame df. Be sure that the index of df is the "timestamp" from the results.
• Create a DataFrame from a dictionary using pandas.
[6]: def wrangle(collection):
#DB query
results = collection.find(
{"metadata.site": 29, "metadata.measurement": "P2"},
projection={"P2": 1, "timestamp": 1, "_id": 0},
)

df = pd.DataFrame(results).set_index("timestamp")

# Localize Tiemzone
df.index = df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")

#Remove Outliers
df = df[df["P2"] < 500]

#Resample to 1H window , ffill missing values

df = df["P2"].resample("1H").mean().fillna(method="ffill").to_frame()

#Add Lag feature

df["P2.L1"] = df["P2"].shift(1)
#Drop NaN rows
df.dropna(inplace=True)

return df

[7]: VimeoVideo("665412496", h="d757475f7c", width=600)

[7]: <IPython.lib.display.VimeoVideo at 0x7f39b4ad8400>

Task 3.2.3: Use your wrangle function to read the data from the nairobi collection into the
DataFrame df.
[8]: df = wrangle(nairobi)
df.head(10)
df.shape

[8]: (2927, 2)

[9]: # Check your work

assert any([isinstance(df, pd.DataFrame), isinstance(df, pd.Series)])
assert len(df) <= 32907
assert isinstance(df.index, pd.DatetimeIndex)

2
[10]: VimeoVideo("665412520", h="e03eefff07", width=600)

[10]: <IPython.lib.display.VimeoVideo at 0x7f39b4ad89d0>

Task 3.2.4: Add to your wrangle function so that the DatetimeIndex for df is localized to the
correct timezone, "Africa/Nairobi". Don’t forget to re-run all the cells above after you change
the function.
• Localize a timestamp to another timezone using pandas.
[11]: # Check your work
assert df.index.tzinfo == pytz.timezone("Africa/Nairobi")

1.2 Explore
[12]: VimeoVideo("665412546", h="97792cb982", width=600)

[12]: <IPython.lib.display.VimeoVideo at 0x7f39ae35d2e0>

Task 3.2.5: Create a boxplot of the "P2" readings in df.

• Create a boxplot using pandas.
[13]: fig, ax = plt.subplots(figsize=(15, 6))
df["P2"].plot(kind="box", vert=False, title="Distribution of PM2.5 Readings",␣
,→ax=ax);

[14]: VimeoVideo("665412573", h="b46049021b", width=600)

[14]: <IPython.lib.display.VimeoVideo at 0x7f39b415fcd0>

3
Task 3.2.6: Add to your wrangle function so that all "P2" readings above 500 are dropped from
the dataset. Don’t forget to re-run all the cells above after you change the function.
• Subset a DataFrame with a mask using pandas.
[15]: # Check your work
assert len(df) <= 32906

[16]: VimeoVideo("665412594", h="e56c2f6839", width=600)

[16]: <IPython.lib.display.VimeoVideo at 0x7f39b41770d0>

Task 3.2.7: Create a time series plot of the "P2" readings in df.
• Create a line plot using pandas.
[17]: fig, ax = plt.subplots(figsize=(15, 6))
df["P2"].plot(xlabel="Time", ylabel="PM2.5", title="PM2.5 Time Series", ax=ax)

[17]: <AxesSubplot:title={'center':'PM2.5 Time Series'}, xlabel='Time',

ylabel='PM2.5'>

[18]: VimeoVideo("665412601", h="a16c5a73fc", width=600)

[18]: <IPython.lib.display.VimeoVideo at 0x7f39b4057bb0>

Task 3.2.8: Add to your wrangle function to resample df to provide the mean "P2" reading for
each hour. Use a forward fill to impute any missing values. Don’t forget to re-run all the cells
above after you change the function.
• Resample time series data in pandas.
• Impute missing time series values using pandas.

4
[19]: # Check your work
assert len(df) <= 2928

[20]: VimeoVideo("665412649", h="d2e99d2e75", width=600)

[20]: <IPython.lib.display.VimeoVideo at 0x7f39b4057e80>

Task 3.2.9: Plot the rolling average of the "P2" readings in df. Use a window size of 168 (the
number of hours in a week).
• What’s a rolling average?
• Calculate a rolling average in pandas.
• Create a line plot using pandas.
[21]: fig, ax = plt.subplots(figsize=(15, 6))
df["P2"].rolling(168).mean().plot(ax=ax,ylabel="PM2.5",title="Weekly Rolling␣
,→Average");

[22]: VimeoVideo("665412693", h="c3bca16aff", width=600)

[22]: <IPython.lib.display.VimeoVideo at 0x7f39aeded400>

Task 3.2.10: Add to your wrangle function to create a column called "P2.L1" that contains the
mean"P2" reading from the previous hour. Since this new feature will create NaN values in your
DataFrame, be sure to also drop null rows from df.
• Shift the index of a Series in pandas.
• Drop rows with missing values from a DataFrame using pandas.
[23]: # Check your work
assert len(df) <= 11686
assert df.shape[1] == 2

5
[24]: VimeoVideo("665412732", h="059e4088c5", width=600)

[24]: <IPython.lib.display.VimeoVideo at 0x7f39b405f160>

Task 3.2.11: Create a correlation matrix for df.

• Create a correlation matrix in pandas.
[25]: df.corr()

[25]: P2 P2.L1
P2 1.000000 0.650679
P2.L1 0.650679 1.000000

[26]: VimeoVideo("665412741", h="7439cb107c", width=600)

[26]: <IPython.lib.display.VimeoVideo at 0x7f39aee0c4f0>

Task 3.2.12: Create a scatter plot that shows PM 2.5 mean reading for each our as a function of
the mean reading from the previous hour. In other words, "P2.L1" should be on the x-axis, and
"P2" should be on the y-axis. Don’t forget to label your axes!
• Create a scatter plot using Matplotlib.
[27]: fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(x=df["P2.L1"],y=df["P2"])
ax.plot([0,120],[0,120], linestyle="--", color="orange")
plt.xlabel("P2.L1")
plt.ylabel("P2")
plt.title("PM2.5 Autocorrelation")

[27]: Text(0.5, 1.0, 'PM2.5 Autocorrelation')

6
1.3 Split
[28]: VimeoVideo("665412762", h="a5eba496f7", width=600)

[28]: <IPython.lib.display.VimeoVideo at 0x7f39aed80370>

Task 3.2.13: Split the DataFrame df into the feature matrix X and the target vector y. Your
target is "P2".
• Subset a DataFrame by selecting one or more columns in pandas.
• Select a Series from a DataFrame in pandas.
[29]: target = "P2"
y = df[target]

# features=["P2.L1"]

7
# X_train= df[features]
# X_train.head()

X = df.drop(columns=target)

[30]: VimeoVideo("665412785", h="03118eda71", width=600)

[30]: <IPython.lib.display.VimeoVideo at 0x7f39aed737f0>

Task 3.2.14: Split X and y into training and test sets. The first 80% of the data should be in your
training set. The remaining 20% should be in the test set.
• Divide data into training and test sets in pandas.
[31]: cutoff = int(len(X) * 0.8)

X_train, y_train = X.iloc[:cutoff], y.iloc[:cutoff]

X_test, y_test = X.iloc[cutoff:], y.iloc[cutoff:]

#len(X_train) + len(X_test) == len(X)

2 Build Model
2.1 Baseline
Task 3.2.15: Calculate the baseline mean absolute error for your model.
• Calculate summary statistics for a DataFrame or Series in pandas.
[32]: y_pred_baseline = [(y_train.mean())] * len(y_train)
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)

print("Mean P2 Reading:", round(y_train.mean(), 2))

print("Baseline MAE:", round(mae_baseline, 2))

Mean P2 Reading: 9.27

Baseline MAE: 3.89

2.2 Iterate
Task 3.2.16: Instantiate a LinearRegression model named model, and fit it to your training
data.
• Instantiate a predictor in scikit-learn.
• Fit a model to training data in scikit-learn.
[33]: #Instantiate
model = LinearRegression()
#Fit

8
model.fit(X_train, y_train)

[33]: LinearRegression()

2.3 Evaluate
[34]: VimeoVideo("665412844", h="129865775d", width=600)

[34]: <IPython.lib.display.VimeoVideo at 0x7f39aed249d0>

Task 3.2.17: Calculate the training and test mean absolute error for your model.
• Generate predictions using a trained model in scikit-learn.
• Calculate the mean absolute error for a list of predictions in scikit-learn.
[35]: training_mae = mean_absolute_error(y_train, model.predict(X_train))
test_mae = mean_absolute_error(y_test, model.predict(X_test))
print("Training MAE:", round(training_mae, 2))
print("Test MAE:", round(test_mae, 2))

Training MAE: 2.46

Test MAE: 1.8

3 Communicate Results
Task 3.2.18: Extract the intercept and coeﬀicient from your model.
• Access an object in a pipeline in scikit-learn
[36]: intercept = round(model.intercept_,2)
coefficient = round(model.coef_[0],2)

print(f"P2 = {intercept} + ({coefficient} * P2.L1)")

P2 = 3.36 + (0.64 * P2.L1)

[37]: VimeoVideo("665412870", h="318d69683e", width=600)

[37]: <IPython.lib.display.VimeoVideo at 0x7f39aed33190>

Task 3.2.19: Create a DataFrame df_pred_test that has two columns: "y_test" and "y_pred".
The first should contain the true values for your test set, and the second should contain your
model’s predictions. Be sure the index of df_pred_test matches the index of y_test.
• Create a DataFrame from a dictionary using pandas.
[38]: df_pred_test = pd.DataFrame(
{
"y_test": y_test,

9
"y_pred": model.predict(X_test)
}
)
df_pred_test.head()

[38]: y_test y_pred

timestamp
2018-12-07 17:00:00+03:00 7.070000 8.478927
2018-12-07 18:00:00+03:00 8.968333 7.865485
2018-12-07 19:00:00+03:00 11.630833 9.076421
2018-12-07 20:00:00+03:00 11.525833 10.774814
2018-12-07 21:00:00+03:00 9.533333 10.707836

[39]: VimeoVideo("665412891", h="39d7356a26", width=600)

[39]: <IPython.lib.display.VimeoVideo at 0x7f39aed333d0>

Task 3.2.20: Create a time series line plot for the values in test_predictions using plotly
express. Be sure that the y-axis is properly labeled as "P2".
• Create a line plot using plotly express.
[40]: fig = px.line(df_pred_test, labels={"value":"P2"})
fig.show()

Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
035 Assignment PDF
No ratings yet
035 Assignment PDF
14 pages
Jakub M. Tomczak - Deep Generative Modeling-Springer (2022)
100% (1)
Jakub M. Tomczak - Deep Generative Modeling-Springer (2022)
210 pages
024 Price and Everything PDF
No ratings yet
024 Price and Everything PDF
12 pages
Customer Churn Data - A Project Based On Logistic Regression
100% (12)
Customer Churn Data - A Project Based On Logistic Regression
31 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
CS 611 Slides 4
No ratings yet
CS 611 Slides 4
25 pages
External
No ratings yet
External
11 pages
DA lab
No ratings yet
DA lab
27 pages
ML Lab 04 Manual - Pandas and MatplotLib
No ratings yet
ML Lab 04 Manual - Pandas and MatplotLib
7 pages
Linear Regression - Cheatsheet
No ratings yet
Linear Regression - Cheatsheet
8 pages
ML Final Prac
No ratings yet
ML Final Prac
47 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Lab 7
No ratings yet
Lab 7
6 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
43 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Python Library Functions
No ratings yet
Python Library Functions
12 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
Dejene Chala Stat606 Screening Quiz Programming Part
No ratings yet
Dejene Chala Stat606 Screening Quiz Programming Part
12 pages
Final Project
No ratings yet
Final Project
34 pages
Question Bank CIA 2
No ratings yet
Question Bank CIA 2
3 pages
ML Lab Manual (Upto Cie-1)
No ratings yet
ML Lab Manual (Upto Cie-1)
33 pages
Lab 7
No ratings yet
Lab 7
6 pages
Exercise 7 - Pandas
No ratings yet
Exercise 7 - Pandas
2 pages
week_3
No ratings yet
week_3
10 pages
DATAANALYSIS FINALS123
No ratings yet
DATAANALYSIS FINALS123
36 pages
Ge Sem II Dav Upc 2344001201 Sl. No. Qp. 2012 July 2023
No ratings yet
Ge Sem II Dav Upc 2344001201 Sl. No. Qp. 2012 July 2023
16 pages
FDS RECORD-1-4
No ratings yet
FDS RECORD-1-4
18 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
CP4252 MACHINE LEARNING LABORATORY
No ratings yet
CP4252 MACHINE LEARNING LABORATORY
37 pages
Lab 02 - Introduction to Pandas
No ratings yet
Lab 02 - Introduction to Pandas
6 pages
index
No ratings yet
index
4 pages
AIML Assignment_merged
No ratings yet
AIML Assignment_merged
7 pages
AIML Hard
No ratings yet
AIML Hard
22 pages
ML Lab Manual1
No ratings yet
ML Lab Manual1
23 pages
Lab 03
No ratings yet
Lab 03
32 pages
PYQ Data Analysis and Visualisation Using Python GE May 2024
No ratings yet
PYQ Data Analysis and Visualisation Using Python GE May 2024
6 pages
machinelearning
No ratings yet
machinelearning
26 pages
Lab_questionbank
No ratings yet
Lab_questionbank
3 pages
UNITIV.BtechIot
No ratings yet
UNITIV.BtechIot
43 pages
AIML 01 Merged
No ratings yet
AIML 01 Merged
25 pages
Python for ML
No ratings yet
Python for ML
41 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
210430_PracticalWeek03a
No ratings yet
210430_PracticalWeek03a
1 page
Python For Statistics
No ratings yet
Python For Statistics
40 pages
Question_bank2_1722502558363
No ratings yet
Question_bank2_1722502558363
6 pages
Load Dataset: Import As
No ratings yet
Load Dataset: Import As
8 pages
Ap Python
No ratings yet
Ap Python
12 pages
Task - Preprocessing
No ratings yet
Task - Preprocessing
7 pages
CH-6 Data Loading, Storage, and File Formats
No ratings yet
CH-6 Data Loading, Storage, and File Formats
163 pages
Adobe Scan 14-May-2024
No ratings yet
Adobe Scan 14-May-2024
7 pages
Ml Lab Manual Completed
No ratings yet
Ml Lab Manual Completed
56 pages
# (Data Preprocessing) : (Cheatsheet)
No ratings yet
# (Data Preprocessing) : (Cheatsheet)
10 pages
EXP1-siddhant gupta (23_SE_148)
No ratings yet
EXP1-siddhant gupta (23_SE_148)
17 pages
ML(sudhanshu)
No ratings yet
ML(sudhanshu)
24 pages
ML Expt 1 Description
No ratings yet
ML Expt 1 Description
15 pages
Advance Python
No ratings yet
Advance Python
5 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
Pandas: Reference Sheet
No ratings yet
Pandas: Reference Sheet
9 pages
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
From Everand
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
Kanto
No ratings yet
Review on state-of-the-art dynamic task allocation strategies for multiple-
No ratings yet
Review on state-of-the-art dynamic task allocation strategies for multiple-
15 pages
Splines and Piecewise Interpolation: Powerpoints Organized by Dr. Michael R. Gustafson Ii, Duke University
No ratings yet
Splines and Piecewise Interpolation: Powerpoints Organized by Dr. Michael R. Gustafson Ii, Duke University
17 pages
Lecture 23
No ratings yet
Lecture 23
16 pages
ECON0019 Week1 SLR OLS
No ratings yet
ECON0019 Week1 SLR OLS
33 pages
Lecture 1 Mechanism Design 2
No ratings yet
Lecture 1 Mechanism Design 2
11 pages
Digital Signal Processing Question Bank 01
No ratings yet
Digital Signal Processing Question Bank 01
37 pages
Geostatistics - 2D: GMS 10.0 Tutorial
No ratings yet
Geostatistics - 2D: GMS 10.0 Tutorial
15 pages
Delhi: Monte Carlo Simulation of Value at Risk Calculation (Var) of Stock Prices
No ratings yet
Delhi: Monte Carlo Simulation of Value at Risk Calculation (Var) of Stock Prices
25 pages
Lecture 2
No ratings yet
Lecture 2
9 pages
Designing Machine Learning Workflows in Python Chapter3
No ratings yet
Designing Machine Learning Workflows in Python Chapter3
42 pages
Chapter 5 - Numerical Integration
No ratings yet
Chapter 5 - Numerical Integration
23 pages
SAT SMT by Example PDF
No ratings yet
SAT SMT by Example PDF
575 pages
Scikit-Learn Cheat Sheet
No ratings yet
Scikit-Learn Cheat Sheet
1 page
PM1715
No ratings yet
PM1715
4 pages
Sequences With Adjacent Elements Unequal
No ratings yet
Sequences With Adjacent Elements Unequal
2 pages
Queues: Data Structures Using Java 1
No ratings yet
Queues: Data Structures Using Java 1
42 pages
1993 Prismatic grid generation for three-dimensional complex geometries
No ratings yet
1993 Prismatic grid generation for three-dimensional complex geometries
7 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Lp-Iii Be Lab Manual Final1
No ratings yet
Lp-Iii Be Lab Manual Final1
4 pages
Tutorial III Root Locus Design
100% (3)
Tutorial III Root Locus Design
25 pages
CS2 B Chapter 2 - Markov Chains - Solutions
No ratings yet
CS2 B Chapter 2 - Markov Chains - Solutions
15 pages
Simple Regression: Quality MKT Share X Y Error LINEST Output (1
100% (1)
Simple Regression: Quality MKT Share X Y Error LINEST Output (1
2 pages
Comparing Topic Modeling and Named Entity Recognition Techniques For The Semantic Indexing of A Landscape Architecture Textbook
No ratings yet
Comparing Topic Modeling and Named Entity Recognition Techniques For The Semantic Indexing of A Landscape Architecture Textbook
6 pages
Maltparser: A Language-Independent System For Data-Driven Dependency Parsing
No ratings yet
Maltparser: A Language-Independent System For Data-Driven Dependency Parsing
42 pages
Introduction To AI With Python
No ratings yet
Introduction To AI With Python
6 pages
Numerical Analysis - I. Jacques and C. Judd
No ratings yet
Numerical Analysis - I. Jacques and C. Judd
110 pages
Assignment 2.3.1 Transfer Learning
No ratings yet
Assignment 2.3.1 Transfer Learning
7 pages
Panel Data Econometrics Sul Donggyu; instant download
100% (1)
Panel Data Econometrics Sul Donggyu; instant download
26 pages