032 Linear Regression With Time Series Data
032 Linear Regression With Time Series Data
1 Prepare Data
1.1 Import
[3]: VimeoVideo("665412469", h="135f32c7da", width=600)
Task 3.2.1: Complete to the create a client to connect to the MongoDB server, assign the
"air-quality" database to db, and assign the "nairobi" connection to nairobi.
• Create a client object for a MongoDB instance.
• Access a database using PyMongo.
• Access a collection in a database using PyMongo.
[4]: client = MongoClient(host="localhost", port=27017)
db = client["air-quality"]
nairobi = db["nairobi"]
1
Task 3.2.2: Complete the wrangle function below so that the results from the database query
are read into the DataFrame df. Be sure that the index of df is the "timestamp" from the results.
• Create a DataFrame from a dictionary using pandas.
[6]: def wrangle(collection):
#DB query
results = collection.find(
{"metadata.site": 29, "metadata.measurement": "P2"},
projection={"P2": 1, "timestamp": 1, "_id": 0},
)
df = pd.DataFrame(results).set_index("timestamp")
# Localize Tiemzone
df.index = df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")
#Remove Outliers
df = df[df["P2"] < 500]
return df
Task 3.2.3: Use your wrangle function to read the data from the nairobi collection into the
DataFrame df.
[8]: df = wrangle(nairobi)
df.head(10)
df.shape
[8]: (2927, 2)
2
[10]: VimeoVideo("665412520", h="e03eefff07", width=600)
Task 3.2.4: Add to your wrangle function so that the DatetimeIndex for df is localized to the
correct timezone, "Africa/Nairobi". Don’t forget to re-run all the cells above after you change
the function.
• Localize a timestamp to another timezone using pandas.
[11]: # Check your work
assert df.index.tzinfo == pytz.timezone("Africa/Nairobi")
1.2 Explore
[12]: VimeoVideo("665412546", h="97792cb982", width=600)
3
Task 3.2.6: Add to your wrangle function so that all "P2" readings above 500 are dropped from
the dataset. Don’t forget to re-run all the cells above after you change the function.
• Subset a DataFrame with a mask using pandas.
[15]: # Check your work
assert len(df) <= 32906
Task 3.2.7: Create a time series plot of the "P2" readings in df.
• Create a line plot using pandas.
[17]: fig, ax = plt.subplots(figsize=(15, 6))
df["P2"].plot(xlabel="Time", ylabel="PM2.5", title="PM2.5 Time Series", ax=ax)
Task 3.2.8: Add to your wrangle function to resample df to provide the mean "P2" reading for
each hour. Use a forward fill to impute any missing values. Don’t forget to re-run all the cells
above after you change the function.
• Resample time series data in pandas.
• Impute missing time series values using pandas.
4
[19]: # Check your work
assert len(df) <= 2928
Task 3.2.9: Plot the rolling average of the "P2" readings in df. Use a window size of 168 (the
number of hours in a week).
• What’s a rolling average?
• Calculate a rolling average in pandas.
• Create a line plot using pandas.
[21]: fig, ax = plt.subplots(figsize=(15, 6))
df["P2"].rolling(168).mean().plot(ax=ax,ylabel="PM2.5",title="Weekly Rolling␣
,→Average");
Task 3.2.10: Add to your wrangle function to create a column called "P2.L1" that contains the
mean"P2" reading from the previous hour. Since this new feature will create NaN values in your
DataFrame, be sure to also drop null rows from df.
• Shift the index of a Series in pandas.
• Drop rows with missing values from a DataFrame using pandas.
[23]: # Check your work
assert len(df) <= 11686
assert df.shape[1] == 2
5
[24]: VimeoVideo("665412732", h="059e4088c5", width=600)
[25]: P2 P2.L1
P2 1.000000 0.650679
P2.L1 0.650679 1.000000
Task 3.2.12: Create a scatter plot that shows PM 2.5 mean reading for each our as a function of
the mean reading from the previous hour. In other words, "P2.L1" should be on the x-axis, and
"P2" should be on the y-axis. Don’t forget to label your axes!
• Create a scatter plot using Matplotlib.
[27]: fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(x=df["P2.L1"],y=df["P2"])
ax.plot([0,120],[0,120], linestyle="--", color="orange")
plt.xlabel("P2.L1")
plt.ylabel("P2")
plt.title("PM2.5 Autocorrelation")
6
1.3 Split
[28]: VimeoVideo("665412762", h="a5eba496f7", width=600)
Task 3.2.13: Split the DataFrame df into the feature matrix X and the target vector y. Your
target is "P2".
• Subset a DataFrame by selecting one or more columns in pandas.
• Select a Series from a DataFrame in pandas.
[29]: target = "P2"
y = df[target]
# features=["P2.L1"]
7
# X_train= df[features]
# X_train.head()
X = df.drop(columns=target)
Task 3.2.14: Split X and y into training and test sets. The first 80% of the data should be in your
training set. The remaining 20% should be in the test set.
• Divide data into training and test sets in pandas.
[31]: cutoff = int(len(X) * 0.8)
2 Build Model
2.1 Baseline
Task 3.2.15: Calculate the baseline mean absolute error for your model.
• Calculate summary statistics for a DataFrame or Series in pandas.
[32]: y_pred_baseline = [(y_train.mean())] * len(y_train)
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)
2.2 Iterate
Task 3.2.16: Instantiate a LinearRegression model named model, and fit it to your training
data.
• Instantiate a predictor in scikit-learn.
• Fit a model to training data in scikit-learn.
[33]: #Instantiate
model = LinearRegression()
#Fit
8
model.fit(X_train, y_train)
[33]: LinearRegression()
2.3 Evaluate
[34]: VimeoVideo("665412844", h="129865775d", width=600)
Task 3.2.17: Calculate the training and test mean absolute error for your model.
• Generate predictions using a trained model in scikit-learn.
• Calculate the mean absolute error for a list of predictions in scikit-learn.
[35]: training_mae = mean_absolute_error(y_train, model.predict(X_train))
test_mae = mean_absolute_error(y_test, model.predict(X_test))
print("Training MAE:", round(training_mae, 2))
print("Test MAE:", round(test_mae, 2))
3 Communicate Results
Task 3.2.18: Extract the intercept and coefficient from your model.
• Access an object in a pipeline in scikit-learn
[36]: intercept = round(model.intercept_,2)
coefficient = round(model.coef_[0],2)
Task 3.2.19: Create a DataFrame df_pred_test that has two columns: "y_test" and "y_pred".
The first should contain the true values for your test set, and the second should contain your
model’s predictions. Be sure the index of df_pred_test matches the index of y_test.
• Create a DataFrame from a dictionary using pandas.
[38]: df_pred_test = pd.DataFrame(
{
"y_test": y_test,
9
"y_pred": model.predict(X_test)
}
)
df_pred_test.head()
Task 3.2.20: Create a time series line plot for the values in test_predictions using plotly
express. Be sure that the y-axis is properly labeled as "P2".
• Create a line plot using plotly express.
[40]: fig = px.line(df_pred_test, labels={"value":"P2"})
fig.show()
Copyright © 2022 WorldQuant University. This content is licensed solely for personal use. Redis-
tribution or publication of this material is strictly prohibited.
10