Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add Ridge Regression examples using Intel® Extension for Scikit-learn (
  • Loading branch information
nandried authored Mar 5, 2025
commit 0ef48b9bbd1c76a02c4c3ce9adcaaed66a536616
373 changes: 373 additions & 0 deletions examples/notebooks/ridge_regression_bike_sharing.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,373 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "c44da9e1-d76e-4956-8c19-9dade31e2bc5",
"metadata": {},
"source": [
"# Intel® Extension for Scikit-learn Ridge Regression for New York City Bike Share dataset"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "963c5285-6470-474c-9550-6990029fa6a3",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from timeit import default_timer as timer\n",
"from sklearn import metrics\n",
"from sklearn.model_selection import train_test_split\n",
"import warnings\n",
"from sklearn.datasets import fetch_openml\n",
"from sklearn.preprocessing import LabelEncoder\n",
"from IPython.display import HTML\n",
"\n",
"warnings.filterwarnings(\"ignore\")"
]
},
{
"cell_type": "markdown",
"id": "421d17b2-a02f-464d-b196-9e46cd0bb396",
"metadata": {},
"source": [
"### Download the data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "d0b94c65-bd7c-480e-bd56-e5adc711585f",
"metadata": {},
"outputs": [],
"source": [
"dataset = fetch_openml(data_id=43526, as_frame=True)"
]
},
{
"cell_type": "markdown",
"id": "bcb6b5bc-5a76-4741-b350-518cd9c43f9c",
"metadata": {},
"source": [
"### Preprocessing\n",
"Let's encode categorical features with LabelEncoder"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a3995621-d516-470f-a8d1-78d635435bb7",
"metadata": {},
"outputs": [],
"source": [
"# Access the data as a DataFrame\n",
"data = dataset.frame\n",
"\n",
"# Convert date columns to datetime\n",
"data['Start_Time'] = pd.to_datetime(data['Start_Time'])\n",
"data['Stop_Time'] = pd.to_datetime(data['Stop_Time'])\n",
"\n",
"# Extract useful features from datetime columns\n",
"data['Start_Year'] = data['Start_Time'].dt.year\n",
"data['Start_Month'] = data['Start_Time'].dt.month\n",
"data['Start_Day'] = data['Start_Time'].dt.day\n",
"data['Start_Hour'] = data['Start_Time'].dt.hour\n",
"\n",
"data['Stop_Year'] = data['Stop_Time'].dt.year\n",
"data['Stop_Month'] = data['Stop_Time'].dt.month\n",
"data['Stop_Day'] = data['Stop_Time'].dt.day\n",
"data['Stop_Hour'] = data['Stop_Time'].dt.hour\n",
"\n",
"# Drop the original datetime columns\n",
"data = data.drop(columns=['Start_Time', 'Stop_Time'])\n",
"\n",
"# Encode categorical variables\n",
"for col in ['Start_Station_Name', 'End_Station_Name', 'Gender', 'User_Type']:\n",
" le = LabelEncoder().fit(data[col])\n",
" data[col] = le.transform(data[col])\n",
"\n",
"# Set the target variable\n",
"data['target'] = data['Trip_Duration']\n",
"\n",
"# Separate features and target\n",
"x = data.drop(columns=['target', 'Trip_Duration'])\n",
"y = data['target']"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "d21fbc59-1f27-4114-a54e-51a5263a5d31",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(661951, 22) (73551, 22) (661951,) (73551,)\n"
]
}
],
"source": [
"# Ensure x and y are defined and not None\n",
"if x is not None and y is not None:\n",
" for col in ['User_Type', 'Gender']:\n",
" if col in x.columns:\n",
" le = LabelEncoder().fit(x[col])\n",
" x[col] = le.transform(x[col])\n",
" else:\n",
" print(f\"Column {col} does not exist in the DataFrame.\")\n",
"\n",
" # Split the data\n",
" x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=0)\n",
" print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)\n",
"else:\n",
" print(\"x or y is None. Please check your data.\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "862647f0-e3cc-486f-b40f-c9d013b2b973",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import MinMaxScaler, StandardScaler\n",
"\n",
"scaler_x = MinMaxScaler()\n",
"scaler_y = StandardScaler()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "7fabcc1c-1632-45bb-aa8e-cc6edd864a51",
"metadata": {},
"outputs": [],
"source": [
"y_train = y_train.to_numpy().reshape(-1, 1)\n",
"y_test = y_test.to_numpy().reshape(-1, 1)\n",
"\n",
"scaler_x.fit(x_train)\n",
"x_train = scaler_x.transform(x_train)\n",
"x_test = scaler_x.transform(x_test)\n",
"\n",
"scaler_y.fit(y_train)\n",
"y_train = scaler_y.transform(y_train).ravel()\n",
"y_test = scaler_y.transform(y_test).ravel()"
]
},
{
"cell_type": "markdown",
"id": "b6e77f12-027c-4519-940c-26c9b127daac",
"metadata": {},
"source": [
"### Patch original Scikit-learn with Intel® Extension for Scikit-learn\n",
"Intel® Extension for Scikit-learn (previously known as daal4py) contains drop-in replacement functionality for the stock Scikit-learn package. You can take advantage of the performance optimizations of Intel® Extension for Scikit-learn by adding just two lines of code before the usual Scikit-learn imports:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "4fe10be8-83a0-4681-8ecc-6f8b09eb1e13",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Intel(R) Extension for Scikit-learn* enabled (https://github.com/uxlfoundation/scikit-learn-intelex)\n"
]
}
],
"source": [
"from sklearnex import patch_sklearn\n",
"\n",
"patch_sklearn()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "60c6aa9b-748b-4a02-87b1-600d1130c4d8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Intel® extension for Scikit-learn time: 0.04 s'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.linear_model import Ridge\n",
"\n",
"params = {\n",
" \"alpha\": 0.3,\n",
" \"fit_intercept\": False,\n",
" \"random_state\": 0,\n",
" \"copy_X\": False,\n",
"}\n",
"start = timer()\n",
"model = Ridge(random_state=0).fit(x_train, y_train)\n",
"train_patched = timer() - start\n",
"f\"Intel® extension for Scikit-learn time: {train_patched:.2f} s\""
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "e3b5e2bc-a5fe-4f1b-90ab-b3119ad77b4b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Patched Scikit-learn MSE: 0.29078674972552815'"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_predict = model.predict(x_test)\n",
"mse_metric_opt = metrics.mean_squared_error(y_test, y_predict)\n",
"f\"Patched Scikit-learn MSE: {mse_metric_opt}\""
]
},
{
"cell_type": "markdown",
"id": "177a3800-4ee6-421f-8cd0-149beebb460a",
"metadata": {},
"source": [
"### Train the same algorithm with original Scikit-learn\n",
"In order to cancel optimizations, we use *unpatch_sklearn* and reimport the class Ridge"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "7dbf40f4-7cb6-4ad5-a5d1-a07793780b23",
"metadata": {},
"outputs": [],
"source": [
"from sklearnex import unpatch_sklearn\n",
"\n",
"unpatch_sklearn()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "eb46f060-dbde-43a3-b173-6d1c02fd9b9c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Original Scikit-learn time: 0.19 s'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.linear_model import Ridge\n",
"\n",
"start = timer()\n",
"model = Ridge(random_state=0).fit(x_train, y_train)\n",
"train_unpatched = timer() - start\n",
"f\"Original Scikit-learn time: {train_unpatched:.2f} s\""
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "6f55a543-d85a-4e76-95d1-4fef83b59d7e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Original Scikit-learn MSE: 0.29078674972650354'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_predict = model.predict(x_test)\n",
"mse_metric_original = metrics.mean_squared_error(y_test, y_predict)\n",
"f\"Original Scikit-learn MSE: {mse_metric_original}\""
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "aa1355e0-b70c-448b-99d8-eee19f2b1f36",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<h3>Compare MSE metric of patched Scikit-learn and original</h3>MSE metric of patched Scikit-learn: 0.29078674972552815 <br>MSE metric of unpatched Scikit-learn: 0.29078674972650354 <br>Metrics ratio: 0.9999999999966457 <br><h3>With Scikit-learn-intelex patching you can:</h3><ul><li>Use your Scikit-learn code for training and prediction with minimal changes (a couple of lines of code);</li><li>Fast execution training and prediction of Scikit-learn models;</li><li>Get the similar quality</li><li>Get speedup in <strong>4.7</strong> times.</li></ul>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"HTML(\n",
" f\"<h3>Compare MSE metric of patched Scikit-learn and original</h3>\"\n",
" f\"MSE metric of patched Scikit-learn: {mse_metric_opt} <br>\"\n",
" f\"MSE metric of unpatched Scikit-learn: {mse_metric_original} <br>\"\n",
" f\"Metrics ratio: {mse_metric_opt/mse_metric_original} <br>\"\n",
" f\"<h3>With Scikit-learn-intelex patching you can:</h3>\"\n",
" f\"<ul>\"\n",
" f\"<li>Use your Scikit-learn code for training and prediction with minimal changes (a couple of lines of code);</li>\"\n",
" f\"<li>Fast execution training and prediction of Scikit-learn models;</li>\"\n",
" f\"<li>Get the similar quality</li>\"\n",
" f\"<li>Get speedup in <strong>{(train_unpatched/train_patched):.1f}</strong> times.</li>\"\n",
" f\"</ul>\"\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading