|
26 | 26 | },
|
27 | 27 | "outputs": [],
|
28 | 28 | "source": [
|
29 |
| - "# Author: Pedro Morales <[email protected]>\n#\n# License: BSD 3 clause\n\nimport numpy as np\n\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.preprocessing import StandardScaler, OneHotEncoder\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split, GridSearchCV\n\nnp.random.seed(0)\n\n# Load data from https://www.openml.org/d/40945\nX, y = fetch_openml(\"titanic\", version=1, as_frame=True, return_X_y=True)\n\n# Alternatively X and y can be obtained directly from the frame attribute:\n# X = titanic.frame.drop('survived', axis=1)\n# y = titanic.frame['survived']" |
| 29 | + "# Author: Pedro Morales <[email protected]>\n#\n# License: BSD 3 clause" |
| 30 | + ] |
| 31 | + }, |
| 32 | + { |
| 33 | + "cell_type": "code", |
| 34 | + "execution_count": null, |
| 35 | + "metadata": { |
| 36 | + "collapsed": false |
| 37 | + }, |
| 38 | + "outputs": [], |
| 39 | + "source": [ |
| 40 | + "import numpy as np\n\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.preprocessing import StandardScaler, OneHotEncoder\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split, GridSearchCV\n\nnp.random.seed(0)" |
| 41 | + ] |
| 42 | + }, |
| 43 | + { |
| 44 | + "cell_type": "markdown", |
| 45 | + "metadata": {}, |
| 46 | + "source": [ |
| 47 | + "Load data from https://www.openml.org/d/40945\n\n" |
| 48 | + ] |
| 49 | + }, |
| 50 | + { |
| 51 | + "cell_type": "code", |
| 52 | + "execution_count": null, |
| 53 | + "metadata": { |
| 54 | + "collapsed": false |
| 55 | + }, |
| 56 | + "outputs": [], |
| 57 | + "source": [ |
| 58 | + "X, y = fetch_openml(\"titanic\", version=1, as_frame=True, return_X_y=True)\n\n# Alternatively X and y can be obtained directly from the frame attribute:\n# X = titanic.frame.drop('survived', axis=1)\n# y = titanic.frame['survived']" |
| 59 | + ] |
| 60 | + }, |
| 61 | + { |
| 62 | + "cell_type": "markdown", |
| 63 | + "metadata": {}, |
| 64 | + "source": [ |
| 65 | + "Use ``ColumnTransformer`` by selecting column by names\n\nWe will train our classifier with the following features:\n\nNumeric Features:\n\n* ``age``: float;\n* ``fare``: float.\n\nCategorical Features:\n\n* ``embarked``: categories encoded as strings ``{'C', 'S', 'Q'}``;\n* ``sex``: categories encoded as strings ``{'female', 'male'}``;\n* ``pclass``: ordinal integers ``{1, 2, 3}``.\n\nWe create the preprocessing pipelines for both numeric and categorical data.\nNote that ``pclass`` could either be treated as a categorical or numeric\nfeature.\n\n" |
| 66 | + ] |
| 67 | + }, |
| 68 | + { |
| 69 | + "cell_type": "code", |
| 70 | + "execution_count": null, |
| 71 | + "metadata": { |
| 72 | + "collapsed": false |
| 73 | + }, |
| 74 | + "outputs": [], |
| 75 | + "source": [ |
| 76 | + "numeric_features = [\"age\", \"fare\"]\nnumeric_transformer = Pipeline(\n steps=[(\"imputer\", SimpleImputer(strategy=\"median\")), (\"scaler\", StandardScaler())]\n)\n\ncategorical_features = [\"embarked\", \"sex\", \"pclass\"]\ncategorical_transformer = OneHotEncoder(handle_unknown=\"ignore\")\n\npreprocessor = ColumnTransformer(\n transformers=[\n (\"num\", numeric_transformer, numeric_features),\n (\"cat\", categorical_transformer, categorical_features),\n ]\n)" |
30 | 77 | ]
|
31 | 78 | },
|
32 | 79 | {
|
33 | 80 | "cell_type": "markdown",
|
34 | 81 | "metadata": {},
|
35 | 82 | "source": [
|
36 |
| - "## Use ``ColumnTransformer`` by selecting column by names\n We will train our classifier with the following features:\n\n Numeric Features:\n\n * ``age``: float;\n * ``fare``: float.\n\n Categorical Features:\n\n * ``embarked``: categories encoded as strings ``{'C', 'S', 'Q'}``;\n * ``sex``: categories encoded as strings ``{'female', 'male'}``;\n * ``pclass``: ordinal integers ``{1, 2, 3}``.\n\n We create the preprocessing pipelines for both numeric and categorical data.\n Note that ``pclass`` could either be treated as a categorical or numeric\n feature.\n\n" |
| 83 | + "Append classifier to preprocessing pipeline.\nNow we have a full prediction pipeline.\n\n" |
37 | 84 | ]
|
38 | 85 | },
|
39 | 86 | {
|
|
44 | 91 | },
|
45 | 92 | "outputs": [],
|
46 | 93 | "source": [
|
47 |
| - "numeric_features = [\"age\", \"fare\"]\nnumeric_transformer = Pipeline(\n steps=[(\"imputer\", SimpleImputer(strategy=\"median\")), (\"scaler\", StandardScaler())]\n)\n\ncategorical_features = [\"embarked\", \"sex\", \"pclass\"]\ncategorical_transformer = OneHotEncoder(handle_unknown=\"ignore\")\n\npreprocessor = ColumnTransformer(\n transformers=[\n (\"num\", numeric_transformer, numeric_features),\n (\"cat\", categorical_transformer, categorical_features),\n ]\n)\n\n# Append classifier to preprocessing pipeline.\n# Now we have a full prediction pipeline.\nclf = Pipeline(\n steps=[(\"preprocessor\", preprocessor), (\"classifier\", LogisticRegression())]\n)\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)\n\nclf.fit(X_train, y_train)\nprint(\"model score: %.3f\" % clf.score(X_test, y_test))" |
| 94 | + "clf = Pipeline(\n steps=[(\"preprocessor\", preprocessor), (\"classifier\", LogisticRegression())]\n)\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)\n\nclf.fit(X_train, y_train)\nprint(\"model score: %.3f\" % clf.score(X_test, y_test))" |
48 | 95 | ]
|
49 | 96 | },
|
50 | 97 | {
|
51 | 98 | "cell_type": "markdown",
|
52 | 99 | "metadata": {},
|
53 | 100 | "source": [
|
54 |
| - "## HTML representation of ``Pipeline`` (display diagram)\n When the ``Pipeline`` is printed out in a jupyter notebook an HTML\n representation of the estimator is displayed as follows:\n\n" |
| 101 | + "HTML representation of ``Pipeline`` (display diagram)\n\nWhen the ``Pipeline`` is printed out in a jupyter notebook an HTML\nrepresentation of the estimator is displayed:\n\n" |
55 | 102 | ]
|
56 | 103 | },
|
57 | 104 | {
|
|
62 | 109 | },
|
63 | 110 | "outputs": [],
|
64 | 111 | "source": [
|
65 |
| - "from sklearn import set_config\n\nset_config(display=\"diagram\")\nclf" |
| 112 | + "clf" |
66 | 113 | ]
|
67 | 114 | },
|
68 | 115 | {
|
69 | 116 | "cell_type": "markdown",
|
70 | 117 | "metadata": {},
|
71 | 118 | "source": [
|
72 |
| - "## Use ``ColumnTransformer`` by selecting column by data types\n When dealing with a cleaned dataset, the preprocessing can be automatic by\n using the data types of the column to decide whether to treat a column as a\n numerical or categorical feature.\n :func:`sklearn.compose.make_column_selector` gives this possibility.\n First, let's only select a subset of columns to simplify our\n example.\n\n" |
| 119 | + "Use ``ColumnTransformer`` by selecting column by data types\n\nWhen dealing with a cleaned dataset, the preprocessing can be automatic by\nusing the data types of the column to decide whether to treat a column as a\nnumerical or categorical feature.\n:func:`sklearn.compose.make_column_selector` gives this possibility.\nFirst, let's only select a subset of columns to simplify our\nexample.\n\n" |
73 | 120 | ]
|
74 | 121 | },
|
75 | 122 | {
|
|
123 | 170 | },
|
124 | 171 | "outputs": [],
|
125 | 172 | "source": [
|
126 |
| - "from sklearn.compose import make_column_selector as selector\n\npreprocessor = ColumnTransformer(\n transformers=[\n (\"num\", numeric_transformer, selector(dtype_exclude=\"category\")),\n (\"cat\", categorical_transformer, selector(dtype_include=\"category\")),\n ]\n)\nclf = Pipeline(\n steps=[(\"preprocessor\", preprocessor), (\"classifier\", LogisticRegression())]\n)\n\n\nclf.fit(X_train, y_train)\nprint(\"model score: %.3f\" % clf.score(X_test, y_test))" |
| 173 | + "from sklearn.compose import make_column_selector as selector\n\npreprocessor = ColumnTransformer(\n transformers=[\n (\"num\", numeric_transformer, selector(dtype_exclude=\"category\")),\n (\"cat\", categorical_transformer, selector(dtype_include=\"category\")),\n ]\n)\nclf = Pipeline(\n steps=[(\"preprocessor\", preprocessor), (\"classifier\", LogisticRegression())]\n)\n\n\nclf.fit(X_train, y_train)\nprint(\"model score: %.3f\" % clf.score(X_test, y_test))\nclf" |
127 | 174 | ]
|
128 | 175 | },
|
129 | 176 | {
|
|
159 | 206 | "cell_type": "markdown",
|
160 | 207 | "metadata": {},
|
161 | 208 | "source": [
|
162 |
| - "## Using the prediction pipeline in a grid search\n Grid search can also be performed on the different preprocessing steps\n defined in the ``ColumnTransformer`` object, together with the classifier's\n hyperparameters as part of the ``Pipeline``.\n We will search for both the imputer strategy of the numeric preprocessing\n and the regularization parameter of the logistic regression using\n :class:`~sklearn.model_selection.GridSearchCV`.\n\n" |
| 209 | + "Using the prediction pipeline in a grid search\n\nGrid search can also be performed on the different preprocessing steps\ndefined in the ``ColumnTransformer`` object, together with the classifier's\nhyperparameters as part of the ``Pipeline``.\nWe will search for both the imputer strategy of the numeric preprocessing\nand the regularization parameter of the logistic regression using\n:class:`~sklearn.model_selection.GridSearchCV`.\n\n" |
163 | 210 | ]
|
164 | 211 | },
|
165 | 212 | {
|
|
0 commit comments