Skip to content

Commit db917ec

Browse files
committed
Pushing the docs to dev/ for branch: main, commit 071293d7ded1f7f9bc6c5466d014a000560e167c
1 parent 7fb4af4 commit db917ec

File tree

1,262 files changed

+6411
-6262
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,262 files changed

+6411
-6262
lines changed
Binary file not shown.
Binary file not shown.

dev/_downloads/ae2d0a2ad69c5df5b93e5ea5c87d56b2/plot_release_highlights_1_5_0.ipynb

+19-12
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
"cell_type": "markdown",
1212
"metadata": {},
1313
"source": [
14-
"## FixedThresholdClassifier: Setting the decision threshold of a binary classifier\nAll binary classifiers of scikit-learn use a fixed decision threshold of 0.5 to\nconvert probability estimates (i.e. output of `predict_proba`) into class\npredictions. However, 0.5 is almost never the desired threshold for a given problem.\n:class:`~model_selection.FixedThresholdClassifier` allows to wrap any binary\nclassifier and set a custom decision threshold.\n\n"
14+
"## FixedThresholdClassifier: Setting the decision threshold of a binary classifier\nAll binary classifiers of scikit-learn use a fixed decision threshold of 0.5\nto convert probability estimates (i.e. output of `predict_proba`) into class\npredictions. However, 0.5 is almost never the desired threshold for a given\nproblem. :class:`~model_selection.FixedThresholdClassifier` allows wrapping any\nbinary classifier and setting a custom decision threshold.\n\n"
1515
]
1616
},
1717
{
@@ -22,7 +22,7 @@
2222
},
2323
"outputs": [],
2424
"source": [
25-
"from sklearn.datasets import make_classification\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import confusion_matrix\n\nX, y = make_classification(n_samples=1_000, weights=[0.9, 0.1], random_state=0)\nclassifier = LogisticRegression(random_state=0).fit(X, y)\n\nprint(\"confusion matrix:\\n\", confusion_matrix(y, classifier.predict(X)))"
25+
"from sklearn.datasets import make_classification\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import ConfusionMatrixDisplay\n\n\nX, y = make_classification(n_samples=10_000, weights=[0.9, 0.1], random_state=0)\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n\nclassifier_05 = LogisticRegression(C=1e6, random_state=0).fit(X_train, y_train)\n_ = ConfusionMatrixDisplay.from_estimator(classifier_05, X_test, y_test)"
2626
]
2727
},
2828
{
@@ -40,14 +40,14 @@
4040
},
4141
"outputs": [],
4242
"source": [
43-
"from sklearn.model_selection import FixedThresholdClassifier\n\nwrapped_classifier = FixedThresholdClassifier(classifier, threshold=0.1).fit(X, y)\n\nprint(\"confusion matrix:\\n\", confusion_matrix(y, wrapped_classifier.predict(X)))"
43+
"from sklearn.model_selection import FixedThresholdClassifier\n\nclassifier_01 = FixedThresholdClassifier(classifier_05, threshold=0.1)\nclassifier_01.fit(X_train, y_train)\n_ = ConfusionMatrixDisplay.from_estimator(classifier_01, X_test, y_test)"
4444
]
4545
},
4646
{
4747
"cell_type": "markdown",
4848
"metadata": {},
4949
"source": [
50-
"## TunedThresholdClassifierCV: Tuning the decision threshold of a binary classifier\nThe decision threshold of a binary classifier can be tuned to optimize a given\nmetric, using :class:`~model_selection.TunedThresholdClassifierCV`.\n\n"
50+
"## TunedThresholdClassifierCV: Tuning the decision threshold of a binary classifier\nThe decision threshold of a binary classifier can be tuned to optimize a\ngiven metric, using :class:`~model_selection.TunedThresholdClassifierCV`.\n\nIt is particularly useful to find the best decision threshold when the model\nis meant to be deployed in a specific application context where we can assign\ndifferent gains or costs for true positives, true negatives, false positives,\nand false negatives.\n\nLet's illustrate this by considering an arbitrary case where:\n\n- each true positive gains 1 unit of profit, e.g. euro, year of life in good\n health, etc.;\n- true negatives gain or cost nothing;\n- each false negative costs 2;\n- each false positive costs 0.1.\n\nOur metric quantifies the average profit per sample, which is defined by the\nfollowing Python function:\n\n"
5151
]
5252
},
5353
{
@@ -58,14 +58,14 @@
5858
},
5959
"outputs": [],
6060
"source": [
61-
"from sklearn.metrics import balanced_accuracy_score\n\n# Due to the class imbalance, the balanced accuracy is not optimal for the default\n# threshold. The classifier tends to over predict the majority class.\nprint(f\"balanced accuracy: {balanced_accuracy_score(y, classifier.predict(X)):.2f}\")"
61+
"from sklearn.metrics import confusion_matrix\n\n\ndef custom_score(y_observed, y_pred):\n tn, fp, fn, tp = confusion_matrix(y_observed, y_pred, normalize=\"all\").ravel()\n return tp - 2 * fn - 0.1 * fp\n\n\nprint(\"Untuned decision threshold: 0.5\")\nprint(f\"Custom score: {custom_score(y_test, classifier_05.predict(X_test)):.2f}\")"
6262
]
6363
},
6464
{
6565
"cell_type": "markdown",
6666
"metadata": {},
6767
"source": [
68-
"Tuning the threshold to optimize the balanced accuracy gives a smaller threshold\nthat allows more samples to be classified as the positive class.\n\n"
68+
"It is interesting to observe that the average gain per prediction is negative\nwhich means that this decision system is making a loss on average.\n\nTuning the threshold to optimize this custom metric gives a smaller threshold\nthat allows more samples to be classified as the positive class. As a result,\nthe average gain per prediction improves.\n\n"
6969
]
7070
},
7171
{
@@ -76,21 +76,21 @@
7676
},
7777
"outputs": [],
7878
"source": [
79-
"from sklearn.model_selection import TunedThresholdClassifierCV\n\ntuned_classifier = TunedThresholdClassifierCV(\n classifier, cv=5, scoring=\"balanced_accuracy\"\n).fit(X, y)\n\nprint(f\"new threshold: {tuned_classifier.best_threshold_:.4f}\")\nprint(\n f\"balanced accuracy: {balanced_accuracy_score(y, tuned_classifier.predict(X)):.2f}\"\n)"
79+
"from sklearn.model_selection import TunedThresholdClassifierCV\nfrom sklearn.metrics import make_scorer\n\ncustom_scorer = make_scorer(\n custom_score, response_method=\"predict\", greater_is_better=True\n)\ntuned_classifier = TunedThresholdClassifierCV(\n classifier_05, cv=5, scoring=custom_scorer\n).fit(X, y)\n\nprint(f\"Tuned decision threshold: {tuned_classifier.best_threshold_:.3f}\")\nprint(f\"Custom score: {custom_score(y_test, tuned_classifier.predict(X_test)):.2f}\")"
8080
]
8181
},
8282
{
8383
"cell_type": "markdown",
8484
"metadata": {},
8585
"source": [
86-
":class:`~model_selection.TunedThresholdClassifierCV` also benefits from the\nmetadata routing support (`Metadata Routing User Guide<metadata_routing>`)\nallowing to optimze complex business metrics, detailed\nin `Post-tuning the decision threshold for cost-sensitive learning\n<sphx_glr_auto_examples_model_selection_plot_cost_sensitive_learning.py>`.\n\n"
86+
"We observe that tuning the decision threshold can turn a machine\nlearning-based system that makes a loss on average into a beneficial one.\n\nIn practice, defining a meaningful application-specific metric might involve\nmaking those costs for bad predictions and gains for good predictions depend on\nauxiliary metadata specific to each individual data point such as the amount\nof a transaction in a fraud detection system.\n\nTo achieve this, :class:`~model_selection.TunedThresholdClassifierCV`\nleverages metadata routing support (`Metadata Routing User\nGuide<metadata_routing>`) allowing to optimize complex business metrics as\ndetailed in `Post-tuning the decision threshold for cost-sensitive\nlearning\n<sphx_glr_auto_examples_model_selection_plot_cost_sensitive_learning.py>`.\n\n"
8787
]
8888
},
8989
{
9090
"cell_type": "markdown",
9191
"metadata": {},
9292
"source": [
93-
"## Performance improvements in PCA\n:class:`~decomposition.PCA` has a new solver, \"covariance_eigh\", which is faster\nand more memory efficient than the other solvers for datasets with a large number\nof samples and a small number of features.\n\n"
93+
"## Performance improvements in PCA\n:class:`~decomposition.PCA` has a new solver, `\"covariance_eigh\"`, which is\nup to an order of magnitude faster and more memory efficient than the other\nsolvers for datasets with many data points and few features.\n\n"
9494
]
9595
},
9696
{
@@ -101,14 +101,14 @@
101101
},
102102
"outputs": [],
103103
"source": [
104-
"from sklearn.datasets import make_low_rank_matrix\nfrom sklearn.decomposition import PCA\n\nX = make_low_rank_matrix(\n n_samples=10_000, n_features=100, tail_strength=0.1, random_state=0\n)\n\npca = PCA(n_components=10).fit(X)\n\nprint(f\"explained variance: {pca.explained_variance_ratio_.sum():.2f}\")"
104+
"from sklearn.datasets import make_low_rank_matrix\nfrom sklearn.decomposition import PCA\n\nX = make_low_rank_matrix(\n n_samples=10_000, n_features=100, tail_strength=0.1, random_state=0\n)\n\npca = PCA(n_components=10, svd_solver=\"covariance_eigh\").fit(X)\nprint(f\"Explained variance: {pca.explained_variance_ratio_.sum():.2f}\")"
105105
]
106106
},
107107
{
108108
"cell_type": "markdown",
109109
"metadata": {},
110110
"source": [
111-
"The \"full\" solver has also been improved to use less memory and allows to\ntransform faster. The \"auto\" option for the solver takes advantage of the\nnew solver and is now able to select an appropriate solver for sparse\ndatasets.\n\n"
111+
"The new solver also accepts sparse input data:\n\n"
112112
]
113113
},
114114
{
@@ -119,7 +119,14 @@
119119
},
120120
"outputs": [],
121121
"source": [
122-
"from scipy.sparse import random\n\nX = random(10000, 100, format=\"csr\", random_state=0)\n\npca = PCA(n_components=10, svd_solver=\"auto\").fit(X)"
122+
"from scipy.sparse import random\n\nX = random(10_000, 100, format=\"csr\", random_state=0)\n\npca = PCA(n_components=10, svd_solver=\"covariance_eigh\").fit(X)\nprint(f\"Explained variance: {pca.explained_variance_ratio_.sum():.2f}\")"
123+
]
124+
},
125+
{
126+
"cell_type": "markdown",
127+
"metadata": {},
128+
"source": [
129+
"The `\"full\"` solver has also been improved to use less memory and allows\nfaster transformation. The default `svd_solver=\"auto\"`` option takes\nadvantage of the new solver and is now able to select an appropriate solver\nfor sparse datasets.\n\nSimilarly to most other PCA solvers, the new `\"covariance_eigh\"` solver can leverage\nGPU computation if the input data is passed as a PyTorch or CuPy array by\nenabling the experimental support for `Array API <array_api>`.\n\n"
123130
]
124131
},
125132
{

dev/_downloads/ba0cfc16d7953e1c2c6912b6beca1e91/plot_release_highlights_1_5_0.py

+87-40
Original file line numberDiff line numberDiff line change
@@ -24,89 +24,136 @@
2424
# %%
2525
# FixedThresholdClassifier: Setting the decision threshold of a binary classifier
2626
# -------------------------------------------------------------------------------
27-
# All binary classifiers of scikit-learn use a fixed decision threshold of 0.5 to
28-
# convert probability estimates (i.e. output of `predict_proba`) into class
29-
# predictions. However, 0.5 is almost never the desired threshold for a given problem.
30-
# :class:`~model_selection.FixedThresholdClassifier` allows to wrap any binary
31-
# classifier and set a custom decision threshold.
27+
# All binary classifiers of scikit-learn use a fixed decision threshold of 0.5
28+
# to convert probability estimates (i.e. output of `predict_proba`) into class
29+
# predictions. However, 0.5 is almost never the desired threshold for a given
30+
# problem. :class:`~model_selection.FixedThresholdClassifier` allows wrapping any
31+
# binary classifier and setting a custom decision threshold.
3232
from sklearn.datasets import make_classification
33+
from sklearn.model_selection import train_test_split
3334
from sklearn.linear_model import LogisticRegression
34-
from sklearn.metrics import confusion_matrix
35+
from sklearn.metrics import ConfusionMatrixDisplay
36+
3537

36-
X, y = make_classification(n_samples=1_000, weights=[0.9, 0.1], random_state=0)
37-
classifier = LogisticRegression(random_state=0).fit(X, y)
38+
X, y = make_classification(n_samples=10_000, weights=[0.9, 0.1], random_state=0)
39+
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
3840

39-
print("confusion matrix:\n", confusion_matrix(y, classifier.predict(X)))
41+
classifier_05 = LogisticRegression(C=1e6, random_state=0).fit(X_train, y_train)
42+
_ = ConfusionMatrixDisplay.from_estimator(classifier_05, X_test, y_test)
4043

4144
# %%
4245
# Lowering the threshold, i.e. allowing more samples to be classified as the positive
4346
# class, increases the number of true positives at the cost of more false positives
4447
# (as is well known from the concavity of the ROC curve).
4548
from sklearn.model_selection import FixedThresholdClassifier
4649

47-
wrapped_classifier = FixedThresholdClassifier(classifier, threshold=0.1).fit(X, y)
48-
49-
print("confusion matrix:\n", confusion_matrix(y, wrapped_classifier.predict(X)))
50+
classifier_01 = FixedThresholdClassifier(classifier_05, threshold=0.1)
51+
classifier_01.fit(X_train, y_train)
52+
_ = ConfusionMatrixDisplay.from_estimator(classifier_01, X_test, y_test)
5053

5154
# %%
5255
# TunedThresholdClassifierCV: Tuning the decision threshold of a binary classifier
5356
# --------------------------------------------------------------------------------
54-
# The decision threshold of a binary classifier can be tuned to optimize a given
55-
# metric, using :class:`~model_selection.TunedThresholdClassifierCV`.
56-
from sklearn.metrics import balanced_accuracy_score
57+
# The decision threshold of a binary classifier can be tuned to optimize a
58+
# given metric, using :class:`~model_selection.TunedThresholdClassifierCV`.
59+
#
60+
# It is particularly useful to find the best decision threshold when the model
61+
# is meant to be deployed in a specific application context where we can assign
62+
# different gains or costs for true positives, true negatives, false positives,
63+
# and false negatives.
64+
#
65+
# Let's illustrate this by considering an arbitrary case where:
66+
#
67+
# - each true positive gains 1 unit of profit, e.g. euro, year of life in good
68+
# health, etc.;
69+
# - true negatives gain or cost nothing;
70+
# - each false negative costs 2;
71+
# - each false positive costs 0.1.
72+
#
73+
# Our metric quantifies the average profit per sample, which is defined by the
74+
# following Python function:
75+
from sklearn.metrics import confusion_matrix
76+
77+
78+
def custom_score(y_observed, y_pred):
79+
tn, fp, fn, tp = confusion_matrix(y_observed, y_pred, normalize="all").ravel()
80+
return tp - 2 * fn - 0.1 * fp
5781

58-
# Due to the class imbalance, the balanced accuracy is not optimal for the default
59-
# threshold. The classifier tends to over predict the majority class.
60-
print(f"balanced accuracy: {balanced_accuracy_score(y, classifier.predict(X)):.2f}")
82+
83+
print("Untuned decision threshold: 0.5")
84+
print(f"Custom score: {custom_score(y_test, classifier_05.predict(X_test)):.2f}")
6185

6286
# %%
63-
# Tuning the threshold to optimize the balanced accuracy gives a smaller threshold
64-
# that allows more samples to be classified as the positive class.
87+
# It is interesting to observe that the average gain per prediction is negative
88+
# which means that this decision system is making a loss on average.
89+
#
90+
# Tuning the threshold to optimize this custom metric gives a smaller threshold
91+
# that allows more samples to be classified as the positive class. As a result,
92+
# the average gain per prediction improves.
6593
from sklearn.model_selection import TunedThresholdClassifierCV
94+
from sklearn.metrics import make_scorer
6695

96+
custom_scorer = make_scorer(
97+
custom_score, response_method="predict", greater_is_better=True
98+
)
6799
tuned_classifier = TunedThresholdClassifierCV(
68-
classifier, cv=5, scoring="balanced_accuracy"
100+
classifier_05, cv=5, scoring=custom_scorer
69101
).fit(X, y)
70102

71-
print(f"new threshold: {tuned_classifier.best_threshold_:.4f}")
72-
print(
73-
f"balanced accuracy: {balanced_accuracy_score(y, tuned_classifier.predict(X)):.2f}"
74-
)
103+
print(f"Tuned decision threshold: {tuned_classifier.best_threshold_:.3f}")
104+
print(f"Custom score: {custom_score(y_test, tuned_classifier.predict(X_test)):.2f}")
75105

76106
# %%
77-
# :class:`~model_selection.TunedThresholdClassifierCV` also benefits from the
78-
# metadata routing support (:ref:`Metadata Routing User Guide<metadata_routing>`)
79-
# allowing to optimze complex business metrics, detailed
80-
# in :ref:`Post-tuning the decision threshold for cost-sensitive learning
107+
# We observe that tuning the decision threshold can turn a machine
108+
# learning-based system that makes a loss on average into a beneficial one.
109+
#
110+
# In practice, defining a meaningful application-specific metric might involve
111+
# making those costs for bad predictions and gains for good predictions depend on
112+
# auxiliary metadata specific to each individual data point such as the amount
113+
# of a transaction in a fraud detection system.
114+
#
115+
# To achieve this, :class:`~model_selection.TunedThresholdClassifierCV`
116+
# leverages metadata routing support (:ref:`Metadata Routing User
117+
# Guide<metadata_routing>`) allowing to optimize complex business metrics as
118+
# detailed in :ref:`Post-tuning the decision threshold for cost-sensitive
119+
# learning
81120
# <sphx_glr_auto_examples_model_selection_plot_cost_sensitive_learning.py>`.
82121

83122
# %%
84123
# Performance improvements in PCA
85124
# -------------------------------
86-
# :class:`~decomposition.PCA` has a new solver, "covariance_eigh", which is faster
87-
# and more memory efficient than the other solvers for datasets with a large number
88-
# of samples and a small number of features.
125+
# :class:`~decomposition.PCA` has a new solver, `"covariance_eigh"`, which is
126+
# up to an order of magnitude faster and more memory efficient than the other
127+
# solvers for datasets with many data points and few features.
89128
from sklearn.datasets import make_low_rank_matrix
90129
from sklearn.decomposition import PCA
91130

92131
X = make_low_rank_matrix(
93132
n_samples=10_000, n_features=100, tail_strength=0.1, random_state=0
94133
)
95134

96-
pca = PCA(n_components=10).fit(X)
135+
pca = PCA(n_components=10, svd_solver="covariance_eigh").fit(X)
136+
print(f"Explained variance: {pca.explained_variance_ratio_.sum():.2f}")
97137

98-
print(f"explained variance: {pca.explained_variance_ratio_.sum():.2f}")
99138

100139
# %%
101-
# The "full" solver has also been improved to use less memory and allows to
102-
# transform faster. The "auto" option for the solver takes advantage of the
103-
# new solver and is now able to select an appropriate solver for sparse
104-
# datasets.
140+
# The new solver also accepts sparse input data:
105141
from scipy.sparse import random
106142

107-
X = random(10000, 100, format="csr", random_state=0)
143+
X = random(10_000, 100, format="csr", random_state=0)
108144

109-
pca = PCA(n_components=10, svd_solver="auto").fit(X)
145+
pca = PCA(n_components=10, svd_solver="covariance_eigh").fit(X)
146+
print(f"Explained variance: {pca.explained_variance_ratio_.sum():.2f}")
147+
148+
# %%
149+
# The `"full"` solver has also been improved to use less memory and allows
150+
# faster transformation. The default `svd_solver="auto"`` option takes
151+
# advantage of the new solver and is now able to select an appropriate solver
152+
# for sparse datasets.
153+
#
154+
# Similarly to most other PCA solvers, the new `"covariance_eigh"` solver can leverage
155+
# GPU computation if the input data is passed as a PyTorch or CuPy array by
156+
# enabling the experimental support for :ref:`Array API <array_api>`.
110157

111158
# %%
112159
# ColumnTransformer is subscriptable

dev/_downloads/scikit-learn-docs.zip

45.5 KB
Binary file not shown.
-762 Bytes
-87 Bytes
-95 Bytes
-154 Bytes
-66 Bytes
162 Bytes
11 Bytes
-139 Bytes
-277 Bytes
189 Bytes
-110 Bytes
-124 Bytes
89 Bytes
2.14 KB
3.15 KB
-75 Bytes
45 Bytes
-88 Bytes
-15 Bytes
-31 Bytes
-133 Bytes
-51 Bytes
-113 Bytes
-1 Bytes
-17 Bytes
-131 Bytes
18 Bytes
-44 Bytes
52 Bytes
-64 Bytes
88 Bytes
-32 Bytes
117 Bytes

dev/_sources/auto_examples/applications/plot_cyclical_feature_engineering.rst.txt

+1-1

dev/_sources/auto_examples/applications/plot_digits_denoising.rst.txt

+1-1

0 commit comments

Comments
 (0)