k-means clustering based knot-point screening #113

tq21 · 2024-04-19T22:25:38Z

Implementation of k-means clustering based knot-point screening algorithm. At each basis function level, we could run a k-means clustering to get k knots. The set of knot points obtain this way may be better than the previous quantile based discretization. R implementation of k-means is very fast, so the overhead of running this is minimal, but we reduce memory significantly due to reduced number of knots. This method coupled with variable-level screening (MARS-based) would be an ideal scalable version of HAL. Performance of k-means screening compared to quantile-based is demonstrated in the simulations below:

When the marginal distribution of covariates are skewed (bottom three plots), full HAL (using all bases) overfitted the data in the dense region (potentially due to the large number of knots created using observations in that region, the little blue ticks at the bottom are all knots created), and quantile-based HAL fit failed to capture variations in the sparse data regions. K-means screened algorithm produces a more balanced fit:

K-means has better prediction accuracy on test sets when number of covariates increases and covariate distributions are skewed. This simulation is purposely designed to be biased against k-means because the total number of knots created by quantile-based method cannot be predetermined, so it is hard to make sure the number of knots from k-means matches exactly. However, even with the disadvantage, i.e. thinner HAL design matrix, k-means still beats quantile-based method, so it is more memory efficient. This might be because quantile is a univariate measure. We run k-means at each basis function level, so for a basis of three variables, it will select k points in the three-dimensional space:

tq21 added 3 commits April 19, 2024 14:50

k-means clustering based knot-point screening

749ce3e

minor fix for checking unique number of knots

2d11694

pam

a81af69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

k-means clustering based knot-point screening #113

k-means clustering based knot-point screening #113

Uh oh!

tq21 commented Apr 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

k-means clustering based knot-point screening #113

Are you sure you want to change the base?

k-means clustering based knot-point screening #113

Uh oh!

Conversation

tq21 commented Apr 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant