Skip to content

Conversation

@tq21
Copy link

@tq21 tq21 commented Apr 19, 2024

Implementation of k-means clustering based knot-point screening algorithm. At each basis function level, we could run a k-means clustering to get k knots. The set of knot points obtain this way may be better than the previous quantile based discretization. R implementation of k-means is very fast, so the overhead of running this is minimal, but we reduce memory significantly due to reduced number of knots. This method coupled with variable-level screening (MARS-based) would be an ideal scalable version of HAL. Performance of k-means screening compared to quantile-based is demonstrated in the simulations below:

  1. When the marginal distribution of covariates are skewed (bottom three plots), full HAL (using all bases) overfitted the data in the dense region (potentially due to the large number of knots created using observations in that region, the little blue ticks at the bottom are all knots created), and quantile-based HAL fit failed to capture variations in the sparse data regions. K-means screened algorithm produces a more balanced fit:
Screenshot 2024-04-19 at 3 05 51 PM
  1. K-means has better prediction accuracy on test sets when number of covariates increases and covariate distributions are skewed. This simulation is purposely designed to be biased against k-means because the total number of knots created by quantile-based method cannot be predetermined, so it is hard to make sure the number of knots from k-means matches exactly. However, even with the disadvantage, i.e. thinner HAL design matrix, k-means still beats quantile-based method, so it is more memory efficient. This might be because quantile is a univariate measure. We run k-means at each basis function level, so for a basis of three variables, it will select k points in the three-dimensional space:
Screenshot 2024-04-19 at 3 06 45 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant