Rule extraction with Bayesian Logistic Regression

Hello!

I am a beginner in Bayesian modelling and looking for advice on a specific implementation. I am working with small datasets and I thought that estimating priors with posteriors taken from a model trained on similar data would help (I work with languages, so training a model on a closely related model and then initiating another model with posterios).

However, my main task is to extract what factors affect the choice of answer. So, I would need some feature selection / rule extraction algorithm.

I have been looking into Bayesian Sparse Logistic Regression or adapting this tutorial. However, maybe I am missing something and there are other ways of doing that?

I would appreciate any advice!

Can I assume you essentnially have a linear model of covariates (Bx) and a logistic transform?

If so, by using a relatively standard practice Normal prior on the Betas you get L2 regularization for free, or you can force L1 regularization by using a Laplace prior.

Some (quite old, possibly out of date) code dicussed here: How to add an L1 Regularization on the likelihood when use pymc3 to sample a MCMC although I expect you can probably find newer discussions / code approaches here too

There are a lot of ways to do this.

Unfortunately, this isn’t one of them. If you use these as priors and look at Bayesian posterior means, there is zero probability (measure zero) that you will bet a posterior mean of zero, because it’s a continuous distribution.

L1 (lasso) regularization can force actual zeros if you use maximum likelihood. L2 (ridge) won’t.

The only way to assign a non-zero probability of zero in the posterior is to have a non-zero probability in the prior, which means a spike-and-slab prior. That is, you make the prior a mixture of a probability mass at zero and a continuous density elsewhere. This will let the prior, and hence the posterior, assign probability mass at zero. Otherwise you never get posterior probabity mass of zero in a Bayesian approach because it’s a measure zero set in a continuous distribution.

[EDIT: But, you’re not going to be able to fit spike-and-slab with HMC/NUTS, because the marginalization is combinatorial (it will work with a handful of coefficients, but not more). You might be able to get PyMC to fit it by sampling slab/no-slab with a discrete sampler, but the problem there is that this is an NP-hard problem in general, so no way to guarantee you get reasonable answers everywhere in reasonable time.]

P.S. There’s an elastic net prior that combines the goodness of L1 (can get actual zero results) and L2 (will identify coefficients for collinearity). But again, it won’t work with Bayesian posterior inference, only with penalized maximum likelihood. And even then, you have to write a special optimizer to get an actual value of zero after finitely many iterations (it has to truncate at zero otherwise it will see-saw).

Have you seen this continuous spike-and-slab prior? I played around with it a bit and it seemed to do reasonable things, but I admittedly don’t have a lot of experience working with shrinkage priors to know what “reasonable” is.

The horseshoe prior is also a continuous form of the spike and slab. The only problem is that if you have a continuous prior, you have a continuous posterior, so there’s no way to get non-zero probability mass at zero. Oddly, the paper you linked doesn’t seem to mention this. If all you care about is predictive performance, shrinking to nearly zero is good enough (after you take scale of covariates into account). But if you have a bajillion covariates and want to trim them for run-time speed, there’s a post-processing step you need to do.

2 Likes