License: CC BY 4.0
arXiv:2604.20322v1 [stat.ME] 22 Apr 2026

Zero-Inflated Logistic Regression Models with Shared Design: Identifiability, Existence of Estimates, and a Relabeling Rule

Yui Tomo Department of Epidemiology, National Institute of Infectious Diseases, Japan Institute for Health Security, 1-23-1 Toyama, Shinjuku-ku, Tokyo 162-0052, Japan E-mail: tomo.y@jihs.go.jp Shinto Eguchi The Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan Daisuke Yoneoka Department of Epidemiology, National Institute of Infectious Diseases, Japan Institute for Health Security, 1-23-1 Toyama, Shinjuku-ku, Tokyo 162-0052, Japan
Abstract

The zero-inflated logistic regression model accommodates binary responses with excess zeros, which often arise from a latent mixture of susceptible and insusceptible subpopulations or asymmetric misclassification of the response. The model has two components: regression for the binary response and a latent binary indicator for the zero-inflation state. In applied settings, it is common to use the same design matrix for both components if there is no prior knowledge. However, this shared-design specification lacks guaranteed identifiability of the regression parameters, as established in prior works. This paper investigates the theoretical properties of the zero-inflated logistic regression model under the shared-design setting and computational methods for applications. First, to motivate the use of the zero-inflated model, we prove that ignoring the zero-inflation mechanism can lead to a sign flip in the pseudo-true coefficient value relative to the true value. We then establish sufficient conditions for the existence of the maximum likelihood estimate. As a main result, we establish that the model under the shared-design setting is identifiable up to exchange symmetry of the parameters for two components and that the expected log-likelihood has a unique maximizer on the resulting quotient space. The posterior bimodality is examined using a Pólya-Gamma Gibbs sampler with replica exchange. Finally, we propose a simple relabeling rule to select a single ordered parameter pair, and evaluate its performance through simulation studies and an application to self-reported diabetes data.

Keywords: Asymmetric misclassification; Data separation; Exchange symmetry; Pólya-Gamma augmentation; Replica exchange.

1 Introduction

The logistic regression model is a standard tool for binary outcomes and remains attractive due to its simplicity and interpretability. However, in many applied settings, the observed response contains more zeros than a standard logistic model can accommodate. For example, such excess zeros may arise from a latent mixture of susceptible and insusceptible subpopulations, such as biological immunity, or one-sided outcome misclassification, such as the failure to record an event due to delayed reporting. To accommodate these situations, a natural extension is a zero-inflated logistic regression model (Hall, 2000; Diop et al., 2011). This model expresses the observed response as a mixture of a standard logistic regression and a structural zero, where the latent binary indicator for the zero-inflation state is itself modeled via a logistic regression. We refer to these two regression components as the ordinary logistic regression component and the structural-zero component, respectively. This formulation captures complex data-generating mechanisms while retaining interpretability.

Earlier work includes the three-parameter logistic model in psychometrics and related methods in ecology and epidemiology (Wainer et al., 2007; Komori et al., 2016; Nagelkerke and Fidler, 2015). Furthermore, robust estimation approaches based on label-noise modeling or robust divergence have also been studied (Bootkrajang and Kabán, 2012, 2013; Hung et al., 2018; Fujisawa and Eguchi, 2008). From the perspective of excess zeros, these settings are closely connected: positive outcomes that are systematically recorded as zeros induce a structural-zero mechanism in the observed data. From the theoretical perspective, Diop et al. (2011) established sufficient conditions for identifiability of the zero-inflated logistic regression model. One such condition requires at least one continuous covariate that appears in one component but not in the other.

Although the zero-inflated logistic model has been explored practically and theoretically, there remain several gaps in the understanding of its theoretical properties. In particular, this study focuses on the practically important scenario where no reliable prior information is available regarding which covariates should enter both components. A common empirical choice is then to use the same design matrix in both components of the model. In this case, the model faces a theoretical difficulty: once the two components share the same covariates, the likelihood becomes invariant under exchange of the two coefficient vectors. The model is therefore not identifiable as an ordered pair, and both optimization and posterior simulation may exhibit symmetric multiple modes. This symmetry is analogous to the label-switching phenomenon in mixture models (Frühwirth-Schnatter, 2006). Specifically, once the same covariates are used in both components, exchanging the two coefficient vectors does not change the likelihood values. However, such non-identifiability has yet to be characterized.

Another theoretical issue concerns the existence of the maximum likelihood estimate. For ordinary logistic regression, non-existence of the estimate under separation is well-known (Albert and Anderson, 1984; Silvapulle, 1981). For the zero-inflated logistic model, however, the analogous conditions have received less attention.

Our contributions are fourfold. First, we prove that model misspecification can lead to a sign flip in the pseudo-true parameter relative to the true coefficient. Second, we introduce a double-separation condition for the zero-inflated logistic model and derive sufficient conditions for the existence of the estimates. Third, we establish that the model under a shared design matrix is identifiable up to exchange symmetry and that, under mild regularity conditions, the expected log-likelihood has a unique maximizer on the resulting quotient space. Furthermore, we investigate the resulting multimodality numerically through a Pólya-Gamma Gibbs sampler integrated with replica exchange (Polson et al., 2013; Swendsen and Wang, 1986). Fourth, we propose a relabeling rule based on an estimate from the ordinary logistic regression model and conduct a numerical study.

The remainder of the paper is structured as follows. Section 2 introduces the model, studies the sign-flip phenomenon under misspecification, and presents sufficient conditions for the existence and non-existence of the estimates. Section 3 studies identifiability under the shared design setting, formalizes the exchange-symmetry structure, and provides a numerical illustration of the resulting posterior bimodality. Section 4 proposes a relabeling rule and conducts a simulation study. Section 5 presents an illustrative application to NHANES self-reported diabetes data. Section 6 discusses limitations and future work. Section 7 concludes the paper.

2 Zero-inflated Logistic Regression Model

In this section, we define the zero-inflated logistic regression model and study two basic aspects of the model before turning to the shared-design case. First, we examine the consequence of misspecification when zero inflation is ignored and a standard logistic regression model is fitted instead. Second, we consider the existence of maximum likelihood estimates and introduce separation-type conditions that provide sufficient criteria for non-existence and existence.

2.1 Model Definition

Let yi{0,1}y_{i}\in\{0,1\} (i=1,,ni=1,\ldots,n) denote the binary responses for nn observations, and let xi=(1,xi,1,,xi,d1)dx_{i}=\left(1,x_{i,1},\ldots,x_{i,d-1}\right)^{\top}\in\mathbb{R}^{d} denote the covariate vector. Let β=(β0,,βd1)d\beta=\left(\beta_{0},\ldots,\beta_{d-1}\right)^{\top}\in\mathbb{R}^{d} be the corresponding parameter vector. We assume that a latent variable hi{0,1}h_{i}\in\{0,1\} determines whether the response is constrained to be zero, where hi=0h_{i}=0 indicates the zero state. Let zi=(1,zi,1,,zi,p1)pz_{i}=\left(1,z_{i,1},\ldots,z_{i,p-1}\right)^{\top}\in\mathbb{R}^{p} denote the covariate vector for hih_{i}, and let γ=(γ0,,γp1)p\gamma=\left(\gamma_{0},\ldots,\gamma_{p-1}\right)^{\top}\in\mathbb{R}^{p} denote the corresponding parameter vector. Let F()F(\cdot) be the inverse logit function:

F(μ):=exp(μ)/{1+exp(μ)},forμ.\displaystyle F(\mu):={\exp(\mu)}/{\left\{1+\exp(\mu)\right\}},\quad\text{for}\quad\mu\in\mathbb{R}.

Then, the zero-inflated logistic regression model is defined as

p(yixi,zi,β,γ)=q(hi=1zi,γ)p(yixi,β)+q(hi=0zi,γ)I(yi=0),\displaystyle p(y_{i}\mid x_{i},z_{i},\beta,\gamma)=q(h_{i}=1\mid z_{i},\gamma)\cdot p(y_{i}\mid x_{i},\beta)+q(h_{i}=0\mid z_{i},\gamma)\cdot\mathrm{I}(y_{i}=0),

where p(yi=1xi,β)=F(βxi)p(y_{i}=1\mid x_{i},\beta)=F(\beta^{\top}x_{i}) and q(hi=1zi,γ)=F(γzi)q(h_{i}=1\mid z_{i},\gamma)=F(\gamma^{\top}z_{i}). We refer to the logistic regression components p(yixi,β)=F(βxi)yi(1F(βxi))1yip(y_{i}\mid x_{i},\beta)=F(\beta^{\top}x_{i})^{y_{i}}(1-F(\beta^{\top}x_{i}))^{1-y_{i}} and q(hizi,γ)=F(γzi)hi(1F(γzi))1hiq(h_{i}\mid z_{i},\gamma)=F(\gamma^{\top}z_{i})^{h_{i}}(1-F(\gamma^{\top}z_{i}))^{1-h_{i}} as the ordinary logistic regression component and the structural-zero component, respectively. In this context, q(hi=1zi,γ)q(h_{i}=1\mid z_{i},\gamma) represents the probability that the ii-th observation is not a structural zero, corresponding to the probability of susceptibility to the event. Then, we have

p(yixi,zi,β,γ)={F(γzi)F(βxi)}yi{1F(γzi)F(βxi)}1yi.\displaystyle p(y_{i}\mid x_{i},z_{i},\beta,\gamma)=\left\{F(\gamma^{\top}z_{i})F(\beta^{\top}x_{i})\right\}^{y_{i}}\left\{1-F(\gamma^{\top}z_{i})F(\beta^{\top}x_{i})\right\}^{1-y_{i}}.

Therefore, the log-likelihood function is defined as

L(β,γ)=i=1n[yilog{F(γzi)F(βxi)}+(1yi)log{1F(γzi)F(βxi)}].\displaystyle L(\beta,\gamma)=\sum_{i=1}^{n}\left[y_{i}\log\left\{F(\gamma^{\top}z_{i})F(\beta^{\top}x_{i})\right\}+(1-y_{i})\log\left\{1-F(\gamma^{\top}z_{i})F(\beta^{\top}x_{i})\right\}\right].

2.2 Sign-Flip Phenomenon Under Misspecification

To motivate the use of the zero-inflated logistic regression model, we examine the consequences of ignoring zero-inflation and fitting a standard logistic regression model to data generated from the zero-inflated model. When the standard logistic regression model is applied to data with excess zeros, the resulting estimator can be severely biased. In particular, when a covariate is positively associated with the response yiy_{i} but negatively associated with the latent indicator hih_{i} (or vice versa), the fitted logistic regression model may estimate a regression coefficient whose sign is opposite to the underlying true value. In this subsection, we formalize this phenomenon under a random-design setting.

Let xdx\in\mathbb{R}^{d} and zpz\in\mathbb{R}^{p} denote covariate vectors in the ordinary logistic regression component and the structural-zero component, respectively, and suppose that they share a common covariate indexed by jj:

x\displaystyle x =(1,x~j,x~j),\displaystyle=(1,\tilde{x}_{j},\tilde{x}_{-j}^{\top})^{\top},
z\displaystyle z =(1,z~j,z~j),\displaystyle=(1,\tilde{z}_{j},\tilde{z}_{-j}^{\top})^{\top},

where x~jd2\tilde{x}_{-j}\in\mathbb{R}^{d-2}, z~jp2\tilde{z}_{-j}\in\mathbb{R}^{p-2}, and the jjth covariate is shared between the two vectors: x~j=z~j\tilde{x}_{j}=\tilde{z}_{j}. The corresponding regression coefficients are decomposed as

β\displaystyle\beta =(β0,βj,βj),\displaystyle=(\beta_{0},\beta_{j},\beta_{-j}^{\top})^{\top},
γ\displaystyle\gamma =(γ0,γj,γj).\displaystyle=(\gamma_{0},\gamma_{j},\gamma_{-j}^{\top})^{\top}.

Without loss of generality, we focus on a covariate that has a positive effect on yy but a negative effect on hh, and let

βj\displaystyle\beta_{j} =:a>0,\displaystyle=:a>0,
γj\displaystyle\gamma_{j} =:c<0.\displaystyle=:c<0.

Under the zero-inflated logistic regression model, the conditional event occurrence probability can be written as

πc(x,z):=F(β0+ax~j+βjx~j)F(γ0+cx~j+γjz~j).\displaystyle\pi_{c}(x,z):=F(\beta_{0}+a\tilde{x}_{j}+\beta_{-j}^{\top}\tilde{x}_{-j})F(\gamma_{0}+c\tilde{x}_{j}+\gamma_{-j}^{\top}\tilde{z}_{-j}).

We then consider the ordinary logistic regression model where the conditional event occurrence probability is specified as F(θ0+tx~j+θjx~j)F(\theta_{0}+t\tilde{x}_{j}+\theta_{-j}^{\top}\tilde{x}_{-j}). Here, (θ0,t,θj)××d2(\theta_{0},t,\theta_{-j})\in\mathbb{R}\times\mathbb{R}\times\mathbb{R}^{d-2} is the regression coefficient vector. Under the true zero-inflated model, the expected log-likelihood function of this misspecified logistic regression is given by

c(θ0,t,θj)\displaystyle\mathcal{L}_{c}(\theta_{0},t,\theta_{-j})
:=𝔼(x,z)[πc(x,z)logF(θ0+tx~j+θjx~j)+{1πc(x,z)}log{1F(θ0+tx~j+θjx~j)}],\displaystyle\quad:=\mathbb{E}_{(x,z)}\left[\pi_{c}(x,z)\,\log F(\theta_{0}+t\tilde{x}_{j}+\theta_{-j}^{\top}\tilde{x}_{-j})+\{1-\pi_{c}(x,z)\}\,\log\{1-F(\theta_{0}+t\tilde{x}_{j}+\theta_{-j}^{\top}\tilde{x}_{-j})\}\right],

where the expectation is taken with respect to the joint distribution of (x,z)(x,z).

We then define the profile objective function

gc(t)\displaystyle g_{c}(t) :=sup(θ0,θj)×d2c(θ0,t,θj),\displaystyle:=\sup_{(\theta_{0},\theta_{-j})\in\mathbb{R}\times\mathbb{R}^{d-2}}\mathcal{L}_{c}(\theta_{0},t,\theta_{-j}),

and let t(c)argmaxtgc(t)t^{*}(c)\in\arg\max_{t\in\mathbb{R}}g_{c}(t). The following theorem shows that, when the magnitude of the negative association in the zero-inflation component is sufficiently large, every pseudo-true value of the jjth coefficient is negative.

Theorem 2.1.

Suppose that Assumption A.1 in Appendix A.1 holds. Then there exists a constant C0<0C_{0}<0 such that

cC0argmaxtgc(t)(,0).\displaystyle c\leq C_{0}\quad\Longrightarrow\quad\arg\max_{t\in\mathbb{R}}g_{c}(t)\subset(-\infty,0).

See Appendix A.1 for the proof. This theorem implies that, although the true coefficient satisfies βj=a>0\beta_{j}=a>0, every pseudo-true value of the same coefficient under the misspecified conventional logistic regression model is negative when the |γj|=|c||\gamma_{j}|=|c| is sufficiently large in the opposite sign direction. Specifically, any choice t(c)argmaxtgc(t)t^{*}(c)\in\arg\max_{t\in\mathbb{R}}g_{c}(t) satisfies t(c)<0t^{*}(c)<0.

2.3 Maximum Likelihood Estimation

The maximum likelihood estimator (MLE) (β^,γ^)d+p(\hat{\beta}^{\top},\hat{\gamma}^{\top})^{\top}\in\mathbb{R}^{d+p} is defined as any maximizers of L(β,γ)L(\beta,\gamma). However, due to the structure of the zero-inflated logistic model, the existence of an estimate is not always guaranteed. To investigate this issue, we introduce the concept of double separation, an analogue of the separation condition for the standard logistic regression.

Definition 2.1 (Double separation).

The dataset {(yi,xi,zi)}i=1n\{(y_{i},x_{i},z_{i})\}_{i=1}^{n} is said to satisfy double separation if there exist non-zero vectors vd{v}\in\mathbb{R}^{d} and wp{w}\in\mathbb{R}^{p} such that for every i=1,,ni=1,\dots,n,

{vxi0,wzi0if yi=1,vxi0,wzi0if yi=0,\displaystyle\begin{cases}v^{\top}x_{i}\;\geq 0,\quad w^{\top}z_{i}\;\geq 0&\text{if }y_{i}=1,\\ v^{\top}x_{i}\;\leq 0,\quad w^{\top}z_{i}\;\leq 0&\text{if }y_{i}=0,\end{cases}

and either of the inequalities is strict for at least one observation: there exists j{1,,n}j\in\{1,\dots,n\} such that either yj=1y_{j}=1 with (vxj,wzj)(0,0)(v^{\top}x_{j},w^{\top}z_{j})\neq(0,0) or yj=0y_{j}=0 with (vxj,wzj)(0,0)(v^{\top}x_{j},w^{\top}z_{j})\neq(0,0).

Proposition 2.2.

If the dataset satisfies double separation, then the log-likelihood L(β,γ)L(\beta,\gamma) has no maximizer in d×p\mathbb{R}^{d}\times\mathbb{R}^{p}.

See Appendix A.2 for the proof.

Furthermore, we introduce the following condition to guarantee the existence of a maximum likelihood estimate.

Definition 2.2 (ε\varepsilon–Double–non-separation).

The dataset {(yi,xi,zi)}i=1n\{(y_{i},x_{i},z_{i})\}_{i=1}^{n} is said to satisfy ε\varepsilon–double–non-separation if there exists a constant ε>0\varepsilon>0 such that

inf(v,w):v2+w2=1max{mini{i:yi=1}{vxi+wzi},maxi{i:yi=0}min{vxi,wzi}}ε.\displaystyle\inf_{(v,w):\|v\|^{2}+\|w\|^{2}=1}\max\left\{-\min_{i\in\{i:y_{i}=1\}}\{v^{\top}x_{i}+w^{\top}z_{i}\},~\max_{i\in\{i:y_{i}=0\}}\min\{v^{\top}x_{i},~w^{\top}z_{i}\}\right\}\geq\varepsilon.

We then establish the following proposition, which provides a sufficient condition for the existence of an estimate.

Proposition 2.3.

If the dataset satisfies ε\varepsilon–double–non-separation for some ε>0\varepsilon>0, then a maximizer of L(β,γ)L(\beta,\gamma) exists.

See Appendix A.3 for the proof.

The ε\varepsilon–double–non-separation condition implies that for any unit direction (v,w)(v,w), there exists at least one observation located at a signed distance of at least ε\varepsilon from the separating hyperplane associated with that direction. In other words, the dataset maintains a uniform margin of width ε\varepsilon that prevents double separation.

Even when ε\varepsilon–double–non-separation holds with ε\varepsilon close to zero, the surface of the log-likelihood function may be nearly flat along certain directions, and numerical optimization may be unstable in practice. In such settings, penalized estimation approaches may be useful.

A closely related phenomenon in this section arises in ordinary logistic regression, where data separation leads to the non-existence of the maximum likelihood estimate (Albert and Anderson, 1984; Silvapulle, 1981).

3 Shared-Design Model

We now focus on the case p=dp=d and zi=xiz_{i}=x_{i} for all i=1,,ni=1,\ldots,n. This is the configuration that arises when the analyst has no prior information with which to distinguish covariates for the ordinary logistic regression component xix_{i} from covariates for the structural-zero component ziz_{i} and therefore uses the same design matrix in both components of the model. In that case, the conditional probability of yiy_{i} is expressed as

p(yixi,β,γ)={F(γxi)F(βxi)}yi{1F(γxi)F(βxi)}1yi,\displaystyle p(y_{i}\mid x_{i},\beta,\gamma)=\left\{F(\gamma^{\top}x_{i})F(\beta^{\top}x_{i})\right\}^{y_{i}}\left\{1-F(\gamma^{\top}x_{i})F(\beta^{\top}x_{i})\right\}^{1-y_{i}},

for i=1,,ni=1,\ldots,n. We refer to this model as a shared-design model. The log-likelihood function is

L(β,γ)=i=1n[yilog{F(γxi)F(βxi)}+(1yi)log{1F(γxi)F(βxi)}],\displaystyle L(\beta,\gamma)=\sum_{i=1}^{n}\left[y_{i}\log\left\{F(\gamma^{\top}x_{i})F(\beta^{\top}x_{i})\right\}+(1-y_{i})\log\left\{1-F(\gamma^{\top}x_{i})F(\beta^{\top}x_{i})\right\}\right],

which satisfies the exchange symmetry

L(β,γ)=L(γ,β).\displaystyle L(\beta,\gamma)=L(\gamma,\beta).

Motivated by this symmetry, we define the equivalence relation

(β1,γ1)(β2,γ2)(β1,γ1)=(β2,γ2)or(β1,γ1)=(γ2,β2),\displaystyle(\beta_{1},\gamma_{1})\sim(\beta_{2},\gamma_{2})\quad\Longleftrightarrow\quad(\beta_{1},\gamma_{1})=(\beta_{2},\gamma_{2})\quad\text{or}\quad(\beta_{1},\gamma_{1})=(\gamma_{2},\beta_{2}),

and define [β,γ][\beta,\gamma] for the corresponding equivalence class.

3.1 Prior Work on Identifiability

Diop et al. (2011) established sufficient conditions for identifiability of the zero-inflated logistic regression model. A key condition in their argument is the availability of a continuous covariate that appears in one component but not the other. When the same covariates are used in both components, this source of asymmetry disappears. Therefore, the shared-design setting is a boundary case in which the ordinary guarantee of identifiability fails.

3.2 Identifiability under Shared Design

We formulate identifiability of the shared design model. Let x=(1,x~0)x=(1,\tilde{x}_{-0}^{\top})^{\top} where x~0d1\tilde{x}_{-0}\in\mathbb{R}^{d-1}. We first consider the following support condition.

  • (C1)

    The support of x~0\tilde{x}_{-0} contains a nonempty open subset 𝒰d1\mathcal{U}\subset\mathbb{R}^{d-1}.

Under condition (C1), we obtain the following basic identifiability result.

Proposition 3.1.

Suppose that condition (C1) holds. Let β1=(β1,0,β1,0)\beta_{1}=(\beta_{1,0},\beta_{1,-0}^{\top})^{\top} and γ1=(γ1,0,γ1,0)\gamma_{1}=(\gamma_{1,0},\gamma_{1,-0}^{\top})^{\top}, and suppose that β1,0\beta_{1,-0}, γ1,0\gamma_{1,-0}, β1,0+γ1,0\beta_{1,-0}+\gamma_{1,-0} are pairwise distinct. Suppose that for two parameter pairs (β1,γ1)(\beta_{1},\gamma_{1}) and (β2,γ2)(\beta_{2},\gamma_{2}),

p(yx,β1,γ1)=p(yx,β2,γ2),\displaystyle p\left(y\mid x,\beta_{1},\gamma_{1}\right)=p\left(y\mid x,\beta_{2},\gamma_{2}\right), (1)

holds for all y{0,1}y\in\{0,1\} and all x=(1,x~0)x=(1,\tilde{x}_{-0}^{\top})^{\top} for x~0𝒰\tilde{x}_{-0}\in\mathcal{U}. Then, we have either (β1,γ1)=(β2,γ2)(\beta_{1},\gamma_{1})=(\beta_{2},\gamma_{2}) or (β1,γ1)=(γ2,β2)(\beta_{1},\gamma_{1})=(\gamma_{2},\beta_{2}).

We next extend Proposition 3.1 to a mixed support for continuous and discrete covariates. Let x~0=(x~0(1),x~0(2))\tilde{x}_{-0}=(\tilde{x}_{-0}^{(1)\top},\tilde{x}_{-0}^{(2)\top})^{\top}, where x~0(1)r\tilde{x}_{-0}^{(1)}\in\mathbb{R}^{r} and x~0(2)s\tilde{x}_{-0}^{(2)}\in\mathbb{R}^{s} with r1r\geq 1, s0s\geq 0, and r+s=d1r+s=d-1. Here, x~0(1)\tilde{x}_{-0}^{(1)} denotes the subvector of continuously distributed covariates, and x~0(2)\tilde{x}_{-0}^{(2)} denotes the subvector of discretely distributed covariates. We consider the following support condition.

  • (C2)

    There exists a subset 𝒮supp(x~0(2))\mathcal{S}\subset\operatorname{supp}(\tilde{x}_{-0}^{(2)}) such that aff(𝒮)=s\operatorname{aff}(\mathcal{S})=\mathbb{R}^{s}, and, for each ξ𝒮\xi\in\mathcal{S}, the conditional support of x~0(1)\tilde{x}_{-0}^{(1)} given x~0(2)=ξ\tilde{x}_{-0}^{(2)}=\xi contains a nonempty open subset 𝒰ξr\mathcal{U}_{\xi}\subset\mathbb{R}^{r}.

When s=0s=0, condition (C2) reduces to condition (C1). Let βj,0=(βj,0(1),βj,0(2))\beta_{j,-0}=(\beta_{j,-0}^{(1)\top},\beta_{j,-0}^{(2)\top})^{\top} and γj,0=(γj,0(1),γj,0(2))\gamma_{j,-0}=(\gamma_{j,-0}^{(1)\top},\gamma_{j,-0}^{(2)\top})^{\top}, corresponding to the decomposition of x~0\tilde{x}_{-0} into x~0(j)\tilde{x}_{-0}^{(j)} for j=1,2j=1,2. Under condition (C2), we obtain the following extended result.

Theorem 3.2.

Suppose that condition (C2) holds. Let β1=(β1,0,β1,0)\beta_{1}=(\beta_{1,0},\beta_{1,-0}^{\top})^{\top} and γ1=(γ1,0,γ1,0)\gamma_{1}=(\gamma_{1,0},\gamma_{1,-0}^{\top})^{\top}, and suppose that β1,0(1)\beta_{1,-0}^{(1)}, γ1,0(1)\gamma_{1,-0}^{(1)}, β1,0(1)+γ1,0(1)\beta_{1,-0}^{(1)}+\gamma_{1,-0}^{(1)} are pairwise distinct. Suppose that for two parameter pairs (β1,γ1)(\beta_{1},\gamma_{1}) and (β2,γ2)(\beta_{2},\gamma_{2}), (1) holds for all y{0,1}y\in\{0,1\} and all x=(1,x~0(1),ξ)x=(1,\tilde{x}_{-0}^{(1)\top},\xi^{\top})^{\top} for ξ𝒮\xi\in\mathcal{S} and x~0(1)𝒰ξ\tilde{x}_{-0}^{(1)}\in\mathcal{U}_{\xi}. Then, we have either (β1,γ1)=(β2,γ2)(\beta_{1},\gamma_{1})=(\beta_{2},\gamma_{2}) or (β1,γ1)=(γ2,β2)(\beta_{1},\gamma_{1})=(\gamma_{2},\beta_{2}).

We now establish the following identifiability result.

Corollary 3.3.

Suppose that condition (C2) holds, and suppose that the shared-design model is correctly specified with true parameter pair (β,γ)(\beta^{\ast},~\gamma^{\ast}). Let β=(β0,β0)\beta^{\ast}=(\beta_{0}^{\ast},~\beta_{-0}^{\ast\top})^{\top} and γ=(γ0,γ0)\gamma^{\ast}=(\gamma_{0}^{\ast},~\gamma_{-0}^{\ast\top})^{\top}, where β0=(β0(1),β0(2))\beta_{-0}^{\ast}=(\beta_{-0}^{\ast(1)\top},\beta_{-0}^{\ast(2)\top})^{\top} and γ0=(γ0(1),γ0(2))\gamma_{-0}^{\ast}=(\gamma_{-0}^{\ast(1)\top},\gamma_{-0}^{\ast(2)\top})^{\top}, and suppose that β0(1)\beta_{-0}^{\ast(1)}, γ0(1)\gamma_{-0}^{\ast(1)}, β0(1)+γ0(1)\beta_{-0}^{\ast(1)}+\gamma_{-0}^{\ast(1)} are pairwise distinct. Then the expected log-likelihood

(β,γ):=𝔼(x,y)[ylog{F(xβ)F(xγ)}+(1y)log{1F(xβ)F(xγ)}],\displaystyle\mathcal{L}(\beta,~\gamma):=\mathbb{E}_{(x,y)}\left[y\log\left\{F(x^{\top}\beta)F(x^{\top}\gamma)\right\}+(1-y)\log\left\{1-F(x^{\top}\beta)F(x^{\top}\gamma)\right\}\right],

is uniquely maximized on (d×d)/(\mathbb{R}^{d}\times\mathbb{R}^{d})/{\sim} at the class [β,γ][\beta^{\ast},~\gamma^{\ast}].

See Appendix A.4 for the proofs of Proposition 3.1, Theorem 3.2, and Corollary 3.3. Proposition 3.1 gives the basic identifiability result under a fully continuous support, while Theorem 3.2 extends it to a mixed support for continuous and discrete covariates. Corollary 3.3 clarifies the inferential target under the shared-design setting. At the population level, the equivalence class [β,γ][\beta,\gamma] is uniquely identified, yet the model cannot distinguish which component corresponds to the event occurrence process and which to the structural-zero process. These results establish identifiability over the support of xx and do not imply the uniqueness of the likelihood maximizer in finite samples. Consequently, resolving this label ambiguity requires either external information or some relabeling rule.

These results also relate the shared-design model to the identifiability of finite mixture models (Teicher, 1963; Yakowitz and Spragins, 1968). In this context, identifiability is formulated as the uniqueness of the finite-mixture representation. Therefore, under an ordered-component parameterization, this corresponds to uniqueness up to permutation of component labels. In the present model, the relevant symmetry is not a literal permutation of mixture components but the exchange symmetry of the pair of parameter vectors (β,γ)(\beta,\gamma) for both components. In this sense, [β,γ][\beta,\gamma] is the natural inferential target.

The condition that β0\beta_{-0}^{\ast}, γ0\gamma_{-0}^{\ast}, and β0+γ0\beta_{-0}^{\ast}+\gamma_{-0}^{\ast} are pairwise distinct, which is inherited from Theorem 3.2, is a condition on the true parameter vector. This is a generic condition and can be assumed to hold in practice.

3.3 Numerical Confirmation of Bimodality

To illustrate the exchange symmetry established in Theorem 3.2 and Corollary 3.3, we examined the posterior distribution of (β,γ)(\beta,\gamma) under the shared-design setting via a Markov chain Monte Carlo (MCMC) algorithm. We set the number of non-intercept covariates as 44 (d=p=5d=p=5), and considered three covariate designs: Scenario 1: all 44 covariates were drawn from independent standard normal distributions; Scenario 2: all 44 covariates were drawn from independent Bernoulli(0.5)\text{Bernoulli}(0.5) distributions; Scenario 3: the first two non-intercept covariates were drawn from independent standard normal distributions, and the remaining ones were from independent Bernoulli(0.5)\text{Bernoulli}(0.5). The sample size was 2,0002,000. The data-generating mechanism and sampling setup are described in Appendix B. We developed the Pólya-Gamma Gibbs sampler with replica exchange, which is detailed in Appendix C (Polson et al., 2013; Swendsen and Wang, 1986).

Figure 1 displays the principal component analysis (PCA) plots of the posterior samples after kk-means++ clustering with k=2k=2 (Arthur and Vassilvitskii, 2007). Under continuous and mixed designs (Scenarios 1 and 3), the posterior exhibited clear bimodality, with two well-separated clusters. Under the binary design (Scenario 2), in which all non-intercept covariates take values in the bounded discrete set {0,1}\{0,1\}, the two clusters were not clearly distinguished in the PCA plots. This result is consistent with the failure of the binary design to satisfy the condition (C2), leading to a lack of guaranteed identifiability by Corollary 3.3. Posterior means of each cluster and further numerical details are provided in Appendix B and D.

Refer to caption
(a) Scenario 1: continuous
Refer to caption
(b) Scenario 2: binary
Refer to caption
(c) Scenario 3: mixed
Figure 1: PCA plots of posterior samples after kk-means clustering (k=2k=2). Each panel corresponds to a different covariate scenario: (a) Scenario 1: all 5 covariates were drawn from independent standard normal distributions; (b) Scenario 2: all 5 covariates were drawn from independent Bernoulli(0.5)\text{Bernoulli}(0.5) distributions; (c) Scenario 3: the first two non-intercept covariates were drawn from independent standard normal distributions, and the remaining ones were from independent Bernoulli(0.5)\text{Bernoulli}(0.5).

4 Relabeling Rule

Section 3 shows that, under the shared-design setting, the likelihood identifies only the equivalence class [β,γ][\beta,\gamma]. In many applications, however, investigators still need a single ordered pair because subsequent interpretation is conducted in terms of the associations with event occurrence or zero-inflation such as zeros due to misclassification. For that practical purpose, we introduce a simple relabeling rule.

4.1 Proposed Rule

We propose a simple relabeling rule. We proceed as follows.

  1. 1.

    Fit a standard logistic regression of yy on xx and denote the resulting coefficient vector by β^LR\hat{\beta}_{\mathrm{LR}}.

  2. 2.

    Fit the zero-inflated logistic regression model with shared design and obtain one ordered maximizer (β^,γ^)(\hat{\beta},\hat{\gamma}).

  3. 3.

    Form the exchange-symmetric solution (γ^,β^)(\hat{\gamma},\hat{\beta}).

  4. 4.

    Choose the pair whose first component vector is closer to β^LR\hat{\beta}_{\mathrm{LR}}; that is, choose

    (β^,γ^):=argmin(β,γ){(β^,γ^),(γ^,β^)}ββ^LR2.\displaystyle(\hat{\beta}^{\dagger},\hat{\gamma}^{\dagger}):=\operatorname*{arg\,min}_{(\beta,\gamma)\in\{(\hat{\beta},\hat{\gamma}),(\hat{\gamma},\hat{\beta})\}}\|\beta-\hat{\beta}_{\mathrm{LR}}\|_{2}.

We should note that Theorem 2.1 shows that the estimator of the ordinary logistic regression is biased under misspecification. Therefore, while β^LR\hat{\beta}_{\mathrm{LR}} serves as a convenient reference, it should not be interpreted as a definitive benchmark for the ordinary logistic regression component of the zero-inflated model. Whenever prior knowledge is available to distinguish the two components, that information should take precedence over this rule.

4.2 Simulation Study

We next investigate the behavior of the proposed relabeling rule through a simulation study.

Simulation Design.

We considered four scenarios that differ only in the intercept of the structural-zero component, hence, in the probability of zero inflation. The coefficient vector for the ordinary logistic regression component was fixed at

β=(0.5,1.0,0.5,0.5,0.25),\displaystyle{\beta}^{*}=(0.5,~1.0,~0.5,~0.5,~0.25)^{\top},

and the structural-zero coefficient was

γ=(γ0,int,1.0,1.0,0.5,0.5).\displaystyle{\gamma}^{*}=(\gamma_{0,\mathrm{int}},~-1.0,~-1.0,~0.5,~0.5)^{\top}.

We used four values of γ0,int\gamma_{0,\mathrm{int}}. Specifically, we considered: (i) Very Low Mislabel (γ0,int=4.3\gamma_{0,\text{int}}=4.3), yielding approximately 3.8%3.8\% structural zeros; (ii) Low Mislabel (γ0,int=3.0\gamma_{0,\text{int}}=3.0), yielding approximately 10.3%10.3\% structural zeros; (iii) Moderate Mislabel (γ0,int=1.7\gamma_{0,\text{int}}=1.7), yielding approximately 23.4%23.4\% structural zeros; and (iv) High Mislabel (γ0,int=1.0\gamma_{0,\text{int}}=1.0), yielding approximately 33.4%33.4\% structural zeros. These scenarios were determined to span the range of zero-inflation levels commonly encountered in applications, from nearly negligible to substantial proportions of structural zeros.

For each scenario, we generated n=1,000n=1,000 observations with d=5d=5 covariates including an intercept. The first element of xix_{i} was 11, and the remaining elements were drawn independently from the standard normal distributions. Responses were generated from the zero-inflated logistic regression model as follows:

hi\displaystyle h_{i} Bernoulli(F(γxi)),\displaystyle\sim\operatorname{Bernoulli}\left(F({\gamma}^{*\top}x_{i})\right),
yi\displaystyle\quad y_{i}^{\ast} Bernoulli(F(βxi)),\displaystyle\sim\operatorname{Bernoulli}\left(F({\beta}^{*\top}x_{i})\right),
yi\displaystyle\quad y_{i} =hiyi.\displaystyle=h_{i}y_{i}^{\ast}.

We compared three estimation approaches: (i) Proposed approach: the zero-inflated model with the proposed relabeling rule in Section 4, (ii) The standard logistic regression approach: the ordinary logistic regression model ignoring zero inflation, and (iii) Naive zero-inflated model approach: the zero-inflated model without relabeling, retaining the first local maximizer returned by the optimization algorithm.

All methods were performed by the same optimization method and settings: the L-BFGS-B algorithm with analytical gradients, a maximum of 1,0001,000 iterations, and random initialization from 𝒩(0,0.01I)\mathcal{N}(0,0.01I). An estimate was classified as unreasonable if at least one component exceeds ten times the absolute value of its true value. Each scenario was replicated 10,00010,000 times.

Results.

Tables 13 summarize the results. Figure 2 and Figure 3 visualize the boxplots for distributions of the observed bias. In these results, several patterns were clearly observed. First, for the parameter β\beta, the bias of the estimates from the proposed approach remained more concentrated around zero than that of the estimates from the naive approach, whereas the estimates from standard logistic regression approach exhibited a systematic negative shift that became larger as the zero-inflation proportion increased. Second, the proposed approach was more stable than the naive approach. Across the low-to-moderate zero-inflation scenarios, the biases of the relabeled estimates were smaller than those of the standard logistic regression, while their magnitudes of standard deviations remained moderate. For for the parameter γ\gamma, the proposed approach also dominated the naive approach in most scenarios, especially for the parameters other than the intercept. A weakness appeared when structural zeros were very rare, probably because the intercept of the structural-zero component was weakly identified and highly variable. Third, the naive approach behaved as expected in terms of a non-identifiable ordered parameterization. Specifically, the estimates were obtained from both symmetric solutions. The concentration of the relabeled estimates around the true parameter values, in contrast to the spread of the naive estimates, suggests that the proposed relabeling rule was effective at selecting an appropriate representative from each equivalence class.

Table 3 shows that the proposed relabeling rule improved practical reliability. The proposed method produced reasonable estimates in more than 99%99\% of replications in all but the Very Low Mislabel scenario and consistently outperformed the naive zero-inflated model approach in terms of reasonable solutions.

Table 1: Simulation results based on 10,00010{,}000 replications: estimates of the parameters for the ordinary logistic regression component (β\beta)
Scenario Method Bias (SD)
β0\beta_{0} β1\beta_{1} β2\beta_{2} β3\beta_{3} β4\beta_{4}
Very Low Mislabel Proposed 0.102 (0.234) -0.007 (0.231) -0.003 (0.190) -0.004 (0.114) -0.002 (0.115)
Standard LR -0.170 (0.071) -0.217 (0.083) -0.166 (0.076) 0.005 (0.076) 0.032 (0.072)
Naive ZILR 1.100 (1.449) -0.521 (0.861) -0.395 (0.674) -0.004 (0.243) 0.064 (0.253)
Low Mislabel Proposed 0.055 (0.253) -0.019 (0.274) -0.013 (0.218) -0.001 (0.109) 0.001 (0.109)
Standard LR -0.409 (0.068) -0.455 (0.075) -0.346 (0.069) 0.011 (0.073) 0.064 (0.069)
Naive ZILR 1.175 (1.384) -0.882 (1.059) -0.667 (0.810) 0.010 (0.184) 0.119 (0.231)
Moderate Mislabel Proposed -0.005 (0.289) -0.267 (0.592) -0.207 (0.457) 0.014 (0.121) 0.043 (0.142)
Standard LR -0.815 (0.068) -0.724 (0.069) -0.559 (0.068) 0.032 (0.072) 0.115 (0.070)
Naive ZILR 0.751 (0.953) -1.015 (1.127) -0.767 (0.862) 0.010 (0.167) 0.135 (0.225)
High Mislabel Proposed -0.195 (0.284) -0.648 (0.780) -0.506 (0.597) 0.035 (0.132) 0.105 (0.162)
Standard LR -1.121 (0.070) -0.860 (0.068) -0.670 (0.069) 0.050 (0.074) 0.145 (0.072)
Naive ZILR 0.416 (0.802) -1.021 (1.146) -0.771 (0.867) 0.007 (0.181) 0.133 (0.225)
Table 2: Simulation results based on 10,00010{,}000 replications: estimates of the parameters for the structural-zero component (γ\gamma)
Scenario Method Bias (SD)
γ0\gamma_{0} γ1\gamma_{1} γ2\gamma_{2} γ3\gamma_{3} γ4\gamma_{4}
Very Low Mislabel Proposed 1.144 (3.760) -0.102 (1.478) -0.140 (1.297) 0.092 (0.674) 0.086 (0.692)
Naive ZILR -0.579 (3.822) 0.677 (1.581) 0.471 (1.324) 0.060 (0.553) -0.042 (0.579)
Low Mislabel Proposed 0.548 (1.899) -0.118 (0.883) -0.109 (0.733) 0.047 (0.339) 0.042 (0.348)
Naive ZILR -0.802 (2.053) 0.852 (1.262) 0.632 (0.977) 0.028 (0.270) -0.094 (0.309)
Moderate Mislabel Proposed 0.345 (0.957) 0.223 (1.070) 0.160 (0.834) 0.007 (0.215) -0.018 (0.268)
Naive ZILR -0.432 (1.057) 0.981 (1.150) 0.728 (0.882) 0.011 (0.176) -0.113 (0.228)
High Mislabel Proposed 0.553 (0.789) 0.629 (1.350) 0.484 (1.030) -0.021 (0.219) -0.092 (0.278)
Naive ZILR -0.071 (0.859) 1.005 (1.163) 0.752 (0.882) 0.006 (0.181) -0.120 (0.228)
Refer to caption
Figure 2: Boxplots of parameter bias (estimate minus true value) for β\beta parameters across four scenarios based on 10,00010{,}000 simulation replications. Each panel represents a different scenario, with three estimation methods compared within panels. The true parameter values were set as β0=(0.5,1.0,0.5,0.5,0.25){\beta}_{0}=(0.5,~1.0,~0.5,~0.5,~0.25)^{\top}.
Refer to caption
Figure 3: Boxplots of parameter bias (estimate minus true value) for γ\gamma parameters across four scenarios based on 10,00010,000 simulation replications. Each panel represents a different scenario, with three estimation methods compared within panels. The true parameter values were set as γ0=(γ0,int,1.0,1.0,0.5,0.5){\gamma}_{0}=(\gamma_{0,\mathrm{int}},~-1.0,~-1.0,~0.5,~0.5)^{\top}, where γ0,int=4.3,3.0,1.7,1.0\gamma_{0,\mathrm{int}}=4.3,~3.0,~1.7,~1.0 for each scenario.
Table 3: Convergence diagnostics across simulation scenarios based on 10,00010,000 replications. ”Converged” indicates successful optimization with reasonable parameter estimates, ”Unreasonable” denotes cases where at least one parameter estimate exceeded ten times its true value, and ”Ratio” shows the percentage of successful convergence.
Scenario Method Converged Unreasonable Total Ratio (%)
Very Low Mislabel Proposed 9,156 844 10,000 91.6
Standard LR 10,000 0 10,000 100.0
Naive ZILR 7,249 2,751 10,000 72.5
Low Mislabel Proposed 9,936 64 10,000 99.4
Standard LR 10,000 0 10,000 100.0
Naive ZILR 9,232 768 10,000 92.3
Moderate Mislabel Proposed 9,985 15 10,000 99.9
Standard LR 10,000 0 10,000 100.0
Naive ZILR 9,936 64 10,000 99.4
High Mislabel Proposed 9,985 15 10,000 99.9
Standard LR 10,000 0 10,000 100.0
Naive ZILR 9,955 43 10,000 99.6

5 Application to Actual Data

We illustrate the performance of the zero-inflated logistic model with shared design through an application to public data. We note that the data analysis in this section is intended only as a methodological illustration and not for clinical interpretation.

5.1 Dataset

We used the National Health and Nutrition Examination Survey (NHANES), the 2017–2018 public release (Centers for Disease Control and Prevention (CDC), 2017). The outcome yiy_{i} for individual ID (SEQN) ii was self-reported diabetes status, constructed from the diabetes questionnaire as yi=1y_{i}=1 for respondents who reported a prior diabetes diagnosis (DIQ010=1\text{DIQ010}=1) and yi=0y_{i}=0 otherwise (DIQ010=2\text{DIQ010}=2). As covariates, we used insurance coverage (HIQ011), usual source of care (HUQ030), age (RIDAGEYR), body mass index: bmi (BMXBMI), and sex (RIAGENDR). To motivate a zero-inflated model, we compared self-reported status with an HbA1c-based variable. Specifically, we defined dA1c,i=1d_{\mathrm{A1c},i}=1 when HbA1c was at least 6.5%6.5\% (LBXGH6.5\text{LBXGH}\geq 6.5) and dA1c,i=0d_{\mathrm{A1c},i}=0 otherwise (LBXGH<6.5\text{LBXGH}<6.5). We used samples with non-zero 2-year sample weights (WTMEC2YR>0)({\text{WTMEC2YR}>0}). Table 4 summarizes the sample size of the data. Among respondents with dA1c,i=1d_{\mathrm{A1c},i}=1 and non-missing self-report, the proportion with yi=0y_{i}=0 was about 21.4%21.4\%. Although HbA1c is not a gold standard for diagnosis, these descriptive proportions suggest that self-reported diabetes can contain a non-negligible proportion of undiagnosed cases.

Table 4: Sample size summaries of NHANES data used to illustrate zero-inflated logistic regression. The last column reports the proportion of self-reported non-cases (yi=0y_{i}=0) among respondents with HbA1c 6.5%\geq 6.5\% (dA1c,i=1d_{\mathrm{A1c},i}=1) and observed self-reported diabetes status.
Period Interviewed participants sample weights >0>0 HbA1c was observed Self report was observed Ratio of yi=0y_{i}=0 given dA1c,i=1d_{\mathrm{A1c},i}=1
2017–2018 9,254 8,704 6,045 8,709 0.214

5.2 Model Settings

We specified a shared-design model. Specifically, the covariates of the ordinary logistic regression component and the structural-zero component were set equal, with covariates

(1,insured,usualcare,age,bmi,female),\displaystyle(1,~\text{insured},~\text{usualcare},~\text{age},~\text{bmi},~\text{female}),

where age and bmi were standardized. The model was estimated on the complete-case subset with 711711 samples. Because the aim of this section is methodological illustration, we did not incorporate survey weights.

5.3 Results

We obtained one ordered solution using the BFGS algorithm from a random initial value and then obtained the second by exchanging the two coefficient vectors. Table 5 shows the two solutions: solution A and B, denoted as (β^A,γ^A)(\hat{\beta}_{\mathrm{A}},\hat{\gamma}_{\mathrm{A}}) and (β^B,γ^B)(\hat{\beta}_{\mathrm{B}},\hat{\gamma}_{\mathrm{B}}), respectively. The resulting negative log-likelihood values were identical up to numerical precision. Specifically, both of the evaluated values were 1918.0231918.023. Moreover, the exchange symmetry was numerically exact: maxj{1,,6}|β^A,jγ^B,j|=2.98×107\max_{j\in\{1,\ldots,6\}}|\hat{\beta}_{A,j}-\hat{\gamma}_{B,j}|=2.98\times 10^{-7} and maxj{1,,6}|γ^A,jβ^B,j|=7.40×107\max_{j\in\{1,\ldots,6\}}|\hat{\gamma}_{A,j}-\hat{\beta}_{B,j}|=7.40\times 10^{-7}, where β^A,j\hat{\beta}_{A,j} denotes the jjth element of β^A\hat{\beta}_{A}.

We applied the relabeling rule from Section 4. Then, solution B was selected because β^Bβ^LR22=1.102\|\hat{\beta}_{\mathrm{B}}-\hat{\beta}_{\mathrm{LR}}\|_{2}^{2}=1.102, and β^Aβ^LR22=29.771\|\hat{\beta}_{\mathrm{A}}-\hat{\beta}_{\mathrm{LR}}\|_{2}^{2}=29.771, where β^LR\hat{\beta}_{\mathrm{LR}} denotes the estimate from the ordinary logistic regression, shown in Table 5.

Table 5: Two solutions (β^A,γ^A)(\hat{\beta}_{\mathrm{A}},\hat{\gamma}_{\mathrm{A}}) and (β^B,γ^B)(\hat{\beta}_{\mathrm{B}},\hat{\gamma}_{\mathrm{B}}), and the estimate from the ordinary logistic regression β^LR\hat{\beta}_{\mathrm{LR}} for the NHANES illustration.
Term β^A\hat{\beta}_{\mathrm{A}} γ^A\hat{\gamma}_{\mathrm{A}} β^B\hat{\beta}_{\mathrm{B}} γ^B\hat{\gamma}_{\mathrm{B}} β^LR\hat{\beta}_{\mathrm{LR}}
Intercept 1.250 -3.192 -3.192 1.250 -3.444
insured -0.413 0.138 0.138 -0.413 -0.111
usualcare 1.085 0.179 0.179 1.085 0.627
age -1.251 2.300 2.300 -1.251 1.474
bmi 0.647 0.371 0.371 0.647 0.590
female -0.500 -0.223 -0.223 -0.500 -0.431

6 Discussion

This study investigates the zero-inflated logistic regression model with shared design, in terms of a sign-flip phenomenon under misspecification, the existence of maximum likelihood estimates, identifiability of the regression parameters, computational methods for implementation, and a practical relabeling rule. The primary theoretical message is that non-identifiability in the shared-design setting is not unstructured. Under mild regularity conditions, the non-identifiability is reduced to the exchange symmetry of the two coefficient vectors. By considering the quotient space with respect to this symmetry, the expected log-likelihood has a unique maximizer. This result is useful for understanding the inherent inferential limits of the model.

A second contribution is the analysis of the existence of maximum likelihood estimates. The concepts of double separation and ε\varepsilon–double–non-separation introduced in this study extend the classical separation conditions for the ordinary logistic regression model. While these conditions do not provide a complete characterization, they offer tractable sufficient conditions for both the existence and non-existence of estimates. In particular, the results on non-existence explain why optimization algorithms may fail to converge even before considering the exchange symmetry of the regression parameters. Furthermore, as shown in Theorem 2.1, model misspecification can lead to a sign flip in the regression coefficients relative to their true values. This provides a formal warning against analyses that ignore zero-inflation structures.

Numerical results based on posterior sampling support the theoretical findings regarding identifiability. Bimodality was clearly observed in posterior distributions under the continuous and mixed designs. However, the binary design did not exhibit clear mode separation and increased numerical instability, probably because the design fails to satisfy the condition (C2) and thereby lacks the guaranteed identifibility. These results provide a practical guideline: analysts should exercise caution when interpreting estimates if the all covariate values are restricted to a small number of support points in the covariate space. In terms of sampling algorithm, because standard single-chain Gibbs samplers may become trapped in one of the modes, we employed replica exchange method for efficient exploration of the parameter space.

The relabeling rule proposed in Section 4 serves as a heuristic for interpretation rather than a new source of identification. Its role is to provide a reproducible ordering rule when an ordered pair is required for subsequent interpretation or comparison with the results from the ordinary logistic regression. If external information is available to distinguish the ordinary logistic regression component from the structural-zero component, such information should take precedence over the proposed rule.

This study has several limitations. First, our theoretical results provide sufficient conditions for the existence and non-existence of the maximum likelihood estimate, rather than a complete characterization. Second, regarding the relabeling rule, it remains to be investigated whether alternative rules can improve performance when the referenced ordinary logistic regression itself suffers from a severe bias. Third, as the asymptotic theory for the model on the quotient space remains to be established, we do not perform formal statistical inference. These limitations suggest several directions for future research. Promising directions include a more refined characterization of the existence of the estimates, the formalization of asymptotic theory for parameters defined on the quotient space, and the development of relabeling rules that provide reliable choice even when the referenced logistic regression suffers from a severe bias.

A further direction is to examine whether the theoretical results extend to other link functions for binary regression, such as the probit or the complementary log-log function. Theorem 3.2 exploits the specific logistic form F(μ)=exp(μ)/{1+exp(μ)}F(\mu)=\exp(\mu)/\{1+\exp(\mu)\}, which reduces the identifiability condition to an equality of sums of exponential terms with distinct exponents. While such a representation is unavailable for other general link functions, the principle that the equality F(β1x)F(γ1x)=F(β2x)F(γ2x)F(\beta_{1}^{\top}x)F(\gamma_{1}^{\top}x)=F(\beta_{2}^{\top}x)F(\gamma_{2}^{\top}x) over an open set restricts the parameter pairs to an exchange-symmetric set may hold for a broader class of functions. Specifically, for real analytic link functions, an exchange-symmetry result may be established via the identity theorem, provided that the analytic form of the product function ensures that local equality implies global equivalence. Furthermore, because the results on the existence of the maximum likelihood estimate and the sign flip phenomenon under misspecification depend primarily on the monotonicity and boundedness of F()F(\cdot) and the structure of the log-likelihood function, these properties are expected to hold across other link functions, probably with appropriate modifications to reflect specific characteristics of link functions, such as the asymmetric behavior of the complementary log-log function.

7 Concluding Remarks

When the same covariates are used for both components of the zero-inflated logistic regression model, identifiability results as established in existing literatures does not hold. We establish that this non-identifiability has a specific structure. Namely, under mild regularity conditions, it reduces to exchange symmetry, and the expected log-likelihood has a unique maximizer on the resulting quotient space. In addition, we introduce sufficient conditions for the existence and non-existence of the maximum likelihood estimate, demonstrate posterior bimodality through numerical experiments, and propose a simple relabeling rule for applications. We also establish a sign flip phenomenon under misspecification. These theoretical and numerical results provide us with a practical guideline in applying the zero-inflated logistic regression model.

Funding

This research was supported by AMED under Grant Number JP223fa627001 (UTOPIA AI Research Discovery Program) and JSPS KAKENHI Grant Number 26K02664.

Data Availability

The NHANES data used in the application is available from the website of the Centers for Disease Control and Prevention: https://wwwn.cdc.gov/nchs/nhanes/.

Code Availability

The Python scripts for numerical studies and the R script for actual data application in this manuscript are available from the GitHub repository: https://github.com/t-yui/zero-inflated-logistic-shared-design.

Declaration of the Use of Generative AI and AI-assisted Technologies

The authors used ChatGPT (OpenAI), Claude (Anthropic) and Gemini (Google) to assist with developing scripts for simulation and application, and editing the English language during the preparation of this manuscript. The authors checked and edited the content and take full responsibility for this manuscript.

Appendix A Proofs

A.1 Proof of Theorem 2.1

We state the following regularity conditions.

Assumption A.1.

We assume that

  1. (A1)

    𝔼[x2]<\mathbb{E}[\|x\|^{2}]<\infty.

  2. (A2)

    (x~j>0)>0\mathbb{P}(\tilde{x}_{j}>0)>0 and (x~j<0)>0\mathbb{P}(\tilde{x}_{j}<0)>0.

  3. (A3)

    𝔼[x~jx~j,z~j]=0\mathbb{E}[\tilde{x}_{j}\mid\tilde{x}_{-j},\tilde{z}_{-j}]=0 almost surely.

  4. (A4)

    For the fixed value of c<0c<0 and every tt\in\mathbb{R}, the map (θ0,θj)c(θ0,t,θj)(\theta_{0},\theta_{-j})\mapsto\mathcal{L}_{c}(\theta_{0},t,\theta_{-j}) has a unique maximizer, and the profile objective function gcg_{c} has at least one maximizer on \mathbb{R}.

For the fixed value of c<0c<0 and each tt\in\mathbb{R}, let (θ0(t),θj(t))(\theta_{0}^{*}(t),\theta_{-j}^{*}(t)) denote the maximizer in the definition of gc(t)g_{c}(t), and let

η(t,x)\displaystyle\eta(t,x) :=θ0(t)+tx~j+θj(t)x~j,\displaystyle:=\theta_{0}^{*}(t)+t\tilde{x}_{j}+\theta_{-j}^{*}(t)^{\top}\tilde{x}_{-j},
μ(t,x)\displaystyle\mu(t,x) :=F(η(t,x)).\displaystyle:=F(\eta(t,x)).

We first establish basic properties of the profile objective function gcg_{c}.

Lemma A.1.

Suppose that Assumption A.1 holds. Then, we have

  1. (i)

    The derivative of gcg_{c} is gc(t)=𝔼(x,z)[(πc(x,z)μ(t,x))x~j]g_{c}^{\prime}(t)=\mathbb{E}_{(x,z)}\left[(\pi_{c}(x,z)-\mu(t,x))\,\tilde{x}_{j}\right].

  2. (ii)

    The function gcg_{c} is concave with respect to tt.

Proof of Lemma A.1..

We begin with (i). For fixed (x,z)(x,z), let

(η)\displaystyle\ell(\eta) :=πc(x,z)logF(η)+{1πc(x,z)}log{1F(η)}.\displaystyle:=\pi_{c}(x,z)\log F(\eta)+\{1-\pi_{c}(x,z)\}\log\{1-F(\eta)\}.

Using F(η)=F(η){1F(η)}F^{\prime}(\eta)=F(\eta)\{1-F(\eta)\}, we have

η(η)\displaystyle\frac{\partial}{\partial\eta}\ell(\eta) =πc(x,z)F(η).\displaystyle=\pi_{c}(x,z)-F(\eta).

Since η/t=xj\partial\eta/\partial t=x_{j}, we obtain

tc(θ0,t,θj)\displaystyle\frac{\partial}{\partial t}\mathcal{L}_{c}(\theta_{0},t,\theta_{-j}) =𝔼(x,z)[(η)ηηt]=𝔼(x,z)[(πc(x,z)F(η))xj].\displaystyle=\mathbb{E}_{(x,z)}\left[\frac{\partial\ell(\eta)}{\partial\eta}\,\frac{\partial\eta}{\partial t}\right]=\mathbb{E}_{(x,z)}\left[(\pi_{c}(x,z)-F(\eta))\,x_{j}\right].

Moreover, since 0πc(x,z)10\leq\pi_{c}(x,z)\leq 1 and 0F(η)10\leq F(\eta)\leq 1, we have

|t(η)||x~j|.\displaystyle\left|\frac{\partial}{\partial t}\ell(\eta)\right|\leq|\tilde{x}_{j}|.

By Assumption A.1(A1), we have 𝔼[|x~j|]{𝔼[x~j2]}1/2<\mathbb{E}[|\tilde{x}_{j}|]\leq\{\mathbb{E}[\tilde{x}_{j}^{2}]\}^{1/2}<\infty. Therefore, by the uniqueness of the maximizer from Assumption A.1(A4), using Danskin’s theorem, we obtain

gc(t)\displaystyle g_{c}^{\prime}(t) =tc(θ0,t,θj)|(θ0,θj)=(θ0(t),θj(t))\displaystyle=\frac{\partial}{\partial t}\mathcal{L}_{c}(\theta_{0},t,\theta_{-j})\Big|_{(\theta_{0},\theta_{-j})=(\theta^{*}_{0}(t),\theta^{*}_{-j}(t))}
=𝔼(x,z)[(πc(x,z)μ(t,x))x~j].\displaystyle=\mathbb{E}_{(x,z)}\left[(\pi_{c}(x,z)-\mu(t,x))\tilde{x}_{j}\right].

We then prove (ii). Fix t1,t2t_{1},~t_{2}\in\mathbb{R} and λ[0,1]\lambda\in[0,1]. Let (θ0,r,θj,r)×d2(\theta^{*}_{0,r},\theta^{*}_{-j,r})\in\mathbb{R}\times\mathbb{R}^{d-2} be a maximizer of c(,tr,)\mathcal{L}_{c}(\cdot,t_{r},\cdot) for r=1,2r=1,~2. Using the concavity of c\mathcal{L}_{c}, we have

gc(λt1+(1λ)t2)\displaystyle g_{c}(\lambda t_{1}+(1-\lambda)t_{2})
=sup(θ0,θj)×d2c(θ0,λt1+(1λ)t2,θj)\displaystyle\quad=\sup_{(\theta_{0},\theta_{-j})\in\mathbb{R}\times\mathbb{R}^{d-2}}\mathcal{L}_{c}(\theta_{0},\lambda t_{1}+(1-\lambda)t_{2},\theta_{-j})
c(λθ0,1+(1λ)θ0,2,λt1+(1λ)t2,λθj,1+(1λ)θj,2)\displaystyle\quad\geq\mathcal{L}_{c}(\lambda\theta^{*}_{0,1}+(1-\lambda)\theta^{*}_{0,2},~\lambda t_{1}+(1-\lambda)t_{2},~\lambda\theta^{*}_{-j,1}+(1-\lambda)\theta^{*}_{-j,2})
λc(θ0,1,t1,θj,1)+(1λ)c(θ0,2,t2,θj,2)\displaystyle\quad\geq\lambda\mathcal{L}_{c}(\theta^{*}_{0,1},~t_{1},~\theta^{*}_{-j,1})+(1-\lambda)\mathcal{L}_{c}(\theta^{*}_{0,2},~t_{2},~\theta^{*}_{-j,2})
=λgc(t1)+(1λ)gc(t2).\displaystyle\quad=\lambda g_{c}(t_{1})+(1-\lambda)g_{c}(t_{2}).

Thus gc(t)g_{c}(t) is concave. ∎

The next lemma simplifies the derivative at t=0t=0.

Lemma A.2.

Suppose that Assumption A.1 holds. Then, we have

𝔼(x,z)[(πc(x,z)μ(0,x))x~j]=𝔼(x,z)[πc(x,z)x~j].\displaystyle\mathbb{E}_{(x,z)}\left[(\pi_{c}(x,z)-\mu(0,x))\tilde{x}_{j}\right]=\mathbb{E}_{(x,z)}\left[\pi_{c}(x,z)\tilde{x}_{j}\right].
Proof of Lemma A.2..

At t=0t=0, we have μ(0,x)=F(θ0(0)+θj(0)x~j)=:μ~(x~j)\mu(0,x)=F(\theta^{*}_{0}(0)+\theta^{*}_{-j}(0)^{\top}\tilde{x}_{-j})=:\tilde{\mu}(\tilde{x}_{-j}), which depends on xx only through x~j\tilde{x}_{-j}. Thus, using (A3),

𝔼(x,z)[μ(0,x)x~j]\displaystyle\mathbb{E}_{(x,z)}[\mu(0,x)\tilde{x}_{j}] =𝔼(x~j,z~j)[μ~(x~j)𝔼[x~jx~j,z~j]]=0.\displaystyle=\mathbb{E}_{(\tilde{x}_{-j},\tilde{z}_{-j})}\left[\tilde{\mu}(\tilde{x}_{-j})\mathbb{E}[\tilde{x}_{j}\mid\tilde{x}_{-j},\tilde{z}_{-j}]\right]=0.

Therefore, we have

𝔼(x,z)[(πc(x,z)μ(0,x))x~j]=𝔼(x,z)[πc(x,z)x~j].\displaystyle\mathbb{E}_{(x,z)}\left[(\pi_{c}(x,z)-\mu(0,x))\tilde{x}_{j}\right]=\mathbb{E}_{(x,z)}\left[\pi_{c}(x,z)\tilde{x}_{j}\right].

Finally, we investigate the behavior of f(c):=𝔼(x,z)[πc(x,z)x~j]f(c):=\mathbb{E}_{(x,z)}[\pi_{c}(x,z)\tilde{x}_{j}].

Lemma A.3.

Suppose that Assumption A.1 holds. Then, we have

  1. (i)

    For every c<0c<0, f(c)=𝔼(x,z)[F(βx)F(γz)x~j2]>0.f^{\prime}(c)=\mathbb{E}_{(x,z)}[F(\beta^{\top}x)~F^{\prime}(\gamma^{\top}z)\tilde{x}_{j}^{2}]>0.

  2. (ii)

    limcf(c)=𝔼(x,z)[F(βx)x~j𝟏{x~j<0}]<0.\lim_{c\to-\infty}f(c)=\mathbb{E}_{(x,z)}[F(\beta^{\top}x)~\tilde{x}_{j}~\mathbf{1}\{\tilde{x}_{j}<0\}]<0.

Proof of Lemma A.3..

We first prove (i). Since γz=γ0+cx~j+γjz~j\gamma^{\top}z=\gamma_{0}+c\tilde{x}_{j}+\gamma_{-j}^{\top}\tilde{z}_{-j}, we have (γz)/c=x~j\partial(\gamma^{\top}z)/\partial c=\tilde{x}_{j}, and hence

cπc(x,z)=F(βx)F(γz)x~j.\displaystyle\frac{\partial}{\partial c}\pi_{c}(x,z)=F(\beta^{\top}x)F^{\prime}(\gamma^{\top}z)\tilde{x}_{j}.

We then obtain

f(c)=𝔼(x,z)[c{πc(x,z)x~j}]=𝔼(x,z)[F(βx)F(γz)x~j2].\displaystyle f^{\prime}(c)=\mathbb{E}_{(x,z)}\left[\frac{\partial}{\partial c}\{\pi_{c}(x,z)\tilde{x}_{j}\}\right]=\mathbb{E}_{(x,z)}\left[F(\beta^{\top}x)F^{\prime}(\gamma^{\top}z)\tilde{x}_{j}^{2}\right].

Since F()>0F^{\prime}(\cdot)>0 and x~j20\tilde{x}_{j}^{2}\geq 0 with (x~j0)>0\mathbb{P}(\tilde{x}_{j}\neq 0)>0 by (A2), the expectation is strictly positive and thus we have f(c)>0f^{\prime}(c)>0.

We next prove (ii). As cc\to-\infty, we have

γz{+,ifx~j<0,,ifx~j>0,finite,ifx~j=0,\displaystyle\gamma^{\top}z\to\begin{cases}+\infty,&\text{if}~\tilde{x}_{j}<0,\\ -\infty,&\text{if}~\tilde{x}_{j}>0,\\ \text{finite},&\text{if}~\tilde{x}_{j}=0,\end{cases}

and hence F(γz)𝟏{x~j<0}F(\gamma^{\top}z)\to\mathbf{1}\{\tilde{x}_{j}<0\}. We then have

limcf(c)=𝔼(x,z)[F(βx)x~j𝟏{x~j<0}].\displaystyle\lim_{c\to-\infty}f(c)=\mathbb{E}_{(x,z)}\left[F(\beta^{\top}x)\tilde{x}_{j}\mathbf{1}\{\tilde{x}_{j}<0\}\right].

For the event {x~j<0}\{\tilde{x}_{j}<0\}, the integrand is strictly negative because F(βx)>0F(\beta^{\top}x)>0 and x~j<0\tilde{x}_{j}<0. Therefore, by (A2), we obtain limcf(c)<0\lim_{c\to-\infty}f(c)<0. ∎

We now prove Theorem 2.1.

Proof of Theorem 2.1.

By Lemma A.1(i) and Lemma A.2, the derivative of gcg_{c} at t=0t=0 is

gc(0)=𝔼(x,z)[(πc(x,z)μ(0,x))x~j]=𝔼(x,z)[πc(x,z)x~j]=f(c).\displaystyle g_{c}^{\prime}(0)=\mathbb{E}_{(x,z)}\left[(\pi_{c}(x,z)-\mu(0,x))\tilde{x}_{j}\right]=\mathbb{E}_{(x,z)}\left[\pi_{c}(x,z)\tilde{x}_{j}\right]=f(c).

By Lemma A.3(i), ff is strictly increasing in cc, and by Lemma A.3(ii), limcf(c)<0\lim_{c\to-\infty}f(c)<0. Hence there exists a constant C0<0C_{0}<0 such that

cC0f(c)=gc(0)<0.\displaystyle c\leq C_{0}\quad\Longrightarrow\quad f(c)=g_{c}^{\prime}(0)<0.

Fix cC0c\leq C_{0}. By Lemma A.1(ii), gcg_{c} is concave. Therefore, for every t>0t>0, we have gc(t)gc(0)+tgc(0)<gc(0)g_{c}(t)\leq g_{c}(0)+tg_{c}^{\prime}(0)<g_{c}(0). This implies that no maximizer of gcg_{c} can lie in [0,)[0,\infty). By Assumption A.1(A4), we have argmaxtgc(t)(,0)\arg\max_{t\in\mathbb{R}}g_{c}(t)\subset(-\infty,0). ∎

A.2 Proof of Proposition 2.2

We prove Proposition 2.2 as follows.

Proof of Proposition 2.2..

Assume that the data satisfy double separation, so there exist non-zero vectors vdv\in\mathbb{R}^{d} and wpw\in\mathbb{R}^{p} such that

yi=1\displaystyle y_{i}=1 vxi0,wzi0,\displaystyle\Longrightarrow v^{\top}x_{i}\geq 0,\quad w^{\top}z_{i}\geq 0,
yi=0\displaystyle y_{i}=0 vxi0,wzi0,\displaystyle\Longrightarrow v^{\top}x_{i}\leq 0,\quad w^{\top}z_{i}\leq 0,

and at least one observation satisfies a strict inequality in the sense of the definition of double separation.

Fix (β,γ)d×p(\beta,\gamma)\in\mathbb{R}^{d}\times\mathbb{R}^{p}, and define β,γ(t):=L(β+tv,γ+tw)\ell_{\beta,\gamma}(t):=L(\beta+tv,\gamma+tw) for t0t\geq 0. For each ii, let

ai(t)\displaystyle a_{i}(t) =F(γzi+twzi),\displaystyle=F\left(\gamma^{\top}z_{i}+tw^{\top}z_{i}\right),
bi(t)\displaystyle b_{i}(t) =F(βxi+tvxi),\displaystyle=F\left(\beta^{\top}x_{i}+tv^{\top}x_{i}\right),
gi(t)\displaystyle g_{i}(t) =ai(t)bi(t).\displaystyle=a_{i}(t)b_{i}(t).

Because F(u)=F(u){1F(u)}F^{\prime}(u)=F(u)\{1-F(u)\}, we have

gi(t)=gi(t)((1ai(t))wzi+(1bi(t))vxi).\displaystyle g_{i}^{\prime}(t)=g_{i}(t)\left((1-a_{i}(t))w^{\top}z_{i}+(1-b_{i}(t))v^{\top}x_{i}\right).

Therefore, we have

β,γ(t)\displaystyle\ell_{\beta,\gamma}^{\prime}(t) =i=1n{yigi(t)1yi1gi(t)}gi(t)\displaystyle=\sum_{i=1}^{n}\left\{\frac{y_{i}}{g_{i}(t)}-\frac{1-y_{i}}{1-g_{i}(t)}\right\}g_{i}^{\prime}(t)
=i=1nyigi(t)1gi(t)((1ai(t))wzi+(1bi(t))vxi).\displaystyle=\sum_{i=1}^{n}\frac{y_{i}-g_{i}(t)}{1-g_{i}(t)}\left((1-a_{i}(t))w^{\top}z_{i}+(1-b_{i}(t))v^{\top}x_{i}\right).

If yi=1y_{i}=1, then

yigi(t)1gi(t)=1gi(t)1gi(t)=1,\displaystyle\frac{y_{i}-g_{i}(t)}{1-g_{i}(t)}=\frac{1-g_{i}(t)}{1-g_{i}(t)}=1,

and both wzi0w^{\top}z_{i}\geq 0 and vxi0v^{\top}x_{i}\geq 0. Hence, we obtain

yigi(t)1gi(t)((1ai(t))wzi+(1bi(t))vxi)0.\displaystyle\frac{y_{i}-g_{i}(t)}{1-g_{i}(t)}\left((1-a_{i}(t))w^{\top}z_{i}+(1-b_{i}(t))v^{\top}x_{i}\right)\geq 0.

If yi=0y_{i}=0, then

yigi(t)1gi(t)=gi(t)1gi(t)<0,\displaystyle\frac{y_{i}-g_{i}(t)}{1-g_{i}(t)}=-\frac{g_{i}(t)}{1-g_{i}(t)}<0,

while both wzi0w^{\top}z_{i}\leq 0 and vxi0v^{\top}x_{i}\leq 0. Hence, we again obtain

yigi(t)1gi(t)((1ai(t))wzi+(1bi(t))vxi)0.\displaystyle\frac{y_{i}-g_{i}(t)}{1-g_{i}(t)}\left((1-a_{i}(t))w^{\top}z_{i}+(1-b_{i}(t))v^{\top}x_{i}\right)\geq 0.

Moreover, because at least one observation satisfies a strict inequality, we have

yigi(t)1gi(t)((1ai(t))wzi+(1bi(t))vxi)>0,\displaystyle\frac{y_{i}-g_{i}(t)}{1-g_{i}(t)}\left((1-a_{i}(t))w^{\top}z_{i}+(1-b_{i}(t))v^{\top}x_{i}\right)>0,

for t0t\geq 0, for some ii. Indeed, if yi=1y_{i}=1 and either wzi>0w^{\top}z_{i}>0 or vxi>0v^{\top}x_{i}>0, then

(1ai(t))wzi+(1bi(t))vxi>0,\displaystyle(1-a_{i}(t))w^{\top}z_{i}+(1-b_{i}(t))v^{\top}x_{i}>0,

because 0<ai(t),bi(t)<10<a_{i}(t),b_{i}(t)<1. Similarly, if yi=0y_{i}=0 and either wzi<0w^{\top}z_{i}<0 or vxi<0v^{\top}x_{i}<0, then

(1ai(t))wzi+(1bi(t))vxi<0.\displaystyle(1-a_{i}(t))w^{\top}z_{i}+(1-b_{i}(t))v^{\top}x_{i}<0.

Therefore,

β,γ(t)>0for allt0.\displaystyle\ell_{\beta,\gamma}^{\prime}(t)>0\quad\text{for all}~t\geq 0.

Since (β,γ)(\beta,\gamma) was arbitrary, every finite parameter point can be improved by moving a small positive amount along the direction (v,w)(v,w). Therefore, no finite point can be a maximizer of L(β,γ)L(\beta,\gamma). Therefore, the log-likelihood has no maximizer in d×p\mathbb{R}^{d}\times\mathbb{R}^{p}. ∎

A.3 Proof of Proposition 2.3

We prove Proposition 2.3 as follows.

Proof of Proposition 2.3..

Let (β,γ)=t(v,w)(\beta,\gamma)=t(v,w), t:=β2+γ20t:=\sqrt{\|\beta\|^{2}+\|\gamma\|^{2}}\geq 0, and v2+w2=1\|v\|^{2}+\|w\|^{2}=1. Because

F(u)\displaystyle F(u) euand1F(u)eu,\displaystyle\leq e^{u}\quad\text{and}\quad 1-F(u)\leq e^{-u},

we have

F(u1)F(u2)eu1+u2,and1F(u1)F(u2)eu1+eu2,\displaystyle F(u_{1})F(u_{2})\leq e^{u_{1}+u_{2}},\quad\text{and}\quad 1-F(u_{1})F(u_{2})\leq e^{-u_{1}}+e^{-u_{2}}, (2)

for all u1,u2u_{1},u_{2}\in\mathbb{R}.

By ε\varepsilon–double–non-separation, for each unit direction (v,w)(v,w) at least one of the following holds:

  1. (a)

    there exists i{i:yi=1}i\in\{i:y_{i}=1\} such that vxi+wziεv^{\top}x_{i}+w^{\top}z_{i}\leq-\varepsilon;

  2. (b)

    there exists i{i:yi=0}i\in\{i:y_{i}=0\} such that min{vxi,wzi}ε\min\{v^{\top}x_{i},w^{\top}z_{i}\}\geq\varepsilon.

If (a) holds, then for i{i:yi=1}i\in\{i:y_{i}=1\},

logF(tvxi)+logF(twzi)\displaystyle\log F(tv^{\top}x_{i})+\log F(tw^{\top}z_{i}) =log{F(tvxi)F(twzi)}\displaystyle=\log\{F(tv^{\top}x_{i})F(tw^{\top}z_{i})\}
t(vxi+wzi)εt.\displaystyle\leq t(v^{\top}x_{i}+w^{\top}z_{i})\leq-\varepsilon t.

Therefore, we have

L(tv,tw)εt.\displaystyle L(tv,tw)\leq-\varepsilon t.

Next, if (b) holds, then for i{i:yi=0}i\in\{i:y_{i}=0\},

log(1F(tvxi)F(twzi))\displaystyle\log(1-F(tv^{\top}x_{i})F(tw^{\top}z_{i})) log(etvxi+etwzi)\displaystyle\leq\log(e^{-tv^{\top}x_{i}}+e^{-tw^{\top}z_{i}})
log2εt.\displaystyle\leq\log 2-\varepsilon t.

Therefore, we have

L(tv,tw)log2εt.\displaystyle L(tv,tw)\leq\log 2-\varepsilon t.

Consequently, we obtain

sup(v,w):v2+w2=1L(tv,tw)log2εtast.\displaystyle\sup_{(v,w):\|v\|^{2}+\|w\|^{2}=1}L(tv,tw)\leq\log 2-\varepsilon t\to-\infty\quad\text{as}\quad t\to\infty. (3)

Since log2εt\log 2-\varepsilon t\to-\infty as tt\to\infty, there exists a sufficiently large radius t0>0t_{0}>0 such that log2εt0<L(0,0)\log 2-\varepsilon t_{0}<L(0,0). We then define the closed ball

t0={(β,γ)d+p:(β,γ)t0}.\displaystyle\mathcal{B}_{t_{0}}=\{(\beta^{\top},\gamma^{\top})^{\top}\in\mathbb{R}^{d+p}:\|(\beta^{\top},\gamma^{\top})^{\top}\|\leq t_{0}\}.

As t0\mathcal{B}_{t_{0}} is a compact set and LL is continuous, LL attains its maximum at some point (β^,γ^)t0(\hat{\beta},\hat{\gamma})^{\top}\in\mathcal{B}_{t_{0}}. Furthermore, for any (β,γ)(\beta^{\top},\gamma^{\top})^{\top} outside this ball, by (3), we have

L(β,γ)log2ε(β,γ)<log2εt0<L(0,0).\displaystyle L(\beta,\gamma)\leq\log 2-\varepsilon\|(\beta^{\top},\gamma^{\top})^{\top}\|<\log 2-\varepsilon t_{0}<L(0,0).

Since L(0,0)L(β^,γ^)L(0,0)\leq L(\hat{\beta},\hat{\gamma}), it follows that L(β,γ)<L(β^,γ^)L(\beta,\gamma)<L(\hat{\beta},\hat{\gamma}) for all (β,γ)t0(\beta^{\top},\gamma^{\top})^{\top}\notin\mathcal{B}_{t_{0}}. Therefore, (β^,γ^)(\hat{\beta},\hat{\gamma})^{\top} is the global maximizer of L(β,γ)L({\beta},{\gamma}). ∎

A.4 Proofs of Proposition 3.1, Theorem 3.2, and Corollary 3.3

We first remark a standard linear-independence property of exponential functions.

Lemma A.4.

Let λ1,,λmd1\lambda_{1},\ldots,\lambda_{m}\in\mathbb{R}^{d-1} be distinct vectors, and let 𝒰d1\mathcal{U}\subset\mathbb{R}^{d-1} contain a nonempty open set. If

k=1makexp(λkw)=0,\displaystyle\sum_{k=1}^{m}a_{k}\exp\left(\lambda_{k}^{\top}w\right)=0,

for all w𝒰w\in\mathcal{U}, then a1==am=0a_{1}=\cdots=a_{m}=0.

Proof of Lemma A.4..

Choose w0w_{0} in the interior of 𝒰\mathcal{U}. Since the hyperplanes

{vd1:(λkλ)v=0,k,k,{1,,m}},\displaystyle\left\{v\in\mathbb{R}^{d-1}:(\lambda_{k}-\lambda_{\ell})^{\top}v=0,~k\neq\ell,~k,\ell\in\{1,\ldots,m\}\right\},

do not cover d1\mathbb{R}^{d-1}, there exists vd1v\in\mathbb{R}^{d-1} such that λ1v,,λmv\lambda_{1}^{\top}v,~\ldots,~\lambda_{m}^{\top}v are pairwise distinct. For all sufficiently small tt, we have w0+tv𝒰w_{0}+tv\in\mathcal{U}, and hence

0=k=1makexp(λk(w0+tv))=k=1makexp(λkw0)exp((λkv)t).\displaystyle 0=\sum_{k=1}^{m}a_{k}\exp\left(\lambda_{k}^{\top}(w_{0}+tv)\right)=\sum_{k=1}^{m}a_{k}\exp\left(\lambda_{k}^{\top}w_{0}\right)\exp\left((\lambda_{k}^{\top}v)t\right).

Since one-dimensional exponential functions with distinct exponents are linearly independent on any open interval, we obtain

akexp(λkw0)=0,\displaystyle a_{k}\exp\left(\lambda_{k}^{\top}w_{0}\right)=0,

for k=1,,mk=1,\ldots,m. Therefore, we have ak=0a_{k}=0 for k=1,,mk=1,\ldots,m. ∎

We now prove Proposition 3.1.

Proof of Proposition 3.1.

Because (1) holds for all y{0,1}y\in\{0,1\}, it is equivalent to

F(xβ1)F(xγ1)=F(xβ2)F(xγ2),\displaystyle F\left(x^{\top}\beta_{1}\right)F\left(x^{\top}\gamma_{1}\right)=F\left(x^{\top}\beta_{2}\right)F\left(x^{\top}\gamma_{2}\right),

for all x=(1,x~0)x=(1,\tilde{x}_{-0}^{\top})^{\top} with x~0𝒰\tilde{x}_{-0}\in\mathcal{U}. Using F(μ)=1/(1+exp(μ))F(\mu)=1/(1+\exp(-\mu)), we obtain

{1+exp(β1,0)exp(β1,0x~0)}{1+exp(γ1,0)exp(γ1,0x~0)}\displaystyle\left\{1+\exp(-\beta_{1,0})\exp\left(-\beta_{1,-0}^{\top}\tilde{x}_{-0}\right)\right\}\left\{1+\exp(-\gamma_{1,0})\exp\left(-\gamma_{1,-0}^{\top}\tilde{x}_{-0}\right)\right\}
={1+exp(β2,0)exp(β2,0x~0)}{1+exp(γ2,0)exp(γ2,0x~0)},\displaystyle\quad=\left\{1+\exp(-\beta_{2,0})\exp\left(-\beta_{2,-0}^{\top}\tilde{x}_{-0}\right)\right\}\left\{1+\exp(-\gamma_{2,0})\exp\left(-\gamma_{2,-0}^{\top}\tilde{x}_{-0}\right)\right\},

for all x~0𝒰\tilde{x}_{-0}\in\mathcal{U}. Therefore, we have

exp(β1,0)exp(β1,0x~0)+exp(γ1,0)exp(γ1,0x~0)\displaystyle\exp(-\beta_{1,0})\exp(-\beta_{1,-0}^{\top}\tilde{x}_{-0})+\exp(-\gamma_{1,0})\exp(-\gamma_{1,-0}^{\top}\tilde{x}_{-0})
+exp(β1,0γ1,0)exp((β1,0+γ1,0)x~0)\displaystyle\quad+\exp(-\beta_{1,0}-\gamma_{1,0})\exp(-(\beta_{1,-0}+\gamma_{1,-0})^{\top}\tilde{x}_{-0})
=exp(β2,0)exp(β2,0x~0)+exp(γ2,0)exp(γ2,0x~0)\displaystyle=\exp(-\beta_{2,0})\exp(-\beta_{2,-0}^{\top}\tilde{x}_{-0})+\exp(-\gamma_{2,0})\exp(-\gamma_{2,-0}^{\top}\tilde{x}_{-0})
+exp(β2,0γ2,0)exp((β2,0+γ2,0)x~0)\displaystyle\quad+\exp(-\beta_{2,0}-\gamma_{2,0})\exp(-(\beta_{2,-0}+\gamma_{2,-0})^{\top}\tilde{x}_{-0}) (4)

for all x~0𝒰\tilde{x}_{-0}\in\mathcal{U}. Since β1,0,γ1,0,β1,0+γ1,0\beta_{1,-0},~\gamma_{1,-0},~\beta_{1,-0}+\gamma_{1,-0} are pairwise distinct, the left-hand side of (A.4) is a linear combination of three distinct exponential functions and can be written as

j=13exp(c1,j)exp(λ1,jx~0),\displaystyle\sum_{j=1}^{3}\exp(c_{1,j})\exp\left(-\lambda_{1,j}^{\top}\tilde{x}_{-0}\right),

where {λ1,1,λ1,2,λ1,3}={β1,0,γ1,0,β1,0+γ1,0}\{\lambda_{1,1},~\lambda_{1,2},~\lambda_{1,3}\}=\{\beta_{1,-0},~\gamma_{1,-0},~\beta_{1,-0}+\gamma_{1,-0}\} and {c1,1,c1,2,c1,3}={β1,0,γ1,0,β1,0γ1,0}\{c_{1,1},~c_{1,2},~c_{1,3}\}=\{-\beta_{1,0},~-\gamma_{1,0},~-\beta_{1,0}-\gamma_{1,0}\}. The right-hand side can be written as

j=1mexp(c2,j)exp(λ2,jx~0),for1m3,\displaystyle\sum_{j=1}^{m}\exp({c_{2,j}})\exp\left(-\lambda_{2,j}^{\top}\tilde{x}_{-0}\right),\quad\text{for}\quad 1\leq m\leq 3,

where λ2,1,,λ2,m\lambda_{2,1},\ldots,\lambda_{2,m} are distinct. Therefore, we have

j=13exp(c1,j)exp(λ1,jx~0)j=1mexp(c2,j)exp(λ2,jx~0)=0.\displaystyle\sum_{j=1}^{3}\exp(c_{1,j})\exp\left(-\lambda_{1,j}^{\top}\tilde{x}_{-0}\right)-\sum_{j=1}^{m}\exp({c_{2,j}})\exp\left(-\lambda_{2,j}^{\top}\tilde{x}_{-0}\right)=0.

By Lemma A.4, if the sets of vectors {λ1,1,λ1,2,λ1,3}\{\lambda_{1,1},~\lambda_{1,2},~\lambda_{1,3}\} and {λ2,1,,λ2,m}\{\lambda_{2,1},~\ldots,~\lambda_{2,m}\} are not identical, it implies that at least one coefficient in the combined linear combination, which is either exp(c1,j)\exp(c_{1,j}) or exp(c2,k)-\exp(c_{2,k}), must be zero. However, this is a contradiction because the exponential function is strictly positive. Therefore, we obtain m=3m=3 and

{β1,0,γ1,0,β1,0+γ1,0}={β2,0,γ2,0,β2,0+γ2,0}.\displaystyle\left\{\beta_{1,-0},\gamma_{1,-0},\beta_{1,-0}+\gamma_{1,-0}\right\}=\left\{\beta_{2,-0},\gamma_{2,-0},\beta_{2,-0}+\gamma_{2,-0}\right\}.

Since β1,0,γ1,0\beta_{1,-0},~\gamma_{1,-0}, and β1,0+γ1,0\beta_{1,-0}+\gamma_{1,-0} are pairwise distinct111This condition implies that both β1,0\beta_{1,-0} and γ1,0\gamma_{1,-0} are non-zero, and ensures that neither can be expressed as the sum of the other two elements in the set., β1,0+γ1,0\beta_{1,-0}+\gamma_{1,-0} is the unique element in the set {β1,0,γ1,0,β1,0+γ1,0}\{\beta_{1,-0},~\gamma_{1,-0},~\beta_{1,-0}+\gamma_{1,-0}\} that can be expressed as the sum of the other two elements. Because the sets {β1,0,γ1,0,β1,0+γ1,0}\{\beta_{1,-0},~\gamma_{1,-0},~\beta_{1,-0}+\gamma_{1,-0}\} and {β2,0,γ2,0,β2,0+γ2,0}\{\beta_{2,-0},~\gamma_{2,-0},~\beta_{2,-0}+\gamma_{2,-0}\} are identical, their unique sum elements must be equal. Therefore, we have

β1,0+γ1,0=β2,0+γ2,0,\displaystyle\beta_{1,-0}+\gamma_{1,-0}=\beta_{2,-0}+\gamma_{2,-0},

which implies that the sets of the remaining elements are also identical:

{β1,0,γ1,0}={β2,0,γ2,0}.\displaystyle\{\beta_{1,-0},~\gamma_{1,-0}\}=\{\beta_{2,-0},~\gamma_{2,-0}\}.

If β1,0=β2,0\beta_{1,-0}=\beta_{2,-0} and γ1,0=γ2,0\gamma_{1,-0}=\gamma_{2,-0}, then we have exp(β1,0)=exp(β2,0)\exp(-\beta_{1,0})=\exp(-\beta_{2,0}) and exp(γ1,0)=exp(γ2,0)\exp(-\gamma_{1,0})=\exp(-\gamma_{2,0}), and hence β1=β2\beta_{1}=\beta_{2} and γ1=γ2\gamma_{1}=\gamma_{2}. Instead, if β1,0=γ2,0\beta_{1,-0}=\gamma_{2,-0} and γ1,0=β2,0\gamma_{1,-0}=\beta_{2,-0}, then we have exp(β1,0)=exp(γ2,0)\exp(-\beta_{1,0})=\exp(-\gamma_{2,0}) and exp(γ1,0)=exp(β2,0)\exp(-\gamma_{1,0})=\exp(-\beta_{2,0}), and hence β1=γ2\beta_{1}=\gamma_{2} and γ1=β2\gamma_{1}=\beta_{2}. Therefore, we have either (β1,γ1)=(β2,γ2)(\beta_{1},~\gamma_{1})=(\beta_{2},~\gamma_{2}) or (β1,γ1)=(γ2,β2)(\beta_{1},~\gamma_{1})=(\gamma_{2},~\beta_{2}). ∎

We next prove Theorem 3.2.

Proof of Theorem 3.2.

Fix ξ𝒮\xi\in\mathcal{S}. For j=1,2j=1,2, define βj,0(ξ):=βj,0+βj,0(2)ξ\beta_{j,0}^{(\xi)}:=\beta_{j,0}+\beta_{j,-0}^{(2)\top}\xi and γj,0(ξ):=γj,0+γj,0(2)ξ\gamma_{j,0}^{(\xi)}:=\gamma_{j,0}+\gamma_{j,-0}^{(2)\top}\xi. Then, for all y{0,1}y\in\{0,1\} and all x~0(1)𝒰ξ\tilde{x}_{-0}^{(1)}\in\mathcal{U}_{\xi}, (1) can be expressed as

p(y(1,x~0(1)),(β1,0(ξ),β1,0(1)),(γ1,0(ξ),γ1,0(1)))\displaystyle p\left(y\mid(1,\tilde{x}_{-0}^{(1)\top})^{\top},(\beta_{1,0}^{(\xi)},\beta_{1,-0}^{(1)\top})^{\top},(\gamma_{1,0}^{(\xi)},\gamma_{1,-0}^{(1)\top})^{\top}\right)
=p(y(1,x~0(1)),(β2,0(ξ),β2,0(1)),(γ2,0(ξ),γ2,0(1))).\displaystyle\quad=p\left(y\mid(1,\tilde{x}_{-0}^{(1)\top})^{\top},(\beta_{2,0}^{(\xi)},\beta_{2,-0}^{(1)\top})^{\top},(\gamma_{2,0}^{(\xi)},\gamma_{2,-0}^{(1)\top})^{\top}\right).

Because 𝒰ξ\mathcal{U}_{\xi} is a nonempty open subset of r\mathbb{R}^{r} and β1,0(1),γ1,0(1),β1,0(1)+γ1,0(1)\beta_{1,-0}^{(1)},~\gamma_{1,-0}^{(1)},~\beta_{1,-0}^{(1)}+\gamma_{1,-0}^{(1)} are pairwise distinct, Proposition 3.1 yields that, for each ξ𝒮\xi\in\mathcal{S}, we have either

(β1,0(1),γ1,0(1))=(β2,0(1),γ2,0(1))and(β1,0(ξ),γ1,0(ξ))=(β2,0(ξ),γ2,0(ξ)),\displaystyle(\beta_{1,-0}^{(1)},~\gamma_{1,-0}^{(1)})=(\beta_{2,-0}^{(1)},~\gamma_{2,-0}^{(1)})\quad\text{and}\quad(\beta_{1,0}^{(\xi)},~\gamma_{1,0}^{(\xi)})=(\beta_{2,0}^{(\xi)},~\gamma_{2,0}^{(\xi)}), (5)

or

(β1,0(1),γ1,0(1))=(γ2,0(1),β2,0(1))and(β1,0(ξ),γ1,0(ξ))=(γ2,0(ξ),β2,0(ξ)).\displaystyle(\beta_{1,-0}^{(1)},~\gamma_{1,-0}^{(1)})=(\gamma_{2,-0}^{(1)},~\beta_{2,-0}^{(1)})\quad\text{and}\quad(\beta_{1,0}^{(\xi)},~\gamma_{1,0}^{(\xi)})=(\gamma_{2,0}^{(\xi)},~\beta_{2,0}^{(\xi)}). (6)

We claim that the same set of equations must hold for all ξ𝒮\xi\in\mathcal{S}. Indeed, if (5) holds for some ξ𝒮\xi\in\mathcal{S} and (6) holds for some ξ𝒮\xi^{\prime}\in\mathcal{S}, then β1,0(1)=β2,0(1)=γ1,0(1)\beta_{1,-0}^{(1)}=\beta_{2,-0}^{(1)}=\gamma_{1,-0}^{(1)}, which contradicts β1,0(1)γ1,0(1)\beta_{1,-0}^{(1)}\neq\gamma_{1,-0}^{(1)}. Therefore, either (5) holds for all ξ𝒮\xi\in\mathcal{S}, or (6) holds for all ξ𝒮\xi\in\mathcal{S}.

First, suppose that (5) holds for all ξ𝒮\xi\in\mathcal{S}. Then, we have

β1,0+β1,0(2)ξ\displaystyle\beta_{1,0}+\beta_{1,-0}^{(2)\top}\xi =β2,0+β2,0(2)ξ,\displaystyle=\beta_{2,0}+\beta_{2,-0}^{(2)\top}\xi,
γ1,0+γ1,0(2)ξ\displaystyle\quad\gamma_{1,0}+\gamma_{1,-0}^{(2)\top}\xi =γ2,0+γ2,0(2)ξ,\displaystyle=\gamma_{2,0}+\gamma_{2,-0}^{(2)\top}\xi,

for all ξ𝒮\xi\in\mathcal{S}. Hence, both of the following functions:

ξ\displaystyle\xi β1,0+β1,0(2)ξβ2,0β2,0(2)ξ,\displaystyle\mapsto\beta_{1,0}+\beta_{1,-0}^{(2)\top}\xi-\beta_{2,0}-\beta_{2,-0}^{(2)\top}\xi,
ξ\displaystyle\xi γ1,0+γ1,0(2)ξγ2,0γ2,0(2)ξ,\displaystyle\mapsto\gamma_{1,0}+\gamma_{1,-0}^{(2)\top}\xi-\gamma_{2,0}-\gamma_{2,-0}^{(2)\top}\xi,

are identically zero on 𝒮\mathcal{S}. Since aff(𝒮)=s\operatorname{aff}(\mathcal{S})=\mathbb{R}^{s}, both functions are also identically zero on s\mathbb{R}^{s}. Therefore, we have β1,0=β2,0\beta_{1,0}=\beta_{2,0}, β1,0(2)=β2,0(2)\beta_{1,-0}^{(2)}=\beta_{2,-0}^{(2)}, γ1,0=γ2,0\gamma_{1,0}=\gamma_{2,0}, and γ1,0(2)=γ2,0(2)\gamma_{1,-0}^{(2)}=\gamma_{2,-0}^{(2)}. Combining (5) with this, we obtain (β1,γ1)=(β2,γ2)(\beta_{1},\gamma_{1})=(\beta_{2},\gamma_{2}).

Next, suppose that (6) holds for all ξ𝒮\xi\in\mathcal{S}. Then, by a similar argument, we obtain (β1,γ1)=(γ2,β2)(\beta_{1},\gamma_{1})=(\gamma_{2},\beta_{2}).

Therefore, we have either (β1,γ1)=(β2,γ2)(\beta_{1},\gamma_{1})=(\beta_{2},\gamma_{2}) or (β1,γ1)=(γ2,β2)(\beta_{1},\gamma_{1})=(\gamma_{2},\beta_{2}). ∎

We finally prove Corollary 3.3.

Proof of Corollary 3.3..

Let πβ,γ(x):=F(xβ)F(xγ)\pi_{\beta,\gamma}(x):=F(x^{\top}\beta)F(x^{\top}\gamma) and π(x):=πβ,γ(x)\pi^{\ast}(x):=\pi_{\beta^{\ast},\gamma^{\ast}}(x). Under correct specification, we have

yxBernoulli(π(x)).\displaystyle y\mid x\sim\operatorname{Bernoulli}\!\left(\pi^{\ast}(x)\right).

Therefore, the expected log-likelihood function has the following expression:

(β,γ)\displaystyle\mathcal{L}(\beta,~\gamma) =𝔼x[π(x)logπβ,γ(x)+{1π(x)}log{1πβ,γ(x)}],\displaystyle=\mathbb{E}_{x}\left[\pi^{\ast}(x)\log\pi_{\beta,\gamma}(x)+\{1-\pi^{\ast}(x)\}\log\{1-\pi_{\beta,\gamma}(x)\}\right],

and, hence, we have

(β,γ)(β,γ)\displaystyle\mathcal{L}(\beta,~\gamma)-\mathcal{L}(\beta^{\ast},~\gamma^{\ast})
=𝔼x[π(x)logπβ,γ(x)+{1π(x)}log{1πβ,γ(x)}]\displaystyle\quad=\mathbb{E}_{x}\left[\pi^{\ast}(x)\log\pi_{\beta,\gamma}(x)+\{1-\pi^{\ast}(x)\}\log\{1-\pi_{\beta,\gamma}(x)\}\right]
𝔼x[π(x)logπ(x)+{1π(x)}log{1π(x)}]\displaystyle\qquad-\mathbb{E}_{x}\left[\pi^{\ast}(x)\log\pi^{\ast}(x)+\{1-\pi^{\ast}(x)\}\log\{1-\pi^{\ast}(x)\}\right]
=𝔼x[π(x)logπ(x)πβ,γ(x)+{1π(x)}log1π(x)1πβ,γ(x)]\displaystyle\quad=-\mathbb{E}_{x}\left[\pi^{\ast}(x)\log\frac{\pi^{\ast}(x)}{\pi_{\beta,\gamma}(x)}+\{1-\pi^{\ast}(x)\}\log\frac{1-\pi^{\ast}(x)}{1-\pi_{\beta,\gamma}(x)}\right]
=𝔼x[KL(Bernoulli(π(x))Bernoulli(πβ,γ(x)))]0,\displaystyle\quad=-\mathbb{E}_{x}\left[\mathrm{KL}\left(\operatorname{Bernoulli}~\left(\pi^{\ast}(x)\right)~\middle\|~\operatorname{Bernoulli}~\left(\pi_{\beta,\gamma}(x)\right)\right)\right]\leq 0,

for any (β,γ)d×d(\beta,\gamma)\in\mathbb{R}^{d}\times\mathbb{R}^{d}. Here, KL()\mathrm{KL}(\cdot\|\cdot) denotes the Kullback-Leibler divergence from the first probability distribution to the second. Thus [β,γ][\beta^{\ast},~\gamma^{\ast}] is a maximizer on (d×d)/(\mathbb{R}^{d}\times\mathbb{R}^{d})/{\sim}.

Now suppose that (β,γ)d×d(\beta^{\dagger},\gamma^{\dagger})\in\mathbb{R}^{d}\times\mathbb{R}^{d} also attains the maximum. Then, it must hold that

KL(Bernoulli(π(x))Bernoulli(πβ,γ(x)))=0,almost surely.\displaystyle\mathrm{KL}\left(\operatorname{Bernoulli}~\left(\pi^{\ast}(x)\right)~\middle\|~\operatorname{Bernoulli}~\left(\pi_{\beta^{\dagger},\gamma^{\dagger}}(x)\right)\right)=0,\quad\text{almost surely.}

Therefore, we obtain

πβ,γ(x)=π(x),for almost every x.\displaystyle\pi_{\beta^{\dagger},\gamma^{\dagger}}(x)=\pi^{\ast}(x),\quad\text{for almost every $x$.} (7)

Let x=(1,x~0(1),x~0(2))x=(1,\tilde{x}_{-0}^{(1)\top},\tilde{x}_{-0}^{(2)\top})^{\top}. Fix ξ𝒮\xi\in\mathcal{S} and define

g,ξ(x~0(1))\displaystyle g_{\dagger,\xi}(\tilde{x}_{-0}^{(1)}) :=πβ,γ((1,x~0(1),ξ)),\displaystyle:=\pi_{\beta^{\dagger},\gamma^{\dagger}}\left((1,\tilde{x}_{-0}^{(1)\top},\xi^{\top})^{\top}\right),
g,ξ(x~0(1))\displaystyle g_{\ast,\xi}(\tilde{x}_{-0}^{(1)}) :=π((1,x~0(1),ξ)).\displaystyle:=\pi^{\ast}\left((1,\tilde{x}_{-0}^{(1)\top},\xi^{\top})^{\top}\right).

Both functions are continuous on r\mathbb{R}^{r}. We claim that g,ξ(x~0(1))=g,ξ(x~0(1))g_{\dagger,\xi}(\tilde{x}_{-0}^{(1)})=g_{\ast,\xi}(\tilde{x}_{-0}^{(1)}) holds for all ξ𝒮\xi\in\mathcal{S} and all x~0(1)𝒰ξ\tilde{x}_{-0}^{(1)}\in\mathcal{U}_{\xi}. Suppose not, then there exist ξ𝒮\xi\in\mathcal{S} and w𝒰ξw\in\mathcal{U}_{\xi} such that g,ξ(w)g,ξ(w)g_{\dagger,\xi}(w)\neq g_{\ast,\xi}(w). By continuity, there exists a nonempty open neighborhood V𝒰ξV\subset\mathcal{U}_{\xi} of ww such that the two functions remain different on VV. Because ξsupp(x~0(2))\xi\in\operatorname{supp}(\tilde{x}_{-0}^{(2)}) and ww belongs to the conditional support of x~0(1)\tilde{x}_{-0}^{(1)} given x~0(2)=ξ\tilde{x}_{-0}^{(2)}=\xi, we have (x~0(2)=ξ,x~0(1)V)>0\mathbb{P}(\tilde{x}_{-0}^{(2)}=\xi,~\tilde{x}_{-0}^{(1)}\in V)>0. This contradicts (7). Therefore, we obtain

πβ,γ(x)=π(x)\displaystyle\pi_{\beta^{\dagger},\gamma^{\dagger}}(x)=\pi^{\ast}(x)

for all x=(1,x~0(1),ξ)x=(1,\tilde{x}_{-0}^{(1)\top},\xi^{\top})^{\top} with ξ𝒮\xi\in\mathcal{S} and x~0(1)𝒰ξ\tilde{x}_{-0}^{(1)}\in\mathcal{U}_{\xi}.

From the discussion above, we have

p(yx,β,γ)=p(yx,β,γ),\displaystyle p(y\mid x,\beta^{\dagger},\gamma^{\dagger})=p(y\mid x,\beta^{\ast},\gamma^{\ast}),

for all y{0,1}y\in\{0,1\} and all x=(1,x~0(1),ξ)x=(1,\tilde{x}_{-0}^{(1)\top},\xi^{\top})^{\top} with ξ𝒮\xi\in\mathcal{S} and x~0(1)𝒰ξ\tilde{x}_{-0}^{(1)}\in\mathcal{U}_{\xi}. By Theorem 3.2, it follows that

(β,γ)(β,γ).\displaystyle(\beta^{\dagger},~\gamma^{\dagger})\sim(\beta^{\ast},~\gamma^{\ast}).

Therefore, the expected log-likelihood (β,γ)\mathcal{L}(\beta,~\gamma) is uniquely maximized on (d×d)/(\mathbb{R}^{d}\times\mathbb{R}^{d})/{\sim} at the class [β,γ][\beta^{\ast},~\gamma^{\ast}]. ∎

Appendix B Details of Numerical Settings of Bimodality Confirmation

B.1 Posterior Distribution and Sampling Algorithm

We consider the posterior distribution given by

π(β,γ,hy,x,z){i=1np(yi,hixi,zi,β,γ)}pprior(β,γ),\pi(\beta,\gamma,h\mid y,x,z)\propto\left\{\prod_{i=1}^{n}p(y_{i},h_{i}\mid x_{i},z_{i},\beta,\gamma)\right\}p_{\mathrm{prior}}(\beta,\gamma), (8)

where p(yi,hixi,zi,β,γ)p(y_{i},h_{i}\mid x_{i},z_{i},\beta,\gamma) is defined as

p(yi,hixi,zi,β,γ)\displaystyle p(y_{i},h_{i}\mid x_{i},z_{i},\beta,\gamma) ={F(γzi)}hi{1F(γzi)}1hi{F(βxi)}yi{1F(βxi)}hiyi,\displaystyle=\left\{F(\gamma^{\top}z_{i})\right\}^{h_{i}}\left\{1-F(\gamma^{\top}z_{i})\right\}^{1-h_{i}}\left\{F(\beta^{\top}x_{i})\right\}^{y_{i}}\left\{1-F(\beta^{\top}x_{i})\right\}^{h_{i}-y_{i}},

and pprior(β,γ)p_{\mathrm{prior}}(\beta,\gamma) denotes the prior distribution. The complete-data likelihood can be separated into a logistic regression for hh:

hiziBernoulli(F(γzi)).\displaystyle h_{i}\mid z_{i}\sim\operatorname{Bernoulli}(F(\gamma^{\top}z_{i})).

and a logistic regression for yy^{\ast}:

yixiBernoulli(F(βxi)),\displaystyle y_{i}^{\ast}\mid x_{i}\sim\operatorname{Bernoulli}(F(\beta^{\top}x_{i})),

conditioned on hi=1h_{i}=1. Thus hi=0h_{i}=0 generates a structural zero, whereas hi=1h_{i}=1 allows the ordinary logistic regression. The observed outcome yiy_{i} is then given by yi=hiyiy_{i}=h_{i}y_{i}^{\ast}. We use this structure to construct a MCMC algorithm. We further employ the Pólya-Gamma augmentation (Polson et al., 2013) and combine the resulting Gibbs sampling algorithm with replica exchange so that the sampler can move between the multiple modes. The detailed algorithms are provided in Appendix C.

B.2 Data Generation and Sampling Setup

Numerical experiments were conducted under the shared-design setting. We considered three designs that differ only in the distribution of the covariates. In each scenario, we generated n=2,000n=2,000 observations with d=p=5d=p=5 covariates including an intercept. The true coefficient vectors were fixed at

β\displaystyle\beta^{*} =(0.5,1.0,0.5,0.5,0.25),\displaystyle=(0.5,~1.0,~0.5,~0.5,~0.25)^{\top},
γ\displaystyle\gamma^{*} =(1.7,1.0,1.0,0.5,0.5).\displaystyle=(1.7,~-1.0,~-1.0,~0.5,~0.5)^{\top}.

In Scenario 1, all non-intercept covariates were drawn from independent standard normal distributions. In Scenario 2, all non-intercept covariates were drawn from independent Bernoulli(0.5)\operatorname{Bernoulli}(0.5). In Scenario 3, the first two non-intercept covariates were drawn from independent standard normal distributions, and the remaining ones were from independent Bernoulli(0.5)\operatorname{Bernoulli}(0.5). Given the covariates, we generated

hi\displaystyle h_{i} Bernoulli(F(γxi)),\displaystyle\sim\operatorname{Bernoulli}\left(F(\gamma^{*\top}x_{i})\right),
yi\displaystyle y_{i}^{\ast} Bernoulli(F(βxi)),\displaystyle\sim\operatorname{Bernoulli}\left(F(\beta^{*\top}x_{i})\right),

independently, and set yi=hiyiy_{i}=h_{i}y_{i}^{\ast}.

We placed weakly informative Gaussian priors on both coefficient vectors: β𝒩(0,100Ip)\beta\sim\mathcal{N}(0,100I_{p}) and γ𝒩(0,100Iq)\gamma\sim\mathcal{N}(0,100I_{q}). Posterior sampling was performed via a Gibbs sampling algorithm with replica exchange using 2020 replicas. The temperature schedule followed a geometric progression Tm=rmT_{m}=r^{m} with r=1.05r=1.05, and replica exchange was attempted every 5050 iterations. The total number of MCMC iterations was set to 53,00053{,}000, with the first 3,0003{,}000 discarded as burn-in. To facilitate efficient sampling, Gibbs sampling based on Pólya-Gamma data augmentation was used for both the ordinary logistic regression component and the structural-zero component.

B.3 Sampling Results

To explore the structure of the posterior distribution, we applied kk-means++ clustering algorithm with k=2k=2 to the posterior samples Arthur and Vassilvitskii (2007). The samples, consisting of both β\beta and γ\gamma parameters concatenated as 1010-dimensional vectors, were projected onto the first two principal components for visualization using PCA. See Figure 1 for the plots.

Table 6 summarizes the posterior means within each cluster along with cluster sizes and proportions. In Scenarios 1 and 3, the means of the two clusters exhibited an approximately symmetric structure, reflecting the exchange symmetry of (β,γ)(\beta,\gamma) and (γ,β)(\gamma,\beta). In Scenario 2, the posterior means from the smaller cluster were numerically large, especially in the structural-zero component, consistent with the failure of the binary design to satisfy the condition (C2), which results in a lack of guaranteed identifiability.

The trace plots and the histograms for the posterior distributions are provided in Appendix D.

Table 6: Posterior means of each parameter for the clusters in Scenarios 1–3.
Parameter True Scenario 1 Scenario 2 Scenario 3
Cluster 1 Cluster 2 Cluster 1 Cluster 2 Cluster 1 Cluster 2
β0\beta_{0} 0.5 0.325 2.886 9.700 17.168 0.202 2.713
β1\beta_{1} 1.0 0.871 -1.349 5.136 0.847 0.760 -1.453
β2\beta_{2} 0.5 0.355 -1.572 3.180 5.330 0.234 -1.283
β3\beta_{3} 0.5 0.515 0.544 4.194 -0.704 0.532 0.737
β4\beta_{4} 0.25 0.282 0.668 8.992 -0.187 0.574 0.158
γ0\gamma_{0} 1.7 2.413 0.200 0.280 0.265 2.794 0.216
γ1\gamma_{1} -1.0 -1.154 0.803 -0.214 -0.204 -1.491 0.773
γ2\gamma_{2} -1.0 -1.413 0.284 -0.622 -0.622 -1.307 0.249
γ3\gamma_{3} 0.5 0.524 0.516 0.570 0.575 0.749 0.532
γ4\gamma_{4} 0.5 0.625 0.292 0.470 0.485 0.154 0.575
Cluster size 13,250 36,750 8,395 41,605 25,450 24,550
Proportion 0.265 0.735 0.168 0.832 0.509 0.491

Appendix C Details of Sampling Algorithm

For each replica m=1,,Mm=1,\ldots,M, let Tm>0T_{m}>0 denote the temperature. The tempered posterior distribution is defined as

πTm(β,γ,hy,x,z){i=1np(yi,hixi,zi,β,γ)}1/Tmpprior(β,γ),\displaystyle\pi_{T_{m}}(\beta,\gamma,h\mid y,x,z)\propto\left\{\prod_{i=1}^{n}p(y_{i},h_{i}\mid x_{i},z_{i},\beta,\gamma)\right\}^{1/T_{m}}p_{\mathrm{prior}}(\beta,\gamma),

where

p(yi,hixi,zi,β,γ)={F(ziγ)}hi{1F(ziγ)}1hi{F(xiβ)}yi{1F(xiβ)}hiyi,\displaystyle p(y_{i},h_{i}\mid x_{i},z_{i},\beta,\gamma)=\left\{F(z_{i}^{\top}\gamma)\right\}^{h_{i}}\left\{1-F(z_{i}^{\top}\gamma)\right\}^{1-h_{i}}\left\{F(x_{i}^{\top}\beta)\right\}^{y_{i}}\left\{1-F(x_{i}^{\top}\beta)\right\}^{h_{i}-y_{i}},

for yihiy_{i}\leq h_{i}. We assume Gaussian priors: β𝒩(b0,B0)\beta\sim\mathcal{N}(b_{0},B_{0}), and γ𝒩(g0,G0)\gamma\sim\mathcal{N}(g_{0},G_{0}).

C.1 Pólya-Gamma Gibbs Sampling Step

For each replica mm, we perform Gibbs sampling using the following steps.

Step 1. Updating hih_{i}.

If yi=1y_{i}=1, then hih_{i} is deterministically set to 11. For observations with yi=0y_{i}=0, we sample hih_{i} from

hiβ,γ,yi=0Bernoulli(μi(Tm)),\displaystyle h_{i}\mid\beta,\gamma,y_{i}=0\sim\operatorname{Bernoulli}(\mu_{i}^{(T_{m})}),

where

μi(Tm)\displaystyle\mu_{i}^{(T_{m})} =[F(ziγ){1F(xiβ)}]1/Tm[F(ziγ){1F(xiβ)}]1/Tm+[1F(ziγ)]1/Tm\displaystyle=\frac{\left[F(z_{i}^{\top}\gamma)\{1-F(x_{i}^{\top}\beta)\}\right]^{1/T_{m}}}{\left[F(z_{i}^{\top}\gamma)\{1-F(x_{i}^{\top}\beta)\}\right]^{1/T_{m}}+\left[1-F(z_{i}^{\top}\gamma)\right]^{1/T_{m}}}
=F(ziγlog{1+exp(xiβ)}Tm).\displaystyle=F\!\left(\frac{z_{i}^{\top}\gamma-\log\{1+\exp(x_{i}^{\top}\beta)\}}{T_{m}}\right).

Step 2. Updating γ\gamma.

The tempered likelihood for γ\gamma is

i=1nexp{(hi/Tm)ziγ}{1+exp(ziγ)}1/Tm.\displaystyle\prod_{i=1}^{n}\frac{\exp\left\{(h_{i}/T_{m})z_{i}^{\top}\gamma\right\}}{\{1+\exp(z_{i}^{\top}\gamma)\}^{1/T_{m}}}.

Introduce Pólya-Gamma auxiliary variables (Polson et al., 2013):

wiγ,hPG(Tm1,ziγ),i=1,,n,\displaystyle w_{i}\mid\gamma,h\sim\operatorname{PG}(T_{m}^{-1},z_{i}^{\top}\gamma),\quad i=1,\ldots,n,

where PG(,)\operatorname{PG}(\cdot,\cdot) denotes the Pólya-Gamma distribution. Specifically, if wPG(b,c)w\sim\operatorname{PG}(b,c) with b>0b>0 and cc\in\mathbb{R}, using random numbers independently following Gamma distribution vkGa(b,1)v_{k}\sim\operatorname{Ga}(b,1), ww can be obtained as

w=12π2k=1vk(k12)2+c24π2.\displaystyle w=\cfrac{1}{2\pi^{2}}\sum_{k=1}^{\infty}\cfrac{v_{k}}{\left(k-\frac{1}{2}\right)^{2}+\frac{c^{2}}{4\pi^{2}}}.

Let Z=(z1,,zn)Z=(z_{1}^{\top},\ldots,z_{n}^{\top})^{\top}, W=diag(w1,,wn)W=\operatorname{diag}(w_{1},\ldots,w_{n}), and

κ(γ)=(h11/2Tm,,hn1/2Tm).\displaystyle\kappa^{(\gamma)}=\left(\frac{h_{1}-1/2}{T_{m}},~\ldots,~\frac{h_{n}-1/2}{T_{m}}\right)^{\top}.

Then the full conditional distribution of γ\gamma becomes Gaussian:

γh,w,z\displaystyle\gamma\mid h,w,z 𝒩(g1(Tm),G1(Tm)),\displaystyle\sim\mathcal{N}\!\left(g_{1}^{(T_{m})},G_{1}^{(T_{m})}\right),
G1(Tm)\displaystyle G_{1}^{(T_{m})} =(ZWZ+G01)1,\displaystyle=\left(Z^{\top}WZ+G_{0}^{-1}\right)^{-1},
g1(Tm)\displaystyle g_{1}^{(T_{m})} =G1(Tm)(Zκ(γ)+G01g0).\displaystyle=G_{1}^{(T_{m})}\left(Z^{\top}\kappa^{(\gamma)}+G_{0}^{-1}g_{0}\right).

Step 3. Updating β\beta.

Let Ih={i:hi=1}I_{h}=\{i:h_{i}=1\}, let XhX_{h} be the submatrix of X=(x1,,xn)n×pX=(x_{1},\ldots,x_{n})^{\top}\in\mathbb{R}^{n\times p} indexed by IhI_{h}, and let yhy_{h} be the corresponding subvector of yy. The tempered likelihood of β\beta is:

iIhexp{(yi/Tm)xiβ}{1+exp(xiβ)}1/Tm.\displaystyle\prod_{i\in I_{h}}\frac{\exp\left\{(y_{i}/T_{m})x_{i}^{\top}\beta\right\}}{\{1+\exp(x_{i}^{\top}\beta)\}^{1/T_{m}}}.

Introduce independent Pólya-Gamma variables

ωiβ,h,yPG(Tm1,xiβ),iIh,\displaystyle\omega_{i}\mid\beta,h,y\sim\operatorname{PG}\!\left(T_{m}^{-1},x_{i}^{\top}\beta\right),\quad i\in I_{h},

and, define Ωh=diag(ωi:iIh)\Omega_{h}=\operatorname{diag}(\omega_{i}:i\in I_{h}) and

κ(β)=(yi1/2Tm:iIh).\displaystyle\kappa^{(\beta)}=\left(\frac{y_{i}-1/2}{T_{m}}:i\in I_{h}\right)^{\top}.

Then the full conditional distribution of β\beta becomes again Gaussian:

βyh,Xh,Ωh,h\displaystyle\beta\mid y_{h},X_{h},\Omega_{h},h 𝒩(b1(Tm),B1(Tm)),\displaystyle\sim\mathcal{N}\!\left(b_{1}^{(T_{m})},B_{1}^{(T_{m})}\right),
B1(Tm)\displaystyle B_{1}^{(T_{m})} =(XhΩhXh+B01)1,\displaystyle=\left(X_{h}^{\top}\Omega_{h}X_{h}+B_{0}^{-1}\right)^{-1},
b1(Tm)\displaystyle b_{1}^{(T_{m})} =B1(Tm)(Xhκ(β)+B01b0).\displaystyle=B_{1}^{(T_{m})}\left(X_{h}^{\top}\kappa^{(\beta)}+B_{0}^{-1}b_{0}\right).

C.2 Replica Exchange Step

After a fixed number of Gibbs iterations, we attempt to swap the states of two neighboring replicas with adjacent temperatures TmT_{m} and Tm+1T_{m+1}. Let the current states be denoted by

θ(m)\displaystyle\theta^{(m)} =(β(m),γ(m),h(m)),\displaystyle=(\beta^{(m)},\gamma^{(m)},h^{(m)}),
θ(m+1)\displaystyle\theta^{(m+1)} =(β(m+1),γ(m+1),h(m+1)).\displaystyle=(\beta^{(m+1)},\gamma^{(m+1)},h^{(m+1)}).

The acceptance probability for the swap proposal is given by

α=min{1,πTm(θ(m+1)y,x,z)πTm+1(θ(m)y,x,z)πTm(θ(m)y,x,z)πTm+1(θ(m+1)y,x,z)}.\alpha=\min\left\{1,~\frac{\pi_{T_{m}}(\theta^{(m+1)}\mid y,x,z)~\pi_{T_{m+1}}(\theta^{(m)}\mid y,x,z)}{\pi_{T_{m}}(\theta^{(m)}\mid y,x,z)~\pi_{T_{m+1}}(\theta^{(m+1)}\mid y,x,z)}\right\}. (9)

In practice, we compute the log acceptance ratio as

logα=min{0,(1Tm1Tm+1)[logL~(θ(m+1))logL~(θ(m))]},\displaystyle\log\alpha=\min\left\{0,~\left(\frac{1}{T_{m}}-\frac{1}{T_{m+1}}\right)\left[\log\tilde{L}\left(\theta^{(m+1)}\right)-\log\tilde{L}\left(\theta^{(m)}\right)\right]\right\},

where

L~(θ):=i=1np(yi,hixi,zi,β,γ).\displaystyle\tilde{L}(\theta):=\prod_{i=1}^{n}p(y_{i},h_{i}\mid x_{i},z_{i},\beta,\gamma).

C.3 Overall Algorithm

The full algorithm alternates between the Gibbs sampling step and the replica exchange step as follows:

  1. 1.

    For each replica mm, perform Gibbs sampling to update hh, γ\gamma, and β\beta from the full conditionals above under its corresponding temperature TmT_{m}.

  2. 2.

    Every fixed number of iterations, attempt to swap the states of neighboring replicas mm and m+1m+1 according to the acceptance probability given in (9).

  3. 3.

    Retain samples from the chain with T1=1T_{1}=1 as draws from the target posterior distribution.

Appendix D Trace Plots and Posterior Histograms

We provide the trace plots for Scenario 1 as Figure 4, Scenario 2 as Figure 6, and Scenario 3 as Figure 8. Furthermore, we provide the posterior histograms for Scenario 1 as Figure 5, Scenario 2 as Figure 7, and Scenario 3 as Figure 9.

Refer to caption
(a) β0\beta_{0}
Refer to caption
(b) γ0\gamma_{0}
Refer to caption
(c) β1\beta_{1}
Refer to caption
(d) γ1\gamma_{1}
Refer to caption
(e) β2\beta_{2}
Refer to caption
(f) γ2\gamma_{2}
Refer to caption
(g) β3\beta_{3}
Refer to caption
(h) γ3\gamma_{3}
Refer to caption
(i) β4\beta_{4}
Refer to caption
(j) γ4\gamma_{4}
Figure 4: Trace plots of each parameter (Scenario 1).
Refer to caption
(a) β0\beta_{0}
Refer to caption
(b) γ0\gamma_{0}
Refer to caption
(c) β1\beta_{1}
Refer to caption
(d) γ1\gamma_{1}
Refer to caption
(e) β2\beta_{2}
Refer to caption
(f) γ2\gamma_{2}
Refer to caption
(g) β3\beta_{3}
Refer to caption
(h) γ3\gamma_{3}
Refer to caption
(i) β4\beta_{4}
Refer to caption
(j) γ4\gamma_{4}
Figure 5: The histograms of the posterior distributions for each parameter (Scenario 1).
Refer to caption
(a) β0\beta_{0}
Refer to caption
(b) γ0\gamma_{0}
Refer to caption
(c) β1\beta_{1}
Refer to caption
(d) γ1\gamma_{1}
Refer to caption
(e) β2\beta_{2}
Refer to caption
(f) γ2\gamma_{2}
Refer to caption
(g) β3\beta_{3}
Refer to caption
(h) γ3\gamma_{3}
Refer to caption
(i) β4\beta_{4}
Refer to caption
(j) γ4\gamma_{4}
Figure 6: Trace plots of each parameter (Scenario 2).
Refer to caption
(a) β0\beta_{0}
Refer to caption
(b) γ0\gamma_{0}
Refer to caption
(c) β1\beta_{1}
Refer to caption
(d) γ1\gamma_{1}
Refer to caption
(e) β2\beta_{2}
Refer to caption
(f) γ2\gamma_{2}
Refer to caption
(g) β3\beta_{3}
Refer to caption
(h) γ3\gamma_{3}
Refer to caption
(i) β4\beta_{4}
Refer to caption
(j) γ4\gamma_{4}
Figure 7: The histograms of the posterior distributions for each parameter (Scenario 2).
Refer to caption
(a) β0\beta_{0}
Refer to caption
(b) γ0\gamma_{0}
Refer to caption
(c) β1\beta_{1}
Refer to caption
(d) γ1\gamma_{1}
Refer to caption
(e) β2\beta_{2}
Refer to caption
(f) γ2\gamma_{2}
Refer to caption
(g) β3\beta_{3}
Refer to caption
(h) γ3\gamma_{3}
Refer to caption
(i) β4\beta_{4}
Refer to caption
(j) γ4\gamma_{4}
Figure 8: Trace plots of each parameter (Scenario 3).
Refer to caption
(a) β0\beta_{0}
Refer to caption
(b) γ0\gamma_{0}
Refer to caption
(c) β1\beta_{1}
Refer to caption
(d) γ1\gamma_{1}
Refer to caption
(e) β2\beta_{2}
Refer to caption
(f) γ2\gamma_{2}
Refer to caption
(g) β3\beta_{3}
Refer to caption
(h) γ3\gamma_{3}
Refer to caption
(i) β4\beta_{4}
Refer to caption
(j) γ4\gamma_{4}
Figure 9: The histograms of the posterior distributions for each parameter (Scenario 3).

References

  • A. Albert and J. A. Anderson (1984) On the existence of maximum likelihood estimates in logistic regression models. Biometrika 71 (1), pp. 1–10. Cited by: §1, §2.3.
  • D. Arthur and S. Vassilvitskii (2007) K-means++: the advantages of careful seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics 8, pp. 1027–1035. External Links: Document Cited by: §B.3, §3.3.
  • J. Bootkrajang and A. Kabán (2012) Label-noise robust logistic regression and its applications. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 143–158. Cited by: §1.
  • J. Bootkrajang and A. Kabán (2013) Classification of mislabelled microarrays using robust sparse logistic regression. Bioinformatics 29 (7), pp. 870–877. Cited by: §1.
  • Centers for Disease Control and Prevention (CDC) (2017) National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data. Note: Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention Cited by: §5.1.
  • A. Diop, A. Diop, and J.-F. Dupuy (2011) Maximum likelihood estimation in the logistic regression model with a cure fraction. Electronic Journal of Statistics 5, pp. 460–483. External Links: Document, Link Cited by: §1, §1, §3.1.
  • S. Frühwirth-Schnatter (2006) Finite mixture and Markov switching models. Springer, New York. Cited by: §1.
  • H. Fujisawa and S. Eguchi (2008) Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis 99 (9), pp. 2053–2081. Cited by: §1.
  • D. B. Hall (2000) Zero-inflated poisson and binomial regression with random effects: a case study. Biometrics 56 (4), pp. 1030–1039. Cited by: §1.
  • H. Hung, Z.-Y. Jou, and S.-Y. Huang (2018) Robust mislabel logistic regression without modeling mislabel probabilities. Biometrics 74 (1), pp. 145–154. Cited by: §1.
  • O. Komori, S. Eguchi, S. Ikeda, H. Okamura, M. Ichinokawa, and S. Nakayama (2016) An asymmetric logistic regression model for ecological data. Methods in Ecology and Evolution 7 (2), pp. 249–260. Cited by: §1.
  • N. Nagelkerke and V. Fidler (2015) Estimating a logistic discrimination functions when one of the training samples is subject to misclassification: a maximum likelihood approach. PLoS One 10 (10), pp. e0140718. Cited by: §1.
  • N. G. Polson, J. G. Scott, and J. Windle (2013) Bayesian inference for logistic models using pólya–gamma latent variables. Journal of the American Statistical Association 108 (504), pp. 1339–1349. Cited by: §B.1, §C.1, §1, §3.3.
  • M. J. Silvapulle (1981) On the existence of maximum likelihood estimators for the binomial response models. Journal of the Royal Statistical Society: Series B 43 (3), pp. 310–313. Cited by: §1, §2.3.
  • R. H. Swendsen and J.-S. Wang (1986) Replica Monte Carlo simulation of spin-glasses. Physical Review Letters 57 (21), pp. 2607. Cited by: §1, §3.3.
  • H. Teicher (1963) Identifiability of finite mixtures. The Annals of Mathematical Statistics 34 (4), pp. 1265–1269. Cited by: §3.2.
  • H. Wainer, E. T. Bradlow, and X. Wang (2007) Testlet response theory and its applications. Cambridge University Press. Cited by: §1.
  • S. J. Yakowitz and J. D. Spragins (1968) On the identifiability of finite mixtures. The Annals of Mathematical Statistics 39 (1), pp. 209–214. Cited by: §3.2.