Zero-Inflated Logistic Regression Models with Shared Design: Identifiability, Existence of Estimates, and a Relabeling Rule

Yui Tomo Department of Epidemiology, National Institute of Infectious Diseases, Japan Institute for Health Security, 1-23-1 Toyama, Shinjuku-ku, Tokyo 162-0052, Japan E-mail: tomo.y@jihs.go.jp Shinto Eguchi The Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan Daisuke Yoneoka Department of Epidemiology, National Institute of Infectious Diseases, Japan Institute for Health Security, 1-23-1 Toyama, Shinjuku-ku, Tokyo 162-0052, Japan

Abstract

The zero-inflated logistic regression model accommodates binary responses with excess zeros, which often arise from a latent mixture of susceptible and insusceptible subpopulations or asymmetric misclassification of the response. The model has two components: regression for the binary response and a latent binary indicator for the zero-inflation state. In applied settings, it is common to use the same design matrix for both components if there is no prior knowledge. However, this shared-design specification lacks guaranteed identifiability of the regression parameters, as established in prior works. This paper investigates the theoretical properties of the zero-inflated logistic regression model under the shared-design setting and computational methods for applications. First, to motivate the use of the zero-inflated model, we prove that ignoring the zero-inflation mechanism can lead to a sign flip in the pseudo-true coefficient value relative to the true value. We then establish sufficient conditions for the existence of the maximum likelihood estimate. As a main result, we establish that the model under the shared-design setting is identifiable up to exchange symmetry of the parameters for two components and that the expected log-likelihood has a unique maximizer on the resulting quotient space. The posterior bimodality is examined using a Pólya-Gamma Gibbs sampler with replica exchange. Finally, we propose a simple relabeling rule to select a single ordered parameter pair, and evaluate its performance through simulation studies and an application to self-reported diabetes data.

Keywords: Asymmetric misclassification; Data separation; Exchange symmetry; Pólya-Gamma augmentation; Replica exchange.

1 Introduction

The logistic regression model is a standard tool for binary outcomes and remains attractive due to its simplicity and interpretability. However, in many applied settings, the observed response contains more zeros than a standard logistic model can accommodate. For example, such excess zeros may arise from a latent mixture of susceptible and insusceptible subpopulations, such as biological immunity, or one-sided outcome misclassification, such as the failure to record an event due to delayed reporting. To accommodate these situations, a natural extension is a zero-inflated logistic regression model (Hall, 2000; Diop et al., 2011). This model expresses the observed response as a mixture of a standard logistic regression and a structural zero, where the latent binary indicator for the zero-inflation state is itself modeled via a logistic regression. We refer to these two regression components as the ordinary logistic regression component and the structural-zero component, respectively. This formulation captures complex data-generating mechanisms while retaining interpretability.

Earlier work includes the three-parameter logistic model in psychometrics and related methods in ecology and epidemiology (Wainer et al., 2007; Komori et al., 2016; Nagelkerke and Fidler, 2015). Furthermore, robust estimation approaches based on label-noise modeling or robust divergence have also been studied (Bootkrajang and Kabán, 2012, 2013; Hung et al., 2018; Fujisawa and Eguchi, 2008). From the perspective of excess zeros, these settings are closely connected: positive outcomes that are systematically recorded as zeros induce a structural-zero mechanism in the observed data. From the theoretical perspective, Diop et al. (2011) established sufficient conditions for identifiability of the zero-inflated logistic regression model. One such condition requires at least one continuous covariate that appears in one component but not in the other.

Although the zero-inflated logistic model has been explored practically and theoretically, there remain several gaps in the understanding of its theoretical properties. In particular, this study focuses on the practically important scenario where no reliable prior information is available regarding which covariates should enter both components. A common empirical choice is then to use the same design matrix in both components of the model. In this case, the model faces a theoretical difficulty: once the two components share the same covariates, the likelihood becomes invariant under exchange of the two coefficient vectors. The model is therefore not identifiable as an ordered pair, and both optimization and posterior simulation may exhibit symmetric multiple modes. This symmetry is analogous to the label-switching phenomenon in mixture models (Frühwirth-Schnatter, 2006). Specifically, once the same covariates are used in both components, exchanging the two coefficient vectors does not change the likelihood values. However, such non-identifiability has yet to be characterized.

Another theoretical issue concerns the existence of the maximum likelihood estimate. For ordinary logistic regression, non-existence of the estimate under separation is well-known (Albert and Anderson, 1984; Silvapulle, 1981). For the zero-inflated logistic model, however, the analogous conditions have received less attention.

Our contributions are fourfold. First, we prove that model misspecification can lead to a sign flip in the pseudo-true parameter relative to the true coefficient. Second, we introduce a double-separation condition for the zero-inflated logistic model and derive sufficient conditions for the existence of the estimates. Third, we establish that the model under a shared design matrix is identifiable up to exchange symmetry and that, under mild regularity conditions, the expected log-likelihood has a unique maximizer on the resulting quotient space. Furthermore, we investigate the resulting multimodality numerically through a Pólya-Gamma Gibbs sampler integrated with replica exchange (Polson et al., 2013; Swendsen and Wang, 1986). Fourth, we propose a relabeling rule based on an estimate from the ordinary logistic regression model and conduct a numerical study.

The remainder of the paper is structured as follows. Section 2 introduces the model, studies the sign-flip phenomenon under misspecification, and presents sufficient conditions for the existence and non-existence of the estimates. Section 3 studies identifiability under the shared design setting, formalizes the exchange-symmetry structure, and provides a numerical illustration of the resulting posterior bimodality. Section 4 proposes a relabeling rule and conducts a simulation study. Section 5 presents an illustrative application to NHANES self-reported diabetes data. Section 6 discusses limitations and future work. Section 7 concludes the paper.

2 Zero-inflated Logistic Regression Model

In this section, we define the zero-inflated logistic regression model and study two basic aspects of the model before turning to the shared-design case. First, we examine the consequence of misspecification when zero inflation is ignored and a standard logistic regression model is fitted instead. Second, we consider the existence of maximum likelihood estimates and introduce separation-type conditions that provide sufficient criteria for non-existence and existence.

2.1 Model Definition

Let $y_{i}\in\{0,1\}$ ( $i=1,\ldots,n$ ) denote the binary responses for $n$ observations, and let $x_{i}=\left(1,x_{i,1},\ldots,x_{i,d-1}\right)^{\top}\in\mathbb{R}^{d}$ denote the covariate vector. Let $\beta=\left(\beta_{0},\ldots,\beta_{d-1}\right)^{\top}\in\mathbb{R}^{d}$ be the corresponding parameter vector. We assume that a latent variable $h_{i}\in\{0,1\}$ determines whether the response is constrained to be zero, where $h_{i}=0$ indicates the zero state. Let $z_{i}=\left(1,z_{i,1},\ldots,z_{i,p-1}\right)^{\top}\in\mathbb{R}^{p}$ denote the covariate vector for $h_{i}$ , and let $\gamma=\left(\gamma_{0},\ldots,\gamma_{p-1}\right)^{\top}\in\mathbb{R}^{p}$ denote the corresponding parameter vector. Let $F(\cdot)$ be the inverse logit function:

\displaystyle F(\mu):={\exp(\mu)}/{\left\{1+\exp(\mu)\right\}},\quad\text{for}\quad\mu\in\mathbb{R}.

Then, the zero-inflated logistic regression model is defined as

\displaystyle p(y_{i}\mid x_{i},z_{i},\beta,\gamma)=q(h_{i}=1\mid z_{i},\gamma)\cdot p(y_{i}\mid x_{i},\beta)+q(h_{i}=0\mid z_{i},\gamma)\cdot\mathrm{I}(y_{i}=0),

where $p(y_{i}=1\mid x_{i},\beta)=F(\beta^{\top}x_{i})$ and $q(h_{i}=1\mid z_{i},\gamma)=F(\gamma^{\top}z_{i})$ . We refer to the logistic regression components $p(y_{i}\mid x_{i},\beta)=F(\beta^{\top}x_{i})^{y_{i}}(1-F(\beta^{\top}x_{i}))^{1-y_{i}}$ and $q(h_{i}\mid z_{i},\gamma)=F(\gamma^{\top}z_{i})^{h_{i}}(1-F(\gamma^{\top}z_{i}))^{1-h_{i}}$ as the ordinary logistic regression component and the structural-zero component, respectively. In this context, $q(h_{i}=1\mid z_{i},\gamma)$ represents the probability that the $i$ -th observation is not a structural zero, corresponding to the probability of susceptibility to the event. Then, we have

\displaystyle p(y_{i}\mid x_{i},z_{i},\beta,\gamma)=\left\{F(\gamma^{\top}z_{i})F(\beta^{\top}x_{i})\right\}^{y_{i}}\left\{1-F(\gamma^{\top}z_{i})F(\beta^{\top}x_{i})\right\}^{1-y_{i}}.

Therefore, the log-likelihood function is defined as

\displaystyle L(\beta,\gamma)=\sum_{i=1}^{n}\left[y_{i}\log\left\{F(\gamma^{\top}z_{i})F(\beta^{\top}x_{i})\right\}+(1-y_{i})\log\left\{1-F(\gamma^{\top}z_{i})F(\beta^{\top}x_{i})\right\}\right].

2.2 Sign-Flip Phenomenon Under Misspecification

To motivate the use of the zero-inflated logistic regression model, we examine the consequences of ignoring zero-inflation and fitting a standard logistic regression model to data generated from the zero-inflated model. When the standard logistic regression model is applied to data with excess zeros, the resulting estimator can be severely biased. In particular, when a covariate is positively associated with the response $y_{i}$ but negatively associated with the latent indicator $h_{i}$ (or vice versa), the fitted logistic regression model may estimate a regression coefficient whose sign is opposite to the underlying true value. In this subsection, we formalize this phenomenon under a random-design setting.

Let $x\in\mathbb{R}^{d}$ and $z\in\mathbb{R}^{p}$ denote covariate vectors in the ordinary logistic regression component and the structural-zero component, respectively, and suppose that they share a common covariate indexed by $j$ :

	$\displaystyle x$	$\displaystyle=(1,\tilde{x}_{j},\tilde{x}_{-j}^{\top})^{\top},$
	$\displaystyle z$	$\displaystyle=(1,\tilde{z}_{j},\tilde{z}_{-j}^{\top})^{\top},$

where $\tilde{x}_{-j}\in\mathbb{R}^{d-2}$ , $\tilde{z}_{-j}\in\mathbb{R}^{p-2}$ , and the $j$ th covariate is shared between the two vectors: $\tilde{x}_{j}=\tilde{z}_{j}$ . The corresponding regression coefficients are decomposed as

	$\displaystyle\beta$	$\displaystyle=(\beta_{0},\beta_{j},\beta_{-j}^{\top})^{\top},$
	$\displaystyle\gamma$	$\displaystyle=(\gamma_{0},\gamma_{j},\gamma_{-j}^{\top})^{\top}.$

Without loss of generality, we focus on a covariate that has a positive effect on $y$ but a negative effect on $h$ , and let

	$\displaystyle\beta_{j}$	$\displaystyle=:a>0,$
	$\displaystyle\gamma_{j}$	$\displaystyle=:c<0.$

Under the zero-inflated logistic regression model, the conditional event occurrence probability can be written as

\displaystyle\pi_{c}(x,z):=F(\beta_{0}+a\tilde{x}_{j}+\beta_{-j}^{\top}\tilde{x}_{-j})F(\gamma_{0}+c\tilde{x}_{j}+\gamma_{-j}^{\top}\tilde{z}_{-j}).

We then consider the ordinary logistic regression model where the conditional event occurrence probability is specified as $F(\theta_{0}+t\tilde{x}_{j}+\theta_{-j}^{\top}\tilde{x}_{-j})$ . Here, $(\theta_{0},t,\theta_{-j})\in\mathbb{R}\times\mathbb{R}\times\mathbb{R}^{d-2}$ is the regression coefficient vector. Under the true zero-inflated model, the expected log-likelihood function of this misspecified logistic regression is given by

	$\displaystyle\mathcal{L}_{c}(\theta_{0},t,\theta_{-j})$
	$\displaystyle\quad:=\mathbb{E}_{(x,z)}\left[\pi_{c}(x,z)\,\log F(\theta_{0}+t\tilde{x}_{j}+\theta_{-j}^{\top}\tilde{x}_{-j})+\{1-\pi_{c}(x,z)\}\,\log\{1-F(\theta_{0}+t\tilde{x}_{j}+\theta_{-j}^{\top}\tilde{x}_{-j})\}\right],$

where the expectation is taken with respect to the joint distribution of $(x,z)$ .

We then define the profile objective function

\displaystyle g_{c}(t)

\displaystyle:=\sup_{(\theta_{0},\theta_{-j})\in\mathbb{R}\times\mathbb{R}^{d-2}}\mathcal{L}_{c}(\theta_{0},t,\theta_{-j}),

and let $t^{*}(c)\in\arg\max_{t\in\mathbb{R}}g_{c}(t)$ . The following theorem shows that, when the magnitude of the negative association in the zero-inflation component is sufficiently large, every pseudo-true value of the $j$ th coefficient is negative.

Theorem 2.1.

Suppose that Assumption A.1 in Appendix A.1 holds. Then there exists a constant $C_{0}<0$ such that

\displaystyle c\leq C_{0}\quad\Longrightarrow\quad\arg\max_{t\in\mathbb{R}}g_{c}(t)\subset(-\infty,0).

See Appendix A.1 for the proof. This theorem implies that, although the true coefficient satisfies $\beta_{j}=a>0$ , every pseudo-true value of the same coefficient under the misspecified conventional logistic regression model is negative when the $|\gamma_{j}|=|c|$ is sufficiently large in the opposite sign direction. Specifically, any choice $t^{*}(c)\in\arg\max_{t\in\mathbb{R}}g_{c}(t)$ satisfies $t^{*}(c)<0$ .

2.3 Maximum Likelihood Estimation

The maximum likelihood estimator (MLE) $(\hat{\beta}^{\top},\hat{\gamma}^{\top})^{\top}\in\mathbb{R}^{d+p}$ is defined as any maximizers of $L(\beta,\gamma)$ . However, due to the structure of the zero-inflated logistic model, the existence of an estimate is not always guaranteed. To investigate this issue, we introduce the concept of double separation, an analogue of the separation condition for the standard logistic regression.

Definition 2.1 (Double separation).

The dataset $\{(y_{i},x_{i},z_{i})\}_{i=1}^{n}$ is said to satisfy double separation if there exist non-zero vectors ${v}\in\mathbb{R}^{d}$ and ${w}\in\mathbb{R}^{p}$ such that for every $i=1,\dots,n$ ,

\displaystyle\begin{cases}v^{\top}x_{i}\;\geq 0,\quad w^{\top}z_{i}\;\geq 0&\text{if }y_{i}=1,\\ v^{\top}x_{i}\;\leq 0,\quad w^{\top}z_{i}\;\leq 0&\text{if }y_{i}=0,\end{cases}

and either of the inequalities is strict for at least one observation: there exists $j\in\{1,\dots,n\}$ such that either $y_{j}=1$ with $(v^{\top}x_{j},w^{\top}z_{j})\neq(0,0)$ or $y_{j}=0$ with $(v^{\top}x_{j},w^{\top}z_{j})\neq(0,0)$ .

Proposition 2.2.

If the dataset satisfies double separation, then the log-likelihood $L(\beta,\gamma)$ has no maximizer in $\mathbb{R}^{d}\times\mathbb{R}^{p}$ .

See Appendix A.2 for the proof.

Furthermore, we introduce the following condition to guarantee the existence of a maximum likelihood estimate.

Definition 2.2 ( $\varepsilon$ –Double–non-separation).

The dataset $\{(y_{i},x_{i},z_{i})\}_{i=1}^{n}$ is said to satisfy $\varepsilon$ –double–non-separation if there exists a constant $\varepsilon>0$ such that

\displaystyle\inf_{(v,w):\|v\|^{2}+\|w\|^{2}=1}\max\left\{-\min_{i\in\{i:y_{i}=1\}}\{v^{\top}x_{i}+w^{\top}z_{i}\},~\max_{i\in\{i:y_{i}=0\}}\min\{v^{\top}x_{i},~w^{\top}z_{i}\}\right\}\geq\varepsilon.

We then establish the following proposition, which provides a sufficient condition for the existence of an estimate.

Proposition 2.3.

If the dataset satisfies $\varepsilon$ –double–non-separation for some $\varepsilon>0$ , then a maximizer of $L(\beta,\gamma)$ exists.

See Appendix A.3 for the proof.

The $\varepsilon$ –double–non-separation condition implies that for any unit direction $(v,w)$ , there exists at least one observation located at a signed distance of at least $\varepsilon$ from the separating hyperplane associated with that direction. In other words, the dataset maintains a uniform margin of width $\varepsilon$ that prevents double separation.

Even when $\varepsilon$ –double–non-separation holds with $\varepsilon$ close to zero, the surface of the log-likelihood function may be nearly flat along certain directions, and numerical optimization may be unstable in practice. In such settings, penalized estimation approaches may be useful.

A closely related phenomenon in this section arises in ordinary logistic regression, where data separation leads to the non-existence of the maximum likelihood estimate (Albert and Anderson, 1984; Silvapulle, 1981).

3 Shared-Design Model

We now focus on the case $p=d$ and $z_{i}=x_{i}$ for all $i=1,\ldots,n$ . This is the configuration that arises when the analyst has no prior information with which to distinguish covariates for the ordinary logistic regression component $x_{i}$ from covariates for the structural-zero component $z_{i}$ and therefore uses the same design matrix in both components of the model. In that case, the conditional probability of $y_{i}$ is expressed as

\displaystyle p(y_{i}\mid x_{i},\beta,\gamma)=\left\{F(\gamma^{\top}x_{i})F(\beta^{\top}x_{i})\right\}^{y_{i}}\left\{1-F(\gamma^{\top}x_{i})F(\beta^{\top}x_{i})\right\}^{1-y_{i}},

for $i=1,\ldots,n$ . We refer to this model as a shared-design model. The log-likelihood function is

\displaystyle L(\beta,\gamma)=\sum_{i=1}^{n}\left[y_{i}\log\left\{F(\gamma^{\top}x_{i})F(\beta^{\top}x_{i})\right\}+(1-y_{i})\log\left\{1-F(\gamma^{\top}x_{i})F(\beta^{\top}x_{i})\right\}\right],

which satisfies the exchange symmetry

\displaystyle L(\beta,\gamma)=L(\gamma,\beta).

Motivated by this symmetry, we define the equivalence relation

\displaystyle(\beta_{1},\gamma_{1})\sim(\beta_{2},\gamma_{2})\quad\Longleftrightarrow\quad(\beta_{1},\gamma_{1})=(\beta_{2},\gamma_{2})\quad\text{or}\quad(\beta_{1},\gamma_{1})=(\gamma_{2},\beta_{2}),

and define $[\beta,\gamma]$ for the corresponding equivalence class.

3.1 Prior Work on Identifiability

Diop et al. (2011) established sufficient conditions for identifiability of the zero-inflated logistic regression model. A key condition in their argument is the availability of a continuous covariate that appears in one component but not the other. When the same covariates are used in both components, this source of asymmetry disappears. Therefore, the shared-design setting is a boundary case in which the ordinary guarantee of identifiability fails.

3.2 Identifiability under Shared Design

We formulate identifiability of the shared design model. Let $x=(1,\tilde{x}_{-0}^{\top})^{\top}$ where $\tilde{x}_{-0}\in\mathbb{R}^{d-1}$ . We first consider the following support condition.

(C1)

The support of $\tilde{x}_{-0}$ contains a nonempty open subset $\mathcal{U}\subset\mathbb{R}^{d-1}$ .

Under condition (C1), we obtain the following basic identifiability result.

Proposition 3.1.

Suppose that condition (C1) holds. Let $\beta_{1}=(\beta_{1,0},\beta_{1,-0}^{\top})^{\top}$ and $\gamma_{1}=(\gamma_{1,0},\gamma_{1,-0}^{\top})^{\top}$ , and suppose that $\beta_{1,-0}$ , $\gamma_{1,-0}$ , $\beta_{1,-0}+\gamma_{1,-0}$ are pairwise distinct. Suppose that for two parameter pairs $(\beta_{1},\gamma_{1})$ and $(\beta_{2},\gamma_{2})$ ,

\displaystyle p\left(y\mid x,\beta_{1},\gamma_{1}\right)=p\left(y\mid x,\beta_{2},\gamma_{2}\right),

(1)

holds for all $y\in\{0,1\}$ and all $x=(1,\tilde{x}_{-0}^{\top})^{\top}$ for $\tilde{x}_{-0}\in\mathcal{U}$ . Then, we have either $(\beta_{1},\gamma_{1})=(\beta_{2},\gamma_{2})$ or $(\beta_{1},\gamma_{1})=(\gamma_{2},\beta_{2})$ .

We next extend Proposition 3.1 to a mixed support for continuous and discrete covariates. Let $\tilde{x}_{-0}=(\tilde{x}_{-0}^{(1)\top},\tilde{x}_{-0}^{(2)\top})^{\top}$ , where $\tilde{x}_{-0}^{(1)}\in\mathbb{R}^{r}$ and $\tilde{x}_{-0}^{(2)}\in\mathbb{R}^{s}$ with $r\geq 1$ , $s\geq 0$ , and $r+s=d-1$ . Here, $\tilde{x}_{-0}^{(1)}$ denotes the subvector of continuously distributed covariates, and $\tilde{x}_{-0}^{(2)}$ denotes the subvector of discretely distributed covariates. We consider the following support condition.

(C2)

There exists a subset $\mathcal{S}\subset\operatorname{supp}(\tilde{x}_{-0}^{(2)})$ such that $\operatorname{aff}(\mathcal{S})=\mathbb{R}^{s}$ , and, for each $\xi\in\mathcal{S}$ , the conditional support of $\tilde{x}_{-0}^{(1)}$ given $\tilde{x}_{-0}^{(2)}=\xi$ contains a nonempty open subset $\mathcal{U}_{\xi}\subset\mathbb{R}^{r}$ .

When $s=0$ , condition (C2) reduces to condition (C1). Let $\beta_{j,-0}=(\beta_{j,-0}^{(1)\top},\beta_{j,-0}^{(2)\top})^{\top}$ and $\gamma_{j,-0}=(\gamma_{j,-0}^{(1)\top},\gamma_{j,-0}^{(2)\top})^{\top}$ , corresponding to the decomposition of $\tilde{x}_{-0}$ into $\tilde{x}_{-0}^{(j)}$ for $j=1,2$ . Under condition (C2), we obtain the following extended result.

Theorem 3.2.

Suppose that condition (C2) holds. Let $\beta_{1}=(\beta_{1,0},\beta_{1,-0}^{\top})^{\top}$ and $\gamma_{1}=(\gamma_{1,0},\gamma_{1,-0}^{\top})^{\top}$ , and suppose that $\beta_{1,-0}^{(1)}$ , $\gamma_{1,-0}^{(1)}$ , $\beta_{1,-0}^{(1)}+\gamma_{1,-0}^{(1)}$ are pairwise distinct. Suppose that for two parameter pairs $(\beta_{1},\gamma_{1})$ and $(\beta_{2},\gamma_{2})$ , (1) holds for all $y\in\{0,1\}$ and all $x=(1,\tilde{x}_{-0}^{(1)\top},\xi^{\top})^{\top}$ for $\xi\in\mathcal{S}$ and $\tilde{x}_{-0}^{(1)}\in\mathcal{U}_{\xi}$ . Then, we have either $(\beta_{1},\gamma_{1})=(\beta_{2},\gamma_{2})$ or $(\beta_{1},\gamma_{1})=(\gamma_{2},\beta_{2})$ .

We now establish the following identifiability result.

Corollary 3.3.

Suppose that condition (C2) holds, and suppose that the shared-design model is correctly specified with true parameter pair $(\beta^{\ast},~\gamma^{\ast})$ . Let $\beta^{\ast}=(\beta_{0}^{\ast},~\beta_{-0}^{\ast\top})^{\top}$ and $\gamma^{\ast}=(\gamma_{0}^{\ast},~\gamma_{-0}^{\ast\top})^{\top}$ , where $\beta_{-0}^{\ast}=(\beta_{-0}^{\ast(1)\top},\beta_{-0}^{\ast(2)\top})^{\top}$ and $\gamma_{-0}^{\ast}=(\gamma_{-0}^{\ast(1)\top},\gamma_{-0}^{\ast(2)\top})^{\top}$ , and suppose that $\beta_{-0}^{\ast(1)}$ , $\gamma_{-0}^{\ast(1)}$ , $\beta_{-0}^{\ast(1)}+\gamma_{-0}^{\ast(1)}$ are pairwise distinct. Then the expected log-likelihood

\displaystyle\mathcal{L}(\beta,~\gamma):=\mathbb{E}_{(x,y)}\left[y\log\left\{F(x^{\top}\beta)F(x^{\top}\gamma)\right\}+(1-y)\log\left\{1-F(x^{\top}\beta)F(x^{\top}\gamma)\right\}\right],

is uniquely maximized on $(\mathbb{R}^{d}\times\mathbb{R}^{d})/{\sim}$ at the class $[\beta^{\ast},~\gamma^{\ast}]$ .

See Appendix A.4 for the proofs of Proposition 3.1, Theorem 3.2, and Corollary 3.3. Proposition 3.1 gives the basic identifiability result under a fully continuous support, while Theorem 3.2 extends it to a mixed support for continuous and discrete covariates. Corollary 3.3 clarifies the inferential target under the shared-design setting. At the population level, the equivalence class $[\beta,\gamma]$ is uniquely identified, yet the model cannot distinguish which component corresponds to the event occurrence process and which to the structural-zero process. These results establish identifiability over the support of $x$ and do not imply the uniqueness of the likelihood maximizer in finite samples. Consequently, resolving this label ambiguity requires either external information or some relabeling rule.

These results also relate the shared-design model to the identifiability of finite mixture models (Teicher, 1963; Yakowitz and Spragins, 1968). In this context, identifiability is formulated as the uniqueness of the finite-mixture representation. Therefore, under an ordered-component parameterization, this corresponds to uniqueness up to permutation of component labels. In the present model, the relevant symmetry is not a literal permutation of mixture components but the exchange symmetry of the pair of parameter vectors $(\beta,\gamma)$ for both components. In this sense, $[\beta,\gamma]$ is the natural inferential target.

The condition that $\beta_{-0}^{\ast}$ , $\gamma_{-0}^{\ast}$ , and $\beta_{-0}^{\ast}+\gamma_{-0}^{\ast}$ are pairwise distinct, which is inherited from Theorem 3.2, is a condition on the true parameter vector. This is a generic condition and can be assumed to hold in practice.

3.3 Numerical Confirmation of Bimodality

To illustrate the exchange symmetry established in Theorem 3.2 and Corollary 3.3, we examined the posterior distribution of $(\beta,\gamma)$ under the shared-design setting via a Markov chain Monte Carlo (MCMC) algorithm. We set the number of non-intercept covariates as $4$ ( $d=p=5$ ), and considered three covariate designs: Scenario 1: all $4$ covariates were drawn from independent standard normal distributions; Scenario 2: all $4$ covariates were drawn from independent $\text{Bernoulli}(0.5)$ distributions; Scenario 3: the first two non-intercept covariates were drawn from independent standard normal distributions, and the remaining ones were from independent $\text{Bernoulli}(0.5)$ . The sample size was $2,000$ . The data-generating mechanism and sampling setup are described in Appendix B. We developed the Pólya-Gamma Gibbs sampler with replica exchange, which is detailed in Appendix C (Polson et al., 2013; Swendsen and Wang, 1986).

Figure 1 displays the principal component analysis (PCA) plots of the posterior samples after $k$ -means++ clustering with $k=2$ (Arthur and Vassilvitskii, 2007). Under continuous and mixed designs (Scenarios 1 and 3), the posterior exhibited clear bimodality, with two well-separated clusters. Under the binary design (Scenario 2), in which all non-intercept covariates take values in the bounded discrete set $\{0,1\}$ , the two clusters were not clearly distinguished in the PCA plots. This result is consistent with the failure of the binary design to satisfy the condition (C2), leading to a lack of guaranteed identifiability by Corollary 3.3. Posterior means of each cluster and further numerical details are provided in Appendix B and D.

Refer to caption — (a) Scenario 1: continuous

4 Relabeling Rule

Section 3 shows that, under the shared-design setting, the likelihood identifies only the equivalence class $[\beta,\gamma]$ . In many applications, however, investigators still need a single ordered pair because subsequent interpretation is conducted in terms of the associations with event occurrence or zero-inflation such as zeros due to misclassification. For that practical purpose, we introduce a simple relabeling rule.

4.1 Proposed Rule

We propose a simple relabeling rule. We proceed as follows.

1.

Fit a standard logistic regression of $y$ on $x$ and denote the resulting coefficient vector by $\hat{\beta}_{\mathrm{LR}}$ .
2.

Fit the zero-inflated logistic regression model with shared design and obtain one ordered maximizer $(\hat{\beta},\hat{\gamma})$ .
3.

Form the exchange-symmetric solution $(\hat{\gamma},\hat{\beta})$ .

Choose the pair whose first component vector is closer to $\hat{\beta}_{\mathrm{LR}}$ ; that is, choose

\displaystyle(\hat{\beta}^{\dagger},\hat{\gamma}^{\dagger}):=\operatorname*{arg\,min}_{(\beta,\gamma)\in\{(\hat{\beta},\hat{\gamma}),(\hat{\gamma},\hat{\beta})\}}\|\beta-\hat{\beta}_{\mathrm{LR}}\|_{2}.

We should note that Theorem 2.1 shows that the estimator of the ordinary logistic regression is biased under misspecification. Therefore, while $\hat{\beta}_{\mathrm{LR}}$ serves as a convenient reference, it should not be interpreted as a definitive benchmark for the ordinary logistic regression component of the zero-inflated model. Whenever prior knowledge is available to distinguish the two components, that information should take precedence over this rule.

4.2 Simulation Study

We next investigate the behavior of the proposed relabeling rule through a simulation study.

Simulation Design.

We considered four scenarios that differ only in the intercept of the structural-zero component, hence, in the probability of zero inflation. The coefficient vector for the ordinary logistic regression component was fixed at

\displaystyle{\beta}^{*}=(0.5,~1.0,~0.5,~0.5,~0.25)^{\top},

and the structural-zero coefficient was

\displaystyle{\gamma}^{*}=(\gamma_{0,\mathrm{int}},~-1.0,~-1.0,~0.5,~0.5)^{\top}.

We used four values of $\gamma_{0,\mathrm{int}}$ . Specifically, we considered: (i) Very Low Mislabel ( $\gamma_{0,\text{int}}=4.3$ ), yielding approximately $3.8\%$ structural zeros; (ii) Low Mislabel ( $\gamma_{0,\text{int}}=3.0$ ), yielding approximately $10.3\%$ structural zeros; (iii) Moderate Mislabel ( $\gamma_{0,\text{int}}=1.7$ ), yielding approximately $23.4\%$ structural zeros; and (iv) High Mislabel ( $\gamma_{0,\text{int}}=1.0$ ), yielding approximately $33.4\%$ structural zeros. These scenarios were determined to span the range of zero-inflation levels commonly encountered in applications, from nearly negligible to substantial proportions of structural zeros.

For each scenario, we generated $n=1,000$ observations with $d=5$ covariates including an intercept. The first element of $x_{i}$ was $1$ , and the remaining elements were drawn independently from the standard normal distributions. Responses were generated from the zero-inflated logistic regression model as follows:

	$\displaystyle h_{i}$	$\displaystyle\sim\operatorname{Bernoulli}\left(F({\gamma}^{*\top}x_{i})\right),$
	$\displaystyle\quad y_{i}^{\ast}$	$\displaystyle\sim\operatorname{Bernoulli}\left(F({\beta}^{*\top}x_{i})\right),$
	$\displaystyle\quad y_{i}$	$\displaystyle=h_{i}y_{i}^{\ast}.$

We compared three estimation approaches: (i) Proposed approach: the zero-inflated model with the proposed relabeling rule in Section 4, (ii) The standard logistic regression approach: the ordinary logistic regression model ignoring zero inflation, and (iii) Naive zero-inflated model approach: the zero-inflated model without relabeling, retaining the first local maximizer returned by the optimization algorithm.

All methods were performed by the same optimization method and settings: the L-BFGS-B algorithm with analytical gradients, a maximum of $1,000$ iterations, and random initialization from $\mathcal{N}(0,0.01I)$ . An estimate was classified as unreasonable if at least one component exceeds ten times the absolute value of its true value. Each scenario was replicated $10,000$ times.

Results.

Tables 1–3 summarize the results. Figure 2 and Figure 3 visualize the boxplots for distributions of the observed bias. In these results, several patterns were clearly observed. First, for the parameter $\beta$ , the bias of the estimates from the proposed approach remained more concentrated around zero than that of the estimates from the naive approach, whereas the estimates from standard logistic regression approach exhibited a systematic negative shift that became larger as the zero-inflation proportion increased. Second, the proposed approach was more stable than the naive approach. Across the low-to-moderate zero-inflation scenarios, the biases of the relabeled estimates were smaller than those of the standard logistic regression, while their magnitudes of standard deviations remained moderate. For for the parameter $\gamma$ , the proposed approach also dominated the naive approach in most scenarios, especially for the parameters other than the intercept. A weakness appeared when structural zeros were very rare, probably because the intercept of the structural-zero component was weakly identified and highly variable. Third, the naive approach behaved as expected in terms of a non-identifiable ordered parameterization. Specifically, the estimates were obtained from both symmetric solutions. The concentration of the relabeled estimates around the true parameter values, in contrast to the spread of the naive estimates, suggests that the proposed relabeling rule was effective at selecting an appropriate representative from each equivalence class.

Table 3 shows that the proposed relabeling rule improved practical reliability. The proposed method produced reasonable estimates in more than $99\%$ of replications in all but the Very Low Mislabel scenario and consistently outperformed the naive zero-inflated model approach in terms of reasonable solutions.

Table 1: Simulation results based on

10{,}000

replications: estimates of the parameters for the ordinary logistic regression component (

\beta

)

Scenario	Method	Bias (SD)
Scenario	Method	$\beta_{0}$	$\beta_{1}$	$\beta_{2}$	$\beta_{3}$	$\beta_{4}$
Very Low Mislabel	Proposed	0.102 (0.234)	-0.007 (0.231)	-0.003 (0.190)	-0.004 (0.114)	-0.002 (0.115)
	Standard LR	-0.170 (0.071)	-0.217 (0.083)	-0.166 (0.076)	0.005 (0.076)	0.032 (0.072)
	Naive ZILR	1.100 (1.449)	-0.521 (0.861)	-0.395 (0.674)	-0.004 (0.243)	0.064 (0.253)
Low Mislabel	Proposed	0.055 (0.253)	-0.019 (0.274)	-0.013 (0.218)	-0.001 (0.109)	0.001 (0.109)
	Standard LR	-0.409 (0.068)	-0.455 (0.075)	-0.346 (0.069)	0.011 (0.073)	0.064 (0.069)
	Naive ZILR	1.175 (1.384)	-0.882 (1.059)	-0.667 (0.810)	0.010 (0.184)	0.119 (0.231)
Moderate Mislabel	Proposed	-0.005 (0.289)	-0.267 (0.592)	-0.207 (0.457)	0.014 (0.121)	0.043 (0.142)
	Standard LR	-0.815 (0.068)	-0.724 (0.069)	-0.559 (0.068)	0.032 (0.072)	0.115 (0.070)
	Naive ZILR	0.751 (0.953)	-1.015 (1.127)	-0.767 (0.862)	0.010 (0.167)	0.135 (0.225)
High Mislabel	Proposed	-0.195 (0.284)	-0.648 (0.780)	-0.506 (0.597)	0.035 (0.132)	0.105 (0.162)
	Standard LR	-1.121 (0.070)	-0.860 (0.068)	-0.670 (0.069)	0.050 (0.074)	0.145 (0.072)
	Naive ZILR	0.416 (0.802)	-1.021 (1.146)	-0.771 (0.867)	0.007 (0.181)	0.133 (0.225)

Table 2: Simulation results based on

10{,}000

replications: estimates of the parameters for the structural-zero component (

\gamma

)

Scenario	Method	Bias (SD)
Scenario	Method	$\gamma_{0}$	$\gamma_{1}$	$\gamma_{2}$	$\gamma_{3}$	$\gamma_{4}$
Very Low Mislabel	Proposed	1.144 (3.760)	-0.102 (1.478)	-0.140 (1.297)	0.092 (0.674)	0.086 (0.692)
Very Low Mislabel	Naive ZILR	-0.579 (3.822)	0.677 (1.581)	0.471 (1.324)	0.060 (0.553)	-0.042 (0.579)
Low Mislabel	Proposed	0.548 (1.899)	-0.118 (0.883)	-0.109 (0.733)	0.047 (0.339)	0.042 (0.348)
Low Mislabel	Naive ZILR	-0.802 (2.053)	0.852 (1.262)	0.632 (0.977)	0.028 (0.270)	-0.094 (0.309)
Moderate Mislabel	Proposed	0.345 (0.957)	0.223 (1.070)	0.160 (0.834)	0.007 (0.215)	-0.018 (0.268)
Moderate Mislabel	Naive ZILR	-0.432 (1.057)	0.981 (1.150)	0.728 (0.882)	0.011 (0.176)	-0.113 (0.228)
High Mislabel	Proposed	0.553 (0.789)	0.629 (1.350)	0.484 (1.030)	-0.021 (0.219)	-0.092 (0.278)
High Mislabel	Naive ZILR	-0.071 (0.859)	1.005 (1.163)	0.752 (0.882)	0.006 (0.181)	-0.120 (0.228)

Table 3: Convergence diagnostics across simulation scenarios based on

10,000

replications. ”Converged” indicates successful optimization with reasonable parameter estimates, ”Unreasonable” denotes cases where at least one parameter estimate exceeded ten times its true value, and ”Ratio” shows the percentage of successful convergence.

Scenario	Method	Converged	Unreasonable	Total	Ratio (%)
Very Low Mislabel	Proposed	9,156	844	10,000	91.6
	Standard LR	10,000	0	10,000	100.0
	Naive ZILR	7,249	2,751	10,000	72.5
Low Mislabel	Proposed	9,936	64	10,000	99.4
	Standard LR	10,000	0	10,000	100.0
	Naive ZILR	9,232	768	10,000	92.3
Moderate Mislabel	Proposed	9,985	15	10,000	99.9
	Standard LR	10,000	0	10,000	100.0
	Naive ZILR	9,936	64	10,000	99.4
High Mislabel	Proposed	9,985	15	10,000	99.9
	Standard LR	10,000	0	10,000	100.0
	Naive ZILR	9,955	43	10,000	99.6

5 Application to Actual Data

We illustrate the performance of the zero-inflated logistic model with shared design through an application to public data. We note that the data analysis in this section is intended only as a methodological illustration and not for clinical interpretation.

5.1 Dataset

We used the National Health and Nutrition Examination Survey (NHANES), the 2017–2018 public release (Centers for Disease Control and Prevention (CDC), 2017). The outcome $y_{i}$ for individual ID (SEQN) $i$ was self-reported diabetes status, constructed from the diabetes questionnaire as $y_{i}=1$ for respondents who reported a prior diabetes diagnosis ( $\text{DIQ010}=1$ ) and $y_{i}=0$ otherwise ( $\text{DIQ010}=2$ ). As covariates, we used insurance coverage (HIQ011), usual source of care (HUQ030), age (RIDAGEYR), body mass index: bmi (BMXBMI), and sex (RIAGENDR). To motivate a zero-inflated model, we compared self-reported status with an HbA1c-based variable. Specifically, we defined $d_{\mathrm{A1c},i}=1$ when HbA1c was at least $6.5\%$ ( $\text{LBXGH}\geq 6.5$ ) and $d_{\mathrm{A1c},i}=0$ otherwise ( $\text{LBXGH}<6.5$ ). We used samples with non-zero 2-year sample weights $({\text{WTMEC2YR}>0})$ . Table 4 summarizes the sample size of the data. Among respondents with $d_{\mathrm{A1c},i}=1$ and non-missing self-report, the proportion with $y_{i}=0$ was about $21.4\%$ . Although HbA1c is not a gold standard for diagnosis, these descriptive proportions suggest that self-reported diabetes can contain a non-negligible proportion of undiagnosed cases.

Table 4: Sample size summaries of NHANES data used to illustrate zero-inflated logistic regression. The last column reports the proportion of self-reported non-cases (

y_{i}=0

) among respondents with HbA1c

\geq 6.5\%

(

d_{\mathrm{A1c},i}=1

) and observed self-reported diabetes status.

Period	Interviewed participants	sample weights $>0$	HbA1c was observed	Self report was observed	Ratio of $y_{i}=0$ given $d_{\mathrm{A1c},i}=1$
2017–2018	9,254	8,704	6,045	8,709	0.214

5.2 Model Settings

We specified a shared-design model. Specifically, the covariates of the ordinary logistic regression component and the structural-zero component were set equal, with covariates

\displaystyle(1,~\text{insured},~\text{usualcare},~\text{age},~\text{bmi},~\text{female}),

where age and bmi were standardized. The model was estimated on the complete-case subset with $711$ samples. Because the aim of this section is methodological illustration, we did not incorporate survey weights.

5.3 Results

We obtained one ordered solution using the BFGS algorithm from a random initial value and then obtained the second by exchanging the two coefficient vectors. Table 5 shows the two solutions: solution A and B, denoted as $(\hat{\beta}_{\mathrm{A}},\hat{\gamma}_{\mathrm{A}})$ and $(\hat{\beta}_{\mathrm{B}},\hat{\gamma}_{\mathrm{B}})$ , respectively. The resulting negative log-likelihood values were identical up to numerical precision. Specifically, both of the evaluated values were $1918.023$ . Moreover, the exchange symmetry was numerically exact: $\max_{j\in\{1,\ldots,6\}}|\hat{\beta}_{A,j}-\hat{\gamma}_{B,j}|=2.98\times 10^{-7}$ and $\max_{j\in\{1,\ldots,6\}}|\hat{\gamma}_{A,j}-\hat{\beta}_{B,j}|=7.40\times 10^{-7}$ , where $\hat{\beta}_{A,j}$ denotes the $j$ th element of $\hat{\beta}_{A}$ .

We applied the relabeling rule from Section 4. Then, solution B was selected because $\|\hat{\beta}_{\mathrm{B}}-\hat{\beta}_{\mathrm{LR}}\|_{2}^{2}=1.102$ , and $\|\hat{\beta}_{\mathrm{A}}-\hat{\beta}_{\mathrm{LR}}\|_{2}^{2}=29.771$ , where $\hat{\beta}_{\mathrm{LR}}$ denotes the estimate from the ordinary logistic regression, shown in Table 5.

Table 5: Two solutions

(\hat{\beta}_{\mathrm{A}},\hat{\gamma}_{\mathrm{A}})

and

(\hat{\beta}_{\mathrm{B}},\hat{\gamma}_{\mathrm{B}})

, and the estimate from the ordinary logistic regression

\hat{\beta}_{\mathrm{LR}}

for the NHANES illustration.

Term	$\hat{\beta}_{\mathrm{A}}$	$\hat{\gamma}_{\mathrm{A}}$	$\hat{\beta}_{\mathrm{B}}$	$\hat{\gamma}_{\mathrm{B}}$	$\hat{\beta}_{\mathrm{LR}}$
Intercept	1.250	-3.192	-3.192	1.250	-3.444
insured	-0.413	0.138	0.138	-0.413	-0.111
usualcare	1.085	0.179	0.179	1.085	0.627
age	-1.251	2.300	2.300	-1.251	1.474
bmi	0.647	0.371	0.371	0.647	0.590
female	-0.500	-0.223	-0.223	-0.500	-0.431

6 Discussion

This study investigates the zero-inflated logistic regression model with shared design, in terms of a sign-flip phenomenon under misspecification, the existence of maximum likelihood estimates, identifiability of the regression parameters, computational methods for implementation, and a practical relabeling rule. The primary theoretical message is that non-identifiability in the shared-design setting is not unstructured. Under mild regularity conditions, the non-identifiability is reduced to the exchange symmetry of the two coefficient vectors. By considering the quotient space with respect to this symmetry, the expected log-likelihood has a unique maximizer. This result is useful for understanding the inherent inferential limits of the model.

A second contribution is the analysis of the existence of maximum likelihood estimates. The concepts of double separation and $\varepsilon$ –double–non-separation introduced in this study extend the classical separation conditions for the ordinary logistic regression model. While these conditions do not provide a complete characterization, they offer tractable sufficient conditions for both the existence and non-existence of estimates. In particular, the results on non-existence explain why optimization algorithms may fail to converge even before considering the exchange symmetry of the regression parameters. Furthermore, as shown in Theorem 2.1, model misspecification can lead to a sign flip in the regression coefficients relative to their true values. This provides a formal warning against analyses that ignore zero-inflation structures.

Numerical results based on posterior sampling support the theoretical findings regarding identifiability. Bimodality was clearly observed in posterior distributions under the continuous and mixed designs. However, the binary design did not exhibit clear mode separation and increased numerical instability, probably because the design fails to satisfy the condition (C2) and thereby lacks the guaranteed identifibility. These results provide a practical guideline: analysts should exercise caution when interpreting estimates if the all covariate values are restricted to a small number of support points in the covariate space. In terms of sampling algorithm, because standard single-chain Gibbs samplers may become trapped in one of the modes, we employed replica exchange method for efficient exploration of the parameter space.

The relabeling rule proposed in Section 4 serves as a heuristic for interpretation rather than a new source of identification. Its role is to provide a reproducible ordering rule when an ordered pair is required for subsequent interpretation or comparison with the results from the ordinary logistic regression. If external information is available to distinguish the ordinary logistic regression component from the structural-zero component, such information should take precedence over the proposed rule.

This study has several limitations. First, our theoretical results provide sufficient conditions for the existence and non-existence of the maximum likelihood estimate, rather than a complete characterization. Second, regarding the relabeling rule, it remains to be investigated whether alternative rules can improve performance when the referenced ordinary logistic regression itself suffers from a severe bias. Third, as the asymptotic theory for the model on the quotient space remains to be established, we do not perform formal statistical inference. These limitations suggest several directions for future research. Promising directions include a more refined characterization of the existence of the estimates, the formalization of asymptotic theory for parameters defined on the quotient space, and the development of relabeling rules that provide reliable choice even when the referenced logistic regression suffers from a severe bias.

A further direction is to examine whether the theoretical results extend to other link functions for binary regression, such as the probit or the complementary log-log function. Theorem 3.2 exploits the specific logistic form $F(\mu)=\exp(\mu)/\{1+\exp(\mu)\}$ , which reduces the identifiability condition to an equality of sums of exponential terms with distinct exponents. While such a representation is unavailable for other general link functions, the principle that the equality $F(\beta_{1}^{\top}x)F(\gamma_{1}^{\top}x)=F(\beta_{2}^{\top}x)F(\gamma_{2}^{\top}x)$ over an open set restricts the parameter pairs to an exchange-symmetric set may hold for a broader class of functions. Specifically, for real analytic link functions, an exchange-symmetry result may be established via the identity theorem, provided that the analytic form of the product function ensures that local equality implies global equivalence. Furthermore, because the results on the existence of the maximum likelihood estimate and the sign flip phenomenon under misspecification depend primarily on the monotonicity and boundedness of $F(\cdot)$ and the structure of the log-likelihood function, these properties are expected to hold across other link functions, probably with appropriate modifications to reflect specific characteristics of link functions, such as the asymmetric behavior of the complementary log-log function.

7 Concluding Remarks

When the same covariates are used for both components of the zero-inflated logistic regression model, identifiability results as established in existing literatures does not hold. We establish that this non-identifiability has a specific structure. Namely, under mild regularity conditions, it reduces to exchange symmetry, and the expected log-likelihood has a unique maximizer on the resulting quotient space. In addition, we introduce sufficient conditions for the existence and non-existence of the maximum likelihood estimate, demonstrate posterior bimodality through numerical experiments, and propose a simple relabeling rule for applications. We also establish a sign flip phenomenon under misspecification. These theoretical and numerical results provide us with a practical guideline in applying the zero-inflated logistic regression model.

Funding

This research was supported by AMED under Grant Number JP223fa627001 (UTOPIA AI Research Discovery Program) and JSPS KAKENHI Grant Number 26K02664.

Data Availability

The NHANES data used in the application is available from the website of the Centers for Disease Control and Prevention: https://wwwn.cdc.gov/nchs/nhanes/.

Code Availability

The Python scripts for numerical studies and the R script for actual data application in this manuscript are available from the GitHub repository: https://github.com/t-yui/zero-inflated-logistic-shared-design.

Declaration of the Use of Generative AI and AI-assisted Technologies

The authors used ChatGPT (OpenAI), Claude (Anthropic) and Gemini (Google) to assist with developing scripts for simulation and application, and editing the English language during the preparation of this manuscript. The authors checked and edited the content and take full responsibility for this manuscript.

Appendix A Proofs

A.1 Proof of Theorem 2.1

We state the following regularity conditions.

Assumption A.1.

We assume that

(A1)

$\mathbb{E}[\|x\|^{2}]<\infty$ .
(A2)

$\mathbb{P}(\tilde{x}_{j}>0)>0$ and $\mathbb{P}(\tilde{x}_{j}<0)>0$ .
(A3)

$\mathbb{E}[\tilde{x}_{j}\mid\tilde{x}_{-j},\tilde{z}_{-j}]=0$ almost surely.
(A4)

For the fixed value of $c<0$ and every $t\in\mathbb{R}$ , the map $(\theta_{0},\theta_{-j})\mapsto\mathcal{L}_{c}(\theta_{0},t,\theta_{-j})$ has a unique maximizer, and the profile objective function $g_{c}$ has at least one maximizer on $\mathbb{R}$ .

For the fixed value of $c<0$ and each $t\in\mathbb{R}$ , let $(\theta_{0}^{*}(t),\theta_{-j}^{*}(t))$ denote the maximizer in the definition of $g_{c}(t)$ , and let

	$\displaystyle\eta(t,x)$	$\displaystyle:=\theta_{0}^{}(t)+t\tilde{x}_{j}+\theta_{-j}^{}(t)^{\top}\tilde{x}_{-j},$
	$\displaystyle\mu(t,x)$	$\displaystyle:=F(\eta(t,x)).$

We first establish basic properties of the profile objective function $g_{c}$ .

Lemma A.1.

Suppose that Assumption A.1 holds. Then, we have

(i)

The derivative of $g_{c}$ is $g_{c}^{\prime}(t)=\mathbb{E}_{(x,z)}\left[(\pi_{c}(x,z)-\mu(t,x))\,\tilde{x}_{j}\right]$ .
(ii)

The function $g_{c}$ is concave with respect to $t$ .

Proof of Lemma A.1..

We begin with (i). For fixed $(x,z)$ , let

\displaystyle\ell(\eta)

\displaystyle:=\pi_{c}(x,z)\log F(\eta)+\{1-\pi_{c}(x,z)\}\log\{1-F(\eta)\}.

Using $F^{\prime}(\eta)=F(\eta)\{1-F(\eta)\}$ , we have

\displaystyle\frac{\partial}{\partial\eta}\ell(\eta)

\displaystyle=\pi_{c}(x,z)-F(\eta).

Since $\partial\eta/\partial t=x_{j}$ , we obtain

\displaystyle\frac{\partial}{\partial t}\mathcal{L}_{c}(\theta_{0},t,\theta_{-j})

\displaystyle=\mathbb{E}_{(x,z)}\left[\frac{\partial\ell(\eta)}{\partial\eta}\,\frac{\partial\eta}{\partial t}\right]=\mathbb{E}_{(x,z)}\left[(\pi_{c}(x,z)-F(\eta))\,x_{j}\right].

Moreover, since $0\leq\pi_{c}(x,z)\leq 1$ and $0\leq F(\eta)\leq 1$ , we have

\displaystyle\left|\frac{\partial}{\partial t}\ell(\eta)\right|\leq|\tilde{x}_{j}|.

By Assumption A.1(A1), we have $\mathbb{E}[|\tilde{x}_{j}|]\leq\{\mathbb{E}[\tilde{x}_{j}^{2}]\}^{1/2}<\infty$ . Therefore, by the uniqueness of the maximizer from Assumption A.1(A4), using Danskin’s theorem, we obtain

	$\displaystyle g_{c}^{\prime}(t)$	$\displaystyle=\frac{\partial}{\partial t}\mathcal{L}_{c}(\theta_{0},t,\theta_{-j})\Big\|_{(\theta_{0},\theta_{-j})=(\theta^{}_{0}(t),\theta^{}_{-j}(t))}$
		$\displaystyle=\mathbb{E}_{(x,z)}\left[(\pi_{c}(x,z)-\mu(t,x))\tilde{x}_{j}\right].$

We then prove (ii). Fix $t_{1},~t_{2}\in\mathbb{R}$ and $\lambda\in[0,1]$ . Let $(\theta^{*}_{0,r},\theta^{*}_{-j,r})\in\mathbb{R}\times\mathbb{R}^{d-2}$ be a maximizer of $\mathcal{L}_{c}(\cdot,t_{r},\cdot)$ for $r=1,~2$ . Using the concavity of $\mathcal{L}_{c}$ , we have

	$\displaystyle g_{c}(\lambda t_{1}+(1-\lambda)t_{2})$
	$\displaystyle\quad=\sup_{(\theta_{0},\theta_{-j})\in\mathbb{R}\times\mathbb{R}^{d-2}}\mathcal{L}_{c}(\theta_{0},\lambda t_{1}+(1-\lambda)t_{2},\theta_{-j})$
	$\displaystyle\quad\geq\mathcal{L}_{c}(\lambda\theta^{}_{0,1}+(1-\lambda)\theta^{}_{0,2},~\lambda t_{1}+(1-\lambda)t_{2},~\lambda\theta^{}_{-j,1}+(1-\lambda)\theta^{}_{-j,2})$
	$\displaystyle\quad\geq\lambda\mathcal{L}_{c}(\theta^{}_{0,1},~t_{1},~\theta^{}_{-j,1})+(1-\lambda)\mathcal{L}_{c}(\theta^{}_{0,2},~t_{2},~\theta^{}_{-j,2})$
	$\displaystyle\quad=\lambda g_{c}(t_{1})+(1-\lambda)g_{c}(t_{2}).$

Thus $g_{c}(t)$ is concave. ∎

The next lemma simplifies the derivative at $t=0$ .

Lemma A.2.

Suppose that Assumption A.1 holds. Then, we have

\displaystyle\mathbb{E}_{(x,z)}\left[(\pi_{c}(x,z)-\mu(0,x))\tilde{x}_{j}\right]=\mathbb{E}_{(x,z)}\left[\pi_{c}(x,z)\tilde{x}_{j}\right].

Proof of Lemma A.2..

At $t=0$ , we have $\mu(0,x)=F(\theta^{*}_{0}(0)+\theta^{*}_{-j}(0)^{\top}\tilde{x}_{-j})=:\tilde{\mu}(\tilde{x}_{-j})$ , which depends on $x$ only through $\tilde{x}_{-j}$ . Thus, using (A3),

\displaystyle\mathbb{E}_{(x,z)}[\mu(0,x)\tilde{x}_{j}]

\displaystyle=\mathbb{E}_{(\tilde{x}_{-j},\tilde{z}_{-j})}\left[\tilde{\mu}(\tilde{x}_{-j})\mathbb{E}[\tilde{x}_{j}\mid\tilde{x}_{-j},\tilde{z}_{-j}]\right]=0.

Therefore, we have

\displaystyle\mathbb{E}_{(x,z)}\left[(\pi_{c}(x,z)-\mu(0,x))\tilde{x}_{j}\right]=\mathbb{E}_{(x,z)}\left[\pi_{c}(x,z)\tilde{x}_{j}\right].

∎

Finally, we investigate the behavior of $f(c):=\mathbb{E}_{(x,z)}[\pi_{c}(x,z)\tilde{x}_{j}]$ .

Lemma A.3.

Suppose that Assumption A.1 holds. Then, we have

(i)

For every $c<0$ , $f^{\prime}(c)=\mathbb{E}_{(x,z)}[F(\beta^{\top}x)~F^{\prime}(\gamma^{\top}z)\tilde{x}_{j}^{2}]>0.$
(ii)

$\lim_{c\to-\infty}f(c)=\mathbb{E}_{(x,z)}[F(\beta^{\top}x)~\tilde{x}_{j}~\mathbf{1}\{\tilde{x}_{j}<0\}]<0.$

Proof of Lemma A.3..

We first prove (i). Since $\gamma^{\top}z=\gamma_{0}+c\tilde{x}_{j}+\gamma_{-j}^{\top}\tilde{z}_{-j}$ , we have $\partial(\gamma^{\top}z)/\partial c=\tilde{x}_{j}$ , and hence

\displaystyle\frac{\partial}{\partial c}\pi_{c}(x,z)=F(\beta^{\top}x)F^{\prime}(\gamma^{\top}z)\tilde{x}_{j}.

We then obtain

\displaystyle f^{\prime}(c)=\mathbb{E}_{(x,z)}\left[\frac{\partial}{\partial c}\{\pi_{c}(x,z)\tilde{x}_{j}\}\right]=\mathbb{E}_{(x,z)}\left[F(\beta^{\top}x)F^{\prime}(\gamma^{\top}z)\tilde{x}_{j}^{2}\right].

Since $F^{\prime}(\cdot)>0$ and $\tilde{x}_{j}^{2}\geq 0$ with $\mathbb{P}(\tilde{x}_{j}\neq 0)>0$ by (A2), the expectation is strictly positive and thus we have $f^{\prime}(c)>0$ .

We next prove (ii). As $c\to-\infty$ , we have

\displaystyle\gamma^{\top}z\to\begin{cases}+\infty,&\text{if}~\tilde{x}_{j}<0,\\ -\infty,&\text{if}~\tilde{x}_{j}>0,\\ \text{finite},&\text{if}~\tilde{x}_{j}=0,\end{cases}

and hence $F(\gamma^{\top}z)\to\mathbf{1}\{\tilde{x}_{j}<0\}$ . We then have

\displaystyle\lim_{c\to-\infty}f(c)=\mathbb{E}_{(x,z)}\left[F(\beta^{\top}x)\tilde{x}_{j}\mathbf{1}\{\tilde{x}_{j}<0\}\right].

For the event $\{\tilde{x}_{j}<0\}$ , the integrand is strictly negative because $F(\beta^{\top}x)>0$ and $\tilde{x}_{j}<0$ . Therefore, by (A2), we obtain $\lim_{c\to-\infty}f(c)<0$ . ∎

We now prove Theorem 2.1.

Proof of Theorem 2.1.

By Lemma A.1(i) and Lemma A.2, the derivative of $g_{c}$ at $t=0$ is

\displaystyle g_{c}^{\prime}(0)=\mathbb{E}_{(x,z)}\left[(\pi_{c}(x,z)-\mu(0,x))\tilde{x}_{j}\right]=\mathbb{E}_{(x,z)}\left[\pi_{c}(x,z)\tilde{x}_{j}\right]=f(c).

By Lemma A.3(i), $f$ is strictly increasing in $c$ , and by Lemma A.3(ii), $\lim_{c\to-\infty}f(c)<0$ . Hence there exists a constant $C_{0}<0$ such that

\displaystyle c\leq C_{0}\quad\Longrightarrow\quad f(c)=g_{c}^{\prime}(0)<0.

Fix $c\leq C_{0}$ . By Lemma A.1(ii), $g_{c}$ is concave. Therefore, for every $t>0$ , we have $g_{c}(t)\leq g_{c}(0)+tg_{c}^{\prime}(0)<g_{c}(0)$ . This implies that no maximizer of $g_{c}$ can lie in $[0,\infty)$ . By Assumption A.1(A4), we have $\arg\max_{t\in\mathbb{R}}g_{c}(t)\subset(-\infty,0)$ . ∎

A.2 Proof of Proposition 2.2

We prove Proposition 2.2 as follows.

Proof of Proposition 2.2..

Assume that the data satisfy double separation, so there exist non-zero vectors $v\in\mathbb{R}^{d}$ and $w\in\mathbb{R}^{p}$ such that

	$\displaystyle y_{i}=1$	$\displaystyle\Longrightarrow v^{\top}x_{i}\geq 0,\quad w^{\top}z_{i}\geq 0,$
	$\displaystyle y_{i}=0$	$\displaystyle\Longrightarrow v^{\top}x_{i}\leq 0,\quad w^{\top}z_{i}\leq 0,$

and at least one observation satisfies a strict inequality in the sense of the definition of double separation.

Fix $(\beta,\gamma)\in\mathbb{R}^{d}\times\mathbb{R}^{p}$ , and define $\ell_{\beta,\gamma}(t):=L(\beta+tv,\gamma+tw)$ for $t\geq 0$ . For each $i$ , let

	$\displaystyle a_{i}(t)$	$\displaystyle=F\left(\gamma^{\top}z_{i}+tw^{\top}z_{i}\right),$
	$\displaystyle b_{i}(t)$	$\displaystyle=F\left(\beta^{\top}x_{i}+tv^{\top}x_{i}\right),$
	$\displaystyle g_{i}(t)$	$\displaystyle=a_{i}(t)b_{i}(t).$

Because $F^{\prime}(u)=F(u)\{1-F(u)\}$ , we have

\displaystyle g_{i}^{\prime}(t)=g_{i}(t)\left((1-a_{i}(t))w^{\top}z_{i}+(1-b_{i}(t))v^{\top}x_{i}\right).

Therefore, we have

	$\displaystyle\ell_{\beta,\gamma}^{\prime}(t)$	$\displaystyle=\sum_{i=1}^{n}\left\{\frac{y_{i}}{g_{i}(t)}-\frac{1-y_{i}}{1-g_{i}(t)}\right\}g_{i}^{\prime}(t)$
		$\displaystyle=\sum_{i=1}^{n}\frac{y_{i}-g_{i}(t)}{1-g_{i}(t)}\left((1-a_{i}(t))w^{\top}z_{i}+(1-b_{i}(t))v^{\top}x_{i}\right).$

If $y_{i}=1$ , then

\displaystyle\frac{y_{i}-g_{i}(t)}{1-g_{i}(t)}=\frac{1-g_{i}(t)}{1-g_{i}(t)}=1,

and both $w^{\top}z_{i}\geq 0$ and $v^{\top}x_{i}\geq 0$ . Hence, we obtain

\displaystyle\frac{y_{i}-g_{i}(t)}{1-g_{i}(t)}\left((1-a_{i}(t))w^{\top}z_{i}+(1-b_{i}(t))v^{\top}x_{i}\right)\geq 0.

If $y_{i}=0$ , then

\displaystyle\frac{y_{i}-g_{i}(t)}{1-g_{i}(t)}=-\frac{g_{i}(t)}{1-g_{i}(t)}<0,

while both $w^{\top}z_{i}\leq 0$ and $v^{\top}x_{i}\leq 0$ . Hence, we again obtain

\displaystyle\frac{y_{i}-g_{i}(t)}{1-g_{i}(t)}\left((1-a_{i}(t))w^{\top}z_{i}+(1-b_{i}(t))v^{\top}x_{i}\right)\geq 0.

Moreover, because at least one observation satisfies a strict inequality, we have

\displaystyle\frac{y_{i}-g_{i}(t)}{1-g_{i}(t)}\left((1-a_{i}(t))w^{\top}z_{i}+(1-b_{i}(t))v^{\top}x_{i}\right)>0,

for $t\geq 0$ , for some $i$ . Indeed, if $y_{i}=1$ and either $w^{\top}z_{i}>0$ or $v^{\top}x_{i}>0$ , then

\displaystyle(1-a_{i}(t))w^{\top}z_{i}+(1-b_{i}(t))v^{\top}x_{i}>0,

because $0<a_{i}(t),b_{i}(t)<1$ . Similarly, if $y_{i}=0$ and either $w^{\top}z_{i}<0$ or $v^{\top}x_{i}<0$ , then

\displaystyle(1-a_{i}(t))w^{\top}z_{i}+(1-b_{i}(t))v^{\top}x_{i}<0.

Therefore,

\displaystyle\ell_{\beta,\gamma}^{\prime}(t)>0\quad\text{for all}~t\geq 0.

Since $(\beta,\gamma)$ was arbitrary, every finite parameter point can be improved by moving a small positive amount along the direction $(v,w)$ . Therefore, no finite point can be a maximizer of $L(\beta,\gamma)$ . Therefore, the log-likelihood has no maximizer in $\mathbb{R}^{d}\times\mathbb{R}^{p}$ . ∎

A.3 Proof of Proposition 2.3

We prove Proposition 2.3 as follows.

Proof of Proposition 2.3..

Let $(\beta,\gamma)=t(v,w)$ , $t:=\sqrt{\|\beta\|^{2}+\|\gamma\|^{2}}\geq 0$ , and $\|v\|^{2}+\|w\|^{2}=1$ . Because

\displaystyle F(u)

\displaystyle\leq e^{u}\quad\text{and}\quad 1-F(u)\leq e^{-u},

we have

\displaystyle F(u_{1})F(u_{2})\leq e^{u_{1}+u_{2}},\quad\text{and}\quad 1-F(u_{1})F(u_{2})\leq e^{-u_{1}}+e^{-u_{2}},

(2)

for all $u_{1},u_{2}\in\mathbb{R}$ .

By $\varepsilon$ –double–non-separation, for each unit direction $(v,w)$ at least one of the following holds:

(a)

there exists $i\in\{i:y_{i}=1\}$ such that $v^{\top}x_{i}+w^{\top}z_{i}\leq-\varepsilon$ ;
(b)

there exists $i\in\{i:y_{i}=0\}$ such that $\min\{v^{\top}x_{i},w^{\top}z_{i}\}\geq\varepsilon$ .

If (a) holds, then for $i\in\{i:y_{i}=1\}$ ,

	$\displaystyle\log F(tv^{\top}x_{i})+\log F(tw^{\top}z_{i})$	$\displaystyle=\log\{F(tv^{\top}x_{i})F(tw^{\top}z_{i})\}$
		$\displaystyle\leq t(v^{\top}x_{i}+w^{\top}z_{i})\leq-\varepsilon t.$

Therefore, we have

\displaystyle L(tv,tw)\leq-\varepsilon t.

Next, if (b) holds, then for $i\in\{i:y_{i}=0\}$ ,

	$\displaystyle\log(1-F(tv^{\top}x_{i})F(tw^{\top}z_{i}))$	$\displaystyle\leq\log(e^{-tv^{\top}x_{i}}+e^{-tw^{\top}z_{i}})$
		$\displaystyle\leq\log 2-\varepsilon t.$

Therefore, we have

\displaystyle L(tv,tw)\leq\log 2-\varepsilon t.

Consequently, we obtain

\displaystyle\sup_{(v,w):\|v\|^{2}+\|w\|^{2}=1}L(tv,tw)\leq\log 2-\varepsilon t\to-\infty\quad\text{as}\quad t\to\infty.

(3)

Since $\log 2-\varepsilon t\to-\infty$ as $t\to\infty$ , there exists a sufficiently large radius $t_{0}>0$ such that $\log 2-\varepsilon t_{0}<L(0,0)$ . We then define the closed ball

\displaystyle\mathcal{B}_{t_{0}}=\{(\beta^{\top},\gamma^{\top})^{\top}\in\mathbb{R}^{d+p}:\|(\beta^{\top},\gamma^{\top})^{\top}\|\leq t_{0}\}.

As $\mathcal{B}_{t_{0}}$ is a compact set and $L$ is continuous, $L$ attains its maximum at some point $(\hat{\beta},\hat{\gamma})^{\top}\in\mathcal{B}_{t_{0}}$ . Furthermore, for any $(\beta^{\top},\gamma^{\top})^{\top}$ outside this ball, by (3), we have

\displaystyle L(\beta,\gamma)\leq\log 2-\varepsilon\|(\beta^{\top},\gamma^{\top})^{\top}\|<\log 2-\varepsilon t_{0}<L(0,0).

Since $L(0,0)\leq L(\hat{\beta},\hat{\gamma})$ , it follows that $L(\beta,\gamma)<L(\hat{\beta},\hat{\gamma})$ for all $(\beta^{\top},\gamma^{\top})^{\top}\notin\mathcal{B}_{t_{0}}$ . Therefore, $(\hat{\beta},\hat{\gamma})^{\top}$ is the global maximizer of $L({\beta},{\gamma})$ . ∎

A.4 Proofs of Proposition 3.1, Theorem 3.2, and Corollary 3.3

We first remark a standard linear-independence property of exponential functions.

Lemma A.4.

Let $\lambda_{1},\ldots,\lambda_{m}\in\mathbb{R}^{d-1}$ be distinct vectors, and let $\mathcal{U}\subset\mathbb{R}^{d-1}$ contain a nonempty open set. If

\displaystyle\sum_{k=1}^{m}a_{k}\exp\left(\lambda_{k}^{\top}w\right)=0,

for all $w\in\mathcal{U}$ , then $a_{1}=\cdots=a_{m}=0$ .

Proof of Lemma A.4..

Choose $w_{0}$ in the interior of $\mathcal{U}$ . Since the hyperplanes

\displaystyle\left\{v\in\mathbb{R}^{d-1}:(\lambda_{k}-\lambda_{\ell})^{\top}v=0,~k\neq\ell,~k,\ell\in\{1,\ldots,m\}\right\},

do not cover $\mathbb{R}^{d-1}$ , there exists $v\in\mathbb{R}^{d-1}$ such that $\lambda_{1}^{\top}v,~\ldots,~\lambda_{m}^{\top}v$ are pairwise distinct. For all sufficiently small $t$ , we have $w_{0}+tv\in\mathcal{U}$ , and hence

\displaystyle 0=\sum_{k=1}^{m}a_{k}\exp\left(\lambda_{k}^{\top}(w_{0}+tv)\right)=\sum_{k=1}^{m}a_{k}\exp\left(\lambda_{k}^{\top}w_{0}\right)\exp\left((\lambda_{k}^{\top}v)t\right).

Since one-dimensional exponential functions with distinct exponents are linearly independent on any open interval, we obtain

\displaystyle a_{k}\exp\left(\lambda_{k}^{\top}w_{0}\right)=0,

for $k=1,\ldots,m$ . Therefore, we have $a_{k}=0$ for $k=1,\ldots,m$ . ∎

We now prove Proposition 3.1.

Proof of Proposition 3.1.

Because (1) holds for all $y\in\{0,1\}$ , it is equivalent to

\displaystyle F\left(x^{\top}\beta_{1}\right)F\left(x^{\top}\gamma_{1}\right)=F\left(x^{\top}\beta_{2}\right)F\left(x^{\top}\gamma_{2}\right),

for all $x=(1,\tilde{x}_{-0}^{\top})^{\top}$ with $\tilde{x}_{-0}\in\mathcal{U}$ . Using $F(\mu)=1/(1+\exp(-\mu))$ , we obtain

	$\displaystyle\left\{1+\exp(-\beta_{1,0})\exp\left(-\beta_{1,-0}^{\top}\tilde{x}_{-0}\right)\right\}\left\{1+\exp(-\gamma_{1,0})\exp\left(-\gamma_{1,-0}^{\top}\tilde{x}_{-0}\right)\right\}$
	$\displaystyle\quad=\left\{1+\exp(-\beta_{2,0})\exp\left(-\beta_{2,-0}^{\top}\tilde{x}_{-0}\right)\right\}\left\{1+\exp(-\gamma_{2,0})\exp\left(-\gamma_{2,-0}^{\top}\tilde{x}_{-0}\right)\right\},$

for all $\tilde{x}_{-0}\in\mathcal{U}$ . Therefore, we have

		$\displaystyle\exp(-\beta_{1,0})\exp(-\beta_{1,-0}^{\top}\tilde{x}_{-0})+\exp(-\gamma_{1,0})\exp(-\gamma_{1,-0}^{\top}\tilde{x}_{-0})$
		$\displaystyle\quad+\exp(-\beta_{1,0}-\gamma_{1,0})\exp(-(\beta_{1,-0}+\gamma_{1,-0})^{\top}\tilde{x}_{-0})$
		$\displaystyle=\exp(-\beta_{2,0})\exp(-\beta_{2,-0}^{\top}\tilde{x}_{-0})+\exp(-\gamma_{2,0})\exp(-\gamma_{2,-0}^{\top}\tilde{x}_{-0})$
		$\displaystyle\quad+\exp(-\beta_{2,0}-\gamma_{2,0})\exp(-(\beta_{2,-0}+\gamma_{2,-0})^{\top}\tilde{x}_{-0})$		(4)

for all $\tilde{x}_{-0}\in\mathcal{U}$ . Since $\beta_{1,-0},~\gamma_{1,-0},~\beta_{1,-0}+\gamma_{1,-0}$ are pairwise distinct, the left-hand side of (A.4) is a linear combination of three distinct exponential functions and can be written as

\displaystyle\sum_{j=1}^{3}\exp(c_{1,j})\exp\left(-\lambda_{1,j}^{\top}\tilde{x}_{-0}\right),

where $\{\lambda_{1,1},~\lambda_{1,2},~\lambda_{1,3}\}=\{\beta_{1,-0},~\gamma_{1,-0},~\beta_{1,-0}+\gamma_{1,-0}\}$ and $\{c_{1,1},~c_{1,2},~c_{1,3}\}=\{-\beta_{1,0},~-\gamma_{1,0},~-\beta_{1,0}-\gamma_{1,0}\}$ . The right-hand side can be written as

\displaystyle\sum_{j=1}^{m}\exp({c_{2,j}})\exp\left(-\lambda_{2,j}^{\top}\tilde{x}_{-0}\right),\quad\text{for}\quad 1\leq m\leq 3,

where $\lambda_{2,1},\ldots,\lambda_{2,m}$ are distinct. Therefore, we have

\displaystyle\sum_{j=1}^{3}\exp(c_{1,j})\exp\left(-\lambda_{1,j}^{\top}\tilde{x}_{-0}\right)-\sum_{j=1}^{m}\exp({c_{2,j}})\exp\left(-\lambda_{2,j}^{\top}\tilde{x}_{-0}\right)=0.

By Lemma A.4, if the sets of vectors $\{\lambda_{1,1},~\lambda_{1,2},~\lambda_{1,3}\}$ and $\{\lambda_{2,1},~\ldots,~\lambda_{2,m}\}$ are not identical, it implies that at least one coefficient in the combined linear combination, which is either $\exp(c_{1,j})$ or $-\exp(c_{2,k})$ , must be zero. However, this is a contradiction because the exponential function is strictly positive. Therefore, we obtain $m=3$ and

\displaystyle\left\{\beta_{1,-0},\gamma_{1,-0},\beta_{1,-0}+\gamma_{1,-0}\right\}=\left\{\beta_{2,-0},\gamma_{2,-0},\beta_{2,-0}+\gamma_{2,-0}\right\}.

Since $\beta_{1,-0},~\gamma_{1,-0}$ , and $\beta_{1,-0}+\gamma_{1,-0}$ are pairwise distinct¹¹1This condition implies that both $\beta_{1,-0}$ and $\gamma_{1,-0}$ are non-zero, and ensures that neither can be expressed as the sum of the other two elements in the set., $\beta_{1,-0}+\gamma_{1,-0}$ is the unique element in the set $\{\beta_{1,-0},~\gamma_{1,-0},~\beta_{1,-0}+\gamma_{1,-0}\}$ that can be expressed as the sum of the other two elements. Because the sets $\{\beta_{1,-0},~\gamma_{1,-0},~\beta_{1,-0}+\gamma_{1,-0}\}$ and $\{\beta_{2,-0},~\gamma_{2,-0},~\beta_{2,-0}+\gamma_{2,-0}\}$ are identical, their unique sum elements must be equal. Therefore, we have

\displaystyle\beta_{1,-0}+\gamma_{1,-0}=\beta_{2,-0}+\gamma_{2,-0},

which implies that the sets of the remaining elements are also identical:

\displaystyle\{\beta_{1,-0},~\gamma_{1,-0}\}=\{\beta_{2,-0},~\gamma_{2,-0}\}.

If $\beta_{1,-0}=\beta_{2,-0}$ and $\gamma_{1,-0}=\gamma_{2,-0}$ , then we have $\exp(-\beta_{1,0})=\exp(-\beta_{2,0})$ and $\exp(-\gamma_{1,0})=\exp(-\gamma_{2,0})$ , and hence $\beta_{1}=\beta_{2}$ and $\gamma_{1}=\gamma_{2}$ . Instead, if $\beta_{1,-0}=\gamma_{2,-0}$ and $\gamma_{1,-0}=\beta_{2,-0}$ , then we have $\exp(-\beta_{1,0})=\exp(-\gamma_{2,0})$ and $\exp(-\gamma_{1,0})=\exp(-\beta_{2,0})$ , and hence $\beta_{1}=\gamma_{2}$ and $\gamma_{1}=\beta_{2}$ . Therefore, we have either $(\beta_{1},~\gamma_{1})=(\beta_{2},~\gamma_{2})$ or $(\beta_{1},~\gamma_{1})=(\gamma_{2},~\beta_{2})$ . ∎

We next prove Theorem 3.2.

Proof of Theorem 3.2.

Fix $\xi\in\mathcal{S}$ . For $j=1,2$ , define $\beta_{j,0}^{(\xi)}:=\beta_{j,0}+\beta_{j,-0}^{(2)\top}\xi$ and $\gamma_{j,0}^{(\xi)}:=\gamma_{j,0}+\gamma_{j,-0}^{(2)\top}\xi$ . Then, for all $y\in\{0,1\}$ and all $\tilde{x}_{-0}^{(1)}\in\mathcal{U}_{\xi}$ , (1) can be expressed as

	$\displaystyle p\left(y\mid(1,\tilde{x}_{-0}^{(1)\top})^{\top},(\beta_{1,0}^{(\xi)},\beta_{1,-0}^{(1)\top})^{\top},(\gamma_{1,0}^{(\xi)},\gamma_{1,-0}^{(1)\top})^{\top}\right)$
	$\displaystyle\quad=p\left(y\mid(1,\tilde{x}_{-0}^{(1)\top})^{\top},(\beta_{2,0}^{(\xi)},\beta_{2,-0}^{(1)\top})^{\top},(\gamma_{2,0}^{(\xi)},\gamma_{2,-0}^{(1)\top})^{\top}\right).$

Because $\mathcal{U}_{\xi}$ is a nonempty open subset of $\mathbb{R}^{r}$ and $\beta_{1,-0}^{(1)},~\gamma_{1,-0}^{(1)},~\beta_{1,-0}^{(1)}+\gamma_{1,-0}^{(1)}$ are pairwise distinct, Proposition 3.1 yields that, for each $\xi\in\mathcal{S}$ , we have either

\displaystyle(\beta_{1,-0}^{(1)},~\gamma_{1,-0}^{(1)})=(\beta_{2,-0}^{(1)},~\gamma_{2,-0}^{(1)})\quad\text{and}\quad(\beta_{1,0}^{(\xi)},~\gamma_{1,0}^{(\xi)})=(\beta_{2,0}^{(\xi)},~\gamma_{2,0}^{(\xi)}),

(5)

\displaystyle(\beta_{1,-0}^{(1)},~\gamma_{1,-0}^{(1)})=(\gamma_{2,-0}^{(1)},~\beta_{2,-0}^{(1)})\quad\text{and}\quad(\beta_{1,0}^{(\xi)},~\gamma_{1,0}^{(\xi)})=(\gamma_{2,0}^{(\xi)},~\beta_{2,0}^{(\xi)}).

(6)

We claim that the same set of equations must hold for all $\xi\in\mathcal{S}$ . Indeed, if (5) holds for some $\xi\in\mathcal{S}$ and (6) holds for some $\xi^{\prime}\in\mathcal{S}$ , then $\beta_{1,-0}^{(1)}=\beta_{2,-0}^{(1)}=\gamma_{1,-0}^{(1)}$ , which contradicts $\beta_{1,-0}^{(1)}\neq\gamma_{1,-0}^{(1)}$ . Therefore, either (5) holds for all $\xi\in\mathcal{S}$ , or (6) holds for all $\xi\in\mathcal{S}$ .

First, suppose that (5) holds for all $\xi\in\mathcal{S}$ . Then, we have

	$\displaystyle\beta_{1,0}+\beta_{1,-0}^{(2)\top}\xi$	$\displaystyle=\beta_{2,0}+\beta_{2,-0}^{(2)\top}\xi,$
	$\displaystyle\quad\gamma_{1,0}+\gamma_{1,-0}^{(2)\top}\xi$	$\displaystyle=\gamma_{2,0}+\gamma_{2,-0}^{(2)\top}\xi,$

for all $\xi\in\mathcal{S}$ . Hence, both of the following functions:

	$\displaystyle\xi$	$\displaystyle\mapsto\beta_{1,0}+\beta_{1,-0}^{(2)\top}\xi-\beta_{2,0}-\beta_{2,-0}^{(2)\top}\xi,$
	$\displaystyle\xi$	$\displaystyle\mapsto\gamma_{1,0}+\gamma_{1,-0}^{(2)\top}\xi-\gamma_{2,0}-\gamma_{2,-0}^{(2)\top}\xi,$

are identically zero on $\mathcal{S}$ . Since $\operatorname{aff}(\mathcal{S})=\mathbb{R}^{s}$ , both functions are also identically zero on $\mathbb{R}^{s}$ . Therefore, we have $\beta_{1,0}=\beta_{2,0}$ , $\beta_{1,-0}^{(2)}=\beta_{2,-0}^{(2)}$ , $\gamma_{1,0}=\gamma_{2,0}$ , and $\gamma_{1,-0}^{(2)}=\gamma_{2,-0}^{(2)}$ . Combining (5) with this, we obtain $(\beta_{1},\gamma_{1})=(\beta_{2},\gamma_{2})$ .

Next, suppose that (6) holds for all $\xi\in\mathcal{S}$ . Then, by a similar argument, we obtain $(\beta_{1},\gamma_{1})=(\gamma_{2},\beta_{2})$ .

Therefore, we have either $(\beta_{1},\gamma_{1})=(\beta_{2},\gamma_{2})$ or $(\beta_{1},\gamma_{1})=(\gamma_{2},\beta_{2})$ . ∎

We finally prove Corollary 3.3.

Proof of Corollary 3.3..

Let $\pi_{\beta,\gamma}(x):=F(x^{\top}\beta)F(x^{\top}\gamma)$ and $\pi^{\ast}(x):=\pi_{\beta^{\ast},\gamma^{\ast}}(x)$ . Under correct specification, we have

\displaystyle y\mid x\sim\operatorname{Bernoulli}\!\left(\pi^{\ast}(x)\right).

Therefore, the expected log-likelihood function has the following expression:

\displaystyle\mathcal{L}(\beta,~\gamma)

\displaystyle=\mathbb{E}_{x}\left[\pi^{\ast}(x)\log\pi_{\beta,\gamma}(x)+\{1-\pi^{\ast}(x)\}\log\{1-\pi_{\beta,\gamma}(x)\}\right],

and, hence, we have

	$\displaystyle\mathcal{L}(\beta,~\gamma)-\mathcal{L}(\beta^{\ast},~\gamma^{\ast})$
	$\displaystyle\quad=\mathbb{E}_{x}\left[\pi^{\ast}(x)\log\pi_{\beta,\gamma}(x)+\{1-\pi^{\ast}(x)\}\log\{1-\pi_{\beta,\gamma}(x)\}\right]$
	$\displaystyle\qquad-\mathbb{E}_{x}\left[\pi^{\ast}(x)\log\pi^{\ast}(x)+\{1-\pi^{\ast}(x)\}\log\{1-\pi^{\ast}(x)\}\right]$
	$\displaystyle\quad=-\mathbb{E}_{x}\left[\pi^{\ast}(x)\log\frac{\pi^{\ast}(x)}{\pi_{\beta,\gamma}(x)}+\{1-\pi^{\ast}(x)\}\log\frac{1-\pi^{\ast}(x)}{1-\pi_{\beta,\gamma}(x)}\right]$
	$\displaystyle\quad=-\mathbb{E}_{x}\left[\mathrm{KL}\left(\operatorname{Bernoulli}~\left(\pi^{\ast}(x)\right)~\middle\\|~\operatorname{Bernoulli}~\left(\pi_{\beta,\gamma}(x)\right)\right)\right]\leq 0,$

for any $(\beta,\gamma)\in\mathbb{R}^{d}\times\mathbb{R}^{d}$ . Here, $\mathrm{KL}(\cdot\|\cdot)$ denotes the Kullback-Leibler divergence from the first probability distribution to the second. Thus $[\beta^{\ast},~\gamma^{\ast}]$ is a maximizer on $(\mathbb{R}^{d}\times\mathbb{R}^{d})/{\sim}$ .

Now suppose that $(\beta^{\dagger},\gamma^{\dagger})\in\mathbb{R}^{d}\times\mathbb{R}^{d}$ also attains the maximum. Then, it must hold that

\displaystyle\mathrm{KL}\left(\operatorname{Bernoulli}~\left(\pi^{\ast}(x)\right)~\middle\|~\operatorname{Bernoulli}~\left(\pi_{\beta^{\dagger},\gamma^{\dagger}}(x)\right)\right)=0,\quad\text{almost surely.}

Therefore, we obtain

\displaystyle\pi_{\beta^{\dagger},\gamma^{\dagger}}(x)=\pi^{\ast}(x),\quad\text{for almost every $x$.}

(7)

Let $x=(1,\tilde{x}_{-0}^{(1)\top},\tilde{x}_{-0}^{(2)\top})^{\top}$ . Fix $\xi\in\mathcal{S}$ and define

	$\displaystyle g_{\dagger,\xi}(\tilde{x}_{-0}^{(1)})$	$\displaystyle:=\pi_{\beta^{\dagger},\gamma^{\dagger}}\left((1,\tilde{x}_{-0}^{(1)\top},\xi^{\top})^{\top}\right),$
	$\displaystyle g_{\ast,\xi}(\tilde{x}_{-0}^{(1)})$	$\displaystyle:=\pi^{\ast}\left((1,\tilde{x}_{-0}^{(1)\top},\xi^{\top})^{\top}\right).$

Both functions are continuous on $\mathbb{R}^{r}$ . We claim that $g_{\dagger,\xi}(\tilde{x}_{-0}^{(1)})=g_{\ast,\xi}(\tilde{x}_{-0}^{(1)})$ holds for all $\xi\in\mathcal{S}$ and all $\tilde{x}_{-0}^{(1)}\in\mathcal{U}_{\xi}$ . Suppose not, then there exist $\xi\in\mathcal{S}$ and $w\in\mathcal{U}_{\xi}$ such that $g_{\dagger,\xi}(w)\neq g_{\ast,\xi}(w)$ . By continuity, there exists a nonempty open neighborhood $V\subset\mathcal{U}_{\xi}$ of $w$ such that the two functions remain different on $V$ . Because $\xi\in\operatorname{supp}(\tilde{x}_{-0}^{(2)})$ and $w$ belongs to the conditional support of $\tilde{x}_{-0}^{(1)}$ given $\tilde{x}_{-0}^{(2)}=\xi$ , we have $\mathbb{P}(\tilde{x}_{-0}^{(2)}=\xi,~\tilde{x}_{-0}^{(1)}\in V)>0$ . This contradicts (7). Therefore, we obtain

\displaystyle\pi_{\beta^{\dagger},\gamma^{\dagger}}(x)=\pi^{\ast}(x)

for all $x=(1,\tilde{x}_{-0}^{(1)\top},\xi^{\top})^{\top}$ with $\xi\in\mathcal{S}$ and $\tilde{x}_{-0}^{(1)}\in\mathcal{U}_{\xi}$ .

From the discussion above, we have

\displaystyle p(y\mid x,\beta^{\dagger},\gamma^{\dagger})=p(y\mid x,\beta^{\ast},\gamma^{\ast}),

for all $y\in\{0,1\}$ and all $x=(1,\tilde{x}_{-0}^{(1)\top},\xi^{\top})^{\top}$ with $\xi\in\mathcal{S}$ and $\tilde{x}_{-0}^{(1)}\in\mathcal{U}_{\xi}$ . By Theorem 3.2, it follows that

\displaystyle(\beta^{\dagger},~\gamma^{\dagger})\sim(\beta^{\ast},~\gamma^{\ast}).

Therefore, the expected log-likelihood $\mathcal{L}(\beta,~\gamma)$ is uniquely maximized on $(\mathbb{R}^{d}\times\mathbb{R}^{d})/{\sim}$ at the class $[\beta^{\ast},~\gamma^{\ast}]$ . ∎

Appendix B Details of Numerical Settings of Bimodality Confirmation

B.1 Posterior Distribution and Sampling Algorithm

We consider the posterior distribution given by

\pi(\beta,\gamma,h\mid y,x,z)\propto\left\{\prod_{i=1}^{n}p(y_{i},h_{i}\mid x_{i},z_{i},\beta,\gamma)\right\}p_{\mathrm{prior}}(\beta,\gamma),

(8)

where $p(y_{i},h_{i}\mid x_{i},z_{i},\beta,\gamma)$ is defined as

\displaystyle p(y_{i},h_{i}\mid x_{i},z_{i},\beta,\gamma)

\displaystyle=\left\{F(\gamma^{\top}z_{i})\right\}^{h_{i}}\left\{1-F(\gamma^{\top}z_{i})\right\}^{1-h_{i}}\left\{F(\beta^{\top}x_{i})\right\}^{y_{i}}\left\{1-F(\beta^{\top}x_{i})\right\}^{h_{i}-y_{i}},

and $p_{\mathrm{prior}}(\beta,\gamma)$ denotes the prior distribution. The complete-data likelihood can be separated into a logistic regression for $h$ :

\displaystyle h_{i}\mid z_{i}\sim\operatorname{Bernoulli}(F(\gamma^{\top}z_{i})).

and a logistic regression for $y^{\ast}$ :

\displaystyle y_{i}^{\ast}\mid x_{i}\sim\operatorname{Bernoulli}(F(\beta^{\top}x_{i})),

conditioned on $h_{i}=1$ . Thus $h_{i}=0$ generates a structural zero, whereas $h_{i}=1$ allows the ordinary logistic regression. The observed outcome $y_{i}$ is then given by $y_{i}=h_{i}y_{i}^{\ast}$ . We use this structure to construct a MCMC algorithm. We further employ the Pólya-Gamma augmentation (Polson et al., 2013) and combine the resulting Gibbs sampling algorithm with replica exchange so that the sampler can move between the multiple modes. The detailed algorithms are provided in Appendix C.

B.2 Data Generation and Sampling Setup

Numerical experiments were conducted under the shared-design setting. We considered three designs that differ only in the distribution of the covariates. In each scenario, we generated $n=2,000$ observations with $d=p=5$ covariates including an intercept. The true coefficient vectors were fixed at

	$\displaystyle\beta^{*}$	$\displaystyle=(0.5,~1.0,~0.5,~0.5,~0.25)^{\top},$
	$\displaystyle\gamma^{*}$	$\displaystyle=(1.7,~-1.0,~-1.0,~0.5,~0.5)^{\top}.$

In Scenario 1, all non-intercept covariates were drawn from independent standard normal distributions. In Scenario 2, all non-intercept covariates were drawn from independent $\operatorname{Bernoulli}(0.5)$ . In Scenario 3, the first two non-intercept covariates were drawn from independent standard normal distributions, and the remaining ones were from independent $\operatorname{Bernoulli}(0.5)$ . Given the covariates, we generated

	$\displaystyle h_{i}$	$\displaystyle\sim\operatorname{Bernoulli}\left(F(\gamma^{*\top}x_{i})\right),$
	$\displaystyle y_{i}^{\ast}$	$\displaystyle\sim\operatorname{Bernoulli}\left(F(\beta^{*\top}x_{i})\right),$

independently, and set $y_{i}=h_{i}y_{i}^{\ast}$ .

We placed weakly informative Gaussian priors on both coefficient vectors: $\beta\sim\mathcal{N}(0,100I_{p})$ and $\gamma\sim\mathcal{N}(0,100I_{q})$ . Posterior sampling was performed via a Gibbs sampling algorithm with replica exchange using $20$ replicas. The temperature schedule followed a geometric progression $T_{m}=r^{m}$ with $r=1.05$ , and replica exchange was attempted every $50$ iterations. The total number of MCMC iterations was set to $53{,}000$ , with the first $3{,}000$ discarded as burn-in. To facilitate efficient sampling, Gibbs sampling based on Pólya-Gamma data augmentation was used for both the ordinary logistic regression component and the structural-zero component.

B.3 Sampling Results

To explore the structure of the posterior distribution, we applied $k$ -means++ clustering algorithm with $k=2$ to the posterior samples Arthur and Vassilvitskii (2007). The samples, consisting of both $\beta$ and $\gamma$ parameters concatenated as $10$ -dimensional vectors, were projected onto the first two principal components for visualization using PCA. See Figure 1 for the plots.

Table 6 summarizes the posterior means within each cluster along with cluster sizes and proportions. In Scenarios 1 and 3, the means of the two clusters exhibited an approximately symmetric structure, reflecting the exchange symmetry of $(\beta,\gamma)$ and $(\gamma,\beta)$ . In Scenario 2, the posterior means from the smaller cluster were numerically large, especially in the structural-zero component, consistent with the failure of the binary design to satisfy the condition (C2), which results in a lack of guaranteed identifiability.

The trace plots and the histograms for the posterior distributions are provided in Appendix D.

Table 6: Posterior means of each parameter for the clusters in Scenarios 1–3.

Parameter	True	Scenario 1		Scenario 2		Scenario 3
Parameter	True	Cluster 1	Cluster 2	Cluster 1	Cluster 2	Cluster 1	Cluster 2
$\beta_{0}$	0.5	0.325	2.886	9.700	17.168	0.202	2.713
$\beta_{1}$	1.0	0.871	-1.349	5.136	0.847	0.760	-1.453
$\beta_{2}$	0.5	0.355	-1.572	3.180	5.330	0.234	-1.283
$\beta_{3}$	0.5	0.515	0.544	4.194	-0.704	0.532	0.737
$\beta_{4}$	0.25	0.282	0.668	8.992	-0.187	0.574	0.158
$\gamma_{0}$	1.7	2.413	0.200	0.280	0.265	2.794	0.216
$\gamma_{1}$	-1.0	-1.154	0.803	-0.214	-0.204	-1.491	0.773
$\gamma_{2}$	-1.0	-1.413	0.284	-0.622	-0.622	-1.307	0.249
$\gamma_{3}$	0.5	0.524	0.516	0.570	0.575	0.749	0.532
$\gamma_{4}$	0.5	0.625	0.292	0.470	0.485	0.154	0.575
Cluster size		13,250	36,750	8,395	41,605	25,450	24,550
Proportion		0.265	0.735	0.168	0.832	0.509	0.491

Appendix C Details of Sampling Algorithm

For each replica $m=1,\ldots,M$ , let $T_{m}>0$ denote the temperature. The tempered posterior distribution is defined as

\displaystyle\pi_{T_{m}}(\beta,\gamma,h\mid y,x,z)\propto\left\{\prod_{i=1}^{n}p(y_{i},h_{i}\mid x_{i},z_{i},\beta,\gamma)\right\}^{1/T_{m}}p_{\mathrm{prior}}(\beta,\gamma),

where

\displaystyle p(y_{i},h_{i}\mid x_{i},z_{i},\beta,\gamma)=\left\{F(z_{i}^{\top}\gamma)\right\}^{h_{i}}\left\{1-F(z_{i}^{\top}\gamma)\right\}^{1-h_{i}}\left\{F(x_{i}^{\top}\beta)\right\}^{y_{i}}\left\{1-F(x_{i}^{\top}\beta)\right\}^{h_{i}-y_{i}},

for $y_{i}\leq h_{i}$ . We assume Gaussian priors: $\beta\sim\mathcal{N}(b_{0},B_{0})$ , and $\gamma\sim\mathcal{N}(g_{0},G_{0})$ .

C.1 Pólya-Gamma Gibbs Sampling Step

For each replica $m$ , we perform Gibbs sampling using the following steps.

Step 1. Updating $h_{i}$ .

If $y_{i}=1$ , then $h_{i}$ is deterministically set to $1$ . For observations with $y_{i}=0$ , we sample $h_{i}$ from

\displaystyle h_{i}\mid\beta,\gamma,y_{i}=0\sim\operatorname{Bernoulli}(\mu_{i}^{(T_{m})}),

where

	$\displaystyle\mu_{i}^{(T_{m})}$	$\displaystyle=\frac{\left[F(z_{i}^{\top}\gamma)\{1-F(x_{i}^{\top}\beta)\}\right]^{1/T_{m}}}{\left[F(z_{i}^{\top}\gamma)\{1-F(x_{i}^{\top}\beta)\}\right]^{1/T_{m}}+\left[1-F(z_{i}^{\top}\gamma)\right]^{1/T_{m}}}$
		$\displaystyle=F\!\left(\frac{z_{i}^{\top}\gamma-\log\{1+\exp(x_{i}^{\top}\beta)\}}{T_{m}}\right).$

Step 2. Updating $\gamma$ .

The tempered likelihood for $\gamma$ is

\displaystyle\prod_{i=1}^{n}\frac{\exp\left\{(h_{i}/T_{m})z_{i}^{\top}\gamma\right\}}{\{1+\exp(z_{i}^{\top}\gamma)\}^{1/T_{m}}}.

Introduce Pólya-Gamma auxiliary variables (Polson et al., 2013):

\displaystyle w_{i}\mid\gamma,h\sim\operatorname{PG}(T_{m}^{-1},z_{i}^{\top}\gamma),\quad i=1,\ldots,n,

where $\operatorname{PG}(\cdot,\cdot)$ denotes the Pólya-Gamma distribution. Specifically, if $w\sim\operatorname{PG}(b,c)$ with $b>0$ and $c\in\mathbb{R}$ , using random numbers independently following Gamma distribution $v_{k}\sim\operatorname{Ga}(b,1)$ , $w$ can be obtained as

\displaystyle w=\cfrac{1}{2\pi^{2}}\sum_{k=1}^{\infty}\cfrac{v_{k}}{\left(k-\frac{1}{2}\right)^{2}+\frac{c^{2}}{4\pi^{2}}}.

Let $Z=(z_{1}^{\top},\ldots,z_{n}^{\top})^{\top}$ , $W=\operatorname{diag}(w_{1},\ldots,w_{n})$ , and

\displaystyle\kappa^{(\gamma)}=\left(\frac{h_{1}-1/2}{T_{m}},~\ldots,~\frac{h_{n}-1/2}{T_{m}}\right)^{\top}.

Then the full conditional distribution of $\gamma$ becomes Gaussian:

	$\displaystyle\gamma\mid h,w,z$	$\displaystyle\sim\mathcal{N}\!\left(g_{1}^{(T_{m})},G_{1}^{(T_{m})}\right),$
	$\displaystyle G_{1}^{(T_{m})}$	$\displaystyle=\left(Z^{\top}WZ+G_{0}^{-1}\right)^{-1},$
	$\displaystyle g_{1}^{(T_{m})}$	$\displaystyle=G_{1}^{(T_{m})}\left(Z^{\top}\kappa^{(\gamma)}+G_{0}^{-1}g_{0}\right).$

Step 3. Updating $\beta$ .

Let $I_{h}=\{i:h_{i}=1\}$ , let $X_{h}$ be the submatrix of $X=(x_{1},\ldots,x_{n})^{\top}\in\mathbb{R}^{n\times p}$ indexed by $I_{h}$ , and let $y_{h}$ be the corresponding subvector of $y$ . The tempered likelihood of $\beta$ is:

\displaystyle\prod_{i\in I_{h}}\frac{\exp\left\{(y_{i}/T_{m})x_{i}^{\top}\beta\right\}}{\{1+\exp(x_{i}^{\top}\beta)\}^{1/T_{m}}}.

Introduce independent Pólya-Gamma variables

\displaystyle\omega_{i}\mid\beta,h,y\sim\operatorname{PG}\!\left(T_{m}^{-1},x_{i}^{\top}\beta\right),\quad i\in I_{h},

and, define $\Omega_{h}=\operatorname{diag}(\omega_{i}:i\in I_{h})$ and

\displaystyle\kappa^{(\beta)}=\left(\frac{y_{i}-1/2}{T_{m}}:i\in I_{h}\right)^{\top}.

Then the full conditional distribution of $\beta$ becomes again Gaussian:

	$\displaystyle\beta\mid y_{h},X_{h},\Omega_{h},h$	$\displaystyle\sim\mathcal{N}\!\left(b_{1}^{(T_{m})},B_{1}^{(T_{m})}\right),$
	$\displaystyle B_{1}^{(T_{m})}$	$\displaystyle=\left(X_{h}^{\top}\Omega_{h}X_{h}+B_{0}^{-1}\right)^{-1},$
	$\displaystyle b_{1}^{(T_{m})}$	$\displaystyle=B_{1}^{(T_{m})}\left(X_{h}^{\top}\kappa^{(\beta)}+B_{0}^{-1}b_{0}\right).$

C.2 Replica Exchange Step

After a fixed number of Gibbs iterations, we attempt to swap the states of two neighboring replicas with adjacent temperatures $T_{m}$ and $T_{m+1}$ . Let the current states be denoted by

	$\displaystyle\theta^{(m)}$	$\displaystyle=(\beta^{(m)},\gamma^{(m)},h^{(m)}),$
	$\displaystyle\theta^{(m+1)}$	$\displaystyle=(\beta^{(m+1)},\gamma^{(m+1)},h^{(m+1)}).$

The acceptance probability for the swap proposal is given by

\alpha=\min\left\{1,~\frac{\pi_{T_{m}}(\theta^{(m+1)}\mid y,x,z)~\pi_{T_{m+1}}(\theta^{(m)}\mid y,x,z)}{\pi_{T_{m}}(\theta^{(m)}\mid y,x,z)~\pi_{T_{m+1}}(\theta^{(m+1)}\mid y,x,z)}\right\}.

(9)

In practice, we compute the log acceptance ratio as

\displaystyle\log\alpha=\min\left\{0,~\left(\frac{1}{T_{m}}-\frac{1}{T_{m+1}}\right)\left[\log\tilde{L}\left(\theta^{(m+1)}\right)-\log\tilde{L}\left(\theta^{(m)}\right)\right]\right\},

where

\displaystyle\tilde{L}(\theta):=\prod_{i=1}^{n}p(y_{i},h_{i}\mid x_{i},z_{i},\beta,\gamma).

C.3 Overall Algorithm

The full algorithm alternates between the Gibbs sampling step and the replica exchange step as follows:

1.

For each replica $m$ , perform Gibbs sampling to update $h$ , $\gamma$ , and $\beta$ from the full conditionals above under its corresponding temperature $T_{m}$ .
2.

Every fixed number of iterations, attempt to swap the states of neighboring replicas $m$ and $m+1$ according to the acceptance probability given in (9).
3.

Retain samples from the chain with $T_{1}=1$ as draws from the target posterior distribution.

Appendix D Trace Plots and Posterior Histograms

We provide the trace plots for Scenario 1 as Figure 4, Scenario 2 as Figure 6, and Scenario 3 as Figure 8. Furthermore, we provide the posterior histograms for Scenario 1 as Figure 5, Scenario 2 as Figure 7, and Scenario 3 as Figure 9.

References

A. Albert and J. A. Anderson (1984) On the existence of maximum likelihood estimates in logistic regression models. Biometrika 71 (1), pp. 1–10. Cited by: §1, §2.3.
D. Arthur and S. Vassilvitskii (2007) K-means++: the advantages of careful seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics 8, pp. 1027–1035. External Links: Document Cited by: §B.3, §3.3.
J. Bootkrajang and A. Kabán (2012) Label-noise robust logistic regression and its applications. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 143–158. Cited by: §1.
J. Bootkrajang and A. Kabán (2013) Classification of mislabelled microarrays using robust sparse logistic regression. Bioinformatics 29 (7), pp. 870–877. Cited by: §1.
Centers for Disease Control and Prevention (CDC) (2017) National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data. Note: Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention Cited by: §5.1.
A. Diop, A. Diop, and J.-F. Dupuy (2011) Maximum likelihood estimation in the logistic regression model with a cure fraction. Electronic Journal of Statistics 5, pp. 460–483. External Links: Document, Link Cited by: §1, §1, §3.1.
S. Frühwirth-Schnatter (2006) Finite mixture and Markov switching models. Springer, New York. Cited by: §1.
H. Fujisawa and S. Eguchi (2008) Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis 99 (9), pp. 2053–2081. Cited by: §1.
D. B. Hall (2000) Zero-inflated poisson and binomial regression with random effects: a case study. Biometrics 56 (4), pp. 1030–1039. Cited by: §1.
H. Hung, Z.-Y. Jou, and S.-Y. Huang (2018) Robust mislabel logistic regression without modeling mislabel probabilities. Biometrics 74 (1), pp. 145–154. Cited by: §1.
O. Komori, S. Eguchi, S. Ikeda, H. Okamura, M. Ichinokawa, and S. Nakayama (2016) An asymmetric logistic regression model for ecological data. Methods in Ecology and Evolution 7 (2), pp. 249–260. Cited by: §1.
N. Nagelkerke and V. Fidler (2015) Estimating a logistic discrimination functions when one of the training samples is subject to misclassification: a maximum likelihood approach. PLoS One 10 (10), pp. e0140718. Cited by: §1.
N. G. Polson, J. G. Scott, and J. Windle (2013) Bayesian inference for logistic models using pólya–gamma latent variables. Journal of the American Statistical Association 108 (504), pp. 1339–1349. Cited by: §B.1, §C.1, §1, §3.3.
M. J. Silvapulle (1981) On the existence of maximum likelihood estimators for the binomial response models. Journal of the Royal Statistical Society: Series B 43 (3), pp. 310–313. Cited by: §1, §2.3.
R. H. Swendsen and J.-S. Wang (1986) Replica Monte Carlo simulation of spin-glasses. Physical Review Letters 57 (21), pp. 2607. Cited by: §1, §3.3.
H. Teicher (1963) Identifiability of finite mixtures. The Annals of Mathematical Statistics 34 (4), pp. 1265–1269. Cited by: §3.2.
H. Wainer, E. T. Bradlow, and X. Wang (2007) Testlet response theory and its applications. Cambridge University Press. Cited by: §1.
S. J. Yakowitz and J. D. Spragins (1968) On the identifiability of finite mixtures. The Annals of Mathematical Statistics 39 (1), pp. 209–214. Cited by: §3.2.

Zero-Inflated Logistic Regression Models with Shared Design: Identifiability, Existence of Estimates, and a Relabeling Rule

Abstract

1 Introduction

2 Zero-inflated Logistic Regression Model

2.1 Model Definition

2.2 Sign-Flip Phenomenon Under Misspecification

Theorem 2.1.

2.3 Maximum Likelihood Estimation

Definition 2.1 (Double separation).

Proposition 2.2.

Definition 2.2 (ε\varepsilon–Double–non-separation).

Proposition 2.3.

3 Shared-Design Model

3.1 Prior Work on Identifiability

3.2 Identifiability under Shared Design

Proposition 3.1.

Theorem 3.2.

Corollary 3.3.

3.3 Numerical Confirmation of Bimodality

4 Relabeling Rule

4.1 Proposed Rule

4.2 Simulation Study

Simulation Design.

Results.

5 Application to Actual Data

5.1 Dataset

5.2 Model Settings

5.3 Results

6 Discussion

7 Concluding Remarks

Funding

Data Availability

Code Availability

Declaration of the Use of Generative AI and AI-assisted Technologies

Appendix A Proofs

A.1 Proof of Theorem 2.1

Assumption A.1.

Lemma A.1.

Proof of Lemma A.1..

Lemma A.2.

Proof of Lemma A.2..

Lemma A.3.

Proof of Lemma A.3..

Proof of Theorem 2.1.

A.2 Proof of Proposition 2.2

Proof of Proposition 2.2..

A.3 Proof of Proposition 2.3

Proof of Proposition 2.3..

A.4 Proofs of Proposition 3.1, Theorem 3.2, and Corollary 3.3

Lemma A.4.

Proof of Lemma A.4..

Proof of Proposition 3.1.

Proof of Theorem 3.2.

Proof of Corollary 3.3..

Appendix B Details of Numerical Settings of Bimodality Confirmation

B.1 Posterior Distribution and Sampling Algorithm

B.2 Data Generation and Sampling Setup

B.3 Sampling Results

Appendix C Details of Sampling Algorithm

C.1 Pólya-Gamma Gibbs Sampling Step

Step 1. Updating hih_{i}.

Step 2. Updating γ\gamma.

Step 3. Updating β\beta.

C.2 Replica Exchange Step

C.3 Overall Algorithm

Appendix D Trace Plots and Posterior Histograms

References

Definition 2.2 ( $\varepsilon$ –Double–non-separation).

Step 1. Updating $h_{i}$ .

Step 2. Updating $\gamma$ .

Step 3. Updating $\beta$ .