Zero-Inflated Logistic Regression Models with Shared Design: Identifiability, Existence of Estimates, and a Relabeling Rule
Abstract
The zero-inflated logistic regression model accommodates binary responses with excess zeros, which often arise from a latent mixture of susceptible and insusceptible subpopulations or asymmetric misclassification of the response. The model has two components: regression for the binary response and a latent binary indicator for the zero-inflation state. In applied settings, it is common to use the same design matrix for both components if there is no prior knowledge. However, this shared-design specification lacks guaranteed identifiability of the regression parameters, as established in prior works. This paper investigates the theoretical properties of the zero-inflated logistic regression model under the shared-design setting and computational methods for applications. First, to motivate the use of the zero-inflated model, we prove that ignoring the zero-inflation mechanism can lead to a sign flip in the pseudo-true coefficient value relative to the true value. We then establish sufficient conditions for the existence of the maximum likelihood estimate. As a main result, we establish that the model under the shared-design setting is identifiable up to exchange symmetry of the parameters for two components and that the expected log-likelihood has a unique maximizer on the resulting quotient space. The posterior bimodality is examined using a Pólya-Gamma Gibbs sampler with replica exchange. Finally, we propose a simple relabeling rule to select a single ordered parameter pair, and evaluate its performance through simulation studies and an application to self-reported diabetes data.
Keywords: Asymmetric misclassification; Data separation; Exchange symmetry; Pólya-Gamma augmentation; Replica exchange.
1 Introduction
The logistic regression model is a standard tool for binary outcomes and remains attractive due to its simplicity and interpretability. However, in many applied settings, the observed response contains more zeros than a standard logistic model can accommodate. For example, such excess zeros may arise from a latent mixture of susceptible and insusceptible subpopulations, such as biological immunity, or one-sided outcome misclassification, such as the failure to record an event due to delayed reporting. To accommodate these situations, a natural extension is a zero-inflated logistic regression model (Hall, 2000; Diop et al., 2011). This model expresses the observed response as a mixture of a standard logistic regression and a structural zero, where the latent binary indicator for the zero-inflation state is itself modeled via a logistic regression. We refer to these two regression components as the ordinary logistic regression component and the structural-zero component, respectively. This formulation captures complex data-generating mechanisms while retaining interpretability.
Earlier work includes the three-parameter logistic model in psychometrics and related methods in ecology and epidemiology (Wainer et al., 2007; Komori et al., 2016; Nagelkerke and Fidler, 2015). Furthermore, robust estimation approaches based on label-noise modeling or robust divergence have also been studied (Bootkrajang and Kabán, 2012, 2013; Hung et al., 2018; Fujisawa and Eguchi, 2008). From the perspective of excess zeros, these settings are closely connected: positive outcomes that are systematically recorded as zeros induce a structural-zero mechanism in the observed data. From the theoretical perspective, Diop et al. (2011) established sufficient conditions for identifiability of the zero-inflated logistic regression model. One such condition requires at least one continuous covariate that appears in one component but not in the other.
Although the zero-inflated logistic model has been explored practically and theoretically, there remain several gaps in the understanding of its theoretical properties. In particular, this study focuses on the practically important scenario where no reliable prior information is available regarding which covariates should enter both components. A common empirical choice is then to use the same design matrix in both components of the model. In this case, the model faces a theoretical difficulty: once the two components share the same covariates, the likelihood becomes invariant under exchange of the two coefficient vectors. The model is therefore not identifiable as an ordered pair, and both optimization and posterior simulation may exhibit symmetric multiple modes. This symmetry is analogous to the label-switching phenomenon in mixture models (Frühwirth-Schnatter, 2006). Specifically, once the same covariates are used in both components, exchanging the two coefficient vectors does not change the likelihood values. However, such non-identifiability has yet to be characterized.
Another theoretical issue concerns the existence of the maximum likelihood estimate. For ordinary logistic regression, non-existence of the estimate under separation is well-known (Albert and Anderson, 1984; Silvapulle, 1981). For the zero-inflated logistic model, however, the analogous conditions have received less attention.
Our contributions are fourfold. First, we prove that model misspecification can lead to a sign flip in the pseudo-true parameter relative to the true coefficient. Second, we introduce a double-separation condition for the zero-inflated logistic model and derive sufficient conditions for the existence of the estimates. Third, we establish that the model under a shared design matrix is identifiable up to exchange symmetry and that, under mild regularity conditions, the expected log-likelihood has a unique maximizer on the resulting quotient space. Furthermore, we investigate the resulting multimodality numerically through a Pólya-Gamma Gibbs sampler integrated with replica exchange (Polson et al., 2013; Swendsen and Wang, 1986). Fourth, we propose a relabeling rule based on an estimate from the ordinary logistic regression model and conduct a numerical study.
The remainder of the paper is structured as follows. Section 2 introduces the model, studies the sign-flip phenomenon under misspecification, and presents sufficient conditions for the existence and non-existence of the estimates. Section 3 studies identifiability under the shared design setting, formalizes the exchange-symmetry structure, and provides a numerical illustration of the resulting posterior bimodality. Section 4 proposes a relabeling rule and conducts a simulation study. Section 5 presents an illustrative application to NHANES self-reported diabetes data. Section 6 discusses limitations and future work. Section 7 concludes the paper.
2 Zero-inflated Logistic Regression Model
In this section, we define the zero-inflated logistic regression model and study two basic aspects of the model before turning to the shared-design case. First, we examine the consequence of misspecification when zero inflation is ignored and a standard logistic regression model is fitted instead. Second, we consider the existence of maximum likelihood estimates and introduce separation-type conditions that provide sufficient criteria for non-existence and existence.
2.1 Model Definition
Let () denote the binary responses for observations, and let denote the covariate vector. Let be the corresponding parameter vector. We assume that a latent variable determines whether the response is constrained to be zero, where indicates the zero state. Let denote the covariate vector for , and let denote the corresponding parameter vector. Let be the inverse logit function:
Then, the zero-inflated logistic regression model is defined as
where and . We refer to the logistic regression components and as the ordinary logistic regression component and the structural-zero component, respectively. In this context, represents the probability that the -th observation is not a structural zero, corresponding to the probability of susceptibility to the event. Then, we have
Therefore, the log-likelihood function is defined as
2.2 Sign-Flip Phenomenon Under Misspecification
To motivate the use of the zero-inflated logistic regression model, we examine the consequences of ignoring zero-inflation and fitting a standard logistic regression model to data generated from the zero-inflated model. When the standard logistic regression model is applied to data with excess zeros, the resulting estimator can be severely biased. In particular, when a covariate is positively associated with the response but negatively associated with the latent indicator (or vice versa), the fitted logistic regression model may estimate a regression coefficient whose sign is opposite to the underlying true value. In this subsection, we formalize this phenomenon under a random-design setting.
Let and denote covariate vectors in the ordinary logistic regression component and the structural-zero component, respectively, and suppose that they share a common covariate indexed by :
where , , and the th covariate is shared between the two vectors: . The corresponding regression coefficients are decomposed as
Without loss of generality, we focus on a covariate that has a positive effect on but a negative effect on , and let
Under the zero-inflated logistic regression model, the conditional event occurrence probability can be written as
We then consider the ordinary logistic regression model where the conditional event occurrence probability is specified as . Here, is the regression coefficient vector. Under the true zero-inflated model, the expected log-likelihood function of this misspecified logistic regression is given by
where the expectation is taken with respect to the joint distribution of .
We then define the profile objective function
and let . The following theorem shows that, when the magnitude of the negative association in the zero-inflation component is sufficiently large, every pseudo-true value of the th coefficient is negative.
Theorem 2.1.
See Appendix A.1 for the proof. This theorem implies that, although the true coefficient satisfies , every pseudo-true value of the same coefficient under the misspecified conventional logistic regression model is negative when the is sufficiently large in the opposite sign direction. Specifically, any choice satisfies .
2.3 Maximum Likelihood Estimation
The maximum likelihood estimator (MLE) is defined as any maximizers of . However, due to the structure of the zero-inflated logistic model, the existence of an estimate is not always guaranteed. To investigate this issue, we introduce the concept of double separation, an analogue of the separation condition for the standard logistic regression.
Definition 2.1 (Double separation).
The dataset is said to satisfy double separation if there exist non-zero vectors and such that for every ,
and either of the inequalities is strict for at least one observation: there exists such that either with or with .
Proposition 2.2.
If the dataset satisfies double separation, then the log-likelihood has no maximizer in .
See Appendix A.2 for the proof.
Furthermore, we introduce the following condition to guarantee the existence of a maximum likelihood estimate.
Definition 2.2 (–Double–non-separation).
The dataset is said to satisfy –double–non-separation if there exists a constant such that
We then establish the following proposition, which provides a sufficient condition for the existence of an estimate.
Proposition 2.3.
If the dataset satisfies –double–non-separation for some , then a maximizer of exists.
See Appendix A.3 for the proof.
The –double–non-separation condition implies that for any unit direction , there exists at least one observation located at a signed distance of at least from the separating hyperplane associated with that direction. In other words, the dataset maintains a uniform margin of width that prevents double separation.
Even when –double–non-separation holds with close to zero, the surface of the log-likelihood function may be nearly flat along certain directions, and numerical optimization may be unstable in practice. In such settings, penalized estimation approaches may be useful.
3 Shared-Design Model
We now focus on the case and for all . This is the configuration that arises when the analyst has no prior information with which to distinguish covariates for the ordinary logistic regression component from covariates for the structural-zero component and therefore uses the same design matrix in both components of the model. In that case, the conditional probability of is expressed as
for . We refer to this model as a shared-design model. The log-likelihood function is
which satisfies the exchange symmetry
Motivated by this symmetry, we define the equivalence relation
and define for the corresponding equivalence class.
3.1 Prior Work on Identifiability
Diop et al. (2011) established sufficient conditions for identifiability of the zero-inflated logistic regression model. A key condition in their argument is the availability of a continuous covariate that appears in one component but not the other. When the same covariates are used in both components, this source of asymmetry disappears. Therefore, the shared-design setting is a boundary case in which the ordinary guarantee of identifiability fails.
3.2 Identifiability under Shared Design
We formulate identifiability of the shared design model. Let where . We first consider the following support condition.
-
(C1)
The support of contains a nonempty open subset .
Under condition (C1), we obtain the following basic identifiability result.
Proposition 3.1.
Suppose that condition (C1) holds. Let and , and suppose that , , are pairwise distinct. Suppose that for two parameter pairs and ,
| (1) |
holds for all and all for . Then, we have either or .
We next extend Proposition 3.1 to a mixed support for continuous and discrete covariates. Let , where and with , , and . Here, denotes the subvector of continuously distributed covariates, and denotes the subvector of discretely distributed covariates. We consider the following support condition.
-
(C2)
There exists a subset such that , and, for each , the conditional support of given contains a nonempty open subset .
When , condition (C2) reduces to condition (C1). Let and , corresponding to the decomposition of into for . Under condition (C2), we obtain the following extended result.
Theorem 3.2.
Suppose that condition (C2) holds. Let and , and suppose that , , are pairwise distinct. Suppose that for two parameter pairs and , (1) holds for all and all for and . Then, we have either or .
We now establish the following identifiability result.
Corollary 3.3.
Suppose that condition (C2) holds, and suppose that the shared-design model is correctly specified with true parameter pair . Let and , where and , and suppose that , , are pairwise distinct. Then the expected log-likelihood
is uniquely maximized on at the class .
See Appendix A.4 for the proofs of Proposition 3.1, Theorem 3.2, and Corollary 3.3. Proposition 3.1 gives the basic identifiability result under a fully continuous support, while Theorem 3.2 extends it to a mixed support for continuous and discrete covariates. Corollary 3.3 clarifies the inferential target under the shared-design setting. At the population level, the equivalence class is uniquely identified, yet the model cannot distinguish which component corresponds to the event occurrence process and which to the structural-zero process. These results establish identifiability over the support of and do not imply the uniqueness of the likelihood maximizer in finite samples. Consequently, resolving this label ambiguity requires either external information or some relabeling rule.
These results also relate the shared-design model to the identifiability of finite mixture models (Teicher, 1963; Yakowitz and Spragins, 1968). In this context, identifiability is formulated as the uniqueness of the finite-mixture representation. Therefore, under an ordered-component parameterization, this corresponds to uniqueness up to permutation of component labels. In the present model, the relevant symmetry is not a literal permutation of mixture components but the exchange symmetry of the pair of parameter vectors for both components. In this sense, is the natural inferential target.
The condition that , , and are pairwise distinct, which is inherited from Theorem 3.2, is a condition on the true parameter vector. This is a generic condition and can be assumed to hold in practice.
3.3 Numerical Confirmation of Bimodality
To illustrate the exchange symmetry established in Theorem 3.2 and Corollary 3.3, we examined the posterior distribution of under the shared-design setting via a Markov chain Monte Carlo (MCMC) algorithm. We set the number of non-intercept covariates as (), and considered three covariate designs: Scenario 1: all covariates were drawn from independent standard normal distributions; Scenario 2: all covariates were drawn from independent distributions; Scenario 3: the first two non-intercept covariates were drawn from independent standard normal distributions, and the remaining ones were from independent . The sample size was . The data-generating mechanism and sampling setup are described in Appendix B. We developed the Pólya-Gamma Gibbs sampler with replica exchange, which is detailed in Appendix C (Polson et al., 2013; Swendsen and Wang, 1986).
Figure 1 displays the principal component analysis (PCA) plots of the posterior samples after -means++ clustering with (Arthur and Vassilvitskii, 2007). Under continuous and mixed designs (Scenarios 1 and 3), the posterior exhibited clear bimodality, with two well-separated clusters. Under the binary design (Scenario 2), in which all non-intercept covariates take values in the bounded discrete set , the two clusters were not clearly distinguished in the PCA plots. This result is consistent with the failure of the binary design to satisfy the condition (C2), leading to a lack of guaranteed identifiability by Corollary 3.3. Posterior means of each cluster and further numerical details are provided in Appendix B and D.
4 Relabeling Rule
Section 3 shows that, under the shared-design setting, the likelihood identifies only the equivalence class . In many applications, however, investigators still need a single ordered pair because subsequent interpretation is conducted in terms of the associations with event occurrence or zero-inflation such as zeros due to misclassification. For that practical purpose, we introduce a simple relabeling rule.
4.1 Proposed Rule
We propose a simple relabeling rule. We proceed as follows.
-
1.
Fit a standard logistic regression of on and denote the resulting coefficient vector by .
-
2.
Fit the zero-inflated logistic regression model with shared design and obtain one ordered maximizer .
-
3.
Form the exchange-symmetric solution .
-
4.
Choose the pair whose first component vector is closer to ; that is, choose
We should note that Theorem 2.1 shows that the estimator of the ordinary logistic regression is biased under misspecification. Therefore, while serves as a convenient reference, it should not be interpreted as a definitive benchmark for the ordinary logistic regression component of the zero-inflated model. Whenever prior knowledge is available to distinguish the two components, that information should take precedence over this rule.
4.2 Simulation Study
We next investigate the behavior of the proposed relabeling rule through a simulation study.
Simulation Design.
We considered four scenarios that differ only in the intercept of the structural-zero component, hence, in the probability of zero inflation. The coefficient vector for the ordinary logistic regression component was fixed at
and the structural-zero coefficient was
We used four values of . Specifically, we considered: (i) Very Low Mislabel (), yielding approximately structural zeros; (ii) Low Mislabel (), yielding approximately structural zeros; (iii) Moderate Mislabel (), yielding approximately structural zeros; and (iv) High Mislabel (), yielding approximately structural zeros. These scenarios were determined to span the range of zero-inflation levels commonly encountered in applications, from nearly negligible to substantial proportions of structural zeros.
For each scenario, we generated observations with covariates including an intercept. The first element of was , and the remaining elements were drawn independently from the standard normal distributions. Responses were generated from the zero-inflated logistic regression model as follows:
We compared three estimation approaches: (i) Proposed approach: the zero-inflated model with the proposed relabeling rule in Section 4, (ii) The standard logistic regression approach: the ordinary logistic regression model ignoring zero inflation, and (iii) Naive zero-inflated model approach: the zero-inflated model without relabeling, retaining the first local maximizer returned by the optimization algorithm.
All methods were performed by the same optimization method and settings: the L-BFGS-B algorithm with analytical gradients, a maximum of iterations, and random initialization from . An estimate was classified as unreasonable if at least one component exceeds ten times the absolute value of its true value. Each scenario was replicated times.
Results.
Tables 1–3 summarize the results. Figure 2 and Figure 3 visualize the boxplots for distributions of the observed bias. In these results, several patterns were clearly observed. First, for the parameter , the bias of the estimates from the proposed approach remained more concentrated around zero than that of the estimates from the naive approach, whereas the estimates from standard logistic regression approach exhibited a systematic negative shift that became larger as the zero-inflation proportion increased. Second, the proposed approach was more stable than the naive approach. Across the low-to-moderate zero-inflation scenarios, the biases of the relabeled estimates were smaller than those of the standard logistic regression, while their magnitudes of standard deviations remained moderate. For for the parameter , the proposed approach also dominated the naive approach in most scenarios, especially for the parameters other than the intercept. A weakness appeared when structural zeros were very rare, probably because the intercept of the structural-zero component was weakly identified and highly variable. Third, the naive approach behaved as expected in terms of a non-identifiable ordered parameterization. Specifically, the estimates were obtained from both symmetric solutions. The concentration of the relabeled estimates around the true parameter values, in contrast to the spread of the naive estimates, suggests that the proposed relabeling rule was effective at selecting an appropriate representative from each equivalence class.
Table 3 shows that the proposed relabeling rule improved practical reliability. The proposed method produced reasonable estimates in more than of replications in all but the Very Low Mislabel scenario and consistently outperformed the naive zero-inflated model approach in terms of reasonable solutions.
| Scenario | Method | Bias (SD) | ||||
|---|---|---|---|---|---|---|
| Very Low Mislabel | Proposed | 0.102 (0.234) | -0.007 (0.231) | -0.003 (0.190) | -0.004 (0.114) | -0.002 (0.115) |
| Standard LR | -0.170 (0.071) | -0.217 (0.083) | -0.166 (0.076) | 0.005 (0.076) | 0.032 (0.072) | |
| Naive ZILR | 1.100 (1.449) | -0.521 (0.861) | -0.395 (0.674) | -0.004 (0.243) | 0.064 (0.253) | |
| Low Mislabel | Proposed | 0.055 (0.253) | -0.019 (0.274) | -0.013 (0.218) | -0.001 (0.109) | 0.001 (0.109) |
| Standard LR | -0.409 (0.068) | -0.455 (0.075) | -0.346 (0.069) | 0.011 (0.073) | 0.064 (0.069) | |
| Naive ZILR | 1.175 (1.384) | -0.882 (1.059) | -0.667 (0.810) | 0.010 (0.184) | 0.119 (0.231) | |
| Moderate Mislabel | Proposed | -0.005 (0.289) | -0.267 (0.592) | -0.207 (0.457) | 0.014 (0.121) | 0.043 (0.142) |
| Standard LR | -0.815 (0.068) | -0.724 (0.069) | -0.559 (0.068) | 0.032 (0.072) | 0.115 (0.070) | |
| Naive ZILR | 0.751 (0.953) | -1.015 (1.127) | -0.767 (0.862) | 0.010 (0.167) | 0.135 (0.225) | |
| High Mislabel | Proposed | -0.195 (0.284) | -0.648 (0.780) | -0.506 (0.597) | 0.035 (0.132) | 0.105 (0.162) |
| Standard LR | -1.121 (0.070) | -0.860 (0.068) | -0.670 (0.069) | 0.050 (0.074) | 0.145 (0.072) | |
| Naive ZILR | 0.416 (0.802) | -1.021 (1.146) | -0.771 (0.867) | 0.007 (0.181) | 0.133 (0.225) | |
| Scenario | Method | Bias (SD) | ||||
|---|---|---|---|---|---|---|
| Very Low Mislabel | Proposed | 1.144 (3.760) | -0.102 (1.478) | -0.140 (1.297) | 0.092 (0.674) | 0.086 (0.692) |
| Naive ZILR | -0.579 (3.822) | 0.677 (1.581) | 0.471 (1.324) | 0.060 (0.553) | -0.042 (0.579) | |
| Low Mislabel | Proposed | 0.548 (1.899) | -0.118 (0.883) | -0.109 (0.733) | 0.047 (0.339) | 0.042 (0.348) |
| Naive ZILR | -0.802 (2.053) | 0.852 (1.262) | 0.632 (0.977) | 0.028 (0.270) | -0.094 (0.309) | |
| Moderate Mislabel | Proposed | 0.345 (0.957) | 0.223 (1.070) | 0.160 (0.834) | 0.007 (0.215) | -0.018 (0.268) |
| Naive ZILR | -0.432 (1.057) | 0.981 (1.150) | 0.728 (0.882) | 0.011 (0.176) | -0.113 (0.228) | |
| High Mislabel | Proposed | 0.553 (0.789) | 0.629 (1.350) | 0.484 (1.030) | -0.021 (0.219) | -0.092 (0.278) |
| Naive ZILR | -0.071 (0.859) | 1.005 (1.163) | 0.752 (0.882) | 0.006 (0.181) | -0.120 (0.228) | |
| Scenario | Method | Converged | Unreasonable | Total | Ratio (%) |
|---|---|---|---|---|---|
| Very Low Mislabel | Proposed | 9,156 | 844 | 10,000 | 91.6 |
| Standard LR | 10,000 | 0 | 10,000 | 100.0 | |
| Naive ZILR | 7,249 | 2,751 | 10,000 | 72.5 | |
| Low Mislabel | Proposed | 9,936 | 64 | 10,000 | 99.4 |
| Standard LR | 10,000 | 0 | 10,000 | 100.0 | |
| Naive ZILR | 9,232 | 768 | 10,000 | 92.3 | |
| Moderate Mislabel | Proposed | 9,985 | 15 | 10,000 | 99.9 |
| Standard LR | 10,000 | 0 | 10,000 | 100.0 | |
| Naive ZILR | 9,936 | 64 | 10,000 | 99.4 | |
| High Mislabel | Proposed | 9,985 | 15 | 10,000 | 99.9 |
| Standard LR | 10,000 | 0 | 10,000 | 100.0 | |
| Naive ZILR | 9,955 | 43 | 10,000 | 99.6 |
5 Application to Actual Data
We illustrate the performance of the zero-inflated logistic model with shared design through an application to public data. We note that the data analysis in this section is intended only as a methodological illustration and not for clinical interpretation.
5.1 Dataset
We used the National Health and Nutrition Examination Survey (NHANES), the 2017–2018 public release (Centers for Disease Control and Prevention (CDC), 2017). The outcome for individual ID (SEQN) was self-reported diabetes status, constructed from the diabetes questionnaire as for respondents who reported a prior diabetes diagnosis () and otherwise (). As covariates, we used insurance coverage (HIQ011), usual source of care (HUQ030), age (RIDAGEYR), body mass index: bmi (BMXBMI), and sex (RIAGENDR). To motivate a zero-inflated model, we compared self-reported status with an HbA1c-based variable. Specifically, we defined when HbA1c was at least () and otherwise (). We used samples with non-zero 2-year sample weights . Table 4 summarizes the sample size of the data. Among respondents with and non-missing self-report, the proportion with was about . Although HbA1c is not a gold standard for diagnosis, these descriptive proportions suggest that self-reported diabetes can contain a non-negligible proportion of undiagnosed cases.
| Period | Interviewed participants | sample weights | HbA1c was observed | Self report was observed | Ratio of given |
|---|---|---|---|---|---|
| 2017–2018 | 9,254 | 8,704 | 6,045 | 8,709 | 0.214 |
5.2 Model Settings
We specified a shared-design model. Specifically, the covariates of the ordinary logistic regression component and the structural-zero component were set equal, with covariates
where age and bmi were standardized. The model was estimated on the complete-case subset with samples. Because the aim of this section is methodological illustration, we did not incorporate survey weights.
5.3 Results
We obtained one ordered solution using the BFGS algorithm from a random initial value and then obtained the second by exchanging the two coefficient vectors. Table 5 shows the two solutions: solution A and B, denoted as and , respectively. The resulting negative log-likelihood values were identical up to numerical precision. Specifically, both of the evaluated values were . Moreover, the exchange symmetry was numerically exact: and , where denotes the th element of .
We applied the relabeling rule from Section 4. Then, solution B was selected because , and , where denotes the estimate from the ordinary logistic regression, shown in Table 5.
| Term | |||||
|---|---|---|---|---|---|
| Intercept | 1.250 | -3.192 | -3.192 | 1.250 | -3.444 |
| insured | -0.413 | 0.138 | 0.138 | -0.413 | -0.111 |
| usualcare | 1.085 | 0.179 | 0.179 | 1.085 | 0.627 |
| age | -1.251 | 2.300 | 2.300 | -1.251 | 1.474 |
| bmi | 0.647 | 0.371 | 0.371 | 0.647 | 0.590 |
| female | -0.500 | -0.223 | -0.223 | -0.500 | -0.431 |
6 Discussion
This study investigates the zero-inflated logistic regression model with shared design, in terms of a sign-flip phenomenon under misspecification, the existence of maximum likelihood estimates, identifiability of the regression parameters, computational methods for implementation, and a practical relabeling rule. The primary theoretical message is that non-identifiability in the shared-design setting is not unstructured. Under mild regularity conditions, the non-identifiability is reduced to the exchange symmetry of the two coefficient vectors. By considering the quotient space with respect to this symmetry, the expected log-likelihood has a unique maximizer. This result is useful for understanding the inherent inferential limits of the model.
A second contribution is the analysis of the existence of maximum likelihood estimates. The concepts of double separation and –double–non-separation introduced in this study extend the classical separation conditions for the ordinary logistic regression model. While these conditions do not provide a complete characterization, they offer tractable sufficient conditions for both the existence and non-existence of estimates. In particular, the results on non-existence explain why optimization algorithms may fail to converge even before considering the exchange symmetry of the regression parameters. Furthermore, as shown in Theorem 2.1, model misspecification can lead to a sign flip in the regression coefficients relative to their true values. This provides a formal warning against analyses that ignore zero-inflation structures.
Numerical results based on posterior sampling support the theoretical findings regarding identifiability. Bimodality was clearly observed in posterior distributions under the continuous and mixed designs. However, the binary design did not exhibit clear mode separation and increased numerical instability, probably because the design fails to satisfy the condition (C2) and thereby lacks the guaranteed identifibility. These results provide a practical guideline: analysts should exercise caution when interpreting estimates if the all covariate values are restricted to a small number of support points in the covariate space. In terms of sampling algorithm, because standard single-chain Gibbs samplers may become trapped in one of the modes, we employed replica exchange method for efficient exploration of the parameter space.
The relabeling rule proposed in Section 4 serves as a heuristic for interpretation rather than a new source of identification. Its role is to provide a reproducible ordering rule when an ordered pair is required for subsequent interpretation or comparison with the results from the ordinary logistic regression. If external information is available to distinguish the ordinary logistic regression component from the structural-zero component, such information should take precedence over the proposed rule.
This study has several limitations. First, our theoretical results provide sufficient conditions for the existence and non-existence of the maximum likelihood estimate, rather than a complete characterization. Second, regarding the relabeling rule, it remains to be investigated whether alternative rules can improve performance when the referenced ordinary logistic regression itself suffers from a severe bias. Third, as the asymptotic theory for the model on the quotient space remains to be established, we do not perform formal statistical inference. These limitations suggest several directions for future research. Promising directions include a more refined characterization of the existence of the estimates, the formalization of asymptotic theory for parameters defined on the quotient space, and the development of relabeling rules that provide reliable choice even when the referenced logistic regression suffers from a severe bias.
A further direction is to examine whether the theoretical results extend to other link functions for binary regression, such as the probit or the complementary log-log function. Theorem 3.2 exploits the specific logistic form , which reduces the identifiability condition to an equality of sums of exponential terms with distinct exponents. While such a representation is unavailable for other general link functions, the principle that the equality over an open set restricts the parameter pairs to an exchange-symmetric set may hold for a broader class of functions. Specifically, for real analytic link functions, an exchange-symmetry result may be established via the identity theorem, provided that the analytic form of the product function ensures that local equality implies global equivalence. Furthermore, because the results on the existence of the maximum likelihood estimate and the sign flip phenomenon under misspecification depend primarily on the monotonicity and boundedness of and the structure of the log-likelihood function, these properties are expected to hold across other link functions, probably with appropriate modifications to reflect specific characteristics of link functions, such as the asymmetric behavior of the complementary log-log function.
7 Concluding Remarks
When the same covariates are used for both components of the zero-inflated logistic regression model, identifiability results as established in existing literatures does not hold. We establish that this non-identifiability has a specific structure. Namely, under mild regularity conditions, it reduces to exchange symmetry, and the expected log-likelihood has a unique maximizer on the resulting quotient space. In addition, we introduce sufficient conditions for the existence and non-existence of the maximum likelihood estimate, demonstrate posterior bimodality through numerical experiments, and propose a simple relabeling rule for applications. We also establish a sign flip phenomenon under misspecification. These theoretical and numerical results provide us with a practical guideline in applying the zero-inflated logistic regression model.
Funding
This research was supported by AMED under Grant Number JP223fa627001 (UTOPIA AI Research Discovery Program) and JSPS KAKENHI Grant Number 26K02664.
Data Availability
The NHANES data used in the application is available from the website of the Centers for Disease Control and Prevention: https://wwwn.cdc.gov/nchs/nhanes/.
Code Availability
The Python scripts for numerical studies and the R script for actual data application in this manuscript are available from the GitHub repository: https://github.com/t-yui/zero-inflated-logistic-shared-design.
Declaration of the Use of Generative AI and AI-assisted Technologies
The authors used ChatGPT (OpenAI), Claude (Anthropic) and Gemini (Google) to assist with developing scripts for simulation and application, and editing the English language during the preparation of this manuscript. The authors checked and edited the content and take full responsibility for this manuscript.
Appendix A Proofs
A.1 Proof of Theorem 2.1
We state the following regularity conditions.
Assumption A.1.
We assume that
-
(A1)
.
-
(A2)
and .
-
(A3)
almost surely.
-
(A4)
For the fixed value of and every , the map has a unique maximizer, and the profile objective function has at least one maximizer on .
For the fixed value of and each , let denote the maximizer in the definition of , and let
We first establish basic properties of the profile objective function .
Lemma A.1.
Suppose that Assumption A.1 holds. Then, we have
-
(i)
The derivative of is .
-
(ii)
The function is concave with respect to .
Proof of Lemma A.1..
We begin with (i). For fixed , let
Using , we have
Since , we obtain
Moreover, since and , we have
By Assumption A.1(A1), we have . Therefore, by the uniqueness of the maximizer from Assumption A.1(A4), using Danskin’s theorem, we obtain
We then prove (ii). Fix and . Let be a maximizer of for . Using the concavity of , we have
Thus is concave. ∎
The next lemma simplifies the derivative at .
Lemma A.2.
Suppose that Assumption A.1 holds. Then, we have
Proof of Lemma A.2..
At , we have , which depends on only through . Thus, using (A3),
Therefore, we have
∎
Finally, we investigate the behavior of .
Lemma A.3.
Suppose that Assumption A.1 holds. Then, we have
-
(i)
For every ,
-
(ii)
Proof of Lemma A.3..
We first prove (i). Since , we have , and hence
We then obtain
Since and with by (A2), the expectation is strictly positive and thus we have .
We next prove (ii). As , we have
and hence . We then have
For the event , the integrand is strictly negative because and . Therefore, by (A2), we obtain . ∎
We now prove Theorem 2.1.
A.2 Proof of Proposition 2.2
We prove Proposition 2.2 as follows.
Proof of Proposition 2.2..
Assume that the data satisfy double separation, so there exist non-zero vectors and such that
and at least one observation satisfies a strict inequality in the sense of the definition of double separation.
Fix , and define for . For each , let
Because , we have
Therefore, we have
If , then
and both and . Hence, we obtain
If , then
while both and . Hence, we again obtain
Moreover, because at least one observation satisfies a strict inequality, we have
for , for some . Indeed, if and either or , then
because . Similarly, if and either or , then
Therefore,
Since was arbitrary, every finite parameter point can be improved by moving a small positive amount along the direction . Therefore, no finite point can be a maximizer of . Therefore, the log-likelihood has no maximizer in . ∎
A.3 Proof of Proposition 2.3
We prove Proposition 2.3 as follows.
Proof of Proposition 2.3..
Let , , and . Because
we have
| (2) |
for all .
By –double–non-separation, for each unit direction at least one of the following holds:
-
(a)
there exists such that ;
-
(b)
there exists such that .
If (a) holds, then for ,
Therefore, we have
Next, if (b) holds, then for ,
Therefore, we have
Consequently, we obtain
| (3) |
Since as , there exists a sufficiently large radius such that . We then define the closed ball
As is a compact set and is continuous, attains its maximum at some point . Furthermore, for any outside this ball, by (3), we have
Since , it follows that for all . Therefore, is the global maximizer of . ∎
A.4 Proofs of Proposition 3.1, Theorem 3.2, and Corollary 3.3
We first remark a standard linear-independence property of exponential functions.
Lemma A.4.
Let be distinct vectors, and let contain a nonempty open set. If
for all , then .
Proof of Lemma A.4..
Choose in the interior of . Since the hyperplanes
do not cover , there exists such that are pairwise distinct. For all sufficiently small , we have , and hence
Since one-dimensional exponential functions with distinct exponents are linearly independent on any open interval, we obtain
for . Therefore, we have for . ∎
We now prove Proposition 3.1.
Proof of Proposition 3.1.
Because (1) holds for all , it is equivalent to
for all with . Using , we obtain
for all . Therefore, we have
| (4) |
for all . Since are pairwise distinct, the left-hand side of (A.4) is a linear combination of three distinct exponential functions and can be written as
where and . The right-hand side can be written as
where are distinct. Therefore, we have
By Lemma A.4, if the sets of vectors and are not identical, it implies that at least one coefficient in the combined linear combination, which is either or , must be zero. However, this is a contradiction because the exponential function is strictly positive. Therefore, we obtain and
Since , and are pairwise distinct111This condition implies that both and are non-zero, and ensures that neither can be expressed as the sum of the other two elements in the set., is the unique element in the set that can be expressed as the sum of the other two elements. Because the sets and are identical, their unique sum elements must be equal. Therefore, we have
which implies that the sets of the remaining elements are also identical:
If and , then we have and , and hence and . Instead, if and , then we have and , and hence and . Therefore, we have either or . ∎
We next prove Theorem 3.2.
Proof of Theorem 3.2.
Fix . For , define and . Then, for all and all , (1) can be expressed as
Because is a nonempty open subset of and are pairwise distinct, Proposition 3.1 yields that, for each , we have either
| (5) |
or
| (6) |
We claim that the same set of equations must hold for all . Indeed, if (5) holds for some and (6) holds for some , then , which contradicts . Therefore, either (5) holds for all , or (6) holds for all .
First, suppose that (5) holds for all . Then, we have
for all . Hence, both of the following functions:
are identically zero on . Since , both functions are also identically zero on . Therefore, we have , , , and . Combining (5) with this, we obtain .
Next, suppose that (6) holds for all . Then, by a similar argument, we obtain .
Therefore, we have either or . ∎
We finally prove Corollary 3.3.
Proof of Corollary 3.3..
Let and . Under correct specification, we have
Therefore, the expected log-likelihood function has the following expression:
and, hence, we have
for any . Here, denotes the Kullback-Leibler divergence from the first probability distribution to the second. Thus is a maximizer on .
Now suppose that also attains the maximum. Then, it must hold that
Therefore, we obtain
| (7) |
Let . Fix and define
Both functions are continuous on . We claim that holds for all and all . Suppose not, then there exist and such that . By continuity, there exists a nonempty open neighborhood of such that the two functions remain different on . Because and belongs to the conditional support of given , we have . This contradicts (7). Therefore, we obtain
for all with and .
From the discussion above, we have
for all and all with and . By Theorem 3.2, it follows that
Therefore, the expected log-likelihood is uniquely maximized on at the class . ∎
Appendix B Details of Numerical Settings of Bimodality Confirmation
B.1 Posterior Distribution and Sampling Algorithm
We consider the posterior distribution given by
| (8) |
where is defined as
and denotes the prior distribution. The complete-data likelihood can be separated into a logistic regression for :
and a logistic regression for :
conditioned on . Thus generates a structural zero, whereas allows the ordinary logistic regression. The observed outcome is then given by . We use this structure to construct a MCMC algorithm. We further employ the Pólya-Gamma augmentation (Polson et al., 2013) and combine the resulting Gibbs sampling algorithm with replica exchange so that the sampler can move between the multiple modes. The detailed algorithms are provided in Appendix C.
B.2 Data Generation and Sampling Setup
Numerical experiments were conducted under the shared-design setting. We considered three designs that differ only in the distribution of the covariates. In each scenario, we generated observations with covariates including an intercept. The true coefficient vectors were fixed at
In Scenario 1, all non-intercept covariates were drawn from independent standard normal distributions. In Scenario 2, all non-intercept covariates were drawn from independent . In Scenario 3, the first two non-intercept covariates were drawn from independent standard normal distributions, and the remaining ones were from independent . Given the covariates, we generated
independently, and set .
We placed weakly informative Gaussian priors on both coefficient vectors: and . Posterior sampling was performed via a Gibbs sampling algorithm with replica exchange using replicas. The temperature schedule followed a geometric progression with , and replica exchange was attempted every iterations. The total number of MCMC iterations was set to , with the first discarded as burn-in. To facilitate efficient sampling, Gibbs sampling based on Pólya-Gamma data augmentation was used for both the ordinary logistic regression component and the structural-zero component.
B.3 Sampling Results
To explore the structure of the posterior distribution, we applied -means++ clustering algorithm with to the posterior samples Arthur and Vassilvitskii (2007). The samples, consisting of both and parameters concatenated as -dimensional vectors, were projected onto the first two principal components for visualization using PCA. See Figure 1 for the plots.
Table 6 summarizes the posterior means within each cluster along with cluster sizes and proportions. In Scenarios 1 and 3, the means of the two clusters exhibited an approximately symmetric structure, reflecting the exchange symmetry of and . In Scenario 2, the posterior means from the smaller cluster were numerically large, especially in the structural-zero component, consistent with the failure of the binary design to satisfy the condition (C2), which results in a lack of guaranteed identifiability.
The trace plots and the histograms for the posterior distributions are provided in Appendix D.
| Parameter | True | Scenario 1 | Scenario 2 | Scenario 3 | |||
|---|---|---|---|---|---|---|---|
| Cluster 1 | Cluster 2 | Cluster 1 | Cluster 2 | Cluster 1 | Cluster 2 | ||
| 0.5 | 0.325 | 2.886 | 9.700 | 17.168 | 0.202 | 2.713 | |
| 1.0 | 0.871 | -1.349 | 5.136 | 0.847 | 0.760 | -1.453 | |
| 0.5 | 0.355 | -1.572 | 3.180 | 5.330 | 0.234 | -1.283 | |
| 0.5 | 0.515 | 0.544 | 4.194 | -0.704 | 0.532 | 0.737 | |
| 0.25 | 0.282 | 0.668 | 8.992 | -0.187 | 0.574 | 0.158 | |
| 1.7 | 2.413 | 0.200 | 0.280 | 0.265 | 2.794 | 0.216 | |
| -1.0 | -1.154 | 0.803 | -0.214 | -0.204 | -1.491 | 0.773 | |
| -1.0 | -1.413 | 0.284 | -0.622 | -0.622 | -1.307 | 0.249 | |
| 0.5 | 0.524 | 0.516 | 0.570 | 0.575 | 0.749 | 0.532 | |
| 0.5 | 0.625 | 0.292 | 0.470 | 0.485 | 0.154 | 0.575 | |
| Cluster size | 13,250 | 36,750 | 8,395 | 41,605 | 25,450 | 24,550 | |
| Proportion | 0.265 | 0.735 | 0.168 | 0.832 | 0.509 | 0.491 | |
Appendix C Details of Sampling Algorithm
For each replica , let denote the temperature. The tempered posterior distribution is defined as
where
for . We assume Gaussian priors: , and .
C.1 Pólya-Gamma Gibbs Sampling Step
For each replica , we perform Gibbs sampling using the following steps.
Step 1. Updating .
If , then is deterministically set to . For observations with , we sample from
where
Step 2. Updating .
The tempered likelihood for is
Introduce Pólya-Gamma auxiliary variables (Polson et al., 2013):
where denotes the Pólya-Gamma distribution. Specifically, if with and , using random numbers independently following Gamma distribution , can be obtained as
Let , , and
Then the full conditional distribution of becomes Gaussian:
Step 3. Updating .
Let , let be the submatrix of indexed by , and let be the corresponding subvector of . The tempered likelihood of is:
Introduce independent Pólya-Gamma variables
and, define and
Then the full conditional distribution of becomes again Gaussian:
C.2 Replica Exchange Step
After a fixed number of Gibbs iterations, we attempt to swap the states of two neighboring replicas with adjacent temperatures and . Let the current states be denoted by
The acceptance probability for the swap proposal is given by
| (9) |
In practice, we compute the log acceptance ratio as
where
C.3 Overall Algorithm
The full algorithm alternates between the Gibbs sampling step and the replica exchange step as follows:
-
1.
For each replica , perform Gibbs sampling to update , , and from the full conditionals above under its corresponding temperature .
-
2.
Every fixed number of iterations, attempt to swap the states of neighboring replicas and according to the acceptance probability given in (9).
-
3.
Retain samples from the chain with as draws from the target posterior distribution.
Appendix D Trace Plots and Posterior Histograms
We provide the trace plots for Scenario 1 as Figure 4, Scenario 2 as Figure 6, and Scenario 3 as Figure 8. Furthermore, we provide the posterior histograms for Scenario 1 as Figure 5, Scenario 2 as Figure 7, and Scenario 3 as Figure 9.
References
- On the existence of maximum likelihood estimates in logistic regression models. Biometrika 71 (1), pp. 1–10. Cited by: §1, §2.3.
- K-means++: the advantages of careful seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics 8, pp. 1027–1035. External Links: Document Cited by: §B.3, §3.3.
- Label-noise robust logistic regression and its applications. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 143–158. Cited by: §1.
- Classification of mislabelled microarrays using robust sparse logistic regression. Bioinformatics 29 (7), pp. 870–877. Cited by: §1.
- National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data. Note: Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention Cited by: §5.1.
- Maximum likelihood estimation in the logistic regression model with a cure fraction. Electronic Journal of Statistics 5, pp. 460–483. External Links: Document, Link Cited by: §1, §1, §3.1.
- Finite mixture and Markov switching models. Springer, New York. Cited by: §1.
- Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis 99 (9), pp. 2053–2081. Cited by: §1.
- Zero-inflated poisson and binomial regression with random effects: a case study. Biometrics 56 (4), pp. 1030–1039. Cited by: §1.
- Robust mislabel logistic regression without modeling mislabel probabilities. Biometrics 74 (1), pp. 145–154. Cited by: §1.
- An asymmetric logistic regression model for ecological data. Methods in Ecology and Evolution 7 (2), pp. 249–260. Cited by: §1.
- Estimating a logistic discrimination functions when one of the training samples is subject to misclassification: a maximum likelihood approach. PLoS One 10 (10), pp. e0140718. Cited by: §1.
- Bayesian inference for logistic models using pólya–gamma latent variables. Journal of the American Statistical Association 108 (504), pp. 1339–1349. Cited by: §B.1, §C.1, §1, §3.3.
- On the existence of maximum likelihood estimators for the binomial response models. Journal of the Royal Statistical Society: Series B 43 (3), pp. 310–313. Cited by: §1, §2.3.
- Replica Monte Carlo simulation of spin-glasses. Physical Review Letters 57 (21), pp. 2607. Cited by: §1, §3.3.
- Identifiability of finite mixtures. The Annals of Mathematical Statistics 34 (4), pp. 1265–1269. Cited by: §3.2.
- Testlet response theory and its applications. Cambridge University Press. Cited by: §1.
- On the identifiability of finite mixtures. The Annals of Mathematical Statistics 39 (1), pp. 209–214. Cited by: §3.2.