A novel nonparametric framework for DIF detection using kernel-smoothed item response curves

Adéla Hladká¹ Patrícia Martinková^1,2
¹ Institute of Computer Science of the Czech Academy of Sciences Prague Czech Republic
² Faculty of Education Charles University Prague Czech Republic

Abstract

This study introduces a novel nonparametric approach for detecting Differential Item Functioning (DIF) in binary items through direct comparison of Item Response Curves (IRCs). Building on prior work on nonparametric comparison of regression curves, we extend the methodology to accommodate binary response data, which is typical in psychometric applications. The proposed approach includes a new estimator of the asymptotic variance of the test statistic and derives optimal weight functions that maximise local power. Because the asymptotic distribution of the resulting test statistic is unknown, a wild bootstrap procedure is applied for inference. A Monte Carlo simulation study demonstrates that the nonparametric approach effectively controls Type I error and achieves power comparable to the traditional logistic regression method, outperforming it in cases with multiple intersections of the underlying IRCs. The impact of bandwidth and weight specification is explored. Application to a verbal aggression dataset further illustrates the method’s ability to detect subtle DIF patterns missed by parametric models. Overall, the proposed nonparametric framework provides a flexible and powerful alternative for detecting DIF, particularly in complex scenarios where traditional model-based assumptions may not be applicable.

1 Introduction

Multi-item assessments play a significant role in various areas of our everyday lives, intervening in education, psychology, healthcare, and other applied fields (brennan2006educational; haladyna2011handbook; martinkova2023computational). In the educational context, they are used to assess academic performance, certify student qualifications, measure knowledge and skill proficiency, and administer admission tests, as well as national and international large-scale student exams. In psychology, these assessments help measure intelligence, personality traits, and attitudes. In health-related fields, they are employed for measuring outcomes such as fatigue, depression, pain, quality of life, and overall well-being. Similar instruments are also further applied in employment selection and promotion processes, peer reviews of academic work, and the evaluation of grant proposals.

While decisions based on such assessments are often guided by total scores, a detailed analysis at the item level provides deeper insight. Item functioning can be described by Item Response Curve (IRC), which expresses the probability of a specific response to an item as a function of an individual’s latent trait and, potentially, additional covariates. Parametric modelling frameworks, such as Item Response Theory (IRT) (vanderlinden1997handbook; vanderlinden2018handbook) and score-based regression models (martinkova2023computational), are widely used to estimate IRCs.

However, any parametric approach risks oversimplification when the underlying model omits critical information. In cases where the true model is highly complex or lacks a clear parametric form, nonparametric methods offer a more flexible alternative. Notable examples include the monotone homogeneity model and double-monotonicity model for binary items (mokken1971theory), as well as the kernel-smoothing approach employing Nadaraya-Watson weights and rank-based ability estimates (ramsay1991kernel).

A crucial aspect of item functioning is that it may vary across different respondent groups. This phenomenon, known as Differential Item Functioning (DIF), occurs when respondents from different groups but with the same underlying trait have different probabilities of giving a certain response to an item. Parametric approaches, such as IRT models and score-based regression methods, can be extended to account for the effect of covariates (e.g., group-membership variable) and applied for DIF detection (lord1980applications; raju1988area; swaminathan1990detecting; drabinova2017detection; hladka2023combining; hladka2025estimation). Yet, to our knowledge, a nonparametric DIF detection directly based on comparing IRCs has not been systematically explored.

To address this gap, we introduce a new kernel-smoothing-based framework for the nonparametric comparison of IRCs across groups. Our method adapts and extends the general approach of srihera2010nonparametric to the binary-item setting, developing it specifically for the purpose of detecting DIF. We further propose several methodological variants designed to accommodate different DIF types, and we demonstrate their performance in a systematic comparison with the widely used logistic regression method (swaminathan1990detecting).

The remainder of this paper is structured as follows. Section 2 details the estimation procedure, the general test statistic, and several weight-function strategies, including optimal weights designed to maximise test power, along with their estimates and asymptotic properties of the resulting test statistic. Section 3 presents a simulation study evaluating the proposed framework in comparison to the traditional parametric approach based on logistic regression. Section 4 illustrates the methodology using an empirical dataset from a questionnaire about verbal aggression. Section 5 concludes with a discussion of the findings, practical recommendations, and directions for future work.

2 Methodology

DIF and its detection are closely connected to the broader problem of describing the relationship between respondents’ responses $\boldsymbol{Y_{i}}=(Y_{1i},\dots,Y_{ni})$ to item $i$ and their abilities $\boldsymbol{\theta}=(\theta_{1},\dots,\theta_{n})$ , where $n$ is a number of respondents. This relationship can generally be expressed through a regression function $m_{i}$ for item $i$ :

\displaystyle\boldsymbol{Y_{i}}=m_{i}(\boldsymbol{\theta})+\boldsymbol{\epsilon_{i}},\quad\operatorname{\textsf{E}}(\boldsymbol{\epsilon_{i}}|\boldsymbol{\theta})=\boldsymbol{0}

This work focuses on binary outcomes $\boldsymbol{Y_{i}}$ , in which case this relationship can be reformulated as

\displaystyle\operatorname{\textsf{E}}(\boldsymbol{Y_{i}}|\boldsymbol{\theta})=\operatorname{\textsf{P}}(\boldsymbol{Y_{i}}=1|\boldsymbol{\theta})=m_{i}(\boldsymbol{\theta}).

We estimate IRCs, i.e., the function $m_{i}(\cdot)$ , using the nearest-neighbor kernel-smoothing approach of srihera2010nonparametric across two groups: reference ( $g=0$ ) and focal ( $g=1$ ). Let $\mathcal{P}_{g}$ denote the set of respondents in group $g$ of size $n_{g}$ , such that $n_{0}+n_{1}=n$ is the total sample size. For a given item $i$ , let $Y_{pi}$ be the binary response of respondent $p$ and $\theta_{p}$ their ability (e.g., standardised total test score or other matching criterion). Define $\hat{F}_{ig}(x)$ as the empirical distribution function of $\theta_{p}$ in group $g$ :

\displaystyle\hat{F}_{g}(x)=\frac{1}{n_{g}}\sum_{p\in\mathcal{P}_{g}}\mathbf{1}[\theta_{p}\leq x].

The nearest neighbor estimate of the IRC $m_{ig}$ for item $i$ for group $g$ is then given by:

\displaystyle\hat{m}_{ig}(x)=\sum\limits_{p\in\mathcal{P}_{g}}Y_{pi}W_{pig}(x),

(1)

where the weights $W_{pig}(x)$ are defined as:

\displaystyle W_{pig}(x)=\frac{K\left(\frac{\hat{F}_{g}(\theta_{p})-\hat{F}_{g}(x)}{h}\right)}{\sum\limits_{k\in\mathcal{P}_{g}}K\left(\frac{\hat{F}_{g}(\theta_{k})-\hat{F}_{g}(x)}{h}\right)}.

(2)

$K(\cdot)$ is a twice continuously differentiable, symmetric, non-negative kernel function, which is non-decreasing for $u<0$ with a compact support and $\int K(u)\ \mathrm{d}u=1$ . Examples include the Epanechnikov kernel $K(u)=\frac{3}{4}(1-u^{2}),\left|{u}\right|\leq 1$ (epanechnikov1969non), and the uniform kernel $K(u)=\frac{1}{2},\left|{u}\right|\leq 1$ . The bandwidth parameter $h$ satisfies $nh^{3}\rightarrow\infty$ and $nh^{4}\rightarrow 0$ as $n\rightarrow\infty$ (see srihera2010nonparametric, p. 2042). Therefore, the parameter $h$ is assumed to take the value of $n^{-\zeta}$ , where $\zeta\in\left(\frac{1}{4},\frac{1}{3}\right)$ and $n$ has the order of $n_{0}$ and $n_{1}$ .

Kernel smoothing is advantageous here because it makes no assumptions about the functional form of $m_{ig}(x)$ , making it applicable even when IRCs have complex shapes. For an illustration, for IRCs with multiple inflection points (Figure 1(a)), the nearest neighbor estimate (1) applied to simulated binary data provides a closer match to the true curves (Figure 1(b)) than the logistic regression approach (Figure 1(c)).

Refer to caption — ((a)) Underlying IRCs

2.1 Test statistic

Differences in ability distributions between groups can make direct curve comparison problematic. Following srihera2010nonparametric, we define a common support by averaging the values of the matching criterion $\theta_{p}$ from the two groups:

\displaystyle\bar{X}_{p_{0}p_{1}}=\frac{\theta_{p_{0}}+\theta_{p_{1}}}{2},\quad p_{0}\in\mathcal{P}_{0},p_{1}\in\mathcal{P}_{1}.

The proposed general test statistic for item $i$ is then expressed as follows:

\displaystyle\widehat{T}_{i}=\frac{1}{n_{0}n_{1}}\sum_{p_{0}\in\mathcal{P}_{0}}\sum_{p_{1}\in\mathcal{P}_{1}}W_{i}\left(\bar{X}_{p_{0}p_{1}}\right)\left[\hat{m}_{i0}\left(\bar{X}_{p_{0}p_{1}}\right)-\hat{m}_{i1}\left(\bar{X}_{p_{0}p_{1}}\right)\right],

(3)

where $W_{i}(\cdot)$ is a twice continuously differentiable weight function for item $i$ . $\widehat{T}_{i}$ is the weighted average difference between the two IRCs across the common support. Under the null hypothesis $H_{0}{:}\,\,m_{i0}\equiv m_{i1}$ , i.e., no DIF, $\widehat{T}_{i}$ should be close to zero.

2.1.1 Asymptotic properties

Asymptotic variance.

The asymptotic variance of the test statistic (3) under the null hypothesis is given by

\displaystyle\sigma_{i}^{2}=(1-\lambda)\rho_{i0}^{2}+\lambda\rho_{i1}^{2},

(4)

where

\displaystyle\rho_{ig}^{2}=

\displaystyle\ \int\sigma_{ig}(x)W_{i}^{2}(x)\frac{e(x)}{f_{g}(x)}E(\mathrm{d}x)<\infty,\ \ \sigma_{ig}(x)=m_{ig}(x)(1-m_{ig}(x)).

Here, $f_{0}(x)$ , $f_{1}(x)$ , and $e(x)$ are the twice continuously differentiable density functions of the matching criterion for the reference group, focal group, and their averaged values, while $E(x)$ is their cumulative distribution function; and

\displaystyle\lambda=\lim_{n_{0},n_{1}\rightarrow\infty}\frac{n_{0}}{n_{0}+n_{1}}\in(0,1).

For more details, see srihera2010nonparametric.

Variance estimation.

To estimate asymptotic variance (4), we propose

\displaystyle\begin{split}\hat{\sigma}_{i}^{2}=\ &\frac{1}{n_{0}+n_{1}}\sum\limits_{p_{0}\in\mathcal{P}_{0}}\hat{\sigma}_{i0}(\theta_{p_{0}})\left[\sum\limits_{k\in\mathcal{P}_{0}}\sum\limits_{l\in\mathcal{P}_{1}}W_{i}\left(\bar{X}_{kl}\right)W_{p_{0}i}\left(\bar{X}_{kl}\right)\right]^{2}\\ &+\frac{1}{n_{0}+n_{1}}\sum\limits_{p_{1}\in\mathcal{P}_{1}}\hat{\sigma}_{i1}(\theta_{p_{1}})\left[\sum\limits_{k\in\mathcal{P}_{0}}\sum\limits_{l\in\mathcal{P}_{1}}W_{i}\left(\bar{X}_{kl}\right)W_{p_{1}i}\left(\bar{X}_{kl}\right)\right]^{2},\end{split}

(5)

which accounts for our binary-item setting, as it replaces squared residuals with estimated conditional variances

\displaystyle\hat{\sigma}_{ig}(\theta_{p_{g}})=

\displaystyle\ \hat{m}_{ig}(\theta_{p_{g}})\left(1-\hat{m}_{ig}(\theta_{p_{g}})\right)

in the original estimator proposed by srihera2010nonparametric:

	$\displaystyle\hat{\sigma}_{i}^{2}=\$	$\displaystyle\frac{1}{n_{0}+n_{1}}\sum\limits_{p_{0}\in\mathcal{P}_{0}}\left(Y_{p_{0}i}-\hat{m}_{i0}(\theta_{p_{0}})\right)^{2}\left[\sum\limits_{k\in\mathcal{P}_{0}}\sum\limits_{l\in\mathcal{P}_{1}}W_{i}\left(\bar{X}_{kl}\right)W_{p_{0}i}\left(\bar{X}_{kl}\right)\right]^{2}$
		$\displaystyle+\frac{1}{n_{0}+n_{1}}\sum\limits_{p_{1}\in\mathcal{P}_{1}}\left(Y_{p_{1}i}-\hat{m}_{i1}(\theta_{p_{1}})\right)^{2}\left[\sum\limits_{k\in\mathcal{P}_{0}}\sum\limits_{l\in\mathcal{P}_{1}}W_{i}\left(\bar{X}_{kl}\right)W_{p_{1}i}\left(\bar{X}_{kl}\right)\right]^{2}.$

This approach is more convenient, as it accounts for the binary nature of item responses.

Asymptotic distribution.

Under the conditions specified above and assuming the null hypothesis holds, it can be shown that the test statistic (3) normalised by $\hat{\sigma}_{i}$ specified in (5) asymptotically follows a standard normal distribution:

\displaystyle\frac{\sqrt{N}\widehat{T}_{i}}{\hat{\sigma}_{i}}\mathrel{{\mathop{\longrightarrow}\limits_{N\rightarrow\infty}^{\mathcal{D}}}}\mathcal{N}(0,1),\quad N=\frac{n_{0}n_{1}}{n_{0}+n_{1}},

for details see srihera2010nonparametric.

Support size and computation.

The original approach evaluates the test statistic (3) over all $n_{0}\cdot n_{1}$ averaged pairs, which may significantly slow down data manipulation in statistical software, making the proposed method time-consuming and memory-intensive, especially for larger sample sizes. To address this issue and improve efficiency, we propose and employ an alternative technique to calculate a common support:

(1)

The common support is initially calculated as in the original approach.
(2)

The empirical weights of unique values of the averaged points are then computed.
(3)

A fixed-sized random sample is generated from the unique values of the common support using these weights.

This reduces computational burden while preserving the representativeness of both matching criterion distributions. When using the reduced support, it is important to note that the original size of the product (i.e., $n_{0}\cdot n_{1}$ ) must be replaced by the size of the newly defined support set.

2.2 Weight function

The choice of the weight function $W_{i}(\cdot)$ in the test statistic (3) is crucial, as it can significantly influence the statistical power. In this study, we consider three strategies.

2.2.1 Fixed weights

First, we consider uniform weighting:

\displaystyle W_{i}(x)=1\quad\forall x.

(6)

This non-informative option is useful when no prior information about the nature of DIF is available.

2.2.2 Optimal weights

Second, we adapt an optimal weight function, derived in srihera2010nonparametric, to the case of binary data and for the comparison of the IRCs. This weight function is intended to maximise the local asymptotic power of the test.

Under the local alternative hypothesis $m_{i0}=m_{i1}+\frac{cs_{i}}{N},c\neq 0$ , where $s_{i}$ is a difference function, the normalised test statistic (3) converges to the normal distribution:

\displaystyle\frac{\sqrt{N}\widehat{T}_{i}}{\hat{\sigma}_{i}}\mathrel{{\mathop{\longrightarrow}\limits_{N\rightarrow\infty}^{\mathcal{D}}}}\mathcal{N}\left(\frac{\mu_{i}}{\sigma_{i}},1\right),

where $\mu_{i}=-\int W_{i}(x)\left(m_{i0}(x)-m_{i1}(x)\right)E(\mathrm{d}x)$ and $\sigma_{i}^{2}$ is given in (4). The asymptotic power is then given by

\displaystyle\mathsf{P}\left(\left|{\frac{\sqrt{N}\widehat{T}_{i}}{\hat{\sigma}_{i}}}\right|\geq q_{1-\frac{\alpha}{2}}\right)\simeq 1-\phi\left(\frac{\mu_{i}}{\sigma_{i}}+q_{1-\frac{\alpha}{2}}\right)+\phi\left(\frac{\mu_{i}}{\sigma_{i}}-q_{1-\frac{\alpha}{2}}\right),

(7)

which is an increasing function of $\left|{\frac{\mu_{i}}{\sigma_{i}}}\right|$ . Thus, the weight function that maximises the asymptotic power (7) is the one that maximises the term $\left|{\frac{\mu_{i}}{\sigma_{i}}}\right|$ . This is equivalent to maximising the term:

\displaystyle\frac{\mu_{i}^{2}}{\sigma_{i}^{2}}=\frac{\left[\int W_{i}(x)s_{i}(x)E(\mathrm{d}x)\right]^{2}}{\int\left[(1-\lambda)\sigma_{i0}(x)\frac{e(x)}{f_{0}(x)}+\lambda\sigma_{i1}(x)\frac{e(x)}{f_{1}(x)}\right]W_{i}^{2}(x)E(\mathrm{d}x)},

which yields

\displaystyle W_{i}(x)=\frac{s_{i}(x)}{(1-\lambda)\sigma_{i0}(x)\frac{e(x)}{f_{0}(x)}+\lambda\sigma_{i1}(x)\frac{e(x)}{f_{1}(x)}}.

Differences between the IRCs cannot generally be captured by a generic function, such as a polynomial. Therefore, in the context of this paper and its simulation study, we assume $s_{i}(x)=m_{i0}(x)-m_{i1}(x)$ , representing the true difference between the two IRCs. Under this definition, the optimal weight function is given by

\displaystyle W_{i}(x)=\frac{m_{i0}(x)-m_{i1}(x)}{(1-\lambda)\sigma_{i0}(x)\frac{e(x)}{f_{0}(x)}+\lambda\sigma_{i1}(x)\frac{e(x)}{f_{1}(x)}}.

(8)

Figure 2 presents examples of IRCs showing DIF caused by different parameters, along with the corresponding optimal weights. Note that these weight functions can take negative values, enabling the detection of crossing non-uniform DIF, where IRCs intersect. Although the exact weights in (8) cannot be directly used in practice, since the true curve differences are unknown, they can serve as a valuable performance benchmark.

2.2.3 Estimates of optimal weights

Third, to make (8) effective in practice, we extend the approach outlined in Section 2.2.2 by replacing the unknown quantities with their estimates. This yields a natural estimate of the optimal weights,

\displaystyle\widehat{W}_{i}(x)=\frac{\hat{m}_{i0}(x)-\hat{m}_{i1}(x)}{(1-\hat{\lambda})\hat{\sigma}_{i0}(x)\frac{\hat{e}(x)}{\hat{f}_{0}(x)}+\hat{\lambda}\hat{\sigma}_{i1}(x)\frac{\hat{e}(x)}{\hat{f}_{1}(x)}}.

(9)

Substituting $\widehat{W}_{i}(x)$ into (3), the resulting test statistic for item $i$ is given by:

\displaystyle\widehat{T}_{i}=\frac{1}{n_{0}n_{1}}\sum_{p_{0}\in\mathcal{P}_{0}}\sum_{p_{1}\in\mathcal{P}_{1}}\frac{\left[\hat{m}_{i0}\left(\bar{X}_{p_{0}p_{1}}\right)-\hat{m}_{i1}\left(\bar{X}_{p_{0}p_{1}}\right)\right]^{2}}{(1-\hat{\lambda})\hat{\sigma}_{i0}(\bar{X}_{p_{0}p_{1}})\frac{\hat{e}(\bar{X}_{p_{0}p_{1}})}{\hat{f}_{0}(\bar{X}_{p_{0}p_{1}})}+\hat{\lambda}\hat{\sigma}_{i1}(\bar{X}_{p_{0}p_{1}})\frac{\hat{e}(\bar{X}_{p_{0}p_{1}})}{\hat{f}_{1}(\bar{X}_{p_{0}p_{1}})}}.

(10)

The test statistic (10) includes the squared difference $(\hat{m}_{i0}(x)-\hat{m}_{i1}(x))^{2}$ in the numerator. In contrast to the test statistic in (3), which reflects the weighted average of raw differences, this version represents the average squared discrepancy between IRCs. It is specifically designed to maximise sensitivity to complex DIF patterns, including cases where the curves intersect.

Assessing significance.

Because substituting estimated weights invalidates the original asymptotic normality, we assess significance using a wild bootstrap (wu1986jackknife; mammen1993bootstrap). This method is particularly suitable when the data exhibits heteroskedasticity (e.g., hardle1993comparing), which aligns with the binary nature of the responses discussed in this work. This resampling scheme proceeds as follows:

(1)

Initial Step: Estimates of the IRCs are computed using (1). Then, the optimal weights are estimated with (9), and the DIF detection procedure is applied using the test statistic (10).
(2)
Bootstrap Sampling: Under the null hypothesis (i.e., no DIF), a common IRC for both groups is estimated, and the corresponding fitted values $\left\{\hat{y}_{pi}\right\}_{p=1}^{n}$ are computed.
1. (2a)
  
  For each bootstrap run $b\in\left\{1,\dots,B\right\}$ , where $B$ is the number of bootstrap samples, a bootstrap sample $y^{\ast}_{pib}$ is generated directly from Bernoulli distribution using fitted values $\hat{y}_{pi}$ , meaning:
  
  $\displaystyle y^{\ast}_{pib}\sim\text{Bernoulli}(\hat{y}_{pi}),$
  
  to account for the binary nature of the data.
2. (2b)
  
  For each bootstrap sample, the DIF detection procedure is applied as in the original sample in the initial step, resulting in a set of the test statistics $\left\{\widehat{T}_{ib}\right\}_{b=1}^{B}$ .
(3)

Final Step: The set of the test statistics $\left\{\widehat{T}_{ib}\right\}_{b=1}^{B}$ is compared to the test statistic of the original sample. A conclusion on DIF is made based on a two-sided $p$ -value:

$\displaystyle p\text{-value}=\frac{1}{B}\sum_{b=1}^{B}\mathbf{1}[\widehat{T}_{i}<\widehat{T}_{ib}],$

and the predefined level of significance.

3 Simulation study

We conducted a Monte Carlo simulation study to evaluate the statistical properties of the proposed nonparametric DIF detection method (3) under various conditions and to compare its performance with that of the well-established logistic regression approach (swaminathan1990detecting). Specifically, we examined type I error control, statistical power, and accuracy of estimated optimal weights.

3.1 Simulation design

In this part, we describe the design of the simulation study, including the data generation process, the DIF detection procedures and their implementation, as well as the evaluation of the results.

3.1.1 Data and DIF generation

Binary item responses were generated for the reference and focal groups from a logistic regression model extended with higher-order terms to allow for multiple inflexion points. For respondent $p$ and item $i$ , the probability of a correct response was defined as

\displaystyle\begin{split}\operatorname{\textsf{P}}(Y_{pi}=1|\theta_{p})=c_{i}+(d_{i}-c_{i})\frac{e^{a_{i}(\theta_{p}-b_{i}-e_{i}\theta_{p}^{2}-f_{i}\theta_{p}^{3}-g_{i}\theta_{p}^{5})}}{1+e^{a_{i}(\theta_{p}-b_{i}-e_{i}\theta_{p}^{2}-f_{i}\theta_{p}^{3}-g_{i}\theta_{p}^{5})}},\end{split}

(11)

where $a_{i}$ denotes the discrimination, $b_{i}$ the difficulty $c_{i}$ the pseudo-guessing parameter, and $d_{i}$ inattention/slip parameter. The additional item parameters $(e_{i},f_{i},g_{i})$ generate more complex IRC shapes with multiple inflexion points. Respondent abilities $\theta_{p}$ were drawn from a standard normal distribution in both groups.

For non-DIF items, responses were generated using the true 4PL IRT model (barton1981upper) (i.e., model 11 with $(e_{i},f_{i},g_{i})=(0,0,0)$ ). Other item parameters were identical across groups and drawn from normal distributions: Discrimination $a_{i}\sim\mathcal{N}(1.1,0.3)$ , difficulty $b_{i}\sim\mathcal{N}(0,1.1)$ , guessing $c_{i}\sim\mathcal{N}(0.2,0.05)$ , and inattention $d_{i}\sim\mathcal{N}(0.8,0.05)$ .

To generate differentially functioning items, we considered six different sources of DIF in total: changes in discrimination $a_{i}$ , difficulty $b_{i}$ , guessing $c_{i}$ , or inattention $d_{i}$ , and two mixture conditions. In the first setting, called mix1, parameters were selected such that the IRCs intersect exactly once, while in the second setting, called mix2, they intersected twice.

The magnitude of DIF was calibrated so that the weighted unsigned area measure between the two IRCs (siebert2013differential) equalled 0.196, corresponding to a large effect size. The IRCs and the corresponding optimal weight functions for DIF items are illustrated in Figure 2, and their parameters are summarised in Table 1. Each simulated test consisted of 20 items, including one DIF item (5% prevalence).

The standardised total test score was used as the matching criterion $\theta_{p}$ . Although discrete rather than continuous, it reflects common practice in applied DIF analyses.

The total sample sizes of $n=$ 50, 100, 200, 300, and 400 were selected, with both groups being equally sized. Each condition was replicated 1,000 times.

Table 1: Item parameters used to generate DIF items

DIF source	Reference group							Focal group
	$a$	$b$	$c$	$d$	$e$	$f$	$g$	$a$	$b$	$c$	$d$	$e$	$f$	$g$
$a$	$0.42$	$0.00$	$0.00$	$1.00$	$0.00$	$0.00$	$0.00$	$2.00$	$0.00$	$0.00$	$1.00$	$0.00$	$0.00$	$0.00$
$b$	$1.00$	$0.00$	$0.00$	$1.00$	$0.00$	$0.00$	$0.00$	$1.00$	$1.00$	$0.00$	$1.00$	$0.00$	$0.00$	$0.00$
$c$	$1.00$	$0.00$	$0.00$	$1.00$	$0.00$	$0.00$	$0.00$	$1.00$	$0.00$	$0.39$	$1.00$	$0.00$	$0.00$	$0.00$
$d$	$1.00$	$0.00$	$0.00$	$0.61$	$0.00$	$0.00$	$0.00$	$1.00$	$0.00$	$0.00$	$1.00$	$0.00$	$0.00$	$0.00$
mix1	$1.90$	$0.28$	$0.07$	$1.00$	$1.00$	$-0.70$	$0.00$	$0.35$	$-1.75$	$0.03$	$0.98$	$1.60$	$-0.90$	$0.00$
mix2	$4.20$	$0.00$	$0.10$	$0.85$	$0.00$	$-0.50$	$-0.50$	$0.18$	$-1.50$	$0.00$	$1.00$	$1.00$	$-1.20$	$-0.50$

3.1.2 DIF detection

Five approaches for DIF detection were evaluated in the simulation study: Four variations of the proposed nonparametric approach and the logistic regression method for DIF detection with the likelihood ratio test (swaminathan1990detecting). The nonparametric methods differed by the choice of the weight function: the fixed weights (6), the theoretical optimal weights (8), the estimated optimal weigh (9) without bootstrap calibration (i.e., assuming asymptotically normal distribution of the test statistic (3)), and the estimate of optimal weights using wild bootstrap. The optimal weight function was applied only to DIF items, with values set to zero for non-DIF items, which necessarily yielded rejection rates of zero in the latter case. For the bootstrap-based method, the number of samples was set to $B=500$ .

For kernel estimation of IRCs, the Epanechnikov kernel was used with three different bandwidth parameters $h=n_{0}^{-\zeta}$ , where $\zeta$ took values $0.260,\frac{7}{24}(\approx 0.292)$ , and $0.320$ , satisfying the regularity conditions. All tests are performed at a significance level of 0.05.

3.1.3 Evaluation of the results

The five different DIF detection techniques (four variations of the nonparametric approach and the logistic regression method) were compared on two key performance metrics: power and rejection rate. Power is defined as the proportion of true positives (i.e., correctly detected DIF items), while rejection rate refers to the proportion of false positives (i.e., non-DIF items incorrectly identified as DIF).

Additionally, the accuracy of the estimate of the optimal weights was assessed by computing the Root Mean Squared Error (RMSE), which quantifies the root of the mean squared difference between the optimal weights (8) and their estimates.

3.1.4 Implementation

All analyses were conducted in the statistical software R (R2019), version 4.3.2, and its associated packages. Empirical density functions were computed with the ecdf() function from the stats package (R2019). Weights of kernel functions and kernel estimates were calculated by the locCteWeightsC() and locWeightsEval() functions from the locpol package (cabrera2018locpol). Estimates of densities of standardised total scores and the common support of the test statistic were evaluated with the bkde() function from the KernSmooth package (wand2019kernsmooth). The logistic regression method with the likelihood ratio test was performed using the glm() function from the stats package. Finally, graphical representations of the results were created using the ggplot2 package (wickham2016ggplot2).

3.2 Simulation results

3.2.1 Rejection rates and power

The estimates of the optimal weights (1) without the wild bootstrap were the most powerful approach across all scenarios, with a mean power rate of 0.724. However, this gain in sensitivity came at the cost of a substantially inflated Type I error, with an average rejection rate of 0.272, exceeding the nominal significance level of 0.05. Consequently, this variant was considered unreliable and was excluded from subsequent analyses; its results are therefore not reported here.

All remaining approaches maintained appropriate control of the significance level across all bandwidth parameters $\zeta$ , DIF sources, and sample sizes, with rejection rates ranging from 0.050 to 0.069. No significant differences were observed in rejection rates among the DIF detection methods, regardless of DIF sources. A slight increase in rejection rates was noted for both the logistic regression method and the nonparametric approach with fixed weights at the smallest sample size ( $n=50$ ; Figure 3, Table A1).

All DIF detection methods exhibited lower power at smaller sample sizes, as expected. With increasing sample size, power improved for all methods, and differences between approaches became less pronounced (Figure 4, Table A2). The nonparametric approaches, using either optimal or fixed weights, outperformed the logistic regression method in scenarios where shifts in parameters $b$ and $c$ were sources of DIF, as well as in the mix1 scenario, across nearly all sample sizes. This was also the case for small samples ( $n\leq 100$ ) when DIF was caused by parameter $d$ . Furthermore, in the mix2 scenario, the nonparametric approach using optimal weights achieved higher power than logistic regression. In contrast, logistic regression gained the highest power when parameter $a$ was the source of DIF. In this case, the nonparametric approach with the fixed weights failed to detect DIF effectively and achieved only limited power, also in the mix2 scenario. Under such circumstances, the nonparametric method with estimated optimal weights and wild bootstrap provided a marked improvement, consistently yielding higher power across all sample sizes.

Generally, the differences between the nonparametric approaches using different bandwidth parameters $h$ were small. When the value of $\zeta$ was lower, meaning the bandwidth parameter $h=n_{0}^{-\zeta}$ was larger, the optimal weights and their estimates using the wild bootstrap yielded slightly higher mean power.

3.2.2 Estimates of optimal weights

In nine scenarios where the parameter $c$ was the source of DIF and $\zeta=0.292$ , the estimates of the optimal weights exceeded $10^{6}$ , indicating numerical instability or divergence. As a result, these scenarios were excluded from further analysis to ensure the reliability of findings and are not reported here.

Estimation of the optimal weight was the most precise when the parameters $b$ , $c$ , and $d$ were sources of DIF. In contrast, it was the most biased when the mix2 scenario was considered (Figure 5, Table A3; see also rows 2, 3, 4, and 6 in Figure A1). The smallest overall RMSE of 0.242 was achieved for $\zeta=0.320$ (the smallest bandwidth parameter $h$ ), while the largest overall RMSE of 0.270 occurred for $\zeta=0.260$ (the largest $h$ ).

All three choices of the $\zeta$ parameter resulted in more accurate estimates for larger sample sizes compared to smaller ones. While there were no significant differences between bandwidth parameters when the parameters $b$ , $c$ , or $d$ were sources of DIF, this was not the case when the IRCs intersected. In such scenarios, $\zeta=0.320$ (the smallest bandwidth parameter) produced the most accurate estimates, whereas $\zeta=0.260$ (the largest bandwidth parameter) consistently resulted in the highest RMSE across all sample sizes. Furthermore, the precision of the estimates did not always increase (i.e., decrease of RMSE) with increasing sample size when the discrimination parameter $a$ was a source of DIF or when the mix1 setting was considered for the underlying IRCs (Figure 5, Table A3; see also rows 1 and 5 in Figure A1).

4 Real data example

To illustrate the practical application of our methodology, we analyse data from a questionnaire on verbal aggression.

4.1 Data description

The Verbal Aggression dataset (vansteelandt2001formal, vansteelandt2001formal; available in magis2010general, magis2010general) contains responses of 316 participants (243 females and 73 males) to a 24-item questionnaire measuring tendencies toward verbal aggression. Each item describes a frustrating situation paired with a potential verbal aggression reaction. Specifically, four frustration situations were considered: (S1) A bus fails to stop for me; (S2) I miss a train because a clerk gave me faulty information; (S3) The grocery store closes just as I am about to enter; (S4) The operator disconnects me when I had used up my last 10 cents for a call. For each situation, respondents indicated whether they wanted to or actually did react with one of three possible aggressive behaviours: cursing, scolding, or shouting. For instance, item S1WantShout corresponds to the statement ”A bus fails to stop for me. I want to shout”. Items are binary-coded. A response of 1 denotes agreement with the statement, and 0 indicates disagreement.

4.2 Statistical analysis

Three approaches for DIF detection were evaluated in the real data example analysis: Two variations of the proposed nonparametric approach and the logistic regression method for DIF detection with the likelihood ratio test (swaminathan1990detecting). The nonparametric methods differed in the choice of the weight function: the fixed weights (6) and the estimated optimal weights (1) using the bootstrap. For kernel estimation of IRCs, the Epanechnikov kernel was used with three different bandwidth parameters $h=n_{0}^{-\zeta}$ , where $\zeta$ took values $0.260,\frac{7}{24}(\approx 0.292)$ , and $0.320$ . DIF was analysed with respect to the respondent gender. A standardised total score was used as the measure of observed matching ability. All tests are performed at a significance level of 0.05.

4.3 Results

In the Verbal Aggression dataset, eleven items were flagged as exhibiting DIF by at least one of the detection methods (Table 2). Three items (S2WantShout, S2DoCurse, and S2DoScold) were consistently identified across all approaches. By contrast, the item S1WantScold was detected solely by the nonparametric method with estimated optimal weights using bootstrap.

Across the nonparametric variants, the fixed-weight method and the estimated optimal weights with bandwidth $\zeta=0.32$ identified the largest number of DIF items (seven each). The versions with $\zeta=0.292$ and $\zeta=0.26$ followed, flagging six and five items, respectively. Logistic regression identified the fewest DIF items, with only four flagged.

Alignment between methods varied. The highest agreement was observed within the fixed-weight variants (tetrachoric correlations between 0.90 and 0.99) and within the estimated optimal weights using bootstrap (0.89–0.97). Agreement between these two nonparametric families was lower, with correlations ranging from 0.33 to 0.66. Logistic regression showed moderate alignment with nonparametric approaches, correlating 0.86–0.87 with the fixed-weight methods and 0.71–0.84 with the estimated optimal weight using bootstrap.

Table 2: Test statistics and p-values for DIF items from the Verbal Aggression dataset.

Item	Nonparametric												Logistic
	Fixed						Bootstrap
	$\zeta=0.260$		$\zeta=0.292$		$\zeta=0.320$		$\zeta=0.260$		$\zeta=0.292$		$\zeta=0.320$
	$T$ -value	$p$ -value	$T$ -value	$p$ -value	$T$ -value	$p$ -value	$T$ -value	$p$ -value	$T$ -value	$p$ -value	$T$ -value	$p$ -value	$\chi^{2}$ -value	$p$ -value
S1WantScold	$1.369$	$0.171$	$1.313$	$0.189$	$1.280$	$0.201$	$2.135$	$0.144$	$2.491$	$0.060$	$2.936$	$0.030$ *	$3.354$	$0.187$
S2WantCurse	$1.667$	$0.096$	$1.694$	$0.090$	$1.737$	$0.082$	$1.649$	$0.000$ *	$1.602$	$0.000$ *	$1.780$	$0.000$ *	$4.730$	$0.094$
S2WantScold	$1.914$	$0.056$	$2.114$	$0.034$ *	$1.969$	$0.049$ *	$2.471$	$0.072$	$2.558$	$0.090$	$2.740$	$0.070$	$4.140$	$0.126$
S2WantShout	$3.204$	$0.001$ *	$3.358$	$0.001$ *	$3.369$	$0.001$ *	$3.564$	$0.002$ *	$3.355$	$0.004$ *	$3.526$	$0.000$ *	$11.411$	$0.003$ *
S4WantShout	$2.098$	$0.036$ *	$2.327$	$0.020$ *	$2.216$	$0.027$ *	$2.503$	$0.058$	$2.490$	$0.082$	$2.580$	$0.106$	$3.688$	$0.158$
S1DoScold	$-2.128$	$0.033$ *	$-1.949$	$0.051$	$-1.848$	$0.065$	$2.172$	$0.136$	$2.078$	$0.000$ *	$2.110$	$0.000$ *	$4.730$	$0.094$
S2DoCurse	$-2.964$	$0.003$ *	$-2.946$	$0.003$ *	$-2.966$	$0.003$ *	$3.064$	$0.026$ *	$3.066$	$0.034$ *	$3.514$	$0.024$ *	$7.693$	$0.021$ *
S2DoScold	$-3.050$	$0.002$ *	$-2.871$	$0.004$ *	$-2.742$	$0.006$ *	$3.045$	$0.012$ *	$3.012$	$0.014$ *	$3.073$	$0.024$ *	$10.262$	$0.006$ *
S2DoShout	$-0.493$	$0.622$	$-0.130$	$0.897$	$-0.070$	$0.944$	$2.788$	$0.030$ *	$2.844$	$0.024$ *	$2.905$	$0.028$ *	$1.702$	$0.427$
S3DoCurse	$-2.686$	$0.007$ *	$-2.573$	$0.010$ *	$-2.667$	$0.008$ *	$2.596$	$0.068$	$2.618$	$0.076$	$2.650$	$0.074$	$7.238$	$0.027$ *
S3DoScold	$-2.112$	$0.035$ *	$-2.134$	$0.033$ *	$-2.141$	$0.032$ *	$2.153$	$0.100$	$2.118$	$0.188$	$2.242$	$0.158$	$5.868$	$0.053$

Across all items, the nonparametric approach with the bandwidth parameter $\zeta=0.32$ produced the most accurate estimates of the IRCs, with the exception of S2DoScold, S3DoShout, and S4DoShout, where the logistic regression model achieved slightly lower squared bias. Conversely, logistic regression yielded the least precise estimates for 14 of the 24 items, while the nonparametric method with $\zeta=0.26$ showed the lowest precision for the remaining 10 items.

To illustrate how the proposed method performs in practice, the item S2WantCurse provides a representative example. It was estimated most precisely by the nonparametric approach with the bandwidth $\zeta=0.32$ and was also identified as exhibiting DIF by this method (Figure 6(a)). In contrast, logistic regression produced a visibly poorer fit for the same item and did not classify it as DIF (Figure 6(b)). This example demonstrates the ability of the nonparametric framework to reveal subtle group differences that may be undetected under parametric modelling.

5 Discussion

In this work, we proposed a novel nonparametric approach for comparing IRCs to detect DIF in binary items. Building on the general framework for comparing regression curves introduced by srihera2010nonparametric, we adapted their methodology to address a common challenge in psychometrics and social sciences: testing differences between IRCs. Our main methodological contributions include (1) a new estimator of the asymptotic variance of the test statistic tailored to account for the binary nature of the data, (2) a derivation of optimal weight functions that maximize the local power of the test, and a procedure for estimating these weights in a realistic setting where they are unknown, and (3) a wild bootstrap procedure to approximate the unknown asymptotic distribution of the test statistic when using estimate of optimal weight, enabling robust hypothesis testing. These innovations extend existing approaches by providing a flexible, practical, and theoretically grounded framework for DIF detection that directly estimates IRCs, allowing nuanced detection of group differences that may be missed by parametric methods.

To evaluate the performance of the proposed approach, we conducted a Monte Carlo simulation study comparing it using various weighting schemes to the logistic regression method. All methods demonstrated good control of Type I error. The nonparametric approach using the optimal weights achieved power rates close to those of the logistic regression method, and it outperformed it in several scenarios, especially in scenarios with multiple intersections of the underlying IRCs. When comparing different weight functions within the nonparametric approach, the fixed weights performed similarly to the optimal weights in cases where the IRCs did not intersect, and may be recommended when it can be assumed that one group is advantaged over the other group for all levels of the matching criterion. However, when IRCs intersected, the estimate of the optimal weights using the wild bootstrap technique substantially improved performance over fixed weights.

To illustrate the proposed DIF detection method, we analysed a real-life dataset on verbal aggression. Across methods, eleven items were flagged for DIF, with three items consistently detected by all approaches. The nonparametric methods, especially those with fixed weights and estimated optimal weights with bandwidth $\zeta=0.32$ , identified the most DIF items, while logistic regression flagged the fewest. Agreement among nonparametric variants was high, but alignment between nonparametric methods and the logistic regression method was moderate. Importantly, the nonparametric method with $\zeta=0.32$ generally produced the most accurate IRC estimates and successfully detected subtle DIF undetected by logistic regression. This demonstrates the potential of the proposed nonparametric framework to uncover nuanced group differences that parametric models may overlook.

Our approach complements existing nonparametric approaches, such as the Mantel-Haenszel test (mantel1959statistical; holland1988differential), the SIBTEST method (shealy1993model), or standardisation (dorans1986demonstrating), which do not directly model IRCs. In contrast, the newly proposed method explicitly estimates and compares IRCs, allowing for flexible, data-driven weighting that enhances sensitivity to complex DIF patterns, including intersections and nonuniform differences between groups. It also complements kernel smoothing DIF detection methods, such as a kernel-smoothed SIBTEST (douglas1996kernel) or TestGraf, a graphical DIF method with a kernel smoothing for estimating the conditional probability of correct answers related to proficiency estimates (bolt2006testing; ramsay2000testgraf). Unlike these approaches, our method is grounded in a direct comparison of IRCs within a unified nonparametric framework, offering both interpretability and methodological flexibility.

In past decades, many authors have dealt with the topic of nonparametric comparison of regression curves, including dette2001nonparametric; hall1990bootstrap; neumeyer2003nonparametric and scheike2000comparison. Our work builds on the general approach of srihera2010nonparametric, which accommodates a random design and therefore allows for direct extension addressing challenges specific to DIF detection in real-life settings of binary items.

The current study has several limitations, as well as potential directions for future research. First, the simulation study was limited in terms of the number of items, the proportion of DIF items, and sample sizes, as only small to moderate sample sizes were considered to ensure computational feasibility. This limitation precluded the inclusion of extended logistic regression models such as 3PL or 4PL models (barton1981upper; birnbaum1968statistical; hladka2025estimation) in the simulation study, as they require larger sample sizes for both groups. Second, this study focused exclusively on the Epanechnikov kernel. While alternative kernel functions could be considered, previous research suggests that the accuracy of estimation is generally robust to the choice (douglas1996kernel). Nonetheless, future studies could investigate whether alternative kernels offer practical advantages for detecting DIF in various settings. Third, three levels of the bandwidth parameter $h$ were examined. The choice of bandwidth is directly related to the precision of estimating optimal weight functions: If the $h$ is too small, the resulting estimate may be under-smoothed, leading to high variance. Conversely, a large $h$ may result in over-smoothing, which can obscure important features of the data. The bandwidth values chosen for this study were intended to cover a plausible range, while no large differences in power or rejection rates were observed. Fourth, the kernel-smoothing estimate of IRCs does not require monotonicity of item responses, which is a typical assumption in logistic regression or IRT models. Parametric models typically assume that the probability of a correct response increases monotonically with the latent trait. This assumption simplifies estimation and interpretation but may not always hold in practice, particularly when items are affected by multidimensional traits, guessing effects, or complex DIF patterns. However, the monotonicity assumption ensures the interpretability and scalability of test scores (see, e.g., mokken1971theory), required, for example, in nonparametric IRT models (douglas2001asymptotic; he2024extended).

In summary, the proposed nonparametric approaches, including a novel estimate of the optimal weights with the wild bootstrap, demonstrated control of significance levels and, in most cases, matched or exceeded the performance of the logistic regression method in detecting DIF. Importantly, the flexibility of the nonparametric framework allows it to capture complex patterns in IRCs, particularly in scenarios involving multiple intersections or non-monotonic structures, where traditional parametric methods may falter. These results highlight the substantial potential of our approach as a powerful and universal tool for DIF detection, expanding methodological options for applied psychometrics and advancing the analysis of item functioning in real-world testing contexts.

Appendix A Tables

Table A1: Rejection rates by the nonparametric approach with various weight functions and by the logistic regression method with respect to the sample size

n

and the parameter

\zeta

for different sources of DIF.

[0.5ex]DIF [0.75ex] source	$n$	Nonparametric									Logistic
		Optimal			Fixed			Bootstrap
		0.260	0.292	0.320	0.260	0.292	0.320	0.260	0.292	0.320
$a$
	50	0.000	0.000	0.000	0.063	0.065	0.066	0.052	0.053	0.054	0.069
	100	0.000	0.000	0.000	0.055	0.057	0.058	0.051	0.051	0.052	0.057
	200	0.000	0.000	0.000	0.056	0.058	0.060	0.055	0.057	0.056	0.054
	300	0.000	0.000	0.000	0.051	0.051	0.051	0.053	0.052	0.052	0.050
	400	0.000	0.000	0.000	0.052	0.052	0.052	0.060	0.057	0.052	0.052
$b$
	50	0.000	0.000	0.000	0.065	0.066	0.069	0.056	0.059	0.057	0.068
	100	0.000	0.000	0.000	0.059	0.060	0.061	0.057	0.056	0.056	0.062
	200	0.000	0.000	0.000	0.054	0.056	0.057	0.057	0.056	0.057	0.054
	300	0.000	0.000	0.000	0.056	0.057	0.058	0.054	0.056	0.051	0.055
	400	0.000	0.000	0.000	0.056	0.056	0.057	0.058	0.057	0.053	0.056
$c$
	50	0.000	0.000	0.000	0.062	0.063	0.066	0.053	0.055	0.056	0.066
	100	0.000	0.000	0.000	0.056	0.057	0.059	0.053	0.055	0.054	0.056
	200	0.000	0.000	0.000	0.055	0.055	0.057	0.052	0.054	0.053	0.055
	300	0.000	0.000	0.000	0.054	0.056	0.059	0.057	0.057	0.055	0.053
	400	0.000	0.000	0.000	0.058	0.058	0.060	0.058	0.056	0.054	0.055
$d$
	50	0.000	0.000	0.000	0.067	0.067	0.069	0.055	0.054	0.056	0.069
	100	0.000	0.000	0.000	0.056	0.056	0.057	0.054	0.053	0.055	0.055
	200	0.000	0.000	0.000	0.054	0.055	0.055	0.055	0.055	0.053	0.053
	300	0.000	0.000	0.000	0.057	0.058	0.057	0.056	0.054	0.052	0.053
	400	0.000	0.000	0.000	0.056	0.056	0.058	0.059	0.056	0.052	0.056
mix1
	50	0.000	0.000	0.000	0.058	0.059	0.060	0.050	0.050	0.051	0.066
	100	0.000	0.000	0.000	0.059	0.061	0.061	0.051	0.052	0.052	0.056
	200	0.000	0.000	0.000	0.056	0.057	0.058	0.054	0.055	0.058	0.053
	300	0.000	0.000	0.000	0.055	0.054	0.056	0.054	0.057	0.054	0.052
	400	0.000	0.000	0.000	0.056	0.057	0.056	0.055	0.055	0.051	0.054
mix2
	50	0.000	0.000	0.000	0.062	0.063	0.067	0.056	0.057	0.057	0.067
	100	0.000	0.000	0.000	0.054	0.056	0.057	0.052	0.055	0.054	0.055
	200	0.000	0.000	0.000	0.054	0.055	0.055	0.053	0.055	0.055	0.054
	300	0.000	0.000	0.000	0.051	0.053	0.052	0.053	0.055	0.051	0.052
	400	0.000	0.000	0.000	0.050	0.050	0.051	0.057	0.055	0.052	0.050

Table A2: Power rates by the nonparametric approach with various weight functions and by the logistic regression method with respect to the sample size and the parameter

\zeta