100% found this document useful (12 votes)
267 views14 pages

Developments in Statistical Modelling Secure Ebook Download

'Developments in Statistical Modelling' is a compilation of 40 papers presented at the 38th International Workshop on Statistical Modelling (IWSM) scheduled for July 2024 in Durham, UK, aimed at advancing statistical modelling across various disciplines. The volume includes contributions on topics such as generalized linear models, Bayesian methods, and applications in biostatistics, with a notable keynote on 'Statistical Modelling for Big and Little Data' by Robin Henderson. The workshop fosters collaboration among statisticians and aims to stimulate discussions that may lead to new developments in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (12 votes)
267 views14 pages

Developments in Statistical Modelling Secure Ebook Download

'Developments in Statistical Modelling' is a compilation of 40 papers presented at the 38th International Workshop on Statistical Modelling (IWSM) scheduled for July 2024 in Durham, UK, aimed at advancing statistical modelling across various disciplines. The volume includes contributions on topics such as generalized linear models, Bayesian methods, and applications in biostatistics, with a notable keynote on 'Statistical Modelling for Big and Little Data' by Robin Henderson. The workshop fosters collaboration among statisticians and aims to stimulate discussions that may lead to new developments in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Developments in Statistical Modelling

Visit the link below to download the full version of this book:

https://medipdf.com/product/developments-in-statistical-modelling/

Click Download Now


Preface

‘Developments in Statistical Modelling’ is a collection of 40 papers to be presented at


the 38th International Workshop on Statistical Modelling (IWSM), which is held from
14th to 19th of July 2024 in Durham City, UK. The IWSM is the annual workshop of the
Statistical Modelling Society, with the purpose of promoting important developments,
extensions, and applications in statistical modelling and bringing together statisticians
working on related problems from various disciplines. The contributions to this volume
hence reflect this spirit.
The IWSM series has its roots in some rather informal meetings in the late 80s
arising out of the GLIM community, which then got formalized in subsequent years to
become a regular series of workshops. It has since travelled a few times to the UK: 1994
to Exeter, 2010 to Glasgow, 2018 to Bristol, and now in 2024 to Durham. Along this
way, the initially informal network matured, with the launch of a journal, ‘Statistical
Modelling: An International Journal’, in 2000, and the foundation of the Statistical
Modelling Society in 2003.
We are honoured and delighted to be able to host this workshop at Durham, which,
we believe, is a lovely location to host a conference in summer. The cathedral and the
river wear equip the historic city centre with some unique flair. All relevant locations in
the city can be reached by foot, and there is a stunning countryside—with the Pennines
in the West and the Coast in the East—that delegates will have opportunity to explore.
At this occasion, some rather distinctive mathematical history of Durham may also
be mentioned: The Benedictine monk and historian Bede introduced in 725 a novel
technique for finger counting; that is to count, with 10 fingers only, numbers up to
9999. The tradition of the name ‘Bede’ is still living forth: Durham is host to the Tier
2 supercomputer Bede, a facility of the N8 Centre of Excellence in Computationally
Intensive Research (CIR).
The papers published in this volume were submitted in early 2024 for presentation at
the conference, for either oral or poster presentation, and underwent a careful refereeing
process involving the 18 members of the Scientific Committee plus a few other referees
drawn from the Organizing Committee of the conference. Papers which were eligible
for this volume (and consented to publication) underwent a second round of review. It is
noted that this process did not distinguish between oral and poster submissions. As such,
29 of the 40 papers contained in this volume are presented orally at the conference, and
11 as posters. Notably, about half of the papers contained in this volume are presented
by PhD students. It is noted that papers which are presented at the conference but which
do not appear in this volume are provided in the conference online PDF proceedings
volume which is available at https://maths.dur.ac.uk/iwsm2024/.
The papers presented in this volume cover developments in a wide range of disci-
plines including generalized linear models, mixture models, regularization techniques,
hidden Markov models, smoothing methods, censoring and imputation techniques, Gaus-
sian processes, spatial statistics, shape modelling, goodness-of-fit problems, network
vi Preface

analysis, among others. A wide range of applications is considered within these papers,
where a particular abundance of contributions in the field of biostatistics is notable. The
contributions are equally frequentist and Bayesian—but this is anyway a categorization
which this community has synergetically overcome.
We would like to draw particular attention to the keynote contribution ‘Statistical
Modelling for Big and Little Data’ by Robin Henderson, which looks at contemporary
data problems from the wider viewpoint of data science, encompassing statistics and
machine learning, highlighting that both small and large data sets come with their own
challenges, and can be equally simple—or hard!—to model and analyse.
We are looking forward to an enjoying and stimulating conference, and we hope that
this volume will contribute to initiating and sustaining discussions about problems in
statistical modelling, potentially triggering new developments and ideas. We are already
now looking forward to the next edition of the workshop in Limerick in 2025 where
perhaps some of these will be presented.
Acknowledgements. The Editors wish to thank the Durham Research Methods
Centre (DRMC) for their financial support of the conference.

May 2024 Jochen Einbeck


Hyeyoung Maeng
Emmanuel Ogundimu
Konstantinos Perrakis
Contents

REML for Two-Dimensional P-Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


Martin P. Boer

Learning Bayesian Networks from Ordinal Data - The Bayesian Way . . . . . . . . . 7


Marco Grzegorczyk

Latent Dirichlet Allocation and Hidden Markov Models to Identify Public


Perception of Sustainability in Social Media Data . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Luigi Cao Pinna, Claire Miller, and Marian Scott

Bayesian Approaches to Model Overdispersion in Spatio-Temporal


Binomial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Mabel Morales-Otero and Vicente Núñez-Antón

Elicitation of Priors for Intervention Effects in Educational Trial Data . . . . . . . . . 28


Qing Zhang, Germaine Uwimpuhwe, Dimitris Vallis, Akansha Singh,
Tahani Coolen-Maturi, and Jochen Einbeck

Addressing Covariate Lack in Unit-Level Small Area Models Using


GAMLSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Lorenzo Mori and Maria Rosaria Ferrante

Optimism Correction of the AUC with Complex Survey Data . . . . . . . . . . . . . . . . 41


Amaia Iparragirre and Irantzu Barrio

Statistical Models for Patient-Centered Outcomes in Clinical Studies . . . . . . . . . 48


Gillian Heller, Andrew Forbes, and Stephane Heritier

Bayesian Hidden Markov Models for Early Warning . . . . . . . . . . . . . . . . . . . . . . . . 55


Daniele Tancini, Francesco Bartolucci, and Silvia Pandolfi

A Bayesian Markov-Switching for Smooth Modelling of Extreme Value


Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Vincenzo Gioia, Gioia Di Credico, and Francesco Pauli

Derivatives of the Log of a Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68


Paul H. C. Eilers and Martin P. Boer
viii Contents

Monitoring Viral Infections in Severe Acute Respiratory Syndrome


Patients in Brazil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
João Flávio Andrade Silva, Rafael Izbicki, Leonardo S. Bastos,
and Guilherme P. Soares

A Computationally Efficient Spatio-Temporal Fusion Model


for Reflectance Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Zhaoyuan Zou, Ruth O’Donnell, Claire Miller, Duncan Lee,
and Craig Wilkie

Spatial Confounding in Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88


Lars Knieper, Thomas Kneib, and Elisabeth Bergherr

Adaptive Generalized Logistic Lasso and Its Application to Rankings


in Sports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Robert Bajons and Kurt Hornik

A Biclustering Approach via Mixture of Latent Trait Analyzers


for the Analysis of Digital Divide in Italy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Dalila Failli, Bruno Arpino, Maria Francesca Marino,
and Francesca Martella

Shrinkage in a Bayesian Panel Data Model with Time-Varying Coefficients . . . . 109


Roman Pfeiler and Helga Wagner

Integrating Single Index Effects in Generalized Additive Models . . . . . . . . . . . . . 116


Claudia Collarin and Matteo Fasiolo

An Underrated Prior Distribution for Proportions. The Logistic–Normal


for Dynamical Football Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Rui Martins

A Comparison of Extreme Gradient and Gaussian Process Boosting


for a Spatial Logistic Regression on Satellite Data . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Michael Renfrew and Bruce J. Worton

Gene Coexpression Analysis with Dirichlet Mixture Model: Accelerating


Model Evaluation Through Closed-Form KL Divergence Approximation
Using Variational Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Samyajoy Pal and Christian Heumann

Optimizing Variable Selection in Multi-Omics Datasets: A Focus


on Exclusive Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Dayasri Ravi and Andreas Groll
Contents ix

Non-parametric Frailty Model for the Natural History of Prostate Cancer;


Using Data from a Screening Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Ilse Cuevas Andrade, Ardo van den Hout, and Nora Pashayan

Parametric and Non-parametric Bayesian Imputation for Right Censored


Survival Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Shirin Moghaddam, John Newell, and John Hinde

An Updated Wilcoxon–Mann–Whitney Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159


Paul Wilson

Estimating a Lower Bound of the Population Size in Capture-Recapture


Experiments with Right Censored Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Anabel Blasco-Moreno and Pere Puig

Inference for Quasi-reaction Models with Covariate-Dependent Rates . . . . . . . . . 172


Matteo Framba, Veronica Vinciotti, and Ernst C. Wit

Modelling of Overdispersed Count Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179


John Hinde, Alberto Alvarez-Iglesias, John Ferguson,
Clarice G. B. Demétrio, John Crown, Bryan T. Hennessy,
and Vicky Donachie

Sparse Intrinsic Gaussian Processes for Prediction on Manifolds:


Extending Applications to Environmental Contexts . . . . . . . . . . . . . . . . . . . . . . . . . 185
Yuan Liu, Mu Niu, and Claire Miller

Functional Copula Graphical Regression Model for Analysing Brain-Body


Rhythm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Rita Fici, Luigi Augugliaro, and Ernst C. Wit

State-Space Models for Clustering of Compositional Trajectories . . . . . . . . . . . . . 197


Andrea Panarotto, Manuela Cattelan, and Ruggero Bellio

Approximated Gaussian Random Field Under Different Parameterizations


for MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Joaquin Cavieres, Cole C. Monnahan, David Bolin,
and Elisabeth Bergherr

Shape Analysis of AF Segments for Rapid Assessment of Mohs Layers


for BCC Presence by AF-Raman Microscopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Alexey A. Koloydenko, Ioan Notingher, Radu Boitor, and Jüri Lember
x Contents

Additive Mixed Models for Location, Scale and Shape via Gradient
Boosting Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Colin Griesbach and Elisabeth Bergherr

Regression Analysis with Missing Data Using Interval Imputation . . . . . . . . . . . . 224


Tathagata Basu

Models of Network Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231


Ronan Wallace, Xabier Garcia Andrade, Pedro Kayser, Zhao Luo,
Hrishav Mukherjee, Ruan Nunes, and Marc Warrior

Estimating Dose and Time of Exposure from a Protein-Based Radiation


Biomarker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Yilun Cai, Jochen Einbeck, Stephen Barnard, and Elizabeth Ainsbury

Statistical Modelling for Big and Little Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246


Robin Henderson

Semi-Markov Multistate Model with Interval-Censored Transition Times . . . . . . 255


Xavier Piulachs, Klaus Langohr, and Guadalupe Gómez

A Distance-Based Statistic for Goodness-of-Fit Assessment . . . . . . . . . . . . . . . . . 263


Darshana Jayakumari, Jochen Einbeck, John Hinde, and Rafael A. Moral

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269


REML for Two-Dimensional P-Splines

Martin P. Boer(B)

Wageningen University and Research, Wageningen, The Netherlands


[email protected]

Abstract. We propose a new method based on residual (or restricted)


maximum likelihood (REML) for P-splines. Existing methods use a
transformation of P-splines to a mixed model; in the new model it is
shown that a more direct method can be used keeping the sparse struc-
ture of P-splines. The method is illustrated with a two-dimensional exam-
ple using the R-package LMMsolver on CRAN. We will show that for
this example LMMsolver is several orders of magnitude faster than other
methods, which use a transformation of P-splines to mixed models where
the sparse structure of P-splines is lost.

Keywords: Mixed models · Precision Matrices · Sparse Matrix


Algebra

1 Introduction

Using B-splines for penalized regression, also known as P-splines [3], can offer
computational efficiency due to the local character of B-splines. This results in
sparse linear equations that can be solved easily. However, the primary chal-
lenge lies in determining the optimal penalty parameter. One effective approach
to tackle this issue is through mixed models and restricted maximum likelihood
(REML) [8]. Various methods have been suggested to convert the original penal-
ized B-spline model into a mixed model. The drawback of most existing trans-
formations to mixed models lies in their inability to preserve the local character
of B-splines, consequently diminishing computational efficiency.
In [1] a new method was proposed, using a sparse transformation to mixed
models. This method is computationally more efficient than other approaches. In
this paper we will show that REML can be used directly for P-splines, without
a transformation to a mixed model.

2 Sparse Mixed Models for P-Splines

We will first briefly discuss the sparse mixed model formulation proposed by Boer
[1]. For the moment we will assume the one-dimensional case, to keep notation
simple, and extend to two-dimensional case later on. First we introduce some
notation. Let y = (y1 , y2 , . . . , yn ) be the response variable, depending on the
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
J. Einbeck et al. (Eds.): IWSM 2024, CONTRIB.STAT., pp. 1–6, 2024.
https://doi.org/10.1007/978-3-031-65723-8_1
2 M. P. Boer

variable x = (x1 , x2 , . . . , xn ) , defined in the interval [xmin , xmax ]. Let B be a


n × q matrix, with (i, j)th entry Bj,p (xi ), where Bj,p are p-th degree B-splines
functions for j = 1, . . . , q. The constants ξi,j,p are defined by
q

xi = ξi,j,p Bj,p (x), x ∈ [xmin , xmax ].
j=1

Analytic expressions for ξi,j,p are given in [6]. The matrix G is a q × k matrix
with (i, j)th entry ξi,j,p . The q × (q − k) matrix D has k-th order difference
penalties. The matrix X is defined by X = BG = [1|x1 | . . . |xk−1 ], where DG =
0. Let B∗ be a k × q matrix, with (i, j)th entry Bj,p (x∗,i ), where the k reference
points x∗,i can be chosen arbitrarily, with condition |B∗ G| = 0, see [1] for further
details.
The sparse mixed model formulation for P-splines is defined by [1]
y = Xβ + Zu + e, u ∼ N (0, Σ), e ∼ N (0, R), (1)
with X = BG, Z = B, R−1 = θ0 In and Σ−1 = θ1 (D D + B∗ B∗ ). The
restricted log-likelihood of Eq. (1) is given by
2 log L = log |R−1 | + log |Σ−1 | − log |C| − ê R−1 ê − û Σ−1 û, (2)
where ê = y − Xβ̂ − Zû. The mixed model equations are given by
  −1      −1 
XR X X R−1 Z β̂ XR y
= , (3)
Z R−1 X Z R−1 Z + Σ−1 û Z R−1 y
with on the left hand side the mixed model coefficient matrix C. The matrices
R−1 , Σ−1 , and C are sparse for the mixed model P-splines model, and therefore
log L and the partial derivatives with respect to precision parameters θm can be
calculated in a computationally efficient way.
Extension to two-dimension P-splines is relative straightforward. For details
and extension to higher dimensions see [1]. First we define the covariates x1 and
x2 , with corresponding n × qi matrices Bi (i = 1, 2). The n × q matrix B is
defined by B = B1 ⊗r B2 , where ⊗r denotes the row-wise Kronecker product,
and with q = q1 · q2 . Matrix G = G1 ⊗ G2 has dimension k × q, with k = k1 · k2 .
The matrix B∗ = B∗,1 ⊗ B∗,2 has dimension k × q. Finally, for Σ we have
  
Σ−1 = θ1 D1 D1 ⊗ Iq2 + θ2 Iq1 ⊗ D2 D2 + (θ1 + θ2 )B∗ B∗ , (4)
where D1 and D2 are difference penalty matrices.

3 Back Transformation from Mixed Models to P-Splines


In this section we will show how the restricted log-likelihood can be written
in terms of P-splines, without transformation to mixed models. We define the
following invertible linear transformation [1]:
    
η 0k B∗ β
= , (5)
a G Iq u
REML for 2D P-Splines 3

with determinant  
 0k B∗ 
  k
 G Iq  = (−1) |B∗ G| = 0. (6)

The vector a is of length q and η is a vector of length k. The matrix coefficient


matrix C can be decomposed as
   
0k G (θ1 + θ2 )Ik 0 0k B∗
C=  . (7)
B ∗ Iq 0 B R−1 B + P G Iq

where  
P = θ1 D1 D1 ⊗ Iq2 + θ2 Iq1 ⊗ D2 D2 . (8)
Using Eqs. (3), (5), and (7) it can be derived that the estimates for η̂ and â
are given by
    
(θ1 + θ2 )Ik 0 η̂ 0
=
0 B R−1 B + P â B R−1 y

From this it follows that η̂ = 0 and

(B R−1 B + P)â = B R−1 y. (9)

Using Eqs. (5) and (9) we obtain

2 log L = log |R−1 | + log |Σ−1 | − log |C| − ê R−1 ê − â Pâ, (10)

where ê = y − Bâ.
From Eq. (7) it follows that log |C| can be decomposed as

log |C| = log |B R−1 B + P| + 2 log |B∗ G| + k log(θ1 + θ2 ). (11)

Using some linear algebra it can be shown that log |Σ−1 | can be decomposed
as

log |Σ−1 | = log |P|+ − log |G G| + 2 log |B∗ G| + k log(θ1 + θ2 ). (12)

where the pseudo determinant |P|+ is defined as the product of the q−k non-zero
eigenvalues of P. The pseudo determinant |P|+ can be obtained in an efficient

way, by using the spectral decomposition of Di Di = Ui Λi U−1i , where Λi is a

diagonal matrix with the eigenvalues of Di Di (i = 1, 2). The two-dimensional
penalty matrix P defined by Eq. (8) can be written as [2]

P = (θ1 U1 Λ1 U−1 −1 −1 −1
1 ) ⊗ (U2 Iq2 U2 ) + (θ2 U1 Iq1 U1 ) ⊗ (U2 Λ2 U2 )
= (U1 ⊗ U2 ) (θ1 Λ1 ⊗ Iq2 + θ2 Iq1 ⊗ Λ2 ) (U1 ⊗ U2 )−1 .

From this we obtain that |P|+ can be calculated as the product of the non-
zeros of a diagonal matrix

|P|+ = |θ1 Λ1 ⊗ Iq2 + θ2 Iq1 ⊗ Λ2 |+ . (13)


4 M. P. Boer

Precipitation anomaly
50

45
ypred

40 2
Latitude

35 0

−1

30

25
−120 −100 −80
Longitude

Fig. 1. Fitted surface for monthly precipitation anomalies in USA for April 1948,
using LMMsolver with 40 segments in both directions. Computation time is less than
one second, and 250 times faster than the SOP package, which gives the same fit.

Substituting Eqs. (11) and (12) into Eq. (10) gives the following expression
for the REML log-likelihood

2 log L = log |R−1 | + log |P|+ − log |G G| − log |C∗ | − ê R−1 ê − â Pâ, (14)

where C∗ = B R−1 B+P. An efficient way to obtain log |C∗ | and to solve Eq. (9)
is by using a sparse Cholesky decomposition of C∗ .
The expressions for the REML log-likelihood for P-splines and for standard
mixed models have a similar structure, as can be seen from the comparison
between Eqs. (2) and (14). There are two main differences. First, the precision
matrix Σ−1 in Eq. (2) is positive definite, P in Eq. (14) is singular. However,
log |P|+ can be calculated using Eq. (13) in an efficient way. A second difference
is an extra constant − log |G G| in the REML P-splines formulation in Eq. (14),
which can be easily calculated or just ignored.
An important element needed to calculate Eq. (14) and the partial deriva-
tives with respect to precision parameters θm in an efficient way is avoiding
the calculation of the inverses of the precision matrices, which are not sparse.
One way to do this is to calculate the so-called sparse inverse. In LMMsolver
[1] Automated Differentiation of the Cholesky algorithm [11] was implemented.
Backward differentiation was used, which calculates the partial derivatives of
the likelihood efficiently [11]. A detailed example for one-dimensional P-splines
is given by Eilers and Boer [4]. The automated differentiation was implemented
in LMMsolver using supernodal Cholesky factorization [7]. The implementation
was written in C++ using the Rcpp package.

4 An Application and Comparison of Computation Times


We will use two-dimensional P-splines for the USprecip data set from the spam R
package [5]. There are n = 5,906 observations, with longitude-latitude positions
of monitoring stations, and standardized precipitation for April 1948, see [1] and
REML for 2D P-Splines 5

1000

computation time (seconds)

100

method
LMMsolver
10 LMMsolver2
mgcv
SOP

20 30 40 50
Number of segments

Fig. 2. Comparison of computation times as function of number of segments.


LMMsolver is the sparse mixed model in [1], LMMsolver2 is the new REML P-splines
model. The y-axis is on a log10-scale, showing that the two LMMsolver methods are
several orders of magnitude faster than SOP and mgcv.

[9] for further details. Cubic B-splines with second-order differences were used
for both latitude and longitude. The result is shown in Fig. 1, using 40 segments
in both directions. The models defined in [1] and [9] are both equivalent to the
new formulation presented in this article.
We compared computation times with other methods for different number of
segments. The same number of segments was used in both dimensions. All com-
putations were performed in R4.4.0 (R Core Team 2024) and a 2.90 GHz Intel
Core i5-9400 CPU with 24 GB of RAM and Windows10 operating system. Ver-
sion 1.9-1 of mgcv [12], version 1.0.1 of SOP [10], and version 1.0.7 of LMMsolver
[1] were used. For mgcv we used the bam() function, with method="fREML".
Figure 2 compares the computation times, showing that the sparse mixed
model formulation in [1] and the new REML P-splines model are several orders
of magnitude faster than SOP and mgcv. For example, for 50 segments in both
directions, the computation time for the two LMMsolver methods are both less
than 2 s, SOP takes 11 min, and mgcv needs 49 min. The new REML P-splines
model is a bit faster than the original sparse mixed P-splines model in [1], but
the differences are marginal.

5 Discussion
In this article we have shown that there is direct connection between REML
and P-splines. Therefore a transformation of P-splines to mixed model is not
needed. The REML P-splines model and the sparse mixed model formulation in
6 M. P. Boer

[1] keep the sparse structure of the B-splines, which makes them fast compared
to other methods, where the sparse structure is lost in the transformation to
mixed models.
Here we showed results for two-dimensional P-splines, but the same idea can
be extended to other dimensions. For Generalized Additive Models the sparse
mixed model formulation by Boer [1] has the advantage of modeling an explicit
term for the intercept, which makes the system identifiable. The sparse mixed
model P-splines in [1] and the new REML P-splines are closely connected, and
therefore the combination of the two formulations looks promising.

References
1. Boer, M.P.: Tensor product P-splines using a sparse mixed model formulation.
Stat. Model. 23, 465–479 (2023)
2. Currie, I.D., Durban, M., Eilers, P.H.C.: Generalized linear array models with
applications to multidimensional smoothing. J. R. Stat. Soc. Ser. B Stat. Methodol.
68(2), 259–280 (2006)
3. Eilers, P.H., Marx, B.D.: Flexible smoothing with B-splines and penalties. Stat.
Sci. 11(2), 89–121 (1996)
4. Eilers, P.H.C., Boer, M.P.: Derivatives of the log of a determinant. Developments
in statistical modelling. In: 38th International Workshop on Statistical Modelling
(2024)
5. Furrer, R., Sain, S.R.: A sparse matrix R package with emphasis on MCMC meth-
ods for Gaussian Markov random fields. J. Stat. Softw. 36, 1–25 (2010)
6. Lyche, T., Manni, C., Speleers, H.: Foundations of spline theory: B-splines, spline
approximation, and hierarchical refinement. Lect. Notes Math. 2219, 1–76 (2018)
7. Ng, E.G., Peyton, B.W.: Block sparse Cholesky algorithms on advanced unipro-
cessor computers. SIAM J. Sci. Comput. 14, 1034–1056 (1993)
8. Patterson, H.D., Thompson, R.: Recovery of inter-block information when block
sizes are unequal. Biometrika 58, 545–554 (1971)
9. Rodriguez-Alvarez, M.X., Lee, D.J., Kneib, T., Durban, M., Eilers, P.H.: Fast
smoothing parameter separation in multidimensional generalized P-splines: the
SAP algorithm. Stat. Comput. 25, 941–957 (2015)
10. Rodrı́guez-Alvarez, M.X., Durban, M., Lee, D.J., Eilers, P.H.: On the estimation of
variance parameters in non-standard generalised linear mixed models: application
to penalised smoothing. Stat. Comput. 29, 483–500 (2019)
11. Smith, S.P.: Differentiation of the Cholesky algorithm. J. Comput. Graph. Stat. 4,
134 (1995)
12. Wood, S.N.: Generalized Additive Models: An Introduction with R. Chapman and
Hall/CRC (2017)
Learning Bayesian Networks from Ordinal
Data - The Bayesian Way

Marco Grzegorczyk(B)

Bernoulli Institute, FSE, Groningen University, Groningen, The Netherlands


[email protected]
https://www.math.rug.nl/stat/People/Marco

Abstract. We propose a new Bayesian method for Bayesian network


structure learning from ordinal data. Our Bayesian method is similar to a
recently proposed non-Bayesian method, referred to as the ordinal struc-
tural expectation maximization (OSEM) method. Both methods assume
that the ordinal variables originate from Gaussian variables, which can
only be observed in discretized form, and that the dependencies in the
unobserved latent Gaussian space can be described in terms of Gaussian
Bayesian networks. In our simulation studies the new Bayesian method
yields significantly higher network reconstruction accuracies than the
OSEM method.

Keywords: Bayesian networks · ordinal data · latent Gaussian space ·


Markov Chain Monte Carlo (MCMC)

1 Introduction

Bayesian networks (BNs) make use of directed acyclic graphs (DAGs) to describe
the conditional dependencies among random variables X1 , . . . , Xn . The large
majority of BN models assumes that the n random variables have either a joint
multivariate Gaussian distribution (see, e.g., [1,2]) or that each of the n variables
has a nominal (categorical) distribution (see, e.g., [3]). BNs for variables with
ordinal (categorical) distributions have been scarcely explored in the literature.
Recently, Luo et al. [4] proposed the so called OSEM (‘ordinal structural expecta-
tion maximization’) method for BN learning from ordinal data. Luo et al. assume
that there is a Gaussian Bayesian network (DAG) among continuous variables
but that the continuous variables cannot be observed directly. The continuous
variables can only be observed in discretized form; i.e. each Gaussian variable
is in one-to-one correspondence with an ordinal (categorical) variable, obtained
through discretization of the corresponding Gaussian variable. Figure 1 provides
a graphical illustration of the relationships between the unobserved latent Gaus-
sian variables and the observable ordinal variables. BN structure learning then
aims at learning the DAG among the non-observable latent Gaussian variables
from the observable discretized (ordinal) variables.
We propose a Bayesian variant of OSEM, and we refer to it as the BoB
method (‘Bayesian way of modelling ordinal data in form of latent Bayesian
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
J. Einbeck et al. (Eds.): IWSM 2024, CONTRIB.STAT., pp. 7–13, 2024.
https://doi.org/10.1007/978-3-031-65723-8_2

You might also like