Prox-DBRO-VR: A Unified Analysis on Decentralized Byzantine-Resilient Composite Stochastic Optimization with Variance Reduction

Jinhui Hu, Guo Chen, Huaqing Li, Xiaoyu Guo, and Tingwen Huang This work is supported in part by the National Natural Science Foundation of China under Grant 62073344, in part by the Fundamental Research Funds for the Central Universities of Central South University under grant 2023ZZTS0355.J. Hu is with the School of Automation, Central South University, Changsha 410083, China (e-mail: [email protected]). G. Chen is with the School of Electrical Engineering and Telecommunications, University of New South Wales, Sydney, NSW 2052, Australia (e-mail: [email protected]). J. Hu and X. Guo are with the Department of Biomedical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China (e-mail: [email protected]; [email protected]). H. Li and L. Ran are with Chongqing Key Laboratory of Nonlinear Circuits and Intelligent Information Processing, College of Electronic and Information Engineering, Southwest University, Chongqing 400715, China (e-mail: [email protected]; [email protected]). T. Huang is with Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen 518055, China (e-mail: [email protected]).

Abstract

Decentralized stochastic gradient algorithms efficiently solve large-scale finite-sum optimization problems when all agents in the network are reliable. However, most of these algorithms are not resilient to adverse conditions, for instance, Byzantine issues. This paper aims to handle a class of general composite finite-sum optimization problems over multi-agent systems (MASs) in the presence of an unknown number of Byzantine agents. Based on the proximal mapping method, variance-reduced (VR) techniques, and a norm-penalized approximation strategy, we propose a decentralized Byzantine-resilient and proximal-gradient algorithmic framework, dubbed Prox-DBRO-VR, which achieves an optimization and control goal via local updates. To remove asymptotically the variance generated by evaluating the local noisy stochastic gradients, we incorporate two localized VR techniques (SAGA and LSVRG) into Prox-DBRO-VR to design Prox-DBRO-SAGA and Prox-DBRO-LSVRG. By analyzing the contraction relationships among the gradient-learning error, resilient consensus condition, and convergence error in a unified theoretical framework, it is demonstrated that both Prox-DBRO-SAGA and Prox-DBRO-LSVRG, with a well-designed constant (resp., decaying) step-size, converge linearly (resp., sub-linearly) inside an error ball around the optimal solution to the original problem under standard assumptions. A trade-off between the convergence accuracy and Byzantine resilience in both linear and sub-linear cases is also characterized. In numerical experiments, the effectiveness and practicability of the proposed algorithms are manifested via resolving a decentralized sparse machine-learning problem over MASs under various Byzantine attacks.

Index Terms:

Decentralized stochastic optimization, system security, cyber attacks, Byzantine-resilient algorithms, composite optimization problems, variance-reduced stochastic gradients.

Decentralized optimization has received extensive research and achieved notable progress in the field of machine learning [1, 2, 3], smart grids [4], cooperative control [5], and uncooperative games [6]. Decentralized algorithms have advantages of high-efficiency for massive-scale optimization problems, good scalability over large-scale MASs, and a lower communication burden for the master/central agent in contrast to the parameter-server structure.

With the rapid advancement of MASs, there are unavoidable security issues in the process of optimization and control, such as poisoning data, software bugs, malfunctioning devices, cyber attacks, and privacy leakage. Many endeavors have been devoted to developing various privacy-preserving methods in decentralized optimization, for instance, differential privacy [7, 8]. Nevertheless, privacy issues are beyond the scope of this paper, as they do not directly cause compromised agents or disrupt the normal iteration of algorithms. Other issues, however, may lead to node-level failures during multi-agent optimization and control, which is known as Byzantine problems [9], where the malfunctioning or compromised agents are referred to as Byzantine agents. Byzantine agents colluding with each other are able to impede some notable Byzantine-free decentralized optimization algorithms [10, 11, 2, 5, 12, 13, 14, 6, 15, 4, 1, 16] from achieving the optimal solution to the optimization problem, or even cause disagreement and divergence [17]. For example, if a reliable agent is attacked and controlled by adversaries, the attacker can manipulate the agent to send misleadingly falsified information to its different reliable neighboring agents at each iteration. This can significantly hinder reliable agents from achieving convergence, and even impede consensus if a Byzantine agent deliberately sends misleading messages to their different reliable neighbors [18]. Therefore, researchers have been concentrating on designing decentralized resilient algorithms [19, 20, 21, 22, 23, 24, 25, 26, 27] to alleviate or counteract the negative impact caused by Byzantine agents. In fact, there are various approaches to guarantee decentralized Byzantine resilience. One popular line is to combine various (iterative) screening or filtration techniques with decentralized algorithms. To name a few, a pioneering work [17] achieves Byzantine resilience via adopting the trimmed mean (TM) method, i.e., discarding a subset of the largest and smallest messages in the aggregation step. However, [17] focuses on a class of scalar-valued problems for decentralized optimization. A follow-up work ByRDiE [28] combines decentralized coordinate gradient descent step with the TM method to tackle the vector-valued problem in decentralized learning. One imperfection of ByRDiE is its expensive computational overhead and low efficiency in dealing with large-scale finite-sum optimization problems due to the implementation of one-coordinate-at-one-iteration update. Hence, BRIDGE [22] combines respectively four screening techniques including coordinate-wise trimmed-mean, coordinate-wise median, Krum function, and a combination of Krum and coordinate-wise trimmed mean, with decentralized gradient descent (DGD) [29] to devise a Byzantine-resilient algorithmic framework. However, these four screening mechanisms either suffer from a high computational complexity or introduce extra restrictions on the number of neighbors and the network topology. Literature [30] extends a centered clipping technique [31] to a self-centered clipping version for a decentralized implementation, which not only realizes Byzantine resilience, but resolves a category of decentralized non-convex optimization problems. One imperfection of [30] lies in the fact that the localized clipping radius parameter depends on global information. This may impede its decentralized running over a large-scale MAS. Another work [32] designs a two-stage technique to filter out the Byzantine attacks, which can work in the presence of an arbitrary quantity of Byzantine agents while any clairvoyant knowledge of the identities of Byzantine agents is not required. Recent work [25] systematically analyzes the relationship between two critical points, i.e., doubly-stochastic weight matrix and consensus, in the development of decentralized Byzantine-resilient methods. On top of that, an iterative screening-based resilient aggregation rule, dubbed (IOS), is designed in [25], which achieves Byzantine resilience and a controllable (asymptotic) convergence error relied on the assumptions of bounded inner (node-level noisy stochastic gradients) and outer (network-level aggregated gradients) variations.

Decentralized algorithms [28, 32, 22, 25, 24] achieve Byzantine resilience via adopting various screening or filtering techniques. Nevertheless, the screening- or filtration-based methods may not only place a restriction on the number of neighboring agents, but incur an additional cost of $\mathcal{O}\left(mn\right)$ ( $\mathcal{O}$ , $m$ , and $n$ denote an upper bound on the growth rate of the complexity, the total number of agents including both reliable and Byzantine agents in the network and the dimension of the decision variable, respectively), to each reliable agent $i$ at each iteration (see [22, TABLE II]). This could be prohibitively expensive when the MAS is large-scale and the decision variable is high-dimensional. RSA [27] is designed to achieve simultaneously decentralized Byzantine resilience and avoid introducing an additional cost, which combines an $a$ -norm ( $a\geq 1$ ) penalized approximation method with the stochastic sub-gradient descent method to realize resilient aggregation in a distributed fashion. [26] is a decentralized extension of RSA [27]. Via integrating with a noise-shuffle strategy, [33] extends [27] to distributed federated learning, which enhances users’ differential privacy. However, both RSA [27] and [26, 33] establish only sub-linear convergence rates of the proposed algorithms, which are rather slow and have huge potential to be accelerated. A recent decentralized Byzantine-resilient algorithm DECEMBER [34] accelerates the convergence rate via incorporating two VR techniques. Although DECEMBER realizes simultaneously Byzantine resilience and linear convergence, it is still confined to resolving finite-sum optimization problems with only a single smooth¹¹1A smooth function implies that it has Lipschitz continuous gradients. objective function.

-A Motivations

On the one hand, decentralized Byzantine-resilient methods [35, 27, 26, 31, 32, 22, 24, 25, 23, 34, 30, 33] are not available to handling optimization problems with a non-smooth objective function, which is indispensable in many practical applications, such as sparse machine learning [36, 37], model predictive control [5], and signal processing [38]. On the other hand, despite the fact that there are various decentralized algorithms [38, 39, 36, 37, 12, 40, 3] providing many insights to resolve the composite finite-sum optimization problem, they all fail to consider any possible security issues over MASs. This renders the reliable agents under the algorithmic framework of [38, 39, 36, 40, 37, 12, 3] vulnerable to Byzantine attacks or failures. Therefore, to bridge this gap, this paper studies a category of composite finite-sum optimization problems in the presence of Byzantine agents, where the local objective function associated with each agent consists of both smooth and non-smooth parts. In a nutshell, the study on decentralized Byzantine-resilient composite stochastic optimization is non-trivial, which features the main motivation of this paper. The integration of VR techniques to reduce the per-iteration computational cost when evaluating the batch gradients [38, 28, 17, 39, 37, 40, 22, 24, 19] and remove asymptotically the variance incurred by estimating noisy stochastic gradients [27, 31, 32, 26, 14, 30, 25, 23] serves as a side motivation. In contrast to a recent notable algorithm (event-triggered) MW-MSR [18, 41], which proposes an alternative approach to achieving Byzantine-resilient consensus with asynchronous multi-hop communication through detection and filtration, this paper focuses on decentralized optimization problems. In such problems, the heterogeneity in local gradient evaluations can lead to greater variance among reliable agents’ states [30], which may cause false filtration when the reliable information is mistakenly identified as Byzantine information. This observation motivates the design of a screening- and filtration-free decentralized Byzantine-resilient algorithm tailored to handle the worst-case scenario²²2The worst case indicates that the number of Byzantine agents is unknown and they can be omniscient to send any misleadingly falsified messages to their reliable neighbors. of Byzantine problems in decentralized optimization.

-B Contributions

1.

This paper develops a decentralized Byzantine-resilient and proximal-gradient algorithmic framework, dubbed Prox-DBRO-VR, to resolve a class of composite (smooth + non-smooth) finite-sum optimization problems over MASs under the worst case. The challenge of studying the non-smooth objective function in the presence of Byzantine agents stems from incurring additional error terms consisting of coupled Byzantine and reliable information in contrast to Byzantine-free composite optimization [38, 39, 36, 37, 12, 40, 3] and Byzantine-resilient smooth optimization [35, 27, 26, 31, 32, 22, 24, 25, 23, 34, 30, 33]. To handle these introduced error terms, we explore and seek the upper bounds on the sub-differentials related with Byzantine agents after applying the non-expansiveness of the proximal operator in the linear convergence case and bounded-gradient condition in the sub-linear convergence case.
2.

Inspired by [36, 11], we incorporate localized versions of two VR techniques SAGA [42] and LSVRG [43], into Prox-DBRO-VR, to propose two decentralized Byzantine-resilient stochastic gradient algorithms, namely Prox-DBRO-SAGA and Prox-DBRO-LSVRG, both of which trim the per-iteration computational cost in evaluating batch gradients [38, 28, 17, 39, 37, 40, 22, 24, 19] and eliminate the bounded-variance assumption required by estimating local noisy stochastic gradients [27, 31, 32, 26, 14, 30, 25, 23, 33]. The challenge of studying VR techniques in the presence of Byzantine agents lies in seeking an appropriate (Lyapunov) candidate function with respect to the gradient-learning sequences and errors, we address this challenge by exploiting the Bregman divergence, choosing a proper constant or decaying step-size, and utilizing appropriate intermediate constants (see the proofs of Theorems 2-3).
3.

The proposed algorithm framework Prox-DBRO-VR achieves Byzantine resilience without incurring any additional costs in contrast to decentralized Byzantine-resilient methods [17, 28, 22, 24, 32, 25, 23, 18] bringing (at least) an extra computational cost of $\mathcal{O}\left(mn\right)$ for screening or filtration processes. The theoretical analysis of Prox-DBRO-SAGA and Prox-DBRO-LSVRG introduces no assumption or restrictions on the number or proportion of Byzantine agents in the network, but only assumes one potential connected network among reliable agents, which is less restrictive than the topology condition of many related works, such as [17, 28, 22, 24, 25]. Under this assumption, theoretical results reveal an explicit trade-off between convergence accuracy and Byzantine resilience in both cases (see Theorems 2-3 for details), providing directions to optimize performance of the proposed algorithms in practice.

-C Organization

We provide the remainder of the paper in this part. Section I presents the basic notation, problem statement, problem reformulation, and setup of its robust variant. The connection of the proposed algorithms with existing methods and the algorithm development are elaborated in Section II. Section III details the convergence results of the proposed algorithms. Case studies on decentralized learning problems with various Byzantine attacks to illustrate the effectiveness and performance of the proposed algorithms are carried out in Section IV. Section V concludes the paper and states our future direction. Some detailed derivations are placed to Appendix for coherence.

I Preliminaries

I-A Basic Notation

Throughout the paper, we assume all vectors are column vectors if there is no other specified.

TABLE I: Basic notations.

Symbols	Definitions
${\mathbb{R}}$ , ${{\mathbb{R}}^{n}}$ , ${{\mathbb{R}}^{m\times n}}$	the set of real numbers, $n$ -dimensional column real vectors, $m\times n$ real matrices, respectively
${I_{n}}$	the $n\times n$ identity matrix
${0_{n}}$	an $n$ -dimensional column vector with all-zero elements
$1_{m}$	an $m$ -dimensional column vector with all-one elements
${\cdot^{\top}}$	transpose of any matrices or vectors
${\rm{diag}}\left\{\nu\right\}$	a diagonal matrix with all the elements of vector $\nu\in{\mathbb{R}^{n}}$ laying on its main diagonal
$X\leq Y$	each element in $Y-X$ is nonnegative, where $X$ and $Y$ are two vectors or matrices with same dimensions
$\tilde{x}\otimes\tilde{y}$	the Kronecker product of vectors $\tilde{x}$ and $\tilde{y}$
$\left\|\cdot\right\|$	the operator to represent the absolute value of a constant or the cardinality of a set
$\left\\|\nu\right\\|_{a}$	either the $a$ -norm of $\nu\in{\mathbb{R}^{n}}$ equivalent to ${\left({\sum\nolimits_{i=1}^{n}{{{\left\|{{\nu_{i}}}\right\|}^{a}}}}\right)^{% \frac{1}{a}}}$ , $a\geq 1$ , or its induced matrix norm.
${\lambda_{\min}}\left(X\right)$	the minimum nonzero singular value of any matrix $X$
${\lambda_{\max}}\left(X\right)$	the maximum singular value of any matrix $X$

For arbitrary three vectors $\tilde{x},\tilde{y},\tilde{z}\in\mathbb{R}^{n}$ , a positive scalar $a$ and a closed, proper, convex function, $g:{\mathbb{R}^{n}}\to\mathbb{R}$ , the proximal operator is defined as: ${\mathbf{prox}}_{{a,g}}\left\{\tilde{x}\right\}=\arg{\min_{\tilde{y}\in\mathbb% {R}^{n}}}\left\{{g\left(\tilde{y}\right)+\frac{1}{{2a}}{{\left\|{\tilde{y}-% \tilde{x}}\right\|}_{2}^{2}}}\right\}$ ; let $\partial g\left(\tilde{x}\right)$ denote the sub-differential of the proper, closed and convex function $g:{\mathbb{R}^{n}}\to\mathbb{R}$ at $\tilde{x}$ , such that

\partial g\left(\tilde{x}\right)=\left\{{\tilde{y}|\forall\tilde{z}\in{\mathbb% {R}^{n}},\;g\left(\tilde{x}\right)+\left\langle{\tilde{y},\tilde{z}-\tilde{x}}% \right\rangle\leq g\left(\tilde{z}\right)}\right\};

let ${{\partial_{\tilde{x}}}g\left(\tilde{x}\right)}$ denote the sub-gradient of non-smooth convex function $g$ at $\tilde{x}$ . The remaining basic notations of this paper are summarized in Table I.

I-B Problem Statement

A network of $m$ agents connect with each other over an undirected network $\mathcal{G}=\left({\mathcal{R}\cup\mathcal{B},\mathcal{E}}\right)$ , where $\mathcal{R}$ ( $2\leq\left|\mathcal{R}\right|\leq m$ ) and $\mathcal{B}$ indicate the sets of reliable and Byzantine agents, respectively, and $\mathcal{E}$ represents the set of undirected communication edges among all agents. The mutual target of all reliable agents is to minimize (min) a general decentralized composite finite-sum optimization problem as follows:

\mathop{\min}\limits_{\tilde{x}\in{\mathbb{R}^{n}}}\sum\limits_{i\in\mathcal{R% }}{{f_{i}}\left({\tilde{x}}\right)+{g}\left({\tilde{x}}\right)},

(1)

where $\tilde{x}$ is the decision variable, and ${f_{i}}:{\mathbb{R}^{n}}\to\mathbb{R}$ and ${g}:{\mathbb{R}^{n}}\to\mathbb{R}$ , $i\in\mathcal{R}$ , are two different objective functions. The local objective function $f_{i}$ can be further decomposed as ${f_{i}}\left({\tilde{x}}\right)=\sum\nolimits_{l=1}^{{q_{i}}}{f_{i}^{l}\left({% \tilde{x}}\right)}/q_{i}$ , while the function $g$ serves as a shared non-smooth objective among all reliable agents similar to literature [37, 36, 40]. We assume that the optimal solution to (1) exists, denoted by $\tilde{x}^{*}$ , and the local sample set associated with agent $i$ as ${\mathcal{Q}_{i}}=\left\{{1,2,\ldots,{q_{i}}}\right\}$ , $\forall i\in\mathcal{R}$ . This paper aims to resolve the general composite finite-sum optimization problem (1) under Byzantine attacks or failures, including but not limited to the zero-sum attack [30], Gaussian attack [25], and same-value attack [26], which will be testified in numerical experiments. In fact, failures, such as agent breakdown and possible disconnection of communication links, can be deemed as a category of not malicious Byzantine problems. We next specify the studied problem via the following standard assumptions.

Assumption 1

(Convexity and smoothness).
a) For $i\in\mathcal{R}$ , the local objective function $f_{i}$ is assumed to be $\mu_{i}$ -strongly convex, and the local component objective function $f_{i}^{l}$ is $L_{i}^{l}$ -smooth, $\forall l\in{\mathcal{Q}_{i}}$ , i.e., $\forall{\tilde{x}},{\tilde{z}}\in{\mathbb{R}^{n}}$ ,


	$\displaystyle{\mu_{i}}\left\\|{\tilde{x}-\tilde{z}}\right\\|_{2}^{2}\leq{\left({% \nabla{f_{i}}\left({\tilde{x}}\right)-\nabla{f_{i}}\left({\tilde{z}}\right)}% \right)^{\top}}\left({\tilde{x}-\tilde{z}}\right),$		(2a)
	$\displaystyle{\left\\|{\nabla f_{i}^{l}\left({\tilde{x}}\right)-\nabla f_{i}^{l% }\left({\tilde{z}}\right)}\right\\|_{2}}\leq L_{i}^{l}{\left\\|{\tilde{x}-\tilde% {z}}\right\\|_{2}},$		(2b)

where ${\mu_{i}}$ and $L_{i}^{l}$ are the strongly-convex and smooth parameters, respectively.
b) The objective function $g$ is convex and not necessarily smooth.

For convenience, we define $\mu:=\mathop{\min}\nolimits_{i\in\mathcal{R}}\left\{{{\mu_{i}}}\right\}$ and $L:={\max_{i\in\mathcal{R}}}\left\{{{L_{i}}}\right\}$ with ${L_{i}}:=\sum\nolimits_{l=1}^{{q_{i}}}{L_{i}^{l}}/{q_{i}}$ . It can be verified that the global objective function is $L$ -smooth and $\mu$ -strongly convex, with a condition number denoted by ${\kappa_{f}}:=L/\mu$ .

Remark 1

Assumption 1-a) is standard in recent literature [15, 13, 12, 14, 10]. According to [44, Chapter 3], we know that $0<\mu\leq L$ , which indicates $\kappa_{f}\geq 1$ . Moreover, in view of (2b), it can be verified that the local objective functions $f_{i}$ , $i\in\mathcal{R}$ , are $L$ -smooth as well. Under Assumption 1, the optimal solution $\tilde{x}^{*}$ to (1) exists uniquely. The consideration of the possibly non-smooth term $g$ is meaningful, which finds substantial applications in various fields, such as the standard 1-norm regularization term in sparse machine learning [38, 36, 37], a non-smooth indicator function in model predictive control [5] to handle equality and set constraints, and a non-smooth indicator function in energy resource coordination [12] to handle inequality and set constraints.

Assumption 2

(Network connectivity) All reliable agents form a static network, denoted as ${{\cal G}_{\cal R}}:=\left({\mathcal{R},{\mathcal{E}_{\mathcal{R}}}}\right)$ , which is bidirectionally connected.

Remark 2

Assumption 2 is standard in recent literature [45, 33, 30, 19], which implies that each reliable agent must have at least one reliable neighbor and an arbitrary number of Byzantine neighbors to enable the communication with any other reliable agents in the network. There are many examples satisfying Assumption 2 and a straightforward instance is the full-connected network among all agents. From another perspective, there must be two reliable agents that fail to exchange messages if ${{\cal G}_{\cal R}}$ is disconnected. In this scenario, the disconnected agent is also judged as a Byzantine agent subject to possible failures. Assumption 2 imposes few restrictions on the reliable network ${{\cal G}_{\cal R}}$ than the pioneering literature [17], which requires at least $\left|\mathcal{B}\right|+1$ paths between any two reliable agents when there are $\left|\mathcal{B}\right|$ Byzantine agents in the network. In view of this, there are some network examples, for instance the Dumbbell network [31] satisfying Assumption 2 but violating the resilient network assumption made in [17].

I-C Problem Reformulation

To guarantee all reliable agents reach a consensus at the optimal solution, we need to reformulate (1) into an equivalent consensus problem. To achieve this goal, a global decision vector $x=\left[{x_{1}^{\top},x_{2}^{\top},\ldots,x_{\left|\mathcal{R}\right|}^{\top}}% \right]^{\top}\in{\mathbb{R}^{\left|\mathcal{R}\right|n}}$ containing $\left|\mathcal{R}\right|$ local copies of the decision variable $\tilde{x}$ , is introduced, subject to (s.t.) the consensus constraint ${x_{i}}={x_{j}},\left({i,j}\right)\in\mathcal{E}$ . Therefore, it is natural to rewrite (2) as

	$\displaystyle\mathop{\min}\limits_{x\in{\mathbb{R}^{\left\|\mathcal{R}\right\|n}}}$	$\displaystyle F\left(x\right)+G\left(x\right),\hfill$		(3)
	$\displaystyle{\text{s}}{\text{.t}}{\text{. }}{x_{i}}$	$\displaystyle={x_{j}},\left({i,j}\right)\in{\mathcal{E}_{\mathcal{R}}},\hfill$		(3)

where $F\left(x\right):=\sum\nolimits_{i\in\mathcal{R}}{f_{i}\left({{x_{i}}}\right)}$ and $G\left(x\right):=\sum\nolimits_{i\in\mathcal{R}}{g\left({{x_{i}}}\right)}$ .

I-D Resilient Consensus Problem Setup

To enhance the resilience of the consensual aggregation process, we consider a norm-penalized approximation variant, originally proposed in [35], of the consensus problem (3) as follows:

{x^{*}}:=\arg\mathop{\min}\limits_{x}\sum\limits_{i\in\mathcal{R}}{({{f_{i}}% \left({{x_{i}}}\right)+g\left({{x_{i}}}\right)+{\phi}\sum\limits_{j\in{% \mathcal{R}_{i}}}{{{\left\|{{x_{i}}-{x_{j}}}\right\|}_{a}}}})},

(4)

where $a\geq 1$ , $\phi$ is the penalty parameter associated with each reliable agent $i$ , and ${\mathcal{R}_{i}}$ denotes the set of reliable neighbors of agent $i$ , $i\in\mathcal{R}$ . The norm penalty provides a resilient replacement of the consensus constraint, i.e., the controllable distance between $x_{i}$ and $x_{j}$ . The distance is controlled by the penalty parameter $\phi$ , which means that a larger $\phi$ can bring a small gap between $x_{i}$ and $x_{j}$ , $\left({i,j}\right)\in{\mathcal{E}_{\mathcal{R}}}$ . To a certain extent, (4) can be considered as a relaxation of (1), because the former tolerates the dissimilarity among neighboring agents, for instance, the disagreement between reliable agents and their Byzantine neighbors. We call (4) as a soft approximation of (3), which is friendly to the data heterogeneity commonly found in decentralized optimization tasks. The equivalence between the soft approximation problem (4) and the consensus problem (3) with respect to the original problem (1) is proved in Theorem 1.

II Algorithm Development

II-A Connection with Existing Works

Lian et al. in [46] design a decentralized stochastic gradient descent algorithm, namely D-PSGD, to resolve efficiently the transformed problem (3), in an ideal situation. The ideal situation fails to consider the presence of any malfunctioning or malicious agents, which may not be avoided in practical applications [23, 22, 47, 25, 21, 9, 20]. We next find out the reason why D-PSGD cannot be applied directly to solving (3) when there are Byzantine agents in the network, and then seek out a feasible improvement, based on D-PSGD, to maintain Byzantine resilience. We first recap the updates of the generalized D-PSGD as follows:


	$\displaystyle{{\bar{x}}_{i,k}}={x_{i,k}}-{\alpha_{k}}{\nabla f_{i}\left({{x_{i% ,k}}}\right)},$		(5a)
	$\displaystyle{x_{i,k+1}}=\sum\limits_{j\in{{{\cal R}_{i}}\cup{{\mathcal{B}}_{i% }}}}{{w_{ij}}{{{v_{ij,k}}}}},$		(5b)

where ${\mathcal{B}_{i}}$ denotes the set of Byzantine neighbors of agent $i$ , $i\in\mathcal{R}$ . ${v_{ij,k}}:=\left\{\begin{array}[]{l}\!\!\!{{\bar{x}}_{j,k}},j\in{{\mathcal{R}% }_{i}}\\ \!\!\!{z_{ij,k}},j\in{{\mathcal{B}}_{i}}\end{array}\right.$ with ${z_{ij,k}}$ defined as an untrue or misleading information sent by Byzantine agent $j$ , $j\in{\mathcal{B}}_{i}$ , $\alpha_{k}$ denotes a constant or decaying step-size, ${\nabla f\left({{x_{i,k}}}\right)}$ is the local batch gradient, ${w_{ij}}$ is the $i$ -th row and $j$ -th column element of a doubly stochastic weight matrix with $\sum\nolimits_{j\in{{\mathcal{N}}_{i}}}{{w_{ij}}}=\sum\nolimits_{j\in{{% \mathcal{N}}_{i}}}{{w_{ji}}}=1$ . Note that both $\mathcal{R}_{i}$ and $\mathcal{B}_{i}$ exclude agent $i$ (itself). If there is a Byzantine agent $b$ with one reliable neighbor $i$ , ${{z}_{ib,k}}$ could be an untrue or misleading information (depending on whether agent $b$ is out of action or manipulated by adversaries), to its reliable neighboring agents at $k$ -th iteration. If agent $b$ is a Byzantine malicious agent, ${x_{i,k+1}}$ could arbitrarily deviate from its true model, since any Byzantine malicious agent is assumed to be omniscient and can learn from update rules such that they may send an elaborately falsified message to their reliable neighbors. For instance, Byzantine agent $b$ , $b\in{{\mathcal{B}}_{i}}$ , can blow $x_{i,k+1}$ up to infinity through transmitting continually a vector with infinite elements to its reliable neighbor $i$ . Another example is that Byzantine agent $b$ can deter all reliable agents from achieving consensus at iteration $k$ , via sending various values $\tilde{x}_{ib,k}$ to its different reliable neighboring agents $i\in{\mathcal{R}_{b}}$ . The main reason for the above mentioned issues comes from the fact that the aggregation step (5b) is rather vulnerable to Byzantine problems. In fact, similar security threats also prevail in decentralized work [12, 10, 1, 13, 14, 15, 11, 2]. Therefore, the SGD family contains two important extensions, RSA [27] and [26], both of which achieve Byzantine resilience based on a resilient consensus method [35]. [26] is a decentralized extension of RSA [27]. The theoretical analysis of both RSA [27] and [26] is based on a bounded-variance assumption on the local stochastic gradient. With this assumption and the other standard assumptions (see [26] for details), the sequence ${\left\{{{x_{k}}}\right\}_{k\geq 0}}$ generated by the decentralized algorithm proposed in [26] takes a convergent form of

	$\displaystyle\mathbb{E}\left[{\left\\|{{x_{k+1}}-{1_{\left\|\mathcal{R}\right\|}}% \!\otimes\!{\tilde{x}^{*}}}\right\\|_{2}^{2}}\right]\leq$	$\displaystyle\left({1-\eta{\alpha_{k}}}\right)\mathbb{E}\left[{\left\\|{{x_{k}}% -{1_{\left\|\mathcal{R}\right\|}}\!\otimes\!{\tilde{x}^{*}}}\right\\|_{2}^{2}}\right]$		(6)
		$\displaystyle+\alpha_{k}^{2}{\Delta_{0}}+{\alpha_{k}}{\Delta_{1}},$		(6)

where $\eta$ is a positive constant satisfying $0<\eta{\alpha_{k}}<1$ , ${\Delta_{0}}:=\sum\nolimits_{i\in\mathcal{R}}{32n{\phi^{2}}{{\left|{{\mathcal{% R}_{i}}}\right|}^{2}}}+4n{\phi^{2}}{\left|{{\mathcal{B}_{i}}}\right|^{2}}+2% \sigma_{i}^{2}$ ( ${\sigma_{i}}>0$ is the bounded variance yielded by the biased evaluation of the local batch gradients) and ${\Delta_{1}}:=\left({n{\phi^{2}}/\gamma}\right)\sum\nolimits_{i\in{\cal R}}{{{% \left|{{{\cal B}_{i}}}\right|}^{2}}}$ . Based on (6), one can establish either a sub-linear convergence rate with a smaller convergence error determined by the number of Byzantine agents, or a faster linear convergence rate with a larger convergence error determined jointly by the number of Byzantine agents and the bounded variance. In fact, this bounded variance ( $\sigma_{i}^{2}$ ) exists commonly in recent literature, such as [31, 32, 26, 14, 30, 25, 23]. Therefore, this paper aims to remove asymptotically this bounded variance in the linear convergence case and eliminate the bounded-variance assumption as well. Inspired by the recent exploration of decentralized VR stochastic gradient algorithms diffusion-AVRG [2], S-DIGing [15], GT-SAGA/GT-SVRG [11], and GT-SARAH [1] that seek the solution to a finite-sum optimization problem under an ideal Byzantine-free situation, we introduce two popular localized variance-reduction techniques SAGA [42] and LSVRG [43] to remove asymptotically the variance arising in the course of evaluating the local noisy stochastic gradients. These two VR techniques allow us to derive an unified theoretical result on decentralized Byzantine-resilient and proximal-gradient stochastic optimization, which will be given later.

II-B A General Algorithmic Framework

Based on the above analysis, we propose a decentralized Byzantine-resilient stochastic-gradient algorithmic framework in Algorithm 1 to resolve (4) in the presence of Byzantine agents. Note that we denote temporarily the local stochastic gradient by $r_{i,k}$ , which will be specified in Step 3 of Algorithms 2-3.

Algorithm 1 Prox-DBRO-VR Framework

0: Each reliable agent

i

i\in\mathcal{R}

, initializes with an arbitrary starting point

x_{i,0}\in{\mathbb{R}^{n}}

, a proper constant or decaying step-size

\alpha_{k}>0

, and the proper penalty parameter

{\phi}>0

1: for all

k=0,1,2,\ldots

2: Each reliable agent

i

i\in\mathcal{R}

, sends its current local model

{x_{i,k}}

to its neighbors

j\in{\mathcal{N}_{i}}

and receives the true information

{x_{j,k}}

or untrue information

{z_{ij,k}}

from its neighbors.

3: Each reliable agent

i

i\in\mathcal{R}

, evaluates the local stochastic gradient

r_{i,k}

4: Each reliable agent

i

i\in\mathcal{R}

, updates an intermediate variable according to the resilient local stochastic gradient descent step:

{\bar{x}_{i,k}}={x_{i,k}}-{\alpha_{k}}r_{i,k}-{\alpha_{k}}{\phi}\sum\limits_{j% \in{\mathcal{N}_{i}}}{{\partial_{{x_{i}}}}{{\left\|{{x_{i,k}}-v_{ij,k}}\right% \|}_{a}}},

where

{v_{ij,k}}:=\left\{\begin{gathered}{x_{j,k}},{\text{if}}\;j\in{\mathcal{R}_{i}% }\lx@algorithmic@hfill\\ {z_{ij,k}},{\text{if}}\;j\in{\mathcal{B}_{i}}\lx@algorithmic@hfill\\ \end{gathered}\right..

5: Each reliable agent

i

i\in\mathcal{R}

, updates its current local model according to the local proximal mapping step:

{x_{i,k+1}}=\arg\mathop{\min}\limits_{\tilde{x}\in{\mathbb{R}^{n}}}\left\{{g% \left({\tilde{x}}\right)+\frac{1}{{2{\alpha_{k}}}}\left\|{\tilde{x}-{{\bar{x}}% _{i,k}}}\right\|_{2}^{2}}\right\}.

6: end for

Remark 3

The Byzantine resilience of Prox-DBRO-VR is attained by adopting the resilient consensus aggregation based on total variation, which is initially studied in [35]. The literature [27, 33] extends this strategy to handling distributed federated learning problems, and [26] studies it in a decentralized manner. However, all these works [27, 26, 33] not only rely on a bounded-variance assumption in theoretical analysis, but establish slower sub-linear convergence rates. Thus, the motivation of designing Prox-DBRO-VR is to achieve linear convergence independent of the bounded-variance assumption, which can be attained with the aid of VR techniques.

II-C Prox-DBRO-SAGA and Prox-DBRO-LSVRG

We introduce the localized version of two popular centralized VR techniques SAGA [42] and LSVRG [43], into Prox-DBRO-VR, to develop Prox-DBRO-SAGA and Prox-DBRO-LSVRG. The detailed updates of Prox-DBRO-SAGA and Prox-DBRO-LSVRG are presented in Algorithms 2-3, respectively.

Algorithm 2 Prox-DBRO-SAGA

0: Each reliable agent

i

i\in\mathcal{R}

, initializes the same parameters and starting points according to Algorithm 1, and auxiliary variables

u_{i,1}^{l}=u_{i,0}^{l}={x_{i,0}},\forall l\in\mathcal{Q}_{i}

, together with gradient tables

\left\{{\nabla f_{i}^{l}\left({u_{i,0}^{l}}\right)}\right\}_{l=1}^{{q_{i}}}

1: for all

k=0,1,2,\ldots

2: Each reliable agent

i

i\in\mathcal{R}

, exchanges information according to Step 2 in Algorithm 1.

3: Each reliable agent

i

i\in\mathcal{R}

, selects uniformly a random sample with index

s_{i,k}

from the set

\mathcal{Q}_{i}

and evaluates the local stochastic gradient

{r_{i,k}^{u}}\!=\!\nabla f_{i}^{{s_{i,k}}}\left({{x_{i,k}}}\right)-\nabla f_{i% }^{{s_{i,k}}}\left({u_{i,k}^{{s_{i,k}}}}\right)\!+\frac{1}{{{q_{i}}}}\sum% \limits_{l=1}^{{q_{i}}}\!\!{\nabla f_{i}^{l}\left({u_{i,k}^{l}}\right)}.

4: Each reliable agent

i

i\in\mathcal{R}

, takes

u_{i,k+1}^{s_{i,k}}={x_{i,k}}

and replaces

\nabla f_{i}^{{s_{i,k}}}\left({u_{i,k+1}^{s_{i,k}}}\right)

\nabla f_{i}^{{s_{i,k}}}\left({{x_{i,k}}}\right)

in the corresponding position of the gradient table, while keeps

\nabla f_{i}^{l}\left({u_{i,k+1}^{l}}\right)=\nabla f_{i}^{l}\left({u_{i,k}^{l% }}\right)

l\in{\mathcal{Q}_{i}}\backslash\left\{{{s_{i,k}}}\right\}

5: Each reliable agent

i

i\in\mathcal{R}

, updates its current model according to Steps 4-5 in Algorithm 1.

6: end for

Algorithm 3 Prox-DBRO-LSVRG

0: Each reliable agent

i

i\in\mathcal{R}

, initializes the same parameters and starting points according to Algorithm 1 and an auxiliary variable

{w_{i,0}}={x_{i,0}}

1: for all

k=0,1,2,\ldots

2: Each reliable agent

i

i\in\mathcal{R}

, exchanges information according to Step 2 in Algorithm 1.

3: Each reliable agent

i

i\in\mathcal{R}

, selects uniformly a random sample with index

s_{i,k}

from the set

\mathcal{Q}_{i}

and evaluates the local stochastic gradient

{r_{i,k}^{w}}=\nabla f_{i}^{{s_{i,k}}}\left({{x_{i,k}}}\right)-\nabla f_{i}^{{% s_{i,k}}}\left({{w_{i,k}}}\right)+\frac{1}{{{q_{i}}}}\sum\limits_{l=1}^{{q_{i}% }}{\nabla f_{i}^{l}\left({{w_{i,k}}}\right)}.

4: Each reliable agent

i

i\in\mathcal{R}

, takes

w_{i,k+1}={x_{i,k}}

with a heterogenous triggering probability

p_{i}

and keeps

w_{i,k+1}={w_{i,k}}

with the probability

1-p_{i}

5: Each reliable agent

i

i\in\mathcal{R}

, updates its current model according to Steps 4-5 of Algorithm 1.

6: end for

Remark 4

All steps in Algorithms 1-3 are executed in parallel among all reliable agents since they are honest and hence comply with these update rules. It is also worthwhile to mention that the expected cost in evaluating the local stochastic gradient under Prox-DBRO-LSVRG is at least double that of Prox-DBRO-SAGA at every iteration. However, this computational advantage of Prox-DBRO-SAGA is at the expense of an expensive storage cost of ${\cal O}\left({n{q_{i}}}\right)$ for each agent $i$ owing to the employment of the gradient table, while Prox-DBRO-LSVRG does not incur extra storage to save the local batch gradients. Therefore, adopting either Prox-DBRO-SAGA or Prox-DBRO-LSVRG in practice involves a trade-off between per-iteration computational cost and storage. Users may improve and implement Prox-DBRO-VR via incorporating other categories of VR techniques [48] based on their customized needs.

III Convergence Analysis

For the simplicity of notation, we denote ${{\mathcal{F}_{k}}}$ as the filter of the history with respect to the dynamical system generated by the sequence $\left\{{s_{k}^{i}}\right\}_{k\geq 0}^{i=1,2,\ldots,\left|\mathcal{R}\right|}$ , and the conditional expectation $\mathbb{E}\left[{{s_{k}}|{\mathcal{F}_{k}}}\right]$ is shortly denoted by ${\mathbb{E}_{k}}\left[\cdot\right]$ in the sequel analysis. Let ${x_{k}}:={\left[{x_{1,k}^{\top},\ldots,x_{\left|\mathcal{R}\right|,k}^{\top}}% \right]^{\top}}\in{\mathbb{R}^{\left|\mathcal{R}\right|n}}$ and ${r_{k}}:={\left[{r_{1,k}^{\top},\ldots,r_{\left|\mathcal{R}\right|,k}^{\top}}% \right]^{\top}}\in{\mathbb{R}^{\left|\mathcal{R}\right|n}}$ . To facilitate the subsequent analysis, we define

	$\displaystyle\nabla F\left({{x_{k}}}\right):=$	$\displaystyle{\left[{\nabla{f_{1}}{{\left({{x_{1,k}}}\right)}^{\top}},\ldots,% \nabla{f_{\left\|\mathcal{R}\right\|}}{{\left({{x_{\left\|\mathcal{R}\right\|,k}}}% \right)}^{\top}}}\right]^{\top}}\in{\mathbb{R}^{\left\|\mathcal{R}\right\|n}},$
	$\displaystyle{\partial_{x}}G\left({{x_{k}}}\right):=$	$\displaystyle\left[{\partial_{{x_{1}}}}g{{\left({{x_{1,k}}}\right)}^{\top}},% \ldots,{{\partial_{{x_{\left\|\mathcal{R}\right\|}}}}g{{\left({{x_{\left\|% \mathcal{R}\right\|,k}}}\right)}^{\top}}}\right]^{\top}\!\in\!{\mathbb{R}^{% \left\|\mathcal{R}\right\|n}},$
	$\displaystyle\chi\left({{x_{k}}}\right):=$	$\displaystyle{\phi}\sum\nolimits_{i\in\mathcal{R}}{\sum\nolimits_{j\in{% \mathcal{R}_{i}}}{{{\left\\|{{x_{i,k}}-{x_{j,k}}}\right\\|}_{a}}}}\in{\mathbb{R}},$
	$\displaystyle\delta\left({{x_{k}}}\right):=$	$\displaystyle{\phi}\sum\nolimits_{i\in\mathcal{R}}{\sum\nolimits_{j\in{% \mathcal{B}_{i}}}{{{\left\\|{{x_{i,k}}-{z_{ij,k}}}\right\\|}_{a}}}}\in{\mathbb{R% }},$

Based on these definitions, we next provide briefly a compact form of Prox-DBRO-VR for the subsequent convergence analysis as follows:


		$\displaystyle{\bar{x}_{k}}={x_{k}}-{\alpha_{k}}\left({r_{k}}+{\partial_{x}}% \chi\left({{x_{k}}}\right){+{\partial_{x}}\delta\left({{x_{k}}}\right)}\right),$		(7a)
		$\displaystyle{x_{k+1}}={\mathbf{prox}}{{}_{{\alpha_{k}},G}}\left\{{{{\bar{x}}_% {k}}}\right\}.$		(7b)

III-A Auxiliary Results

Inspired by the unified analysis framework for centralized stochastic gradient descent methods in [48], we introduce the following two lemmas. To begin with, we define respectively two sequences for Prox-DBRO-SAGA and Prox-DBRO-LSVRG in the following. For Prox-DBRO-SAGA, we define

t_{i,k}^{u}:=\frac{1}{{{q_{i}}}}\sum\limits_{l=1}^{{q_{i}}}{{f_{i}^{l}\left({u% _{i,k}^{l}}\right)-f_{i}^{l}\left({{{\tilde{x}}^{*}}}\right)-\nabla f_{i}^{l}{% {\left({{{\tilde{x}}^{*}}}\right)}^{\top}}\left({u_{i,k}^{l}-{{\tilde{x}}^{*}}% }\right)}}.

For Prox-DBRO-LSVRG, we define

t_{i,k}^{w}:=\frac{1}{{{q_{i}}}}\sum\limits_{l=1}^{{q_{i}}}{{f_{i}^{l}\left({{% w_{i,k}}}\right)-f_{i}^{l}\left({{{\tilde{x}}^{*}}}\right)-\nabla f_{i}^{l}{{% \left({{{\tilde{x}}^{*}}}\right)}^{\top}}\left({{w_{i,k}}-{{\tilde{x}}^{*}}}% \right)}}.

Note that both sequences ${\left\{{t_{i,k}^{u}}\right\}_{i\in\mathcal{R},k\geq 0}}$ and ${\left\{{t_{i,k}^{w}}\right\}_{i\in\mathcal{R},k\geq 0}}$ are non-negative according to the convexity of the local component function ${f_{i}^{l}}$ , $l\in\mathcal{Q}_{i}$ . For the sequel analysis, we define respectively the gradient-learning quantities $t_{k}^{u}:=\sum\nolimits_{i\in\mathcal{R}}{t_{i,k}^{u}}$ and $t_{k}^{w}:=\sum\nolimits_{i\in\mathcal{R}}{t_{i,k}^{w}}$ for Prox-DBRO-SAGA and Prox-DBRO-LSVRG, the largest and smallest number of local samples ${q_{\min}}:={\min_{i\in\mathcal{R}}}{q_{i}}$ and ${q_{\max}}:={\max_{i\in\mathcal{R}}}{q_{i}}$ , the minimum and maximum triggering probabilities ${p_{\min}}:={\min_{i\in\mathcal{R}}}{p_{i}}$ and ${p_{\max}}:={\max_{i\in\mathcal{R}}}{p_{i}}$ , while ${\kappa_{q}}:={q_{\max}}/{q_{\min}}\geq 1$ .

Lemma 1

(Gradient-learning sequences) Since $t_{k}^{u}$ and $t_{k}^{w}$ are non-negative according to the convexity of the local component function ${f_{i}^{l}}$ , $l\in\mathcal{Q}_{i}$ under Assumption 1, $\forall k\geq 0$ , we have for Prox-DBRO-SAGA,

{\mathbb{E}_{k}}\left[{{t_{k+1}^{u}}}\right]\leq\left({1-\frac{1}{{{q_{\max}}}% }}\right)t_{k}^{u}+\frac{{D_{F}}\left({{x_{k}},{x^{*}}}\right)}{{{q_{\min}}}},

(8)

and for Prox-DBRO-LSVRG,

\displaystyle{\mathbb{E}_{k}}\left[{t_{k+1}^{w}}\right]\leq\left({1-{p_{\min}}% }\right)t_{k}^{w}+{p_{\max}}{D_{F}}\left({{x_{k}},{x^{*}}}\right),

(9)

where ${D_{F}}\left({{x_{k}},{x^{*}}}\right):=F\left({{x_{k}}}\right)-F\left({{x^{*}}% }\right)-\nabla F{\left({{x^{*}}}\right)^{\top}}\left({{x_{k}}-{x^{*}}}\right)$ is known as the Bregman divergence with respect to the convex cost function $F$ .

Proof 1

See Appendix -A.

We next seek the upper bound of the distance between the local stochastic gradient estimator $r_{k}$ and gradient $\nabla F\left({{x^{*}}}\right)$ at the optimal solution for both Prox-DBRO-SAGA and Prox-DBRO-LSVRG.

Lemma 2

(Gradient-learning errors) Suppose that Assumptions 1-2 hold. For $k\geq 0$ , we have for Prox-DBRO-SAGA,

{\mathbb{E}_{k}}\left[{{{\left\|{{r_{k}^{u}}-\nabla F\left({{x^{*}}}\right)}% \right\|}_{2}^{2}}}\right]\leq 4Lt_{k}^{u}+2\left({2L-\mu}\right){D_{F}}\left(% {{x_{k}},{x^{*}}}\right),

(10)

and for Prox-DBRO-LSVRG,

{\mathbb{E}_{k}}\left[{{{\left\|{{r_{k}^{w}}-\nabla F\left({{x^{*}}}\right)}% \right\|}_{2}^{2}}}\right]\leq 4Lt_{k}^{w}+2\left({2L-\mu}\right){D_{F}}\left(% {{x_{k}},{x^{*}}}\right),

(11)

where $r_{k}^{u}:={\left[{{{\left({r_{1,k}^{u}}\right)}^{\top}},{{\left({r_{2,k}^{u}}% \right)}^{\top}},\ldots,{{\left({r_{\left|\mathcal{R}\right|,k}^{u}}\right)}^{% \top}}}\right]^{\top}}\in{\mathbb{R}^{\left|\mathcal{R}\right|n}}$ and $r_{k}^{w}:={\left[{{{\left({r_{1,k}^{w}}\right)}^{\top}},{{\left({r_{2,k}^{w}}% \right)}^{\top}},\ldots,{{\left({r_{\left|\mathcal{R}\right|,k}^{w}}\right)}^{% \top}}}\right]^{\top}}\in{\mathbb{R}^{\left|\mathcal{R}\right|n}}$ .

Proof 2

See Appendix -B.

The following proposition is an important result for the analysis of arbitrary norm approximation.

Proposition 1

Consider two constants $a_{1}\geq 1$ and $a_{2}>0$ , such that $1/{a_{1}}+1/{a_{2}}=1$ . For an arbitrary vector $\tilde{x}\in{\mathbb{R}^{n}}$ , we denote the sub-differential ${\partial}{\left\|{\tilde{x}}\right\|_{a_{1}}}=\left\{{\tilde{z}\in{\mathbb{R}% ^{n}}:\left\langle{\tilde{z},\tilde{x}}\right\rangle={{\left\|{\tilde{x}}% \right\|}_{a_{1}}},{{\left\|{\tilde{z}}\right\|}_{a_{2}}}\leq 1}\right\}$ .

Proof 3

We refer interested readers to the supplementary document of [27] for the proof of Proposition 1.

Proposition 2

Recalling the definition of ${\mathbf{prox}}_{\alpha,g}\left\{{x_{i}}\right\}$ , we know that ${\left[{\mathbf{prox}}_{\alpha,G}\left\{x\right\}\right]}_{i}={\mathbf{prox}}_% {\alpha,g}\left\{{x_{i}}\right\}$ , $\forall i\in\mathcal{R}$ , and

{\left\|{{\mathbf{prox}}_{\alpha,G}\left\{x\right\}-{\mathbf{prox}}_{\alpha,G}% \left\{y\right\}}\right\|_{2}}\leq{\left\|{x-y}\right\|_{2}},

(12)

where $x=\left[{x_{1}^{\top},x_{2}^{\top},\ldots,x_{\left|\mathcal{R}\right|}^{\top}}% \right]^{\top}\in{\mathbb{R}^{\left|\mathcal{R}\right|n}}$ and $y=\left[{y_{1}^{\top},y_{2}^{\top},\ldots,y_{\left|\mathcal{R}\right|}^{\top}}% \right]^{\top}\in{\mathbb{R}^{\left|\mathcal{R}\right|n}}$ .

Proof 4

See Appendix -C.

III-B Main Results

We next derive a feasible range for the penalty parameter to enable the equivalence between the decentralized consensus optimization problem (3) and norm-penalized approximation problem (4) as follows, which further guarantees the equivalence between the original optimization problem (1) and norm-penalized approximation problem (4).

Theorem 1

(Resilient consensus condition) Suppose that Assumptions 1 and 2 hold. Given $g^{\prime}\left({{{\tilde{x}}^{*}}}\right)\in{\partial_{\tilde{x}}}{g}\left({{% {\tilde{x}}^{*}}}\right)$ , if the penalty parameter satisfy $\phi\geq{\phi_{\min}}:={\left|\mathcal{R}\right|^{\frac{3}{2}}}\sqrt{\left|{{% \mathcal{E}_{\mathcal{R}}}}\right|}\mathop{\max}\nolimits_{i\in\mathcal{R}}{% \left\|{\nabla{f_{i}}\left({{{\tilde{x}}^{*}}}\right)}+{g^{\prime}\left({{{% \tilde{x}}^{*}}}\right)}\right\|_{\infty}}/{\lambda_{\min}}\left(\Pi\right)$ , the optimal solution to the original optimization problem (1) is equivalent to the globally optimal solution to norm-penalized approximation problem (4), i.e., ${x^{*}}={1_{\left|\mathcal{R}\right|}}\otimes{{\tilde{x}}^{*}}$ .

Proof 5

See Appendix -D.

Remark 5

Theorem 1 demonstrates that a selection of a sufficiently large penalty parameter guarantees the equivalence between the original optimization problem (1) and norm-penalized approximation problem (4). However, the sequel convergence results manifest that a larger $\phi$ causes a bigger convergence error. Therefore, the notion of a sufficiently large penalty parameter is tailored for theoretical results, and one can hand-tune this parameter to obtain better algorithm performances in practice.

For simplicity, we fix the minimum and maximum heterogenous triggering probabilities as ${p_{\min}}=1/{q_{\max}}$ and $p_{\max}=1/{q_{\min}}$ , respectively. The following analysis considers ${r_{k}}:=\left\{\begin{gathered}r_{k}^{u},{\text{for {Prox-DBRO-SAGA}}}\hfill% \\ r_{k}^{w},{\text{for {Prox-DBRO-LSVRG}}}\hfill\\ \end{gathered}\right.$ and ${t_{k}}:=\left\{\begin{gathered}t_{k}^{u},\text{for {Prox-DBRO-SAGA}}\hfill\\ t_{k}^{w},\text{for {Prox-DBRO-LSVRG}}\hfill\\ \end{gathered}\right.$ , such that the theoretical results for both Prox-DBRO-SAGA (Algorithm 2) and Prox-DBRO-LSVRG (Algorithm 3) can be unified in a general framework. Before deriving a linear convergence rate for Algorithms 2-3, we first define the sequel parameters: $\gamma:=\mu L/\left({\mu+L}\right)$ , $P_{1}^{c}:=16n{\phi^{2}}\sum\nolimits_{i\in{\cal R}}{{{\left|{{{\cal R}_{i}}}% \right|}^{2}}}+4n{\phi^{2}}\sum\nolimits_{i\in{\cal R}}{{{\left|{{{\cal B}_{i}% }}\right|}^{2}}}$ , $P_{2}:=n{\phi^{2}}\sum\nolimits_{i\in{\cal R}}{{{\left|{{{\cal B}_{i}}}\right|% }^{2}}}/\gamma$ , and $E:=4{P_{2}}/\gamma$ .

Theorem 2

(Linear convergence). Suppose that Assumptions 1-2 hold. Under the condition of Theorem 1, if the constant step-size meets $0<{\alpha_{k}}\equiv\alpha\leq 1/\left({{\kappa_{q}}\left({32{{\left({1+{% \kappa_{f}}}\right)}^{2}}+q_{\text{min}}}\right)\mu}\right)$ , then the sequence ${\left\{{{x_{k}}}\right\}_{k\geq 0}}$ generated by Algorithms 2-3, converges linearly to an error ball around the optimal solution to the original optimization problem (1) at a linear rate of ${\left({1-\mathcal{O}\left({\gamma\alpha}\right)}\right)^{k}}$ , i.e.,

		$\displaystyle\mathbb{E}\left[{\left\\|{{x_{k}}-{1_{\left\|\mathcal{R}\right\|}}% \otimes{{\tilde{x}}^{*}}}\right\\|_{2}^{2}}\right]$		(13)
	$\displaystyle\leq$	$\displaystyle{\left({1-\frac{\gamma}{4}\alpha}\right)^{k}}{U_{0}}+4\left({% \frac{{{P_{1}^{c}}}}{\gamma}\alpha+E}\right)\left({1-{{\left({1-\frac{\gamma}{% 4}\alpha}\right)}^{k}}}\right),$		(13)

where ${U_{0}}=\left\|{{x_{0}}-{x^{*}}}\right\|_{2}^{2}+{{q_{\min}}}\gamma\alpha{t_{0% }}/\left({{{q_{\max}}}L}\right)$ , and the radius of the error ball is no more than $4\left({{P_{1}^{c}}\alpha/\gamma+E}\right)$ .

Proof 6

See Appendix -E.

We continue to establish the sub-linear convergence rate of Algorithms 2-3 with the aid of the following bounded-gradient assumption on the non-smooth objective function $g$ , which is standard in literature [49, 50, 51].

Assumption 3

(Bounded gradients). The sub-gradient $\partial_{\tilde{x}}g\left({\tilde{x}}\right)$ at any point ${\tilde{x}}\in\mathbb{R}^{n}$ is $\hat{G}$ -bounded, i.e., $\left\|{{\partial_{\tilde{x}}}g\left({\tilde{x}}\right)}\right\|_{2}^{2}\leq% \hat{G}$ , where $\hat{G}$ can be an arbitrarily large but finite positive constant.

To proceed, we define $\theta>4/\gamma$ , $P_{1}^{d}:=16{\left|\mathcal{R}\right|}\hat{G}{\rm{+}}16n{\phi^{2}}\sum% \nolimits_{i\in{\cal R}}{{{\left|{{{\cal R}_{i}}}\right|}^{2}}}{\rm{+4}}n{\phi% ^{2}}\sum\nolimits_{i\in{\cal R}}{{{\left|{{{\cal B}_{i}}}\right|}^{2}}}$ , $\Xi:=\max\left\{{\frac{{{\theta^{2}}P_{1}^{d}}}{{\gamma\theta-1}},\left({\xi-% \frac{\gamma}{4}\theta}\right){\left\|{{x_{0}}-{x^{*}}}\right\|_{2}^{2}}+\frac% {{{\theta^{2}}}}{\xi}P_{1}^{d}+\theta{P_{2}}-\xi E}\right\}$ , and $\xi:={\kappa_{q}}\left({64{{\left({1+{\kappa_{f}}}\right)}^{2}}+{q_{\min}}}% \right)\mu\theta$ .

Theorem 3

(Sub-linear Convergence). Suppose that Assumptions 1-3 hold. Under the condition of Theorem 1, if the decaying step-size is chosen as ${\alpha_{k}}=\theta/\left({k+\xi}\right)$ , then the sequence ${\left\{{{x_{k}}}\right\}_{k\geq 0}}$ generated by Algorithms 2-3 converges to an error ball around the optimal solution to the original optimization problem (1), at a sub-linear rate of $\mathcal{O}\left({1/k}\right)$ , i.e.,

\displaystyle\mathbb{E}\left[{\left\|{{x_{k}}-{1_{\left|\mathcal{R}\right|}}% \otimes{{\tilde{x}}^{*}}}\right\|_{2}^{2}}\right]\leq\frac{\Xi}{{k+\xi}}+E,% \forall k\geq 0,

(14)

where the radius of the error ball is $E$ .

Proof 7

See Appendix -F.

Remark 6

The convergence results established in Theorems 2-3 assert that the proposed algorithms achieve linear convergence at the expense of a larger convergence error than that of in the sub-linear convergence case. We note that a smaller constant step-size in the linear convergence case may simultaneously lead to a smaller convergence error and a slower convergence rate according to Theorem 2. It is also clear from Theorem 3 that the convergence error of Prox-DBRO-SAGA and Prox-DBRO-LSVRG for the sub-linear convergence case is determined by the number of Byzantine agents. That is to say, the sub-linear exact convergence of Prox-DBRO-SAGA and Prox-DBRO-LSVRG can be recovered, when the number of Byzantine agents equals to zero. Therefore, the theoretical results derived in Theorems 2-3 demonstrate a trade-off between the convergence error and convergence rate, which has been revealed by the theoretical results.

Remark 7

This paper does not make any assumption or restrictions on the number/proportion of Byzantine agents in the network and only assumes a connected network among all reliable agents (see Assumption 2). However, this does not imply that the number of Byzantine agents can be unbounded since an increase in the number of Byzantine agents lead to larger convergence errors in both cases and an unbounded number of Byzantine agents causes eventually divergence of the proposed algorithms according to Eqs. (13) and (14) in Theorems 2-3. According to Theorems 2-3, the resilience of Algorithms 2-3 is characterized by the consensus and controllable convergence errors of all reliable agents.

IV Numerical Experiments

In this section, we perform a case study on decentralized soft-max regression with sparsity to verify the theoretical results and show the convergence performance of the proposed algorithms, where four kinds of Byzantine attacks (zero-sum attacks, Gaussian attacks, same-value attacks, and sign-flipping attacks) are considered. The communication networks are randomly generated by the Erdős-Rényi method, where Byzantine agents are also selected in a random way. Most existing literature adopts only testing accuracy and the consensus error to validate the convergence performance of their proposed algorithms. However, neither higher testing accuracy nor a smaller consensus error can comprehensively reflect the convergence of the tested algorithms. This is because these two metrics fail to precisely measure the iterative distance between the optimized function value and the optimal value of optimization problems, which, however, serves as a primary goal of theoretical analysis. Therefore, there is a gap between the theoretical result and practical performance. To bridge the gap, we introduce the (averaged) optimal gap in the form of function values, i.e., $\left({1/\left|\mathcal{R}\right|}\right)\sum\nolimits_{i\in\mathcal{R}}{\left% ({{f_{i}}\left({{x_{i,k}}}\right)+g\left({{x_{i,k}}}\right)-\left({{f_{i}}% \left({{{\tilde{x}}^{*}}}\right)+g\left({{{\tilde{x}}^{*}}}\right)}\right)}% \right)}$ , as the third metric, which could precisely captures transient behaviors (convergence or divergence) of algorithms when training a machine- or deep-learning model. A network of $m$ agents consisting of ${\left|\mathcal{R}\right|}$ reliable agents and $\left|\mathcal{B}\right|$ Byzantine agents, minimize a regularized soft-max regression problem for a multi-class classification task via specifying the problem formulation (1) as ${f_{i}}\left({\tilde{x}}\right):=-\left({1/{q_{i}}}\right)\sum\nolimits_{j=1}^% {{q_{i}}}{\sum\nolimits_{l=0}^{\tilde{C}-1}{{\mathcal{I}_{\left({{\tilde{b}_{% ij}}=l}\right)}}}}\ln\left({{e^{\left[{\tilde{x}}\right]_{l}^{\top}{\tilde{c}_% {ij}}}}/\sum\nolimits_{t=0}^{\tilde{C}-1}{{e^{\left[{\tilde{x}}\right]_{t}^{% \top}{\tilde{c}_{ij}}}}}}\right)+\left({{\beta_{1}}/2}\right)\left\|{\tilde{x}% }\right\|_{2}^{2}$ and ${g}\left({\tilde{x}}\right):={\beta_{2}}{\left\|{\tilde{x}}\right\|_{1}}$ , where $\tilde{x}\in{\mathbb{R}^{n}}$ with $n=\tilde{C}\tilde{n}$ is the model parameter, ${\tilde{C}}$ represents the number of sample classes, ${\left[{\tilde{x}}\right]_{j}}$ denotes a vector that contains $\left(j\tilde{n}\right)$ -th to $\left(\left({j+1}\right)\tilde{n}-1\right)$ -th elements of $\tilde{x}$ , ${{{\tilde{b}_{ij}}}}$ and ${{\tilde{c}_{ij}}}$ are the $j$ -th label and image allocated to agent $i$ , respectively, ${{\mathcal{I}_{\left({{{\tilde{b}_{ij}}}=l}\right)}}}$ is the indicator function with ${\mathcal{I}_{\left({{{\tilde{b}_{ij}}}=l}\right)}}=1$ if ${{{\tilde{b}_{ij}}}=l}$ and ${\mathcal{I}_{\left({{{\tilde{b}_{ij}}}=l}\right)}}=0$ otherwise, ${\beta_{1}}$ and ${\beta_{2}}$ are positive parameters of regularized terms for avoiding over-fitting and obtaining a sparse solution, respectively. We denote the total number of training samples by $N$ and the regularized parameters are set as ${\beta_{1}}={\beta_{2}}=1/N$ in the following numerical experiments.

Refer to caption — Figure 1: Random samples selected from the MNIST data set

Since the algorithmic framework BRIDGE [22] and decentralized algorithm (denoted by Peng) [26] are only available to handling a class of smooth single-objective optimization problems, we equip them with the proximal-gradient mapping method [38, 39, 40, 37, 12] to obtain Prox-BRIDGE-T, Prox-BRIDGE-M, Prox-BRIDGE-K, and Prox-Peng for the non-smooth composite finite-sum optimization problem, which is also applied to GeoMed [52] to get Prox-GeoMed. The initial state of decision variables of all tested algorithms are the same and generated from a standard normal distribution. Note that the parameters of all tested algorithms are optimized manually to obtain their best performance, and the parameters associated with the problem model keep the same to ensure fairness. A total number of $Q=60000$ training samples from the MNIST [53] data set are evenly allocated to each agent (including both reliable agents and Byzantine agents in the network) to train the discriminator, while the rest $10000$ samples are used for testing. Fig. 1 presents 100 samples randomly selected from the data set. Recall the theoretical results regarding the decaying and constant step-sizes and penalty parameter such that the experimental setting gives the following feasible selection ranges: $\alpha\in\left({0,0.1385}\right]$ , ${\alpha_{k}}\in\left({0,1/\left({k+14.2733}\right)}\right]$ , and $\phi\geq 0.0003$ .

TABLE II: Parameter settings and algorithm performance at 150 epochs under zero-sum attacks.

	NIDS	PMGT-LSVRG	PMGT-SAGA	Prox-BRIDGE-T	Prox-BRIDGE-M	Prox-BRIDGE-K	Prox-GeoMed	Prox-Peng	Prox-DBRO-LSVRG	Prox-DBRO-SAGA
Step-size	[0.01, 0.015]	0.001	0.01	0.35	0.3	0.35	0.35	0.5	0.05	0.005
Triggered probability	N/A	$m/Q$	N/A	N/A	N/A	N/A	N/A	N/A	$\left[{m/Q/2,m/Q}\right]$	N/A
Penalty parameter	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.1	[0.2, 0.25]	[0.2, 0.25]
Consensus error	0	1.4163e-11	1.4159e-09	1.2492e-03	1.2116e-03	1.9340e-03	3.8010e-03	92.9523	4.4680	4.2898e-02
Testing accuracy	0.098	0.6812	0.6812	0.8965	0.8918	0.8653	0.8999	0.8789	0.9137	0.9155
Optimal gap	2.0245	1.9230	1.9228	1.4940e-01	1.5751e-01	2.1701e-01	1.2435e-01	2.7077e-01	5.0790e-02	3.7756e-02
PS&PI is the abbreviation of parameter settings and performance metrics.

Zero-sum attacks: As depicted in Fig. 2: LABEL:sub@Fig-1-1, an $m=30$ network consists of ${\left|\mathcal{R}\right|}=25$ reliable agents (yellow nodes) and $\left|\mathcal{B}\right|=5$ Byzantine agents (red nodes), where each Byzantine agent $b$ , $b\in\mathcal{B}$ , sends a well-designed malicious message ${z_{ib,k}}=-{\sum\nolimits_{j\in{\mathcal{R}_{i}}}{{w_{ij}}{x_{j,k}}}}/\left|{% {\mathcal{B}_{i}}}\right|/{w_{bi}}$ to its reliable neighbor $i$ , $i\in{\mathcal{R}_{b}}$ , to drive the states of the reliable agent ${x_{i,k}}=0_{n}$ at each iteration. For NIDS [39], the algorithm parameter is fixed as $c=1/\left({2{{\max}_{i\in\mathcal{R}}}\left\{{{\alpha_{i}}}\right\}}\right)$ for $\tilde{W}={I_{m}}-c{D_{\alpha}}\left({{I_{m}}-W}\right)$ , where $\tilde{W}$ , $W$ , and $D_{\alpha}$ are the modified mixing matrix, weight matrix, and uncoordinated step-size, respectively. This means if $c$ is sufficiently small, NIDS runs without any communication happening among all agents (both reliable agents and Byzantine agents) in the network. For PMGT-SAGA/PMGT-LSVRG [36], we hand-tune the parameter associated with multi-step communications to obtain the best performance. It is clear from Fig. 2: LABEL:sub@Fig-1-2-LABEL:sub@Fig-1-4 and Table II that the proposed algorithms achieve a smaller optimal gap and higher testing accuracy than the other tested algorithms in a same amount of computational costs (epochs). This demonstrates that the proposed algorithms approximate faster to the optimal solution than the other tested algorithms. It is worthwhile to mention that zero-sum attacks launched by Byzantine agents aim to drive the states of all reliable agents to zero at each iteration. Therefore, a much smaller consensus error of NIDS and PMGT-SAGA/PMGT-LSVRG than the other decentralized Byzantine-resilient algorithms, may indicate that they are less resilient or more susceptible to the zero-sum attacks. This can be testified by their bigger optimal gaps and lower testing accuracy shown in Fig. 2: LABEL:sub@Fig-1-2-LABEL:sub@Fig-1-3 and Table II.

Gaussian attacks: It is shown in Fig. 3: LABEL:sub@Fig-2-1 that an $m=40$ network consists of ${\left|\mathcal{R}\right|}=32$ reliable agents (yellow nodes) and $\left|\mathcal{B}\right|=8$ Byzantine agents (red nodes), where each Byzantine agent $b$ , $b\in\mathcal{B}$ , sends a message subject to a Gaussian distribution with mean $\sum\nolimits_{j\in{\mathcal{R}_{i}}}{{w_{ij}}{x_{j,k}}}/\sum\nolimits_{j\in{% \mathcal{R}_{i}}}{{w_{ij}}}$ and standard

TABLE III: Parameter settings and algorithm performance at 150 epochs under Gaussian attacks.

	NIDS	PMGT-LSVRG	PMGT-SAGA	Prox-BRIDGE-T	Prox-BRIDGE-M	Prox-BRIDGE-K	Prox-GeoMed	Prox-Peng	Prox-DBRO-LSVRG	Prox-DBRO-SAGA
Step-size	[0.3, 0.35]	0.3	0.3	0.4	0.3	0.3	0.4	0.05	0.0015	0.0025
Triggered probability	N/A	$m/Q$	N/A	N/A	N/A	N/A	N/A	N/A	$\left[{m/Q/4,m/Q/2}\right]$	N/A
Penalty parameter	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.001	[0.001, 0.0015]	[0.001, 0.0015]
Consensus error	3.0396e+04	3.2041e+04	3.1770e+04	1.8532e-03	1.0682e-03	1.9835e-03	1.6716e-03	8.3993e-02	1.1938e-03	5.1252e-06
Testing accuracy	0.0742	0.4985	0.5765	0.9001	0.8952	0.8522	0.9002	0.8856	0.9165	0.9197
Optimal gap	Inf	4.2709e+01	3.7046e+01	1.2136e-01	1.4870e-01	2.4074e-01	1.1507e-01	1.4480e-01	3.2053e-02	5.7316e-03
PS&PI is the abbreviation of parameter settings and performance metrics.

deviation 30, to its reliable neighbor $i$ , $i\in{\mathcal{R}_{b}}$ at each iteration. This attack serves as a Gaussian noise, which can easily inflict fluctuation on the state of reliable agents and deviate the states from their true values. Even though the testing accuracy index can still fluctuate around 0.6, we can see from Fig. 3: LABEL:sub@Fig-2-2 and Table III that NIDS and PMGT-SAGA/PMGT-LSVRG show divergence from the optimal solution under Gaussian attacks. It is shown by Figs. 3: LABEL:sub@Fig-2-2-LABEL:sub@Fig-2-4 that the proposed algorithms can still achieve a smaller optimal gap and higher testing accuracy in the same epochs, alternatively faster convergence, than the other tested algorithms. Moreover, one can clearly see from Table III that Prox-DBRO-SAGA takes the superiority on all three performance metrics (optimal gap, testing accuracy, and consensus error) at 150 epochs, while Prox-DBRO-LSVRG ranks second on these three performance metrics.

TABLE IV: Parameter settings and algorithm performance at 150 epochs under same-value attacks.

	NIDS	PMGT-LSVRG	PMGT-SAGA	Prox-BRIDGE-T	Prox-BRIDGE-M	Prox-BRIDGE-K	Prox-GeoMed	Prox-Peng	Prox-DBRO-LSVRG	Prox-DBRO-SAGA
Step-size	[0.5, 0.55]	0.5	0.5	0.35	0.3	0.4	0.4	$0.97/\left({k+25}\right)$	$0.91/\left({k+21}\right)$	$0.74/\left({k+35}\right)$
Triggered probability	N/A	$m/Q$	N/A	N/A	N/A	N/A	N/A	N/A	$\left[{m/Q/8,m/Q/4}\right]$	N/A
Penalty parameter	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.0005	[0.0004, 0.00045]	[0.0005, 0.00055]
Consensus error	4.1710e-02	1.1216e+06	1.1216e+06	2.0447e-03	1.7397e-03	1.9091e+08	2.6619e-03	6.0639e-01	1.4196e-03	7.0507e-04
Testing accuracy	0.098	0.8227	0.8162	0.8972	0.8942	0.8649	0.9006	0.8915	0.9072	0.9067
Optimal gap	6.5336e+04	2.2783e+04	2.2783e+04	1.6548e-01	1.6724e-01	1.6336e+03	1.5303e-01	1.9619e-01	1.4623e-01	1.5938e-01
PS&PI is the abbreviation of parameter settings and performance metrics.

Same-value attacks: As depicted in Fig. 4: LABEL:sub@Fig-3-1, an $m=60$ network consists of ${\left|\mathcal{R}\right|}=40$ reliable agents (yellow nodes) and $\left|\mathcal{B}\right|=20$ Byzantine agents (red nodes), where each Byzantine agent $b$ , $b\in\mathcal{B}_{i}$ , keeps sending ${z_{ib,k}}=1000*{1_{n}}$ to its reliable neighbor $i$ , $i\in{\mathcal{R}}$ , at each iteration. Under this attack, the states of reliable agents can be easily blown up to sufficiently large values, which prevents the tested algorithms from convergence. Figs. 4: LABEL:sub@Fig-3-2-LABEL:sub@Fig-3-3 manifest that the proposed algorithms achieve faster convergence than the other tested algorithms on the performance metrics of the optimal gap and testing accuracy, while Prox-DBRO-LSVRG is slightly faster than Prox-DBRO-SAGA in this case. It can be found in Table IV that the proposed algorithms attain also a smaller consensus error than the other tested algorithms at 150 epochs. Note that the consensus error of Prox-BRIDGE-B goes to a very large value since it adopts a vector-valued operation for screening at each reliable agent, which results in a single surviving vector totally from one neighboring agent and thus easily leads to a large state variance between any two reliable agents. The performance comparison takes no account of BRIDGE-B [22] due to its strict requirement on the number of neighboring agents and high computational overhead. In a nutshell, the above numerical experiments demonstrate that while the proposed algorithms do not achieve the smallest consensus error only under zero-sum attacks, they achieve the best performance in all other cases.

Sign-flipping attacks: In this case, we aim to verify the trade-off between the convergence accuracy and Byzantine resilience of the propose algorithms. The total number of agents including both reliable and Byzantine agents is $m=100$ , where each Byzantine agent $b$ , $b\in\mathcal{B}$ , sends the falsified model ${z_{ib,k}}=-s_{b}\sum\nolimits_{j\in{\mathcal{R}_{i}}\cup\left\{i\right\}}{{x_% {j,k}}}/\left({\left|{{\mathcal{R}_{i}}}\right|+1}\right)$ to their reliable neighbors $i$ , $i\in{\mathcal{R}_{b}}$ , where $s_{b}>0$ is the hyperparameter controlling the deviation of the attack. From Fig. 5, we can see that if the proportion or number of Byzantine agents increases, then the convergence accuracy regarding all three performance metrics becomes worse. This verifies the trade-off established in the theoretical result (see Theorems 2-3).

V Conclusion

In this paper, we proposed two decentralized Byzantine-resilient and VR stochastic gradient algorithms, namely Prox-DBRO-LSVRG and Prox-DBRO-SAGA, to resolve a category of non-smooth composite finite-sum optimization problems over MASs in the presence of Byzantine agents. Theoretical analysis established both linear and sub-linear convergence rates for the proposed algorithms under different conditions. In the numerical experiments, the proposed algorithms were applied to resolving a decentralized sparse soft-max regression task over MASs under different Byzantine attacks, which verifies the theoretical findings and demonstrates the better convergence performance of the proposed algorithms than the other notable decentralized algorithms. However, both Prox-DBRO-LSVRG and Prox-DBRO-SAGA are not perfect, since they can only achieve exact sub-linear convergence in the absence of Byzantine agents. Future work will further investigate privacy issues and intermittent communication, which are also prevalent in MASs.

-A Proof of Lemma 1

According to Step 4 in Algorithm 3, at iteration $k$ , $\forall k\geq 1$ , the auxiliary variables $u_{i,k+1}^{l}$ , $i\in\mathcal{R}$ , take value $u_{i,k}^{l}$ or ${x_{i,k}}$ , associated with probabilities $\left({1-1/{q_{i}}}\right)$ and $1/q_{i}$ , respectively. This observation is owing to the fact that selection of the random sample for Prox-DBRO-SAGA, at each iteration $k\geq 1$ , is uniformly and independently executed. Hence, we have

	$\displaystyle{\mathbb{E}_{k}}\left[{\frac{1}{{{q_{i}}}}\sum\limits_{l=1}^{{q_{% i}}}{{\nabla f_{i}^{l}{{\left({{{\tilde{x}}^{}}}\right)}^{\top}}\left({u_{i,k% +1}^{l}-{{\tilde{x}}^{}}}\right)}}}\right]$	(15)
$\displaystyle=$	$\displaystyle\left({1-\frac{1}{{{q_{i}}}}}\right)\frac{1}{{{q_{i}}}}\sum% \limits_{l=1}^{{q_{i}}}{\nabla f_{i}^{l}{{\left({{{\tilde{x}}^{}}}\right)}^{% \top}}\left({u_{i,k}^{l}-{{\tilde{x}}^{}}}\right)}+\frac{1}{{{q_{i}}}}\nabla{% f_{i}}{\left({{{\tilde{x}}^{*}}}\right)^{\top}}$
	$\displaystyle\times\left({{x_{i,k}}-{{\tilde{x}}^{*}}}\right).$

Similarly, it holds that

{\mathbb{E}_{k}}\left[{f_{i}^{l}\left({u_{i,k+1}^{l}}\right)}\right]=\left({1-% \frac{1}{{{q_{i}}}}}\right)f_{i}^{l}\left({u_{i,k}^{l}}\right)+\frac{1}{{{q_{i% }}}}f_{i}^{l}\left({{x_{i,k}}}\right).

(16)

Via summing (16) over index $l$ for all $l=1,\ldots,{q_{i}}$ , we can further obtain

	$\displaystyle{\mathbb{E}_{k}}\left[{\frac{1}{{{q_{i}}}}\sum\limits_{l=1}^{{q_{% i}}}{f_{i}^{l}\left({u_{i,k+1}^{l}}\right)}}\right]$	(17)
$\displaystyle=$	$\displaystyle\left({1-\frac{1}{{{q_{i}}}}}\right)\frac{1}{{{q_{i}}}}\sum% \limits_{l=1}^{{q_{i}}}{f_{i}^{l}\left({u_{i,k}^{l}}\right)}+\frac{1}{{{q_{i}}% }}\frac{1}{{{q_{i}}}}\sum\limits_{l=1}^{{q_{i}}}{f_{i}^{l}\left({{x_{i,k}}}% \right)}$
$\displaystyle=$	$\displaystyle\left({1-\frac{1}{{{q_{i}}}}}\right)\frac{1}{{{q_{i}}}}\sum% \limits_{l=1}^{{q_{i}}}{f_{i}^{l}\left({u_{i,k}^{l}}\right)}+\frac{1}{{{q_{i}}% }}{f_{i}}\left({{x_{i,k}}}\right).$

Recalling the definition of $t_{i,k}^{u}$ and combining Eqs. (15) and (17) give

		$\displaystyle{\mathbb{E}_{k}}\left[{t_{i,k+1}^{u}}\right]$
	$\displaystyle=$	$\displaystyle{\mathbb{E}_{k}}\left[{\frac{1}{{{q_{i}}}}\sum\limits_{l=1}^{{q_{% i}}}{f_{i}^{l}\left({u_{i,k+1}^{l}}\right)}\!-\!{f_{i}}\left({{{\tilde{x}}^{}% }}\right)\!-\!\nabla f_{i}^{l}{{\left({{{\tilde{x}}^{}}}\right)}^{\top}}\left% ({u_{i,k+1}^{l}\!-{{\tilde{x}}^{*}}}\right)}\right]$
	$\displaystyle=$	$\displaystyle\left({1-\frac{1}{{{q_{i}}}}}\right)\frac{1}{{{q_{i}}}}\sum% \limits_{l=1}^{{q_{i}}}{f_{i}^{l}\left({u_{i,k}^{l}}\right)-\nabla f_{i}^{l}{{% \left({{{\tilde{x}}^{}}}\right)}^{\top}}\left({u_{i,k}^{l}-{{\tilde{x}}^{}}}% \right)}$
		$\displaystyle+\frac{1}{{{q_{i}}}}{f_{i}}\left({{x_{i,k}}}\right)-{f_{i}}\left(% {{{\tilde{x}}^{}}}\right)-\frac{1}{{{q_{i}}}}\nabla{f_{i}}{\left({{{\tilde{x}% }^{}}}\right)^{\top}}\left({{x_{i,k}}-{{\tilde{x}}^{*}}}\right)$

	$\displaystyle=$	$\displaystyle\left({1-\frac{1}{{{q_{i}}}}}\right){t_{i,k}}+\frac{1}{{{q_{i}}}}% \left({{f_{i}}\left({{x_{i,k}}}\right)-{f_{i}}\left({{{\tilde{x}}^{}}}\right)% }\right)-\frac{1}{{{q_{i}}}}\nabla{f_{i}}{\left({{{\tilde{x}}^{}}}\right)^{% \top}}$		(18)
		$\displaystyle\times\left({{x_{i,k}}-{{\tilde{x}}^{*}}}\right).$		(18)

Summing Eq. (18) over $i$ yields

	$\displaystyle\sum\limits_{i\in\mathcal{R}}{{\mathbb{E}_{k}}\left[{t_{i,k+1}^{u% }}\right]}$	(19)
$\displaystyle=$	$\displaystyle\sum\limits_{i\in\mathcal{R}}{\frac{1}{{{q_{i}}}}\left({{f_{i}}% \left({{x_{i,k}}}\right)-{f_{i}}\left({{{\tilde{x}}^{}}}\right)-\nabla{f_{i}}% {{\left({{{\tilde{x}}^{}}}\right)}^{\top}}\left({{x_{i,k}}-{{\tilde{x}}^{*}}}% \right)}\right)}$
	$\displaystyle+\sum\limits_{i\in\mathcal{R}}{\left({1-\frac{1}{{{q_{i}}}}}% \right)t_{i,k}^{u}}$
$\displaystyle\leq$	$\displaystyle\frac{1}{{{q_{\min}}}}\sum\limits_{i\in\mathcal{R}}{{{f_{i}}\left% ({{x_{i,k}}}\right)-{f_{i}}\left({{{\tilde{x}}^{}}}\right)-\nabla{f_{i}}{{% \left({{{\tilde{x}}^{}}}\right)}^{\top}}\left({{x_{i,k}}-{{\tilde{x}}^{*}}}% \right)}}$
	$\displaystyle+\left({1-\frac{1}{{{q_{\max}}}}}\right)\sum\limits_{i\in\mathcal% {R}}{t_{i,k}^{u}}$
$\displaystyle=$	$\displaystyle\frac{1}{{{q_{\min}}}}{D_{F}}\left({{x_{k}},{x^{*}}}\right)+\left% ({1-\frac{1}{{{q_{\max}}}}}\right)\sum\limits_{i\in\mathcal{R}}{t_{i,k}^{u}},$

where the second inequality uses $1\leq{q_{\min}}\leq{q_{i}}\leq{q_{\max}}$ , and the last equality is according to $f\left(x\right)=\sum\nolimits_{i\in\mathcal{R}}{{f_{i}}\left({{x_{i}}}\right)}$ and the definition of ${D_{F}}\left({{x_{k}},{x^{*}}}\right)$ . Substituting the definition of $t_{k}^{u}$ obtains the relation (8). In view of Step 4 in Algorithm 2, we know that at iteration $k$ , $\forall k\geq 1$ , the auxiliary variables ${w_{i,k+1}}$ , $i\in\mathcal{R}$ , take value ${x_{i,k}}$ with probability $p_{i}$ , or keep the most recent update ${w_{i,k}}$ with probability $1-{p_{i}}$ . Therefore, it can be verified that

		$\displaystyle{\mathbb{E}_{k}}\left[{\frac{1}{{{q_{i}}}}\sum\limits_{l=1}^{{q_{% i}}}{{\nabla f_{i}^{l}{{\left({{{\tilde{x}}^{}}}\right)}^{\top}}\left({{w_{i,% k+1}}-{{\tilde{x}}^{}}}\right)}}}\right]$		(20)
	$\displaystyle=$	$\displaystyle\frac{{{p_{i}}}}{{{q_{i}}}}\sum\limits_{l=1}^{{q_{i}}}{\nabla f_{% i}^{l}{{\left({{{\tilde{x}}^{}}}\right)}^{\top}}\left({{x_{i,k}}\!-\!{{\tilde% {x}}^{}}}\right)}\!+\!\left({1\!-\!{p_{i}}}\right)\nabla{f_{i}}{\left({{{% \tilde{x}}^{}}}\right)^{\top}}({{w_{i,k}}\!-\!{{\tilde{x}}^{}}}).$		(20)

Likewise, we have

{\mathbb{E}_{k}}\!\!\left[{\frac{1}{{{q_{i}}}}\sum\limits_{l=1}^{{q_{i}}}{f_{i% }^{l}\left({{w_{i,k+1}}}\right)}}\right]\\ \!\!\!=\!\left({1\!-\!{p_{i}}}\right)\frac{1}{{{q_{i}}}}\sum\limits_{l=1}^{{q_% {i}}}{f_{i}^{l}\left({{w_{i,k}}}\right)}+{p_{i}}{f_{i}}\left({{x_{i,k}}}\right).

(21)

Recalling the definition of $t_{i,k}^{w}$ and combining Eq. (20) with (21) give

	$\displaystyle{\mathbb{E}_{k}}\left[{t_{i,k+1}^{w}}\right]$	(22)
$\displaystyle=$	$\displaystyle{\mathbb{E}_{k}}\left[{\frac{1}{{{q_{i}}}}\sum\limits_{l=1}^{{q_{% i}}}{f_{i}^{l}\left({{w_{i,k+1}}}\right)}\!-\!{f_{i}}\left({{{\tilde{x}}^{}}}% \right)\!-\!\nabla f_{i}^{l}{{\left({{{\tilde{x}}^{}}}\right)}^{\top}}\left({% {w_{i,k+1}}\!-\!{{\tilde{x}}^{*}}}\right)}\right]$
$\displaystyle=$	$\displaystyle\left({1\!-{p_{i}}}\right){t_{i,k}^{w}}\!+\!{p_{i}}({{f_{i}}\left% ({{x_{i,k}}}\right)\!-\!{f_{i}}\left({{{\tilde{x}}^{}}}\right)\!-\!\nabla{f_{% i}}{{\left({{{\tilde{x}}^{}}}\right)}^{\top}}\left({{x_{i,k}}\!-\!{{\tilde{x}% }^{*}}}\right)}),$

where we apply ${f_{i}}\left({{{\tilde{x}}^{*}}}\right)=\left({1/{q_{i}}}\right)\sum\nolimits_% {l=1}^{{q_{i}}}{f_{i}^{l}\left({{\tilde{x}}^{*}}\right)}$ in the last equality. The relation (9) is reached through summing Eq. (22) over $i$ and substituting the definitions of $t_{k}^{w}$ and ${D_{F}}\left({{x_{k}},{x^{*}}}\right)$ .

-B Proof of Lemma 2

According to Step 3 in Algorithm 2, it holds that

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{r_{i,k}^{u}-\nabla{f_{i}}\left({{{% \tilde{x}}^{*}}}\right)}\right\\|_{2}^{2}}\right]$	(23)
$\displaystyle=$	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{r_{i,k}^{u}-\nabla{f_{i}}\left({{{% \tilde{x}}^{}}}\right)-\nabla{f_{i}}\left({{x_{i,k}}}\right)+\nabla{f_{i}}% \left({{{\tilde{x}}^{}}}\right)}\right\\|_{2}^{2}}\right]$
	$\displaystyle+\left\\|{\nabla{f_{i}}\left({{x_{i,k}}}\right)-\nabla{f_{i}}\left% ({{{\tilde{x}}^{*}}}\right)}\right\\|_{2}^{2},$

where the equality is due to the standard variance decomposition $\mathbb{E}_{k}\left[{\left\|A\right\|_{2}^{2}}\right]=\left\|{\mathbb{E}_{k}% \left[A\right]}\right\|_{2}^{2}+\mathbb{E}_{k}\left[{\left\|{A-\mathbb{E}_{k}% \left[A\right]}\right\|_{2}^{2}}\right]$ , with $A={r_{i,k}^{u}}-\nabla{f_{i}}\left({{{\tilde{x}}^{*}}}\right)$ . We continue to handle the first term in the right-hand-side of Eq. (23) as follows:

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{r_{i,k}^{u}-\nabla{f_{i}}\left({{{% \tilde{x}}^{}}}\right)-\nabla{f_{i}}\left({{x_{i,k}}}\right)+\nabla{f_{i}}% \left({{{\tilde{x}}^{}}}\right)}\right\\|_{2}^{2}}\right]$	(24)
$\displaystyle\leq$	$\displaystyle 2{\mathbb{E}_{k}}\!\!\left[{\left\\|{\nabla f_{i}^{{s_{i,k}}}\!% \left({{x_{i,k}}}\right)\!-\!\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{}}}% \right)\!-\!\nabla{f_{i}}\left({{x_{i,k}}}\right)\!+\!\nabla{f_{i}}\left({{{% \tilde{x}}^{}}}\right)}\right\\|_{2}^{2}}\right]$
	$\displaystyle+2{\mathbb{E}_{k}}\!\left[{\left\\|{\nabla f_{i}^{{s_{i,k}}}\left(% {u_{i,k}^{{s_{i,k}}}}\right)-\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{*}}}% \right)-\frac{1}{{{q_{i}}}}\sum\limits_{l=1}^{{q_{i}}}{\nabla f_{i}^{l}\left({% u_{i,k}^{l}}\right)}}\right.}\right.$
	$\displaystyle+\left.{\left.{\nabla{f_{i}}\left({{{\tilde{x}}^{*}}}\right)}% \right\\|_{2}^{2}}\right]$
$\displaystyle\leq$	$\displaystyle 2{\mathbb{E}_{k}}\!\!\left[{\left\\|{\nabla f_{i}^{{s_{i,k}}}% \left({{x_{i,k}}}\right)\!-\!\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{}}}% \right)\!-\!\nabla{f_{i}}\left({{x_{i,k}}}\right)\!+\!\nabla{f_{i}}\left({{{% \tilde{x}}^{}}}\right)}\right\\|_{2}^{2}}\right]$
	$\displaystyle+2{\mathbb{E}_{k}}\left[{\left\\|{\nabla f_{i}^{{s_{i,k}}}\left({u% _{i,k}^{{s_{i,k}}}}\right)-\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{*}}}% \right)}\right\\|_{2}^{2}}\right]$
$\displaystyle=$	$\displaystyle 2{\mathbb{E}_{k}}\left[{\left\\|{\nabla f_{i}^{{s_{i,k}}}\left({{% x_{i,k}}}\right)-\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{*}}}\right)}% \right\\|_{2}^{2}}\right]$
	$\displaystyle+2{\mathbb{E}_{k}}\left[{\left\\|{\nabla f_{i}^{{s_{i,k}}}\left({u% _{i,k}^{{s_{i,k}}}}\right)-\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{*}}}% \right)}\right\\|_{2}^{2}}\right]$
	$\displaystyle-2\left\\|{\nabla{f_{i}}\left({{x_{i,k}}}\right)-\nabla{f_{i}}% \left({{{\tilde{x}}^{*}}}\right)}\right\\|_{2}^{2},$

where the second inequality utilizes $\mathbb{E}_{k}\left[{\left\|{B-\mathbb{E}_{k}\left[B\right]}\right\|_{2}^{2}}% \right]\leq\mathbb{E}_{k}\left[{\left\|B\right\|_{2}^{2}}\right]$ , with $B=\nabla f_{i}^{{s_{i,k}}}\left({u_{i,k}^{{s_{i,k}}}}\right)-\nabla f_{i}^{{s_% {i,k}}}\left({{{\tilde{x}}^{*}}}\right)$ , and the last equality applies the standard variance decomposition again. We proceed with substituting (24) into (23) to obtain

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{r_{i,k}^{u}-\nabla{f_{i}}\left({{{% \tilde{x}}^{*}}}\right)}\right\\|_{2}^{2}}\right]$	(25)
$\displaystyle=$	$\displaystyle 2{\mathbb{E}_{k}}\left[{\left\\|{\nabla f_{i}^{{s_{i,k}}}\left({{% x_{i,k}}}\right)-\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{*}}}\right)}% \right\\|_{2}^{2}}\right]$
	$\displaystyle+2{\mathbb{E}_{k}}\left[{\left\\|{\nabla f_{i}^{{s_{i,k}}}\left({u% _{i,k}^{{s_{i,k}}}}\right)-\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{*}}}% \right)}\right\\|_{2}^{2}}\right]$
	$\displaystyle-\left\\|{\nabla{f_{i}}\left({{x_{i,k}}}\right)-\nabla{f_{i}}\left% ({{{\tilde{x}}^{*}}}\right)}\right\\|_{2}^{2}.$

Summing Eq. (25) over $i$ generates

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{r_{k}^{u}-\nabla f\left({{{\tilde{% x}}^{*}}}\right)}\right\\|_{2}^{2}}\right]$	(26)
$\displaystyle\leq$	$\displaystyle 2\sum\limits_{i\in\mathcal{R}}{{\mathbb{E}_{k}}\left[{\left\\|{% \nabla f_{i}^{{s_{i,k}}}\left({{x_{i,k}}}\right)-\nabla f_{i}^{{s_{i,k}}}\left% ({{{\tilde{x}}^{*}}}\right)}\right\\|_{2}^{2}}\right]}$
	$\displaystyle+2\sum\limits_{i\in\mathcal{R}}{{\mathbb{E}_{k}}\left[{\left\\|{% \nabla f_{i}^{{s_{i,k}}}\left({u_{i,k}^{{s_{i,k}}}}\right)-\nabla f_{i}^{{s_{i% ,k}}}\left({{{\tilde{x}}^{*}}}\right)}\right\\|_{2}^{2}}\right]}$
	$\displaystyle-\sum\limits_{i\in\mathcal{R}}{\left\\|{\nabla{f_{i}}\left({{x_{i,% k}}}\right)-\nabla{f_{i}}\left({{{\tilde{x}}^{*}}}\right)}\right\\|_{2}^{2}}.$

Since the local component objective function $f_{i}^{l}$ , $\forall l\in\mathcal{Q}_{i}$ and $\forall i\in\mathcal{R}$ , is $L$ -smooth according to Assumption 1, we have

		$\displaystyle\frac{1}{{2L}}\left\\|{\nabla f_{i}^{l}\left({u_{i,k}^{l}}\right)-% \nabla{f_{i,l}}\left({{{\tilde{x}}^{*}}}\right)}\right\\|_{2}^{2}$		(27)
	$\displaystyle\leq$	$\displaystyle f_{i}^{l}\left({u_{i,k}^{l}}\right)-f_{i}^{l}\left({{{\tilde{x}}% ^{}}}\right)-\nabla f_{i}^{l}{\left({{{\tilde{x}}^{}}}\right)^{\top}}\left({% u_{i,k}^{l}-{{\tilde{x}}^{*}}}\right).$		(27)

Summing the both sides of (27) over $l$ from $1$ to $q_{i}$ becomes

\frac{1}{{{q_{i}}}}\sum\limits_{l=1}^{{q_{i}}}{\left\|{\nabla f_{i}^{l}\left({% u_{i,k}^{l}}\right)-\nabla f_{i}^{l}\left({{{\tilde{x}}^{*}}}\right)}\right\|_% {2}^{2}}\leq 2Lt_{i,k}^{u}.

(28)

Since the local component function $f_{i}^{{s_{i,k}}}$ , has a uniform distribution over the set $\left\{{f_{i}^{1},\ldots,f_{i}^{{q_{i}}}}\right\}$ , it is natural to obtain

		$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{\nabla f_{i}^{{s_{i,k}}}\left({u_{% i,k}^{{s_{i,k}}}}\right)-\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{*}}}% \right)}\right\\|_{2}^{2}}\right]$		(29)
	$\displaystyle=$	$\displaystyle\frac{1}{{{q_{i}}}}\sum\limits_{l=1}^{{q_{i}}}{\left\\|{\nabla f_{% i}^{l}\left({u_{i,k}^{l}}\right)-\nabla f_{i}^{l}\left({{{\tilde{x}}^{*}}}% \right)}\right\\|_{2}^{2}}.$		(29)

Combining Eq. (29) and (28) and then summing over $i$ yield

\sum\limits_{i\in\mathcal{R}}{{\mathbb{E}_{k}}\left[{\left\|{\nabla f_{i}^{{s_% {i,k}}}\left({u_{i,k}^{{s_{i,k}}}}\right)-\nabla f_{i}^{{s_{i,k}}}\left({{{% \tilde{x}}^{*}}}\right)}\right\|_{2}^{2}}\right]}\leq 2Lt_{k}^{u}.

(30)

Summarizing (26) and (30) obtains

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{r_{k}^{u}-\nabla f\left({{{\tilde{% x}}^{*}}}\right)}\right\\|_{2}^{2}}\right]$	(31)
$\displaystyle\leq$	$\displaystyle 4Lt_{k}^{u}+2\sum\limits_{i\in\mathcal{R}}{{\mathbb{E}_{k}}\left% [{\left\\|{\nabla f_{i}^{{s_{i,k}}}\left({{x_{i,k}}}\right)-\nabla f_{i}^{{s_{i% ,k}}}\left({{{\tilde{x}}^{*}}}\right)}\right\\|_{2}^{2}}\right]}$
	$\displaystyle-\left\\|{\nabla F\left({{x_{k}}}\right)-\nabla F\left({{x^{*}}}% \right)}\right\\|_{2}^{2},$

where we simplify $\sum\nolimits_{i\in\mathcal{R}}{\left\|{\nabla{f_{i}}\left({{x_{i,k}}}\right)-% \nabla{f_{i}}\left({{{\tilde{x}}^{*}}}\right)}\right\|_{2}^{2}}$ as $\left\|{\nabla F\left({{x_{k}}}\right)-\nabla F\left({{x^{*}}}\right)}\right\|% _{2}^{2}$ . Via applying the Lipschitz continuity of $\nabla f_{i}^{l}$ again, we have

\sum\limits_{i\in\mathcal{R}}{{\mathbb{E}_{k}}\left[{\left\|{\nabla f_{i}^{{s_% {i,k}}}\left({{x_{i,k}}}\right)-\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{*% }}}\right)}\right\|_{2}^{2}}\right]}\leq 2L{D_{F}}\left({{x_{k}},{x^{*}}}% \right),

(32)

where we use the fact that ${f_{i}}\left({{x_{i,k}}}\right)=\left({1/{q_{i}}}\right)\sum\nolimits_{l=1}^{{% q_{i}}}{f_{i}^{l}\left({{x_{i,k}}}\right)}$ and $f\left(x_{k}\right)=\sum\nolimits_{i\in\mathcal{R}}{{f_{i}}\left({{x_{i,k}}}% \right)}$ . Plugging (32) into (31) generates

		$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{r_{k}^{u}-\nabla f\left({{{\tilde{% x}}^{*}}}\right)}\right\\|_{2}^{2}}\right]$		(33)
	$\displaystyle\leq$	$\displaystyle 4Lt_{k}^{u}+4L{D_{F}}\left({{x_{k}},{x^{}}}\right)-\left\\|{% \nabla F\left({{x_{k}}}\right)-\nabla F\left({{x^{}}}\right)}\right\\|_{2}^{2}.$		(33)

Considering the $\mu$ -strong convexity of the local objective function $f_{i}$ , $\forall i\in\mathcal{R}$ , we have

2\mu{D_{F}}\left({{x_{k}},{x^{*}}}\right)\leq\left\|{\nabla F\left({{x_{k}}}% \right)-\nabla F\left({{x^{*}}}\right)}\right\|_{2}^{2}.

(34)

Finally, one can obtain (10) via plugging the relation (34) into (33). For Prox-DBRO-LSVRG, we replace ${u_{i,k}^{l}}$ with ${{w_{i,k}}}$ to obtain (11), which completes the proof.

-C Proof of Proposition 2

According to the definition of ${\mathbf{prox}}_{\alpha,G}\left\{x\right\}$ , we have

$\displaystyle{\mathbf{prox}}_{\alpha,G}\left\{x\right\}=$	$\displaystyle\arg\mathop{\min}\limits_{y}\left\{{G\left(y\right)+\frac{1}{{2% \alpha}}\left\\|{y-x}\right\\|_{2}^{2}}\right\}$	(35)
$\displaystyle=$	$\displaystyle\arg\mathop{\min}\limits_{y}\left\{{\sum\limits_{i\in\mathcal{R}}% {g\left({{y_{i}}}\right)}+\frac{1}{{2\alpha}}\sum\limits_{i\in\mathcal{R}}{% \left\\|{{y_{i}}-{x_{i}}}\right\\|_{2}^{2}}}\right\}$
$\displaystyle=$	$\displaystyle\left({\begin{array}[]{*{20}{c}}{\arg\mathop{\min}\limits_{\tilde% {y}\in{\mathbb{R}^{n}}}\left\{{g\left({\tilde{y}}\right)+\frac{1}{{2\alpha}}% \left\\|{\tilde{y}-{x_{1}}}\right\\|_{2}^{2}}\right\}}\\ {\arg\mathop{\min}\limits_{\tilde{y}\in{\mathbb{R}^{n}}}\left\{{g\left({\tilde% {y}}\right)+\frac{1}{{2\alpha}}\left\\|{\tilde{y}-{x_{2}}}\right\\|_{2}^{2}}% \right\}}\\ \vdots\\ {\arg\mathop{\min}\limits_{\tilde{y}\in{\mathbb{R}^{n}}}\left\{{g\left({\tilde% {y}}\right)+\frac{1}{{2\alpha}}\left\\|{\tilde{y}-{x_{\left\|\mathcal{R}\right\|}% }}\right\\|_{2}^{2}}\right\}}\end{array}}\right),$

which indicates ${\left[{\mathbf{prox}}_{\alpha,G}\left\{x\right\}\right]}_{i}={\mathbf{prox}}_% {\alpha,g}\left\{x_{i}\right\}$ . Based on this equality, it is straightforward to verify (12) with the help of the non-expansiveness of the proximal operator ${{\mathbf{prox}}_{\alpha,g}}$ , which completes the proof.

-D Proof of Theorem 1

The optimal solution to (4) satisfies the optimality condition

{0_{n}}\in{\nabla{f_{i}}\left({x_{i}^{*}}\right)+\partial_{x_{i}}g\left({x_{i}% ^{*}}\right)}+\frac{{{\phi}}}{2}\sum\limits_{j\in{\mathcal{R}_{i}}}{\partial{{% \left\|{x_{i}^{*}-x_{j}^{*}}\right\|}_{a}}},\forall i\in\mathcal{R}.

(36)

According to the definition of the sub-differential ${\partial}{\left\|{x_{i}^{*}-x_{j}^{*}}\right\|_{a}}=\left\{{{y_{ij}}\in{% \mathbb{R}^{n}}|\left\langle{{y_{ij}},x_{i}^{*}}\right\rangle={{\left\|{x_{i}^% {*}}\right\|}_{a}},{{\left\|{{y_{ij}}}\right\|}_{b}}\leq 1}\right\}$ , there exist $g^{\prime}\left({x_{i}^{*}}\right)\in\partial_{x_{i}}g\left({x_{i}^{*}}\right)$ and ${\tilde{y}_{ij}}\in{\partial}{\left\|{x_{i}^{*}-x_{j}^{*}}\right\|_{a}}$ , such that for $\forall i\in\mathcal{R}$

\displaystyle{\nabla{f_{i}}\left({x_{i}^{*}}\right)+g^{\prime}\left({x_{i}^{*}% }\right)}+{\phi}\left({\sum\limits_{j\in{\mathcal{R}_{i}},i<j}{{{\tilde{y}}_{% ij}}}-\sum\limits_{j\in{\mathcal{R}_{i}},i>j}{{{\tilde{y}}_{ji}}}}\right)=0_{n}.

(37)

Under Assumption 1, the globally optimal solution $x^{*}$ exists uniquely. We next need to prove that the optimal solution ${\tilde{x}^{*}}$ satisfies (37), such that

{\nabla{f_{i}}\left({{{\tilde{x}}^{*}}}\right)+g^{\prime}\left({{{\tilde{x}}^{% *}}}\right)}+{\phi}\left({\sum\limits_{j\in{\mathcal{R}_{i}},i<j}{{{\tilde{y}}% _{ij}}}-\sum\limits_{j\in{\mathcal{R}_{i}},i>j}{{{\tilde{y}}_{ji}}}}\right)=0_% {n},

(38)

where $g^{\prime}\left({{{\tilde{x}}^{*}}}\right)\in{\partial_{\tilde{x}}}g\left({{{% \tilde{x}}^{*}}}\right)$ . Since (38) can be decomposed into element-wise, without loss of generality, the rest proof assumes $n=1$ , i.e., the scalar case. Via denoting ${\psi_{i}}:={\nabla{f_{i}}\left({{{\tilde{x}}^{*}}}\right)+g^{\prime}\left({{{% \tilde{x}}^{*}}}\right)}$ , the task to prove (38) reduces to solving for a vector $\Psi$ with $\Psi:={\left[{\psi_{1},\psi_{2},\ldots,\psi_{\left|\mathcal{R}\right|}}\right]% ^{\top}}\in{\mathbb{R}^{\left|\mathcal{R}\right|}}$ , such that the following relation holds

\phi\Pi{\tilde{y}}+\Psi=0_{\left|\mathcal{R}\right|},

(39)

where $\tilde{y}\in{\mathbb{R}^{\left|{{\mathcal{E}_{\mathcal{R}}}}\right|}}$ is the collected form of $\tilde{y}_{ij}$ according to the order of edges in ${\mathcal{E}_{\mathcal{R}}}$ . We need to solve for at least one solution $\tilde{y}$ meeting ${\left\|{{{\tilde{y}}_{ij}}}\right\|_{b}}\leq 1$ with $b>1$ , such that (39) holds true. To proceed, we decompose the task into two parts.
Part I: We first manifest that (39) has at least one solution. In view of the rank of the node-edge incidence matrix $\Pi$ is $\left|\mathcal{R}\right|-1$ and the null space of the columns is spanned by the all-one vector ${1_{\left|\mathcal{R}\right|}}$ . Recalling the definition of ${\psi_{i}}$ , the optimality condition of (1) is $\sum\nolimits_{i\in\mathcal{R}}{{\psi_{i}}}=0$ . Therefore, we know that the columns of $\Pi$ and those of $\left[{\phi\Pi,\Psi}\right]$ share the same null space, which indicates the same rank of $\Pi$ and $\left[{\phi\Pi,\Psi}\right]$ . The existence of solutions to (39) can be demonstrated according to the property of non-homogeneous linear equations.
Part II: In this part, a solution with the $b$ -norm of its elements no larger than $1$ is sought. Suppose that $y\in{\mathbb{R}^{\left|{{\mathcal{E}_{\mathcal{R}}}}\right|}}$ is a solution to (39), such that $\phi\Pi{y}+\Psi=0_{\left|\mathcal{R}\right|}$ . We consider the least-squares solution $y=-{\Pi^{\dagger}}\Psi/\phi$ , where ${\Pi^{\dagger}}$ is the Moore-Penrose pseudo-inverse of $\Pi$ . Then, it suffices to prove that ${\left\|y\right\|_{b}}\leq 1$ . Since ${\left\|y\right\|_{b}}={\left({\sum\nolimits_{i=1}^{\left|\mathcal{R}\right|}{% {{\left|{{y_{i}}}\right|}^{b}}}}\right)^{1/b}},\forall b>1$ , we know that ${\left\|y\right\|_{b}}\leq{\left\|y\right\|_{1}}$ . Therefore, we derive

$\displaystyle{\left\\|y\right\\|_{b}}\leq$	$\displaystyle\frac{1}{\phi}{\left\\|{{\Pi^{\dagger}}\Psi}\right\\|_{1}}$	(40)
$\displaystyle\leq$	$\displaystyle\frac{1}{\phi}{\left\\|{{\Pi^{\dagger}}}\right\\|_{1}}{\left\\|\Psi% \right\\|_{1}}$
$\displaystyle\leq$	$\displaystyle\frac{1}{\phi}{\left\|\mathcal{R}\right\|\sqrt{\left\|{{\mathcal{E}_% {\mathcal{R}}}}\right\|}}{\left\\|{{\Pi^{\dagger}}}\right\\|_{2}}{\left\\|\Psi% \right\\|_{2}},$

where the second inequality uses the vector-matrix norm compatibility, and the last inequality applies the facts that ${\left\|\Psi\right\|_{1}}\leq\left|\mathcal{R}\right|{\left\|\Psi\right\|_{2}}$ and ${\left\|{{\Pi^{\dagger}}}\right\|_{1}}\leq\sqrt{\left|{{\mathcal{E}_{\mathcal{% R}}}}\right|}{\left\|{{\Pi^{\dagger}}}\right\|_{2}}$ . Consider ${{{\lambda}_{\max}}}\left({{\Pi^{\dagger}}}\right)$ and ${{{\lambda}_{\min}}\left(\Pi\right)}$ as the maximum and minimum singular values of matrices $\Pi^{\dagger}$ and $\Pi$ , respectively. Based on (40), we further obtain

	$\displaystyle{\left\\|y\right\\|_{b}}\leq$	$\displaystyle{{{\lambda}_{\max}}}\left({{\Pi^{\dagger}}}\right)\frac{{{{\left\|% \mathcal{R}\right\|\sqrt{\left\|{{\mathcal{E}_{\mathcal{R}}}}\right\|}}}}}{{{\phi% }}}{\left\\|\Psi\right\\|_{2}}$		(41)
	$\displaystyle=$	$\displaystyle\frac{{{{\left\|\mathcal{R}\right\|\sqrt{\left\|{{\mathcal{E}_{% \mathcal{R}}}}\right\|}}}}}{{{\phi}{{{\lambda}_{\min}}\left(\Pi\right)}}}{\left% \\|\Psi\right\\|_{2}}.$		(41)

Since ${\left\|\Psi\right\|_{2}}\leq\sqrt{\left|\mathcal{R}\right|}{\left\|\Psi\right% \|_{\infty}}$ , we further have

{\left\|y\right\|_{b}}\leq\frac{{{{\left|\mathcal{R}\right|}^{\frac{3}{2}}}% \sqrt{\left|{{\mathcal{E}_{\mathcal{R}}}}\right|}}}{{{{{\lambda}_{\min}}\left(% \Pi\right)}{\phi}}}\mathop{\max}\limits_{i\in\mathcal{R}}\left|{{\psi_{i}}}% \right|.

(42)

If we consider $n\geq 1$ , i.e, the arbitrary dimension case, (42) becomes

{\left\|y\right\|_{b}}\leq\frac{{{{\left|\mathcal{R}\right|}^{\frac{3}{2}}}% \sqrt{\left|{{\mathcal{E}_{\mathcal{R}}}}\right|}}}{{{{{\lambda}_{\min}}\left(% \Pi\right)}{\phi}}}\mathop{\max}\limits_{i\in\mathcal{R}}{\left\|\nabla{f_{i}}% \left({\tilde{x}^{*}}\right)+{g^{\prime}\left({{{\tilde{x}}^{*}}}\right)}% \right\|_{\infty}}.

(43)

The proof is completed by choosing an appropriate ${\phi}$ to meet

\frac{{{{\left|\mathcal{R}\right|}^{\frac{3}{2}}}\sqrt{\left|{{\mathcal{E}_{% \mathcal{R}}}}\right|}}}{{{{{\lambda}_{\min}}\left(\Pi\right)}{\phi}}}\mathop{% \max}\limits_{i\in\mathcal{R}}{\left\|\nabla{f_{i}}\left({\tilde{x}^{*}}\right% )+{g^{\prime}\left({{{\tilde{x}}^{*}}}\right)}\right\|_{\infty}}\leq 1.

-E Proof of Theorem 2

Based on the optimality condition (36), we know that

{x^{*}}={\mathbf{prox}}_{\alpha,G}\left\{{{x^{*}}-\alpha\left({\nabla F\left({% {x^{*}}}\right)+{\partial_{x}}\chi\left({{x^{*}}}\right)}\right)}\right\}.

(44)

In view of the compact form (7), it holds

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{{x_{k+1}}-{x^{*}}}\right\\|_{2}^{2}% }\right]$	(45)
$\displaystyle=$	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{{\mathbf{pro}}{{\mathbf{x}}_{% \alpha,G}}\left\{{{{\bar{x}}_{k}}}\right\}-{\mathbf{pro}}{{\mathbf{x}}_{\alpha% ,G}}\left\{{{x^{}}}\right.}\right.-\alpha\nabla F\left({{x^{}}}\right)}\right.$
	$\displaystyle-\left.{\left.{\left.{\alpha{\partial_{x}}\chi\left({{x^{*}}}% \right)}\right\}}\right\\|_{2}^{2}}\right]$
$\displaystyle\leq$	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{{{{\bar{x}}_{k}}}-\left({{x^{}}-% \alpha\left({\nabla F\left({{x^{}}}\right)+{\partial_{x}}\chi\left({{x^{*}}}% \right)}\right)}\right)}\right\\|_{2}^{2}}\right]$
$\displaystyle=$	$\displaystyle\left\\|{{x_{k}}-{x^{}}}\right\\|_{2}^{2}-2\alpha{\mathbb{E}_{k}}% \left[{\left\langle{{x_{k}}-{x^{}},r_{k}-\nabla F\left({{x^{*}}}\right)}% \right\rangle}\right]$
	$\displaystyle-2\alpha\left\langle{{x_{k}}-{x^{}},{\partial_{x}}\chi\left({{x_% {k}}}\right)-{\partial_{x}}\chi\left({{x^{}}}\right)+{\partial_{x}}\delta% \left({{x_{k}}}\right)}\right\rangle$
	$\displaystyle+{\alpha^{2}}{\mathbb{E}_{k}}\left[{\left\\|{r_{k}-\nabla F\left({% {x^{}}}\right)+{\partial_{x}}\chi\left({{x_{k}}}\right)-{\partial_{x}}\chi% \left({{x^{}}}\right)}\right.}\right.$
	$\displaystyle+\left.{\left.{{\partial_{x}}\delta\left({{x_{k}}}\right)}\right% \\|_{2}^{2}}\right],$

where the inequality applies (12). We continue to seek an upper bound for ${\mathbb{E}_{k}}\left[{\left\|{r_{k}-\nabla F\left({{x^{*}}}\right)+{\partial_% {x}}\chi\left({{x_{k}}}\right)-{\partial_{x}}\chi\left({{x^{*}}}\right)+{% \partial_{x}}\delta\left({{x_{k}}}\right)}\right\|_{2}^{2}}\right]$ as follows:

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{r_{k}-\nabla F\left({{x^{}}}% \right)+{\partial_{x}}\chi\left({{x_{k}}}\right)-{\partial_{x}}\chi\left({{x^{% }}}\right)+{\partial_{x}}\delta\left({{x_{k}}}\right)}\right\\|_{2}^{2}}\right]$	(46)
$\displaystyle\leq$	$\displaystyle 4{\mathbb{E}_{k}}\left[{\left\\|{r_{k}-\nabla F\left({{x^{}}}% \right)}\right\\|_{2}^{2}}\right]+2\left\\|{{\partial_{x}}\chi\left({{x_{k}}}% \right)-{\partial_{x}}\chi\left({{x^{}}}\right)}\right\\|_{2}^{2}$
	$\displaystyle+4\left\\|{{\partial_{x}}\delta\left({{x_{k}}}\right)}\right\\|_{2}% ^{2}$
$\displaystyle\leq$	$\displaystyle 4\left({4Lt_{k}+2\left({2L-\mu}\right){D_{F}}\left({{x_{k}},{x^{% *}}}\right)}\right)+4\left\\|{{\partial_{x}}\delta\left({{x_{k}}}\right)}\right% \\|_{2}^{2}$
	$\displaystyle+2\left\\|{{\partial_{x}}\chi\left({{x_{k}}}\right)-{\partial_{x}}% \chi\left({{x^{*}}}\right)}\right\\|_{2}^{2},$

where the first inequality applies ${\left\|{c+d}\right\|^{2}}\leq 2{c^{2}}+2{d^{2}}$ twice, and the second equality employs Lemma 2. To proceed, we bound $\left\|{{\partial_{x}}\delta\left({{x_{k}}}\right)}\right\|_{2}^{2}$ as follows:

	$\displaystyle\left\\|{{\partial_{x}}\delta\left({{x_{k}}}\right)}\right\\|_{2}^{% 2}=$	$\displaystyle\sum\limits_{i\in\mathcal{R}}{\left\\|{{\phi}\sum\limits_{j\in{% \mathcal{B}_{i}}}{{\partial_{{x_{i}}}}{{\left\\|{{x_{i,k}}-{z_{ij,k}}}\right\\|}% _{a}}}}\right\\|}_{2}^{2}$		(47)
	$\displaystyle\leq$	$\displaystyle n{{\phi}^{2}}\sum\limits_{i\in\mathcal{R}}{{{\left\|{{\mathcal{B}% _{i}}}\right\|}^{2}}},$		(47)

where the inequality holds true, since the $b$ -norm ( $b\geq 1$ ) of ${\partial_{{x_{i}}}}{\left\|{{x_{i,k}}-{z_{ij,k}}}\right\|_{a}}$ , $\forall i\in\mathcal{R}$ , is no larger than $1$ owing to Proposition 1, i.e.,

\left|{{{\left[{{\partial_{{x_{i}}}}{{\left\|{{x_{i,k}}-{z_{ij,k}}}\right\|}_{% a}}}\right]}_{e}}}\right|\leq 1,\forall e=1,\ldots,n.

(48)

Following the same technical line of (47)-(48), it is not difficult to verify

	$\displaystyle\left\\|{{\partial_{x}}\chi\left({{x_{k}}}\right)-{\partial_{x}}% \chi\left({{x^{*}}}\right)}\right\\|_{2}^{2}$	(49)
$\displaystyle=$	$\displaystyle\sum\limits_{i\in\mathcal{R}}{\left\\|{{\phi}\sum\limits_{j\in{% \mathcal{R}_{i}}}{\left({{\partial_{{x_{i}}}}{{\left\\|{{x_{i,k}}-{x_{j,k}}}% \right\\|}_{a}}-{\partial_{{x_{i}}}}{{\left\\|{x_{i}^{}-x_{j}^{}}\right\\|}_{a}% }}\right)}}\right\\|_{2}^{2}}$
$\displaystyle\leq$	$\displaystyle 4n{{\phi}^{2}}\sum\limits_{i\in\mathcal{R}}{{{\left\|{{\mathcal{R% }_{i}}}\right\|}^{2}}}.$

Combining (46), (47), and (49) obtains

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{r_{k}-\nabla F\left({{x^{}}}% \right)+{\partial_{x}}\chi\left({{x_{k}}}\right)-{\partial_{x}}\chi\left({{x^{% }}}\right)+{\partial_{x}}\delta\left({{x_{k}}}\right)}\right\\|_{2}^{2}}\right]$	(50)
$\displaystyle\leq$	$\displaystyle 4\left({2Lt_{k}+\left({2L-\mu}\right){D_{F}}\left({{x_{k}},{x^{*% }}}\right)}\right)+8n{{\phi}^{2}}\sum\limits_{i\in\mathcal{R}}{{{\left\|{{% \mathcal{R}_{i}}}\right\|}^{2}}}$
	$\displaystyle+4n{{\phi}^{2}}\sum\limits_{i\in\mathcal{R}}{{{\left\|{{\mathcal{B% }_{i}}}\right\|}^{2}}}.$

Since the local objective function ${f_{i}}\left({{x_{i}}}\right)$ , $\forall i\in\mathcal{R}$ , can be $\mu$ -strongly convex and $L$ -smooth according to Assumption 1, we have

	$\displaystyle-{\mathbb{E}_{k}}\left[{\left\langle{{x_{k}}-{x^{}},r_{k}-\nabla F% \left({{x^{}}}\right)}\right\rangle}\right]$	(51)
$\displaystyle=$	$\displaystyle-\left\langle{{x_{k}}-{x^{}},\nabla F\left({{x_{k}}}\right)-% \nabla F\left({{x^{}}}\right)}\right\rangle$
$\displaystyle\leq$	$\displaystyle\frac{{\mu L}}{{\mu+L}}\left\\|{{x_{k}}-{x^{}}}\right\\|_{2}^{2}+% \frac{{1}}{{\mu+L}}\left\\|{\nabla F\left({{x_{k}}}\right)-\nabla F\left({{x^{% }}}\right)}\right\\|_{2}^{2}.$

Recalling the definition of $\chi\left({{x_{k}}}\right)$ , we know that it is a convex function. Therefore, it is straightforward to obtain

-\left\langle{{x_{k}}-{x^{*}},{\partial_{x}}\chi\left({{x_{k}}}\right)-{% \partial_{x}}\chi\left({{x^{*}}}\right)}\right\rangle\leq 0.

(52)

Applying the Young’s inequality and relation (47) to the term $-2\left\langle{{x_{k}}-{x^{*}},\partial\delta\left({{x_{k}}}\right)}\right\rangle$ yields

-2\left\langle{{x_{k}}-{x^{*}},\partial\delta\left({{x_{k}}}\right)}\right% \rangle\leq\gamma\left\|{{x_{k}}-{x^{*}}}\right\|_{2}^{2}+\frac{{n{{\phi}^{2}}% }}{\gamma}\sum\limits_{i\in\mathcal{R}}{{{\left|{{\mathcal{B}_{i}}}\right|}^{2% }}},

(53)

where the constant $\gamma>0$ . Plugging the results (50)-(53) into (45) gives

		$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{{x_{k+1}}-{x^{*}}}\right\\|_{2}^{2}% }\right]$
	$\displaystyle\leq$	$\displaystyle\left({1-\left({\frac{{2\mu L}}{{\mu+L}}-\gamma}\right)\alpha}% \right)\left\\|{{x_{k}}-{x^{*}}}\right\\|_{2}^{2}+8n{{\phi}^{2}}{\alpha^{2}}\sum% \limits_{i\in\mathcal{R}}{{{\left\|{{\mathcal{R}_{i}}}\right\|}^{2}}}$

		$\displaystyle+4{\alpha^{2}}\left({2Lt_{k}+\left({2L-\mu}\right){D_{F}}\left({{% x_{k}},{x^{*}}}\right)}\right)+4n{{\phi}^{2}}{\alpha^{2}}\sum\limits_{i\in% \mathcal{R}}{{{\left\|{{\mathcal{B}_{i}}}\right\|}^{2}}}$		(54)
		$\displaystyle+\frac{{n\alpha}}{\gamma}{{\phi}^{2}}\sum\limits_{i\in\mathcal{R}% }{{{\left\|{{\mathcal{B}_{i}}}\right\|}^{2}}}.$		(54)

Via setting $\gamma=\mu L/\left({\mu+L}\right)$ , we can rewrite (54) as follows:

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{{x_{k+1}}-{x^{*}}}\right\\|_{2}^{2}% }\right]$	(55)
$\displaystyle\leq$	$\displaystyle\left({1-\gamma{\alpha}}\right)\left\\|{{x_{k}}\!-\!{x^{}}}\right% \\|_{2}^{2}\!+\!4{\alpha^{2}}\left({2Lt_{k}+\left({2L\!-\!\mu}\right){D_{F}}% \left({{x_{k}},{x^{}}}\right)}\right)$
	$\displaystyle+8n{{\phi}^{2}}{\alpha^{2}}\sum\limits_{i\in\mathcal{R}}{{{\left\|% {{\mathcal{R}_{i}}}\right\|}^{2}}}+4n{{\phi}^{2}}{\alpha^{2}}\sum\limits_{i\in% \mathcal{R}}{{{\left\|{{\mathcal{B}_{i}}}\right\|}^{2}}}\!+\!\frac{{n\alpha}}{% \gamma}{{\phi}^{2}}\sum\limits_{i\in\mathcal{R}}{{{\left\|{{\mathcal{B}_{i}}}% \right\|}^{2}}}.$

According to (8), we have for any $c>0$ ,

c\left({{\mathbb{E}_{k}}\left[{t_{k+1}}\right]-t_{k}}\right)\leq-\frac{c}{{{q_% {\max}}}}t_{k}+\frac{c}{{{q_{\min}}}}{D_{F}}\left({{x_{k}},{x^{*}}}\right).

(56)

Recall the definitions of $P_{1}^{c}$ and ${P_{2}}$ . Combining (55) and (56) yields

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{{x_{k+1}}-{x^{*}}}\right\\|_{2}^{2}% }\right]+c\left({{\mathbb{E}_{k}}\left[{t_{k+1}}\right]-t_{k}}\right)$	(57)
$\displaystyle\leq$	$\displaystyle\left({1-\gamma{\alpha}}\right)\left\\|{{x_{k}}-{x^{*}}}\right\\|_{% 2}^{2}+4n{{\phi}^{2}}\alpha^{2}({2\sum\limits_{i\in\mathcal{R}}{{{\left\|{{% \mathcal{R}_{i}}}\right\|}^{2}}}+\sum\limits_{i\in\mathcal{R}}{{{\left\|{{% \mathcal{B}_{i}}}\right\|}^{2}}}})$
	$\displaystyle+\frac{{n{{\phi}^{2}}}}{\gamma}{\alpha}\sum\limits_{i\in\mathcal{% R}}{{{\left\|{{\mathcal{B}_{i}}}\right\|}^{2}}}+8L\alpha^{2}t_{k}+4\left({2L-\mu% }\right){\alpha^{2}}{D_{F}}\left({{x_{k}},{x^{*}}}\right)$
	$\displaystyle-\frac{{c}}{{{q_{\max}}}}t_{k}+\frac{{c}}{{{q_{\min}}}}{D_{F}}% \left({{x_{k}},{x^{*}}}\right)$
$\displaystyle\leq$	$\displaystyle\left({1-\left({\gamma\alpha-\frac{L}{2}\left({4\left({2L-\mu}% \right){\alpha^{2}}+\frac{c}{{{q_{\min}}}}}\right)}\right)}\right)\left\\|{{x_{% k}}-{x^{*}}}\right\\|_{2}^{2}$
	$\displaystyle+P_{1}^{c}{\alpha^{2}}+{P_{2}}\alpha+\left({8L{\alpha^{2}}-\frac{% c}{{{q_{\max}}}}}\right)t_{k},$

where the last inequality employs $L$ -smoothness of the local objective function $f_{i}$ , $\forall i\in\mathcal{R}$ . We proceed by choosing $0<\alpha\leq\gamma/\left({8L\left({2L-\mu}\right)}\right)$ and setting ${{c}=\tilde{c}{\alpha}}$ with $0<\tilde{c}\leq{q_{\min}}\gamma/L$ , such that (57) becomes

		$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{{x_{k+1}}-{x^{*}}}\right\\|_{2}^{2}% }\right]+\frac{{{q_{\min}}\gamma\alpha}}{L}\left({{\mathbb{E}_{k}}\left[{t_{k+% 1}}\right]-t_{k}}\right)$		(58)
	$\displaystyle\leq$	$\displaystyle\left({1\!-\!\frac{\gamma}{4}{\alpha}}\right)\left\\|{{x_{k}}\!-\!% {x^{*}}}\right\\|_{2}^{2}\!+\!\left({8L{\alpha^{2}}\!-\!\frac{c}{{{q_{\max}}}}}% \right)t_{k}+P_{1}^{c}{\alpha^{2}}\!+\!P_{2}\alpha.$		(58)

To proceed, via fixing $\tilde{c}={q_{\min}}\gamma/L$ and $0<\alpha\leq\gamma/\left({8L\left({2L-\mu}\right)}\right)$ , it is equivalent to write (58) as

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{{x_{k+1}}-{x^{*}}}\right\\|_{2}^{2}% }\right]+\frac{{{q_{\min}}\gamma{\alpha}}}{L}\left({{\mathbb{E}_{k}}\left[{t_{% k+1}}\right]-t_{k}}\right)$	(59)
$\displaystyle\leq$	$\displaystyle\left({1-\frac{\gamma}{4}{\alpha}}\right)\left\\|{{x_{k}}-{x^{*}}}% \right\\|_{2}^{2}+4n{{\phi}^{2}}\alpha^{2}({2\sum\limits_{i\in\mathcal{R}}{{{% \left\|{{\mathcal{R}_{i}}}\right\|}^{2}}}+\sum\limits_{i\in\mathcal{R}}{{{\left\|% {{\mathcal{B}_{i}}}\right\|}^{2}}}})$
	$\displaystyle+\frac{{n{{\phi}^{2}}}}{\gamma}{\alpha}\sum\limits_{i\in\mathcal{% R}}{{{\left\|{{\mathcal{B}_{i}}}\right\|}^{2}}}+\left({8L{\alpha}-\frac{{\gamma{% q_{\min}}}}{{L{q_{\max}}}}}\right){\alpha}t_{k}.$

We continue to define ${U_{k}}:=\left\|{{x_{k}}-{x^{*}}}\right\|_{2}^{2}+{q_{\min}}\gamma{\alpha}t_{k% }/{L}$ , which is non-negative due to $t_{k}\geq 0$ . Based on this definition, if we select the constant step-size $0<\alpha\leq 4\gamma/\left({{\kappa_{q}}\left({32{L^{2}}+{q_{\min}}{\gamma^{2}% }}\right)}\right)$ , then it is natural to convert (59) into

$\displaystyle{\mathbb{E}_{k}}\left[{{U_{k+1}}}\right]\leq$	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{{x_{k+1}}-{x^{*}}}\right\\|_{2}^{2}% }\right]+\frac{{{q_{\min}}\gamma\alpha}}{L}{\mathbb{E}_{k}}\left[{t_{k+1}}\right]$	(60)
$\displaystyle\leq$	$\displaystyle\left({1-\frac{\gamma}{4}\alpha}\right)\left\\|{{x_{k}}-{x^{*}}}% \right\\|_{2}^{2}\!+\!\left({1-\frac{\gamma}{4}\alpha}\right)\frac{{{q_{\min}}% \gamma\alpha}}{L}t_{k}$
	$\displaystyle+P_{1}^{c}{\alpha^{2}}+{P_{2}}\alpha$
$\displaystyle=$	$\displaystyle\left({1-\frac{\gamma}{4}{\alpha}}\right){U_{k}}+P_{1}^{c}{\alpha% ^{2}}+{P_{2}}\alpha.$

Summarizing all the upper bounds on the constant step-size generates a feasible selection range as follows:

0<\alpha\leq\frac{1}{{{\kappa_{q}}\left({32{{\left({1+{\kappa_{f}}}\right)}^{2% }}+{q_{\min}}}\right)}}\frac{1}{\mu}.

(61)

Under the condition of (61), taking the full expectation on the both sides of (60) obtains

\mathbb{E}\left[{{U_{k+1}}}\right]\leq\left({1-\frac{\gamma}{4}\alpha}\right)% \mathbb{E}\left[{{U_{k}}}\right]+{\alpha^{2}}P_{1}^{c}+\alpha P_{2}.

(62)

Applying telescopic cancellation to (62) for $k\geq 0$ obtains

	$\displaystyle{\mathbb{E}}\left[{\left\\|{{x_{k+1}}-{x^{*}}}\right\\|_{2}^{2}}% \right]\leq$	$\displaystyle 4\left({\frac{{{P_{1}}}}{\gamma}\alpha+E}\right)\left({1-{{\left% ({1-\frac{\gamma}{4}\alpha}\right)}^{k+1}}}\right)$		(63)
		$\displaystyle+{\left({1-\frac{\gamma}{4}\alpha}\right)^{k+1}}{U_{0}},$		(63)

where ${U_{0}}=\left\|{{x_{0}}-{x^{*}}}\right\|_{2}^{2}+{q_{\min}}\gamma\alpha{t_{0}}/L$ . It is worthwhile to mention that by specifying $r_{k}$ and $t_{k}$ as $r^{u}_{k}$ and $t^{u}_{k}$ (resp., $r^{w}_{k}$ and $t^{w}_{k}$ ), the linear convergence rate is established for Prox-DBRO-SAGA (resp., Prox-DBRO-LSVRG).

-F Proof of Theorem 3

In view of the compact form (7) associated with the proposed algorithms, we make a transformation as follows:

	$\displaystyle{x_{k+1}}=$	$\displaystyle\arg\mathop{\min}\limits_{y\in{\mathbb{R}^{\left\|\mathcal{R}% \right\|n}}}\left\{{G\left(y\right)+\frac{1}{{2{\alpha_{k}}}}\left\\|{y-{x_{k}}+% {\alpha_{k}}\left({{r_{k}}}\right.}\right.}\right.$
		$\displaystyle\left.+{\left.{\left.{{\partial_{x}}\chi\left({{x_{k}}}\right)+{% \partial_{x}}\delta\left({{x_{k}}}\right)}\right)}\right\\|_{2}^{2}}\right\},$

which gives

	$\displaystyle{0_{\left\|\mathcal{R}\right\|n}}\in$	$\displaystyle{x_{k+1}}-{x_{k}}+{\alpha_{k}}\left({{r_{k}}+{\partial_{x}}\chi% \left({{x_{k}}}\right)+{\partial_{x}}\delta\left({{x_{k}}}\right)}\right)$
		$\displaystyle+{\alpha_{k}}{\partial_{x}}G\left({{x_{k+1}}}\right).$

This implies that if ${x_{k+1}}$ is the minimizer of the next update of the proposed algorithm, we are guaranteed to obtain a vector ${G^{\prime}\left({{x_{k+1}}}\right)}\in{\partial_{x}}G\left({{x_{k+1}}}\right)$ , such that

{x_{k+1}}-{x_{k}}\!+\!{\alpha_{k}}\left({{r_{k}}\!+\!G^{\prime}\left({{x_{k+1}% }}\right)\!+\!{\partial_{x}}\chi\left({{x_{k}}}\right)\!+\!{\partial_{x}}% \delta\left({{x_{k}}}\right)}\right)\!=\!{0_{\left|\mathcal{R}\right|n}},

which can be further rearranged as

{x_{k+1}}={x_{k}}-{\alpha_{k}}\left({{r_{k}}+{{G^{\prime}\left({{x_{k+1}}}% \right)}}+{\partial_{x}}\chi\left({{x_{k}}}\right)+{\partial_{x}}\delta\left({% {x_{k}}}\right)}\right).

(64)

We next proceed with the convergence analysis based on the transformed version (64) of the compact form (7).

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{{x_{k+1}}-{x^{*}}}\right\\|_{2}^{2}% }\right]$	(65)
$\displaystyle=$	$\displaystyle\left\\|{{x_{k}}\!-\!{x^{}}}\right\\|_{2}^{2}\!-\!2{\alpha_{k}}{% \mathbb{E}_{k}}\left[{\left\langle{{x_{k}}-{x^{}},{r_{k}}\!+\!{\partial_{x}}% \chi\left({{x_{k}}}\right)\!+\!{G^{\prime}\left({{x_{k+1}}}\right)}}\right% \rangle}\right]$
	$\displaystyle+\alpha_{k}^{2}{\mathbb{E}_{k}}\left[{\left\\|{{r_{k}}+{G^{\prime}% \left({{x_{k+1}}}\right)}+{\partial_{x}}\chi\left({{x_{k}}}\right)+{\partial_{% x}}\delta\left({{x_{k}}}\right)}\right\\|_{2}^{2}}\right]$
	$\displaystyle-2{\alpha_{k}}\left\langle{{x_{k}}-{x^{*}},{\partial_{x}}\delta% \left({{x_{k}}}\right)}\right\rangle.$

Considering ${G^{\prime}\left({{x^{*}}}\right)}\in{\partial_{x}}G\left({{x^{*}}}\right)$ and the optimality condition $\nabla F\left({{x^{*}}}\right)+{G^{\prime}\left({{x^{*}}}\right)}+{\partial_{x% }}\chi\left({{x^{*}}}\right)={0_{mn}}$ , we continue to seek an upper bound for ${\mathbb{E}_{k}}\left[{\left\|{{r_{k}}+{G^{\prime}\left({{x_{k+1}}}\right)}+{% \partial_{x}}\chi\left({{x_{k}}}\right)+{\partial_{x}}\delta\left({{x_{k}}}% \right)}\right\|_{2}^{2}}\right]$ as follows:

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{{r_{k}}+{G^{\prime}\left({{x_{k+1}% }}\right)}+{\partial_{x}}\chi\left({{x_{k}}}\right)+{\partial_{x}}\delta\left(% {{x_{k}}}\right)}\right\\|_{2}^{2}}\right]$	(66)
$\displaystyle=$	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{{r_{k}}-\nabla F\left({{x^{}}}% \right)+{G^{\prime}\left({{x_{k+1}}}\right)}-{G^{\prime}\left({{x^{}}}\right)% }+{\partial_{x}}\chi\left({{x_{k}}}\right)}\right.}\right.$
	$\displaystyle\left.-{\partial_{x}}\chi\left({{x^{*}}}\right)+{\left.{{\partial% _{x}}\delta\left({{x_{k}}}\right)}\right\\|_{2}^{2}}\right]$
$\displaystyle\leq$	$\displaystyle 4{\mathbb{E}_{k}}\left[{\left\\|{{r_{k}}-\nabla F\left({{x^{}}}% \right)}\right\\|_{2}^{2}}\right]+4\left\\|{{G^{\prime}\left({{x_{k+1}}}\right)}% -{G^{\prime}\left({{x^{}}}\right)}}\right\\|_{2}^{2}$
	$\displaystyle+4\left\\|{{\partial_{x}}\chi\left({{x_{k}}}\right)-{\partial_{x}}% \chi\left({{x^{*}}}\right)}\right\\|_{2}^{2}+4\left\\|{{\partial_{x}}\delta\left% ({{x_{k}}}\right)}\right\\|_{2}^{2}$
$\displaystyle\leq$	$\displaystyle 4{\mathbb{E}_{k}}\left[{\left\\|{{r_{k}}-\nabla F\left({{x^{}}}% \right)}\right\\|_{2}^{2}}\right]+4\left\\|{{G^{\prime}\left({{x_{k+1}}}\right)}% -{G^{\prime}\left({{x^{}}}\right)}}\right\\|_{2}^{2}$
	$\displaystyle+16n{{\phi}^{2}}\sum\limits_{i\in\mathcal{R}}{{{\left\|{{\mathcal{% R}_{i}}}\right\|}^{2}}}+4n{{\phi}^{2}}\sum\limits_{i\in\mathcal{R}}{{{\left\|{{% \mathcal{B}_{i}}}\right\|}^{2}}}$
$\displaystyle\leq$	$\displaystyle 16Lt_{k}+8\left({2L-\mu}\right){D_{F}}\left({{x_{k}},{x^{*}}}% \right)+16\left\|\mathcal{R}\right\|\hat{G}$
	$\displaystyle+16n{{\phi}^{2}}\sum\limits_{i\in\mathcal{R}}{{{\left\|{{\mathcal{% R}_{i}}}\right\|}^{2}}}+4n{{\phi}^{2}}\sum\limits_{i\in\mathcal{R}}{{{\left\|{{% \mathcal{B}_{i}}}\right\|}^{2}}},$

where the second inequality uses the results (47) and (49), and the last inequality is owing to Lemma 2 and Assumption 3. To proceed, recalling the definition of ${{\partial_{x}}G\left({{x_{k}}}\right)}$ , it is not difficult to verify

-\left\langle{{x_{k}}-{x^{*}},{G^{\prime}\left({{x_{k+1}}}\right)}-{G^{\prime}% \left({{x^{*}}}\right)}}\right\rangle\leq 0,\\

(67)

which is owing to the convexity of ${g}\left({\tilde{x}}\right)$ , $\forall i\in\mathcal{R}$ . Based on the relations (52) and (67), we know that

	$\displaystyle-2{\mathbb{E}_{k}}\left[{\left\langle{{x_{k}}-{x^{*}},{r_{k}}+{G^% {\prime}\left({{x_{k+1}}}\right)}+{\partial_{x}}\chi\left({{x_{k}}}\right)}% \right\rangle}\right]$	(68)
$\displaystyle=$	$\displaystyle-2{\mathbb{E}_{k}}\left[{\left\langle{{x_{k}}-{x^{}},{r_{k}}-% \nabla F\left({{x_{k}}}\right)+\nabla F\left({{x_{k}}}\right)-\nabla F\left({{% x^{}}}\right)}\right\rangle}\right]$
	$\displaystyle-\!\!2{\mathbb{E}_{k}}\!\left[{\left\langle{{x_{k}}\!-\!{x^{}},{% G^{\prime}\left({{x_{k+1}}}\right)}\!-\!{G^{\prime}\left({{x^{}}}\right)}\!+% \!{\partial_{x}}\chi\left({{x_{k}}}\right)\!-\!{\partial_{x}}\chi\left({{x^{*}% }}\right)}\right\rangle}\right]$
$\displaystyle\leq$	$\displaystyle-2\left\langle{{x_{k}}-{x^{}},\nabla F\left({{x_{k}}}\right)-% \nabla F\left({{x^{}}}\right)}\right\rangle$
$\displaystyle\leq$	$\displaystyle-2\frac{{\mu L}}{{\mu+L}}\left\\|{{x_{k}}-{x^{}}}\right\\|_{2}^{2}% -\frac{{2}}{{\mu+L}}\left\\|{\nabla F\left({{x_{k}}}\right)-\nabla F\left({{x^{% }}}\right)}\right\\|_{2}^{2},$

where the last inequality follows (51). Plugging the results (53), (66), and (68) into (65) reduces to

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{{x_{k+1}}-{x^{*}}}\right\\|_{2}^{2}% }\right]$	(69)
$\displaystyle\leq$	$\displaystyle\left({1-\gamma{\alpha_{k}}}\right)\left\\|{{x_{k}}-{x^{*}}}\right% \\|_{2}^{2}+16L\alpha_{k}^{2}t_{k}+P_{1}^{d}\alpha_{k}^{2}+P_{2}{\alpha_{k}}$
	$\displaystyle+8\left({2L-\mu}\right)\alpha_{k}^{2}{D_{F}}\left({{x_{k}},{x^{*}% }}\right).$

According to Lemma 1, we introduce an iteration-shifting variable ${c_{k}}>0$ , such that

{c_{k}}\left({{\mathbb{E}_{k}}\left[{t_{k+1}}\right]-t_{k}}\right)\leq-\frac{{% {c_{k}}}}{{{q_{\max}}}}t_{k}+\frac{{{c_{k}}}}{{{q_{\min}}}}{D_{F}}\left({{x_{k% }},{x^{*}}}\right).

(70)

Combining (69) and (70) obtains

		$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{{x_{k+1}}-{x^{*}}}\right\\|_{2}^{2}% }\right]+{c_{k}}\left({{\mathbb{E}_{k}}\left[{t_{k+1}}\right]-t_{k}}\right)$
	$\displaystyle\leq$	$\displaystyle\left({1-\gamma{\alpha_{k}}}\right)\left\\|{{x_{k}}-{x^{*}}}\right% \\|_{2}^{2}+\left({16L\alpha_{k}^{2}-\frac{{{c_{k}}}}{{{q_{\max}}}}}\right)t_{k% }+P_{1}^{d}\alpha_{k}^{2}$
		$\displaystyle+P_{2}{\alpha_{k}}+\left({\frac{{{c_{k}}}}{{{q_{\min}}}}+8\left({% 2L-\mu}\right)\alpha_{k}^{2}}\right){D_{F}}\left({{x_{k}},{x^{*}}}\right)$
	$\displaystyle\leq$	$\displaystyle\left({1-\left({\gamma{\alpha_{k}}-\frac{L}{2}\left({8\left({2L-% \mu}\right)\alpha_{k}^{2}+\frac{{{c_{k}}}}{{{q_{\min}}}}}\right)}\right)}% \right)\left\\|{{x_{k}}-{x^{*}}}\right\\|_{2}^{2}$

\displaystyle+\left({16L\alpha_{k}^{2}-\frac{{{c_{k}}}}{{{q_{\max}}}}}\right)t% _{k}+P_{1}^{d}\alpha_{k}^{2}+{P_{2}}{\alpha_{k}},

(71)

where the last inequality uses $L$ -smoothness of the local objective function $f_{i}$ , $\forall i\in\mathcal{R}$ . Via setting ${{c_{k}}=\tilde{c}{\alpha_{k}}}$ and $0<{\alpha_{k}}\leq\gamma/\left({16L\left({2L-\mu}\right)}\right)$ , we have

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{{x_{k+1}}-{x^{*}}}\right\\|_{2}^{2}% }\right]+\frac{{\gamma{q_{\min}}{\alpha_{k}}}}{L}{\mathbb{E}_{k}}\left[{t_{k+1% }}\right]$	(72)
$\displaystyle\leq$	$\displaystyle({1-\frac{\gamma}{4}})\left\\|{{x_{k}}-{x^{*}}}\right\\|_{2}^{2}\!+% \!\left({({1-\frac{1}{{{q_{\max}}}}})\frac{{{q_{\min}}\gamma}}{L}\!+\!16L{% \alpha_{k}}}\right){\alpha_{k}}t_{k}$
	$\displaystyle+P_{1}^{d}\alpha_{k}^{2}+{P_{2}}{\alpha_{k}}.$

We define ${\tilde{U}_{k}}:=\left\|{{x_{k}}-{x^{*}}}\right\|_{2}^{2}+{q_{\min}}\gamma{% \alpha_{k}}t_{k}/L$ , which is non-negative, since $t_{k}$ is non-negative. We further set $0<{\alpha_{k}}\leq 4\gamma/\left({{\kappa_{q}}\left({64{L^{2}}+{q_{\min}}{% \gamma^{2}}}\right)}\right)$ and take the full expectation on the both sides of (72) to obtain

	$\displaystyle{\mathbb{E}}\left[{{\tilde{U}_{k+1}}}\right]\leq$	$\displaystyle{\mathbb{E}}\left[{\left\\|{{x_{k+1}}-{x^{*}}}\right\\|_{2}^{2}}% \right]+\frac{{{q_{\min}}\gamma{\alpha_{k}}}}{L}{\mathbb{E}}\left[{t_{k+1}}\right]$		(73)
	$\displaystyle\leq$	$\displaystyle\left({1-\frac{\gamma}{4}{\alpha_{k}}}\right){\mathbb{E}}\left[{% \tilde{U}_{k}}\right]+P_{1}^{d}\alpha_{k}^{2}+{P_{2}}{\alpha_{k}},$		(73)

where the first inequality is due to the fact that the step-size $\alpha_{k}$ is decaying. In view of all the required upper bounds on the constant step-size, it suffices to consider a feasible range as follows:

0<\alpha_{k}\leq\frac{1}{{{\kappa_{q}}\left({64{{\left({1+{\kappa_{f}}}\right)% }^{2}}+{q_{\min}}}\right)}}\frac{1}{\mu}.

(74)

According to (74), we set ${\alpha_{k}}=\theta/\left({k+\xi}\right)$ , $\forall k\geq 0$ , with $\theta>4/\gamma$ and $\xi={\kappa_{q}}\left({64{{\left({1+{\kappa_{f}}}\right)}^{2}}+{q_{\min}}}% \right)\mu\theta$ . We next prove

\mathbb{E}\left[{{\tilde{U}_{k}}}\right]\leq\Xi/\left({k+\xi}\right)+\tilde{E}% ,\forall k\geq 0,

(75)

by induction. Firstly, for $k=0$ , we know that

{\tilde{U}_{1}}\leq\left({1-\frac{\gamma}{4}{\alpha_{0}}}\right){\tilde{U}_{0}% }+\alpha_{0}^{2}{P_{1}^{d}}+{\alpha_{0}}{P_{2}}.

(76)

Therefore, for a bounded and positive constant $\tilde{E}$ , if $\Xi\geq\left({\xi-\gamma\theta/4}\right){\tilde{U}_{0}}+{{\theta^{2}}{P_{1}^{d% }}/\xi}+\theta{P_{2}}-\xi\tilde{E}$ , we have

\tilde{U}_{1}\leq\left({1-\frac{\gamma}{4}{\alpha_{0}}}\right){\tilde{U}_{0}}+% \alpha_{0}^{2}{P_{1}^{d}}+{\alpha_{0}}{P_{2}}\leq\frac{\Xi}{\xi}+\tilde{E},

(77)

with ${\alpha_{0}}=\theta/\xi$ . We assume that for $k=K$ , $K\geq 1$ , it satisfies that

	$\displaystyle\mathbb{E}\left[{{\tilde{U}_{K+1}}}\right]\leq$	$\displaystyle\left({1-\frac{\gamma}{4}{\alpha_{K}}}\right)\mathbb{E}\left[{{% \tilde{U}_{K}}}\right]+\alpha_{K}^{2}{P_{1}}+{\alpha_{K}}{P_{2}}$		(78)
	$\displaystyle\leq$	$\displaystyle\frac{\Xi}{{K+\xi}}+\tilde{E}.$		(78)

Then, we will prove that for $k=K+1$ ,

\mathbb{E}\left[{{\tilde{U}_{K+2}}}\right]\leq\frac{\Xi}{{K+\xi+1}}+\tilde{E},

(79)

holds true. Define $\tilde{\gamma}:=\gamma/4$ and set $\tilde{E}\geq{P_{2}}/\tilde{\gamma}$ and $\Xi\geq{\theta^{2}}{P_{1}}/\left({\tilde{\gamma}\theta-1}\right)$ with $\theta>1/\tilde{\gamma}$ . We have

		$\displaystyle\mathbb{E}\left[{{\tilde{U}_{K+2}}}\right]$
	$\displaystyle\leq$	$\displaystyle\left({1-\tilde{\gamma}{\alpha_{K+1}}}\right)\mathbb{E}\left[{{% \tilde{U}_{K+1}}}\right]+\alpha_{K+1}^{2}{P_{1}}+{\alpha_{K+1}}{P_{2}}$
	$\displaystyle\leq$	$\displaystyle\left({1-\frac{{\tilde{\gamma}\theta}}{{K+\xi+1}}}\right)\left({% \frac{\Xi}{{K+\xi}}+\tilde{E}}\right)+\frac{{{\theta^{2}}}}{{{{\left({K+\xi+1}% \right)}^{2}}}}{P_{1}}$
		$\displaystyle+\frac{\theta}{{K+\xi+1}}{P_{2}}$

$\displaystyle\leq$	$\displaystyle\left({1-\frac{{\tilde{\gamma}\theta}}{{K+\xi+1}}}\right)\frac{% \Xi}{{K+\xi}}+\tilde{E}+\frac{{{\theta^{2}}}}{{{{\left({K+\xi+1}\right)}^{2}}}% }{P_{1}}$	(80)
$\displaystyle\leq$	$\displaystyle\left({1-\frac{{\tilde{\gamma}\theta}}{{K+\xi+1}}}\right)\frac{% \Xi}{{K+\xi}}+\tilde{E}+\frac{{\Xi\left({\tilde{\gamma}\theta-1}\right)}}{{{{% \left({K+\xi+1}\right)}^{2}}}}$
$\displaystyle\leq$	$\displaystyle\left({1-\frac{{\tilde{\gamma}\theta}}{{K+\xi+1}}}\right)\frac{% \Xi}{{K+\xi}}+\frac{{\Xi\left({\tilde{\gamma}\theta-1}\right)}}{{\left({K+\xi+% 1}\right)\left({K+\xi}\right)}}+\tilde{E}$
$\displaystyle=$	$\displaystyle\frac{\Xi}{{K+\xi+1}}+\tilde{E},$

which means the relation (75) holds true. Via replacing $\tilde{E}$ with its lower bound $E={P_{2}}/\tilde{\gamma}$ , it is straightforward to verify

{\mathbb{E}}\left[{\left\|{{x_{k}}-{x^{*}}}\right\|_{2}^{2}}\right]\leq\frac{% \Xi}{{k+\xi}}+E,\forall k\geq 0,

(81)

owing to $t_{k}\geq 0$ . Through specifying $r_{k}$ and $t_{k}$ as $r^{u}_{k}$ and $t^{u}_{k}$ (resp., $r^{w}_{k}$ and $t^{w}_{k}$ ), the sub-linear convergence rate is established for Prox-DBRO-SAGA (resp., Prox-DBRO-LSVRG).

References

[1] R. Xin, U. A. Khan, and S. Kar, “Fast decentralized nonconvex finite-sum optimization with recursive variance reduction,” SIAM Journal on Optimization, vol. 32, no. 1, pp. 1–28, 2022.
[2] K. Yuan, B. Ying, J. Liu, and A. H. Sayed, “Variance-reduced stochastic learning under random reshuffling,” IEEE Transactions on Signal Processing, vol. 68, no. 2, pp. 1390–1408, 2020.
[3] R. Xin, S. Das, U. A. Khan, and S. Kar, “A stochastic proximal gradient framework for decentralized non-convex composite optimization: Topology-independent sample complexity and communication efficiency,” arXiv preprint arXiv:2110.01594, 2021.
[4] J. Zhai, Y. Jiang, Y. Shi, C. N. Jones, and X. P. Zhang, “Distributionally robust joint chance-constrained dispatch for integrated transmission-distribution systems via distributed optimization,” IEEE Transactions on Smart Grid, vol. 13, no. 3, pp. 2132–2147, 2022.
[5] H. Li, L. Zheng, Z. Wang, Y. Li, and L. Ji, “Asynchronous distributed model predictive control for optimal output consensus of high-order multi-agent systems,” IEEE Transactions on Signal and Information Processing over Networks, vol. 7, pp. 689–698, 2021.
[6] S. Huang, J. Lei, and Y. Hong, “A linearly convergent distributed Nash equilibrium seeking algorithm for aggregative games,” IEEE Transactions on Automatic Control, vol. 68, no. 3, pp. 1753–1759, 2022.
[7] Y. Wang and A. Nedić, “Robust constrained consensus and inequality-constrained distributed optimization with guaranteed differential privacy and accurate convergence,” IEEE Transactions on Automatic Control, vol. 69, no. 11, pp. 7463–7478, 2024.
[8] M. Xing, D. Ma, J. Zhao, and P. K. Wong, “Differentially private dynamic average consensus-based Newton method for distributed optimization over general networks,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 55, no. 2, pp. 1348–1361, 2024.
[9] L. Lamport, R. Shostak, and M. Pease, “The Byzantine generals problem,” in Concurrency: the Works of Leslie Lamport, 2019, pp. 203–226.
[10] F. Saadatniaki, R. Xin, and U. A. Khan, “Decentralized optimization over time-varying directed graphs with row and column-stochastic matrices,” IEEE Transactions on Automatic Control, vol. 65, no. 11, pp. 4769–4780, 2020.
[11] R. Xin, U. A. Khan, and S. Kar, “Variance-reduced decentralized stochastic optimization with accelerated convergence,” IEEE Transactions on Signal Processing, vol. 68, pp. 6255–6271, 2020.
[12] H. Li, J. Hu, L. Ran, Z. Wang, Q. Lü, Z. Du, and T. Huang, “Decentralized dual proximal gradient algorithms for non-smooth constrained composite optimization problems,” IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 10, pp. 2594–2605, 2021.
[13] S. Pu, W. Shi, J. Xu, and A. Nedic, “Push-Pull gradient methods for distributed optimization in networks,” IEEE Transactions on Automatic Control, vol. 66, no. 1, pp. 1–16, 2021.
[14] M. I. Qureshi, R. Xin, S. Kar, and U. A. Khan, “S-ADDOPT: Decentralized stochastic first-order optimization over directed graphs,” IEEE Control Systems Letters, vol. 5, no. 3, pp. 953–958, 2021.
[15] H. Li, L. Zheng, Z. Wang, Y. Yan, L. Feng, and J. Guo, “S-DIGing : A stochastic gradient tracking algorithm for distributed optimization,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 6, no. 1, pp. 53–65, 2022.
[16] C. Wang, S. Xu, D. Yuan, B. Zhang, and Z. Zhang, “Push-sum distributed dual averaging for convex optimization in multiagent systems with communication delays,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 53, no. 1, pp. 1420–1430, 2023.
[17] S. Sundaram and B. Gharesifard, “Distributed optimization under adversarial nodes,” IEEE Transactions on Automatic Control, vol. 64, no. 3, pp. 1063–1076, 2019.
[18] L. Yuan and H. Ishii, “Resilient average consensus with adversaries via distributed detection and recovery,” Automatica, vol. 171, p. 111908, 2025.
[19] M. Yemini, A. Nedic, A. Goldsmith, and S. Gil, “Characterizing trust and resilience in distributed consensus for cyberphysical systems,” IEEE Transactions on Robotics, vol. 38, no. 1, pp. 71–91, 2022.
[20] Z. Zuo, X. Cao, Y. Wang, and W. Zhang, “Resilient consensus of multiagent systems against denial-of-service attacks,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 52, no. 4, pp. 2664–2675, 2022.
[21] E. M. El-Mhamdi, R. Guerraoui, A. Guirguis, L. N. Hoang, and S. Rouault, “Genuinely distributed Byzantine machine learning,” Distributed Computing, vol. 35, no. 4, pp. 305–331, 2022.
[22] C. Fang, Z. Yang, and W. U. Bajwa, “BRIDGE: Byzantine-resilient decentralized gradient descent,” IEEE Transactions on Signal and Information Processing over Networks, vol. 8, pp. 610–626, 2022.
[23] J. Li, W. Abbas, M. Shabbir, and X. Koutsoukos, “Byzantine resilient distributed learning in multirobot systems,” IEEE Transactions on Robotics, vol. 38, no. 6, pp. 3550–3563, 2022.
[24] R. Wang, Y. Liu, and Q. Ling, “Byzantine-resilient decentralized resource allocation,” IEEE Transactions on Signal Processing, vol. 70, pp. 4711–4726, 2022.
[25] Z. Wu, T. Chen, and Q. Ling, “Byzantine-resilient decentralized stochastic optimization with robust aggregation rules,” IEEE Transactions on Signal Processing, vol. 71, pp. 3179–3195, 2023.
[26] J. Peng, W. Li, and Q. Ling, “Byzantine-robust decentralized stochastic optimization over static and time-varying networks,” Signal Processing, vol. 183, p. 108020, 2021.
[27] L. Li, W. Xu, T. Chen, G. B. Giannakis, and Q. Ling, “RSA: Byzantine-robust stochastic aggregation methods for distributed learning from heterogeneous datasets,” in AAAI Conference on Artificial Intelligence (AAAI), 2019, pp. 1544–1551.
[28] Z. Yang and W. U. Bajwa, “ByRDiE: Byzantine-resilient distributed coordinate descent for decentralized learning,” IEEE Transactions on Signal and Information Processing over Networks, vol. 5, no. 4, pp. 611–627, 2019.
[29] A. Nedic, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
[30] L. He, S. P. Karimireddy, and M. Jaggi, “Byzantine-robust decentralized learning via self-centered clipping,” arXiv preprint arXiv:2202.01545, 2022.
[31] S. P. Karimireddy, L. He, and M. Jaggi, “Learning from history for Byzantine robust optimization,” in International Conference on Machine Learning (ICML), 2021, pp. 5311–5319.
[32] S. Guo, T. Zhang, H. Yu, X. Xie, T. Xiang, and Y. Liu, “Byzantine-resilient decentralized stochastic gradient descent,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 6, pp. 4096–4106, 2021.
[33] X. Ma, X. Sun, Y. Wu, Z. Liu, X. Chen, and C. Dong, “Differentially private Byzantine-robust federated learning,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 12, pp. 3690–3701, 2022.
[34] J. Peng, W. Li, and Q. Ling, “Variance reduction-boosted Byzantine robustness in decentralized stochastic optimization,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4283–4287.
[35] W. Ben-ameur, P. Bianchi, and J. Jakubowicz, “Robust distributed consensus using total variation,” IEEE Transactions on Automatic Control, vol. 61, no. 6, pp. 1550–1564, 2016.
[36] H. Ye, W. Xiong, and T. Zhang, “PMGT-VR: A decentralized proximal-gradient algorithmic framework with variance reduction,” arXiv preprint arXiv:2012.15010, 2020.
[37] S. A. Alghunaim, E. K. Ryu, K. Yuan, and A. H. Sayed, “Decentralized proximal gradient algorithms with linear convergence rates,” IEEE Transactions on Automatic Control, vol. 66, no. 6, pp. 2787–2794, 2021.
[38] I. Notarnicola and G. Notarstefano, “Asynchronous distributed optimization via randomized dual proximal gradient,” IEEE Transactions on Automatic Control, vol. 62, no. 5, pp. 2095–2106, 2017.
[39] Z. Li, W. Shi, and M. Yan, “A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates,” IEEE Transactions on Signal Processing, vol. 67, no. 17, pp. 4494–4506, 2019.
[40] J. Xu, Y. Tian, Y. Sun, and G. Scutari, “Distributed algorithms for composite optimization: Unified framework and convergence analysis,” IEEE Transactions on Signal Processing, vol. 69, pp. 3555–3570, 2021.
[41] L. Yuan and H. Ishii, “Event-triggered approximate Byzantine consensus with multi-hop communication,” IEEE Transactions on Signal Processing, vol. 71, pp. 1742–1754, 2023.
[42] A. Defazio, F. Bach, and S. Lacoste-Julien, “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives,” in Advances in Neural Information Processing Systems (NeurIPS), 2014, pp. 1646–1654.
[43] D. Kovalev, S. Horvath, and P. Richtarik, “Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop,” in Algorithmic Learning Theory, 2020, pp. 451–467.
[44] S. Bubeck, “Convex optimization: Algorithms and complexity,” Foundations and Trends® in Machine Learning, vol. 8, no. 3-4, pp. 231–357, 2015.
[45] H. Ye, H. Zhu, and Q. Ling, “On the tradeoff between privacy preservation and Byzantine-robustness in decentralized learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 9336–9340.
[46] X. Lian, C. Zhang, H. Zhang, C. J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 5331–5341.
[47] P. Ramanan, D. Li, and N. Gebraeel, “Blockchain-based decentralized replay attack detection for large-scale power systems,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 52, no. 8, pp. 4727–4739, 2022.
[48] E. Gorbunov, F. Hanzely, and P. Richtárik, “A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent,” in International Conference on Artificial Intelligence and Statistics (AISTATS), 2019, pp. 680–690.
[49] C. Xi and U. A. Khan, “Distributed subgradient projection algorithm over directed graphs,” IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3986–3992, 2017.
[50] J. Zeng and W. Yin, “On nonconvex decentralized gradient descent,” IEEE Transactions on Signal Processing, vol. 66, no. 11, pp. 2834–2848, 2018.
[51] X. Li, S. Chen, Z. Deng, Q. Qu, Z. Zhu, and A. M.-C. So, “Weakly convex optimization over Stiefel manifold using Riemannian subgradient-type methods,” SIAM Journal on Optimization, vol. 31, no. 3, pp. 1605–1634, 2021.
[52] Y. Chen, L. Su, and J. Xu, “Distributed statistical machine learning in adversarial settings: Byzantine gradient descent,” Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), vol. 1, no. 2, pp. 1–25, 2018.
[53] Y. LeCun, C. Cortes, and C. Burges, “MNIST handwritten digit database. [Online]. Available: http://yann. lecun.com/exdb/m,” in AT&T Labs, Florham Park, NJ, USA., 2020.

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{r_{i,k}^{u}-\nabla{f_{i}}\left({{{% \tilde{x}}^{*}}}\right)}\right\\|_{2}^{2}}\right]$	(23)
$\displaystyle=$	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{r_{i,k}^{u}-\nabla{f_{i}}\left({{{% \tilde{x}}^{}}}\right)-\nabla{f_{i}}\left({{x_{i,k}}}\right)+\nabla{f_{i}}% \left({{{\tilde{x}}^{}}}\right)}\right\\|_{2}^{2}}\right]$
	$\displaystyle+\left\\|{\nabla{f_{i}}\left({{x_{i,k}}}\right)-\nabla{f_{i}}\left% ({{{\tilde{x}}^{*}}}\right)}\right\\|_{2}^{2},$

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{r_{i,k}^{u}-\nabla{f_{i}}\left({{{% \tilde{x}}^{}}}\right)-\nabla{f_{i}}\left({{x_{i,k}}}\right)+\nabla{f_{i}}% \left({{{\tilde{x}}^{}}}\right)}\right\\|_{2}^{2}}\right]$	(24)
$\displaystyle\leq$	$\displaystyle 2{\mathbb{E}_{k}}\!\!\left[{\left\\|{\nabla f_{i}^{{s_{i,k}}}\!% \left({{x_{i,k}}}\right)\!-\!\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{}}}% \right)\!-\!\nabla{f_{i}}\left({{x_{i,k}}}\right)\!+\!\nabla{f_{i}}\left({{{% \tilde{x}}^{}}}\right)}\right\\|_{2}^{2}}\right]$
	$\displaystyle+2{\mathbb{E}_{k}}\!\left[{\left\\|{\nabla f_{i}^{{s_{i,k}}}\left(% {u_{i,k}^{{s_{i,k}}}}\right)-\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{*}}}% \right)-\frac{1}{{{q_{i}}}}\sum\limits_{l=1}^{{q_{i}}}{\nabla f_{i}^{l}\left({% u_{i,k}^{l}}\right)}}\right.}\right.$
	$\displaystyle+\left.{\left.{\nabla{f_{i}}\left({{{\tilde{x}}^{*}}}\right)}% \right\\|_{2}^{2}}\right]$
$\displaystyle\leq$	$\displaystyle 2{\mathbb{E}_{k}}\!\!\left[{\left\\|{\nabla f_{i}^{{s_{i,k}}}% \left({{x_{i,k}}}\right)\!-\!\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{}}}% \right)\!-\!\nabla{f_{i}}\left({{x_{i,k}}}\right)\!+\!\nabla{f_{i}}\left({{{% \tilde{x}}^{}}}\right)}\right\\|_{2}^{2}}\right]$
	$\displaystyle+2{\mathbb{E}_{k}}\left[{\left\\|{\nabla f_{i}^{{s_{i,k}}}\left({u% _{i,k}^{{s_{i,k}}}}\right)-\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{*}}}% \right)}\right\\|_{2}^{2}}\right]$
$\displaystyle=$	$\displaystyle 2{\mathbb{E}_{k}}\left[{\left\\|{\nabla f_{i}^{{s_{i,k}}}\left({{% x_{i,k}}}\right)-\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{*}}}\right)}% \right\\|_{2}^{2}}\right]$
	$\displaystyle+2{\mathbb{E}_{k}}\left[{\left\\|{\nabla f_{i}^{{s_{i,k}}}\left({u% _{i,k}^{{s_{i,k}}}}\right)-\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{*}}}% \right)}\right\\|_{2}^{2}}\right]$
	$\displaystyle-2\left\\|{\nabla{f_{i}}\left({{x_{i,k}}}\right)-\nabla{f_{i}}% \left({{{\tilde{x}}^{*}}}\right)}\right\\|_{2}^{2},$

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{r_{i,k}^{u}-\nabla{f_{i}}\left({{{% \tilde{x}}^{*}}}\right)}\right\\|_{2}^{2}}\right]$	(25)
$\displaystyle=$	$\displaystyle 2{\mathbb{E}_{k}}\left[{\left\\|{\nabla f_{i}^{{s_{i,k}}}\left({{% x_{i,k}}}\right)-\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{*}}}\right)}% \right\\|_{2}^{2}}\right]$
	$\displaystyle+2{\mathbb{E}_{k}}\left[{\left\\|{\nabla f_{i}^{{s_{i,k}}}\left({u% _{i,k}^{{s_{i,k}}}}\right)-\nabla f_{i}^{{s_{i,k}}}\left({{{\tilde{x}}^{*}}}% \right)}\right\\|_{2}^{2}}\right]$
	$\displaystyle-\left\\|{\nabla{f_{i}}\left({{x_{i,k}}}\right)-\nabla{f_{i}}\left% ({{{\tilde{x}}^{*}}}\right)}\right\\|_{2}^{2}.$

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{r_{k}^{u}-\nabla f\left({{{\tilde{% x}}^{*}}}\right)}\right\\|_{2}^{2}}\right]$	(26)
$\displaystyle\leq$	$\displaystyle 2\sum\limits_{i\in\mathcal{R}}{{\mathbb{E}_{k}}\left[{\left\\|{% \nabla f_{i}^{{s_{i,k}}}\left({{x_{i,k}}}\right)-\nabla f_{i}^{{s_{i,k}}}\left% ({{{\tilde{x}}^{*}}}\right)}\right\\|_{2}^{2}}\right]}$
	$\displaystyle+2\sum\limits_{i\in\mathcal{R}}{{\mathbb{E}_{k}}\left[{\left\\|{% \nabla f_{i}^{{s_{i,k}}}\left({u_{i,k}^{{s_{i,k}}}}\right)-\nabla f_{i}^{{s_{i% ,k}}}\left({{{\tilde{x}}^{*}}}\right)}\right\\|_{2}^{2}}\right]}$
	$\displaystyle-\sum\limits_{i\in\mathcal{R}}{\left\\|{\nabla{f_{i}}\left({{x_{i,% k}}}\right)-\nabla{f_{i}}\left({{{\tilde{x}}^{*}}}\right)}\right\\|_{2}^{2}}.$

	$\displaystyle{\mathbb{E}_{k}}\left[{\left\\|{r_{k}^{u}-\nabla f\left({{{\tilde{% x}}^{*}}}\right)}\right\\|_{2}^{2}}\right]$	(31)
$\displaystyle\leq$	$\displaystyle 4Lt_{k}^{u}+2\sum\limits_{i\in\mathcal{R}}{{\mathbb{E}_{k}}\left% [{\left\\|{\nabla f_{i}^{{s_{i,k}}}\left({{x_{i,k}}}\right)-\nabla f_{i}^{{s_{i% ,k}}}\left({{{\tilde{x}}^{*}}}\right)}\right\\|_{2}^{2}}\right]}$
	$\displaystyle-\left\\|{\nabla F\left({{x_{k}}}\right)-\nabla F\left({{x^{*}}}% \right)}\right\\|_{2}^{2},$