TGFormer: Temporal Graph Transformer Model
TGFormer: Temporal Graph Transformer Model
ed
TGFormer: Towards Temporal Graph Transformer with Auto-
Correlation Mechanism
iew
Hongjiang Chen1, Pengfei Jiao1, Ming Du1, Xuan Guo2, Zhidong Zhao1, Di Jin2
1Hangzhou Dianzi University, Hangzhou, 310018, China
2Tianjin University, Tianjin, 300072, China
Email addresses:
hchen@[Link] (Hongjiang Chen)
v
pjiao@[Link] (Pengfei Jiao)
re
mdu@[Link] (Ming Du)
guoxuan@[Link] (Xuan Guo)
zhaozd@[Link] (Zhidong Zhao)
jindi@[Link] (Di Jin)
Corresponding Author:
er
Pengfei Jiao,
School of Cyberspace,
pe
Hangzhou Dianzi University,
Hangzhou 310018,
China.
Email: pjiao@[Link]
ot
Present address:
Xiasha Higher Education Zone,
Hangzhou,
tn
310018,
Zhejiang Province,
China.
rin
ep
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Highlights
iew
TGFormer: Towards Temporal Graph Transformer with Auto-Correlation Mech-
anism
Hongjiang Chen, Pengfei Jiao, Ming Du, Xuan Guo, Zhidong Zhao, Di Jin
ev
• We introduce a perspective shift that navigates temporal graph learning toward
time series analysis.
r
aggregating information at the series level.
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
TGFormer: Towards Temporal Graph Transformer with
Auto-Correlation Mechanism
iew
Hongjiang Chena , Pengfei Jiaoa,∗, Ming Dua , Xuan Guob , Zhidong Zhaoa , Di Jinb
a Hangzhou Dianzi University, Hangzhou, 310018, China
b Tianjin
University, Tianjin, 300072, China
ev
Abstract
r
The burgeoning fascination in Temporal Graph Neural Networks (TGNNs), is attributable
to their adeptness in modeling complex dynamics while delivering superior perfor-
er
mance. However, TGNNs face intrinsic constraints, inclusive of long-term temporal
dependency and periodic temporal pattern problems. Simultaneously, the inherent ca-
pability of transformer architectures is a potent solution to the aforementioned predica-
pe
ment. Accordingly, we introduce TGFormer, a novel transformer tailored for temporal
graphs. The model shifts the traditional temporal graph learning paradigm towards a
trajectory aligning with time series analysis. Through this innovative perspective, TG-
Former gains the capability to extract node representations from historical interactions,
ot
∗ Corresponding
author
Email addresses: hchen@[Link] (Hongjiang Chen), pjiao@[Link] (Pengfei Jiao),
mdu@[Link] (Ming Du), guoxuan@[Link] (Xuan Guo), zhaozd@[Link] (Zhidong
Zhao), jindi@[Link] (Di Jin)
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Representation Learning
iew
1. Introduction
ev
Networks (TGNNs) have been identified as robust and versatile apparatuses for mas-
tering temporal graph learning, and finding efficacious applications in vast fields en-
compassing community detection [1], protein design [2], recommendation system [3],
and social network scrutiny [4], among others.
r
Current TGNNs methodologies principally segregate into two distinct categories
in the form of input graphs: discrete-time dynamic graph (DTDG) [5, 6, 7, 8] and
er
continuous-time dynamic graph (CTDG) [9, 10, 11, 12]. Recently, there has been a
growing preference for the latter models due to their proficiency in capturing fine-
pe
grained information and managing temporal intricacies [13]. Following this trend, this
paper also focuses on learning about CTDG.
Despite the impressive results achieved by previous CTDG approaches, they still
face challenges. Most of these approaches adopt an interaction-level learning paradigm,
limiting their applicability to nodes with fewer interactions. Nodes with longer histo-
ot
els, such as recurrent neural networks (RNN) [14], may face challenges in learning
long-term temporal dependencies mainly due to vanishing or exploding gradients [15].
Moreover, existing methods usually use attention mechanisms [16] as encoders to iden-
tify relevant time steps for prediction. However, according to theoretical proofs in [17],
rin
the attention mechanisms commonly act as persistent low-pass filters, treating high-
frequency information as noise and continuously erasing it. Also, the pure attention
mechanism could not realize the periodic temporal patterns, i.e., Most people are re-
ep
peating what they did the day before. Consequently, capturing the periodic temporal
patterns hidden behind time series data also becomes challenging. In conclusion, pre-
vious approaches fail to capture either long-term temporal dependencies or periodic
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
temporal patterns.
iew
Transformers have emerged as a promising approach for addressing the aforemen-
tioned challenges in natural language processing [18], computer vision [19], and time
series analysis [20]. Recently, Graph transformers have seen remarkable advance-
ments. However, it warrants mention that the realm of temporal graphs still presents
numerous unresolved challenges. Existing models, such as DyGFormer [21] utilize
ev
transformers and a patching technique to benefit from longer histories. However, this
integration may lack the necessary refinement. These models do not adequately ad-
dress the unique intricacies inherent in temporal graphs, such as periodic temporal
r
patterns that are prevalent in real-world scenarios [22]. Instead, they primarily amplify
the transformer’s input as a mechanism for enhancing impact. Consequently, designing
er
new transformer architectures that effectively capture and exploit these dependencies
is essential for overcoming these challenges.
Building upon these motivations, we conceive a bespoke Transformer variant tai-
pe
lored for CTDG, fostering model competence in capturing long-term temporal depen-
dencies and periodic temporal patterns. TGFormer adheres to a sequence of interac-
tions, encodes them as node, link, time, and frequency inputs for the Series Trans-
former, and adopts an adaptive readout for link predictions. The TGFormer scruti-
ot
nizes node interactions within continuous point time from a novel-level perspective,
ushering temporal graph learning toward time series analysis. Drawing on the stochas-
tic process theory [23, 24], the TGFormer supersedes attention mechanisms with an
tn
ing the extent of point-wise representation aggregation at the sub-series level, fostering
the learning of patterns, and capturing periodic temporal patterns. The superiority of
our proposal is demonstrated by extensive experiments on six real-world datasets with
different characteristics.
ep
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
graph learning toward time series analysis. Thus, we proposed TGFormer, a
iew
pioneering temporal graph transformer by implementing this idea.
• The proposed TGFormer explicitly models crucial elements like node features,
link features, temporal information, and nodes’ interaction frequency, and effec-
tively fuses these elements.
ev
• We formulate an auto-correlation mechanism, surpassing traditional self-attention
methods by identifying dependencies and aggregating information at the series
level. Our auto-correlation mechanism captures periodic temporal patterns, en-
r
hancing both computational productivity and information utilization.
er
ral graph datasets, under both transductive and inductive settings. Experimental
results corroborate the superior performance of TGFormer compared to contem-
poraneous state-of-the-art methods.
pe
2. Related Work
2.1. Temporal Graph Neural Networks
Over the past few years, the scholarly focus on TGNNs has intensified [25, 26, 27].
ot
Typically, these models adopt an approach that treats temporal graph data as event
streams, facilitating the direct learning of node representations from continuously oc-
curring interactions. Specifically, existing CTDG models commonly employ RNN or
tn
attention mechanism as their sequence modules. For instance, JODIE [10] used RNN
to combine static and dynamic embedding for node representation. TGAT [16] extends
the graph attention mechanism to learn time-aware representations. Building upon this,
rin
some methods incorporate additional techniques like memory networks [28, 9, 29], or-
dinary differential equations (ODE) [30, 31, 32], and random walks [33, 34, 35] to
better learn the continuous temporal information. TGN [9], a variant of TGAT, in-
tegrates a memory module to effectively track the evolution of node-level features.
ep
GSNOP [30] proposes a sequential ODE aggregator, which considers sequential data
dependencies and learns the derivative of the neural process. This helps draw bet-
ter distributions from limited historical information. At the same time, CAWN [33]
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
extracts temporal network motifs through set-based anonymous random walks, effec-
iew
tively capturing the network’s dynamics. NeurTW [31] enhances spatial and temporal
interdependence during anonymous random walks, intertwining continuous evolution
with transient activation processes to comprehend the foundational spatiotemporal dy-
namics of base-order coding. To capture long-term temporal dependencies in the dy-
namic graph interactions, DyExplainer [36] uses a buffer-based live-updating scheme,
ev
and also DyGFormer [37] uses the Transformer-based architecture with a neighbor co-
occurrence encoding scheme and a patching technique to complete it.
In a nutshell, most existing TGNNs struggle to manage nodes with longer interac-
r
tions and effectively capture periodic temporal patterns due to the prohibitive compu-
tational costs of complex modules and optimization challenges such as vanishing or
er
exploding gradients. In this paper, we propose a novel transformer tailored for tem-
poral graphs to show the necessity of long-term temporal dependencies and periodic
dependencies, which is achieved by the Series Transformer layer.
pe
2.2. Transformers on Graphs
Transformer [38] is an innovative model that employs the SAM to handle sequen-
tial data, which has demonstrated significant success across diverse domains, includ-
ing natural language processing [39, 40, 41], computer vision [42, 43, 44] and time
ot
series forecasting [45, 46, 47]. For instance, Autoformer [47] replaces canonical at-
tention with an auto-correlation block to achieve sub-series level attention. Similarly,
tn
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
2024.1.21
Temporal Graph Extract Layer Encoder Layer Series Transformer Layer Decoder Layer
Node Encoding
u b b 𝑋$,, %
iew
𝑡! Edge Encoding
u u %& % %) %
𝑒$$ 𝑒$'( 𝑒$' 𝑋$,+
𝑡" Feed-forward Network Downstream
u b Time Encoding 𝑿𝒕𝒖 Task
%
𝑡# 𝑡! 𝑡" 𝑡# 𝑋$,-
u b
𝑡$ , 𝑡% Frequency Encoding
a c %
1 2 2 𝑋$,.
𝑡& 𝒕𝟒 ? 𝑡$
u v Auto-Correlation Mechanism ℎ!" ℎ#"
𝑡$ , 𝑡% 𝑡& , 𝑡% Frequency Encoding
𝑡%
b d 1 1 1
%
𝑋2,.
𝑡!
v v Time Encoding
%
𝑡" 𝑡! 𝑡" 𝑡" 𝑋2,-
Chronological Order: v d Adaptive
Edge Encoding 𝑿𝒕𝒗 Q K V Readout
𝑡% < 𝑡& < 𝑡' < 𝑡( v
𝑡"
b % % %
𝑒223 𝑒24( 𝑒2'( %
𝑋2,+
ev
Node Encoding
self-loop v d b 𝑋2,, %
interaction
Figure 1: The overview of TGFormer begins with the extract layer, which employs a direct interaction
extractor. Subsequently, the encoder layer generates temporal-aware and structure-aware sequence node
representations. These representations are then used to form the time series data placed into the query
r
(Q), key (K), and value (V) matrices in the series transformer layer. The series transformer layer leverages
the Transformer’s ability to capture long-term temporal dependencies and the auto-correlation mechanism
to uncover inner periodic temporal patterns in the series data. Finally, the decoder layer utilizes adaptive
readout for various downstream tasks.
PE. Recently some applications have appeared in temporal graphs. TGT [52] proposes
a transformer-based method to preserve high-order information. SimpleDyG [53] re-
er
pe
conceptualize temporal graphs as a sequence added to the attention mechanism.
Although some studies have explored the potential of transformers for temporal
graphs, they have only enhanced the inputs to the transformer and have not proposed
a transformer that can accommodate characteristics unique to temporal graphs, such
ot
3. Preliminaries
3.1. Temporal Graph
rin
A temporal graph is represented as G = (V, E), where V is the set of nodes and E
is the set of node interactions with timestamps. For two nodes u and v, there may exist a
sequence of timestamps, which are formally denoted as Eu,v = {(u, v, t1 ), (u, v, t2 ), · · · , (u, v, tn )} ⊂
E, where the timestamps are ordered as (0 < t1 < t2 < · · · < tn ), indicating that nodes
ep
u and v have interacted at least once at each of the corresponding timestamps. Two
interacting nodes are referred to as neighbors. The symbols and their definitions are
listed in Table 1. The notation will be used in the following sections.
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Table 1: Symbols and their definitions
iew
Notation Description
G A temporal graph
V, E The node set and link set of G
S ∗t The historical interactions samples of node ∗ before time t with itself
t
X∗,N Node ∗’s representation of node features at time t
t
X∗,E Node ∗’s representation of link features at time t
t
X∗,T Node ∗’s representation of temporal information at time t
ev
t
X∗,F Node ∗’s representation of node interactions frequency at time t
F∗t The number of times a node appears in S ut and S vt , respectively
X∗t The combine node ∗ representation at time t
ht∗ The node ∗ representation at time t
d N , d E , dT , d F , d t
, X∗,E
t
, X∗,T
t
, X∗,F
t
r
The dimension of X∗,N
n oL
Xti A time series
i=1
δ The time delay
k
L
The select k series
er
The node historical interactions extract length
pe
3.2. Temporal Link Prediction
We aim to learn a model that given a pair of nodes with a specific timestamp t, we
aim to predict whether the two nodes are connected at t based on all the available his-
torical data. Note that we are not only concerned with the prediction of links between
ot
nodes seen during training. We also expect the model to predict links between nodes
that are never seen for inductive evaluation. For a more reliable comparison, we use
the three strategies in [54] (i.e., random, historical, and inductive negative sampling
tn
3.3. Transformer
rin
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
where Attention(·, ·, ·) is the scaled dot-product attention. As input to the Transformer,
iew
an element in a sequence is represented by an embedding vector. The multi-head at-
tention mechanism works by injecting the Positional Embedding into the element em-
beddings to be aware of element orders in a sequence.
4. Proposed Method
ev
The framework of our TGFormer is shown in Fig. 1, which employs a Series Trans-
former as the backbone. Given an interaction (u, v, t), we first extract historical inter-
actions of source node u and destination node v before timestamp t and obtain two
r
interaction sequences S ut and S vt . Next, computing the encodings of neighbors, links,
time intervals, and interaction frequency for each sequence. Then, we concat each
er
encoding sequence feed into the Series Transformer for capturing long-term temporal
dependencies and periodic dependencies. Finally, the outputs of the Transformer adopt
an adaptive readout function to derive time-aware representations of u and v at times-
pe
tamp t (i.e., htu and htv ), which can be applied in various downstream tasks like temporal
link prediction.
In our initial endeavor of predicting the connection between the nodal elements u
ot
and v at a specific timestamp t, we first affix each node with a self-loop at the spe-
cific time t. This maneuver aims to enhance the correlation of the node to its inherent
tn
tions encompassing nodes u and v, expressed as S ut = {(u, u′ , t′ )|t′ < t} ∪ {(u, u, t)} and
S vt = {(v, v′ , t′ )|t′ < t} ∪ {(v, v, t)}, respectively. Note, (u, u′ , t′ ) ∈ E, (v, v′ , t′ ) ∈ E, and
|S ut | = |S vt | = L.
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
4.2. Encoder Layer
iew
In this section, we exposition details about how to convert the extracted interactions
from temporal graph learning to time series analysis. These interactions, construed as
temporally continuous events, embody four encodings: node, link, time, and frequency.
Together, these constitute the comprehensive representation of X∗t , where ∗ refers to
either of the nodes u or v.
ev
4.2.1. Node/Edge Encoding.
In the realm of temporal graphs, both vertices (nodes) and interactions (edges or
links) often harbor concomitant features. To delineate the embeddings allied with in-
r
teractions, it is imperative to harvest the inherent traits of proximate nodes and edges
considering the sequence, designated as S ∗t . Aligning our methodology with estab-
t
links correspondingly as X∗,N ∈ RL×dN and X∗,E
t er
lished approaches in the literature [9, 21], we adopt a schema to encode nodes and
∈ RL×dE , wherein dN and dE typify the
dimensions associated with the specific embeddings of nodes and edges, respectively.
pe
The specific schema is defined as:
t
X∗,N = MLP(x∗,N
t t
), X∗,E = MLP(x∗,E
t
), (3)
t
where x∗,N , x∗,E
t
represents the node ∗ interact nodes’ features and the interact link
ot
features of S ∗t . If the original features do not exist, they are all set to zero.
parameters ensure that the value of tmax × α−(i−1)/β converges to zero when i approaches
dT . So the time encoding can be defined as:
t
X∗,T = [cos(∆t1 ω), cos(∆t2 ω), · · · , cos(∆tL ω)]T . (4)
ep
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
throughout the training phase, thereby accelerating the process of model optimization.
iew
Furthermore, the adoption of relative time encoding is instrumental in discerning repet-
itive temporal patterns. This function inspects the temporal gaps between interactions,
enabling the construction of a similarity index for comparable timestamps, contributing
to the precise temporal differentiation.
ev
Most existing methods often undervalue the potential for re-interactions with his-
torical nodes and overlook inherent node correlations. We address this gap via a
paradigm shift: the introduction of a node interaction frequency encoding technique.
r
This novel approach posits that the frequency of a node’s interaction within a historical
sequence signifies its relevance. Our method goes beyond analyzing merely historical
er
interaction sequences and incorporates the frequency of both interaction node occur-
rence and node pair interaction. This judicious approach effectively captures correla-
tions between two nodes’ common interaction nodes within their respective historical
pe
interaction sequences.
As illustrated in Fig. 1, we calculate the frequency of each interaction in both S ut
and S vt , represented as Fut ∈ RL×2 and Fvt ∈ RL×2 respectively. This is computed as
follows, with the first column denoting the number of times the node appears in S ut and
ot
the second column denoting the number of times the node appears in S vt . Specifically,
as shown in Fig. 1, for the interaction nodes {u, b, b} and {v, d, b} appearing in S ut and S vt ,
respectively, then Fut = [[1, 2, 2], [0, 1, 1]], and Fvt = [[0, 0, 2], [1, 1, 1]]. We then code
tn
the frequency of the node pair interaction, deriving node interaction frequency features
t
for u and v, denoted by X∗,F ∈ RL×dF , wherein dF means the dimensions of frequency
embedding, respectively. This encoding manifests mathematically as follows:
rin
t
X∗,F = f (F∗t [:, 0]) + f (F∗t [:, 1]). (5)
10
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
they accommodate the variability of recurrence patterns apparent across different net-
iew
work domains and structures.
ev
t
and Xv,⋄ ∈ RL×d . It’s imperative to note that the symbol ⋄ can be a representative of
N, E, T or F. These alignments can be mathematically explained as:
t
Xu,⋄ = Xu,⋄
t
W⋄ + b⋄ ∈ RL×d ,
r
(6)
t
Xv,⋄ = Xv,⋄
t
W⋄ + b⋄ ∈ RL×d .
Xut = Xu,N
t t
||Xu,E t
||Xu,T t
||Xu,F
er
We then focus on the concatenation of these encoded nodes which can be calculated as
follows:
∈ RL×4d ,
(7)
pe
Xvt = Xv,N
t t
||Xv,E t
||Xv,T t
||Xv,F ∈ RL×4d .
Following this, we treat Xut and Xvt as a time series of length L, thus positioning us for
a comprehensive analysis of the time series.
2024.1.27
auto-correlation
attention
ot
Linear Linear
Contact Contact
Product Product
FFT FFT
Attention Auto-Correlation
Figure 2: Attention (left) and ACM (right). We utilize the Fast Fourier Transform (FFT) to calculate the
ACM, which reflects the time-delay similarities.
ep
11
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
4.3.1. Auto-Correlation Mechanism.
iew
As depicted in Fig. 2, we propose an ACM that utilizes node interaction sequences
with series-wise connections to optimize information usage. The ACM uncovers period-
based dependencies by calculating the auto-correlation of series and amalgamating
similar sub-sequences through time delay aggregation.
Period-based Dependencies. Inspired by the theory of stochastic processes, we
ev
observe that identical phase positions across periods naturally dictate similar sub-processes.
n oL
Thus, we treat each interaction sequence as a discrete-time process Xti , where Xti
i=1
is the i-th row of X∗t , i.e., the feature of the i-th sampled interaction. Then its auto-
r
correlation, RX,X (δ), can be computed as follows:
L
1X t t
RX,X (δ) = lim XX . (8)
er
L→∞ L i=1 i i−δ
In this equation, RX,X (δ) mirrors the time-delay similarity between Xti and the lag se-
ries, Xti−δ . As shown in Fig. 3, we employ the auto-correlation R(δ) as the unnormalized
pe
auto-correlation period length δ. We then select the most probable k period
confidence of the estimated
attention
2024.1.27
lengths, δ1 , · · · , δk . The resulting period-based dependencies, drawn from the afore-
mentioned estimated periods, can be weighted by the corresponding auto-correlation.
L
ot
Time Delay u
𝑡& 𝑡$
u d
𝑡%
u
𝑡,
u
𝑡-
original
…
a b u a a series
SoftMax
tn
u d u u u
Roll (δ% ) 𝑡$ 𝑡% 𝑡, … 𝑡- 𝑡&
X ℛ(δ% )
b u a a a
u u u u d
Roll (δ& ) 𝑡, … 𝑡- 𝑡& 𝑡$ 𝑡% ℛ(δ& )
a a a b u X
rin
…
u u u d u
Roll (δ+ ) 𝑡- 𝑡& 𝑡$ 𝑡% 𝑡, … X ℛ(δ+ )
a a b u a
Figure 3: Time-delay aggregation block. R(δ) reflects the time-delay similarities. Then the similar sub-
ep
processes are rolled to the same index based on selected delay δ and aggregated by R(δ).
12
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
In response, we introduce the time delay aggregation block (illustrated in Fig. 3) that
allows for the series to roll based on the selected time delay, δ1 , · · · , δk . This operation
iew
aligns similar sub-series onto the same phase position of estimated periods, a technique
that contrasts with the point-wise dot-product aggregation method used by the atten-
tion family. Lastly, we complete the process by incorporating the softmax-normalized
confidences to aggregate the sub-sequences.
ev
We initiate our discourse with a single-head scenario involving a time series Xti of
length-L. To analogize attention, we use symbols Q = K = V = X∗t , which are derived
from the encoder layer. Thus, it can seamlessly interchange the attention mechanism.
r
The subsequent step involves introducing the ACM. This mechanism incorporates
several components, with the initial process given by:
er
δ1 , · · · , δk = arg Topk RQ,K (δ) ,
δ∈{1,··· ,L}
RQ,K (δ1 ), · · · , b
RQ,K (δk ) = SoftMax RQ,K (δ1 ), · · · , RQ,K (δk ) . (10)
b
ot
k
tn
X
ACM(Q, K, V) = Roll(V, δi )b
RQ,K (δi ), (11)
i=1
where Roll(Xt , δ) signifies an operation on Xt with time delay δ. This operation rein-
states any elements shifted beyond the initial position at the end of the sequence.
rin
results into:
13
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
where Oi = ACM(Qi , Ki , Vi ) from Eq. (11), WO ∈ Rd is a trainable parameter.
iew
Efficient Computation. For period-based dependencies are inherently sparse, link-
ing to sub-processes at identical phase positions within specific periods, prioritizing
most probable delays and hence preventing the selection of converse phases. This
mechanism, exhibiting computational aptness, aggregates series equivalent to O(log L),
each measuring L in length. The complexity of Eq. (11) and Eq. (12) is O(L log L), il-
ev
lustrating computational efficiency.
Auto-correlation calculation (Eq. (8)) hinges on Fast Fourier Transforms (FFT),
n oL
according to the Wiener–Khinchin theorem [55], for a time series Xti . The resultant
i=1
r
RX,X (δ) can be deciphered via the following integral equations:
∞ ∞
Z
SX,X ( f ) = F Xi F Xti =
t ∗
er
RX,X (δ) = F −1 SX,X ( f ) =
−∞
Z
Xti e−2πi f di
∞
−∞
Z
−∞
Xti e−2πi f di,
SX,X ( f )e2π f δ d f,
(13)
(14)
pe
where δ belongs to the set {1, · · · , L}, F signifies FFT, its inverse is F −1 , and ∗ denotes
the conjugate operation. By using FFT, the auto-correlation of all lags in {1, · · · , L}
can be computed concurrently, enhancing computational efficiency with complexity at
O(L log L).
ot
4d. The operational functionality of the Series Transformer layer can thus be formally
expounded as follows:
14
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Z∗( j) = FFN(LN(Z∗′( j) )) + Z∗′( j) . (16)
iew
where Q, K, V all equal Z∗( j−1) . The input of the first layer is Z∗0 = X∗t ∈ RL×4d , and
the output of the J-th layer is denoted by H∗t = Z∗(J) ∈ RL×4d . 1 ≤ j ≤ J denotes the
transformer layer number.
ev
For the output matrix H∗t ∈ RL×4d of a node, H∗,1
t
∈ R1×4d is the token representation
t
of the node interacting with itself and H∗,l ∈ R1×4d is its l-th event representation. We
calculate the normalized attention coefficients for its l-th event:
t t
||H∗,l )WaT )
r
exp((H∗,1
λl = PL t t
, (17)
T
i=2 exp((H∗,1 ||H∗,i )Wa )
er
where Wa ∈ R1×8d denotes the learnable projection and l = 2, · · · , L. Therefore, the
readout function takes the correlation between each event and the node representation
into account. The node representation is finally aggregated as follows:
pe
L
X
ht∗ = H∗,1
t
+ λl H∗,l
t
. (18)
l=2
For link prediction loss, we adopt binary cross-entropy loss function, which is de-
ot
fined as:
S
X
Lp = − (yi log ŷi + (1 − yi ) log(1 − ŷi )), (19)
tn
i=1
where yi represents the ground-truth label of i-th sample and the ŷi uses two nodes’
representation represents the prediction value.
rin
and hidden features are denoted by d, and M signifies the size of temporal edges |E|.
The sampling process responsible for acquiring the L most recent neighbors achieves a
time complexity of O(1). Modules can be executed in parallel within the encoder layer,
Pr
15
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Algorithm 1 Training pipeline for TGFormer
Input: A temporal graph G, a node pair (u, v) with a specific timestamp (t), the neigh-
iew
bor sample number L, maximum training epoch of 200, early stopping strategy
with patience = 20.
Output: The probability of the node pair interacting at timestamp t.
1: initial patience = 0;
2: for training epoch = 1, 2, 3, . . . do
3: Acquire the L most recent first-hop interaction neighbors of nodes u and v from
G prior to timestamp t as S ut and S vt ;
ev
4: for S ut and S vt in parallel do
t t
5: Obtain node encoding X∗,N and edge encoding X∗,E from Eq. (3);
t
6: Obtain time encoding X∗,T from Eq. (4);
t
7: Obtain frequency encoding X∗,F from Eq.( 5);
t t t t t
X∗ ← X∗,N ||X∗,E ||X∗,T ||X∗,F ;
r
8:
9: Initial Z∗(0) ← X∗t ;
10: for Series Transformer Layer j do
11:
12:
13:
14:
Q, K, V ← X∗t ;
er
Z∗′( j) ← ACM(Q, K, V) + Z∗( j−1) ;
Z∗( j) ← FFN(LN(Z∗′( j) + Z∗′( j−1) ;
end for
pe
15: Adaptive readout with Eq. (18);
16: end for
17: Conduct link prediction with Eq. (20);
18: Compute loss L p with Eq. (19);
19: if current epoch’s metrics worse than the previous epoch’s then
20: patience = patience + 1
21: else
ot
where the encoder complexity reaches a maximum of O(dL). Furthermore, the series
rin
16
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Table 2: Statistics of the datasets.
Datasets Domains #Nodes #Links #Node & Link Features Bipartite Time Granularity Duration
iew
Wikipedia Social 9,227 157,474 – & 172 True Unix timestamp 1 month
Reddit Social 10,984 672,447 – & 172 True Unix timestamp 1 month
LastFM Interaction 1,980 1,293,103 –&– True Unix timestamp 1 month
Enron Social 184 125,235 –&– False Unix timestamp 3 years
UCI Social 1,899 59,835 –&– False Unix timestamp 196 days
CollegeMsg Social 1,899 59,835 –&– False Unix timestamp 193 days
5. Experiments
ev
5.1. Datasets
r
UCI, and CollegeMsg. The statistical characteristics of these datasets are comprehen-
er
sively detailed in Table 2. Further elaboration on the datasets is provided below.
• Reddit2 consists of a bipartite graph monitoring user posts on Reddit for one
month. Here, nodes represent users and subreddits, whereas links denote times-
tn
• LastFM3 contains a bipartite dataset that records user interactions with songs
over a one-month period. Nodes represent users and songs, while links denote
the listening behaviors of users.
ep
1 [Link]
2 [Link]
3 [Link]
Pr
17
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
• Enron4 documents email communications among employees of the ENRON en-
iew
ergy corporation over a three-year period. This dataset does not include node
attributes or link features.
ev
student communities.
r
nia, depicting user interactions through private messages exchanged at various
timestamps. This dataset does not include node labels or edge features.
5.2. Baselines
er
Our model is compared with nine state-of-the-art methods on temporal graphs.
They are based on graph convolutions, memory networks, random walks, sequential
pe
models, and transformer mechanisms. Here, we briefly introduce the mechanisms of
these methods for our assessment.
user and item states, enhancing precision in capturing intricate temporal pat-
terns within user-item interactions. The incorporation of a projection operation
enables accurate learning of future representation trajectories.
tn
the evolving structural information in dynamic graphs over time. This dual-
component framework provides a consistent and effective approach to capture
intricate temporal dynamics and evolving graph structures in dynamic networks.
ep
4 [Link] ./enron/
5 [Link]
6 [Link]
Pr
18
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
• TGAT [16] utilizes self-attention to compute node representations, aggregat-
iew
ing features from temporal neighbors and employing a time encoding function
for capturing nuanced temporal patterns. This concise approach ensures pre-
cise analysis of evolving graph structures and temporal dynamics in dynamic
networks.
• TGN [9] utilizes an evolving memory system for each node, updated upon node
ev
interactions through the message function, message aggregator, and memory up-
dater mechanisms. Simultaneously, an embedding module is employed to gener-
ate temporal node representations, ensuring a dynamic and comprehensive anal-
r
ysis of evolving graph structures in complex networks.
er
• CAWN [33] initiates its process by extracting multiple causal anonymous walks
for each node, facilitating an in-depth examination of the network dynamics’
causality and the establishment of relative node identities. Following this, RNN
pe
is utilized to encode each individual walk. The encoded walks are subsequently
aggregated to synthesize the final node representation.
• TCL [59] initiates its methodology by generating interaction sequences for each
rin
• GraphMixer [21] adopts a fixed time encoding function, surpassing its trainable
Pr
19
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
counterpart in performance. It incorporates this function into an MLP-Mixer-
iew
based link encoder to learn from temporal links efficiently. The framework uti-
lizes neighbor mean-pooling in the node encoder for a concise summarization of
node features.
ev
within interactions. Also, it employs patching techniques to enable the model to
capture long-term temporal dependencies.
r
5.3. Evaluation Tasks and Metrics
We closely follow [37] by evaluating the model performance for temporal link
er
prediction, which entails forecasting the probability of a link formation between two
specified nodes at a given timestamp. This evaluation is conducted under two distinct
settings: the transductive setting, which aims to predict future links among nodes ob-
pe
served during the training phase, and the inductive setting, which endeavors to predict
future links involving previously unseen nodes. To this end, a multi-layer perceptron
is utilized, which takes the concatenated representations of the two nodes as input and
outputs the likelihood of a link. Evaluation metrics employed include Average Preci-
ot
sion (AP) and Area Under the Receiver Operating Characteristic Curve (AUC-ROC).
For the node classification task, AUC-ROC is adopted as the performance metric. All
results presented are the aggregate of 10 independent experimental runs.
tn
challenging. Specifically, the three distinct negative sampling strategies are precisely
defined as follows: 1) Random Negative Sampling Strategy, where negative edges are
randomly chosen from virtually all conceivable node pairs within the graphs. 2) Histor-
ical Negative Sampling Strategy, entailing the sampling of negative edges from the set
ep
of edges observed in preceding timestamps but absent in the current step. 3) Inductive
Negative Sampling Strategy, involving the selection of negative edges from previously
unseen edges that were not encountered during the training phase. Please refer to [54]
Pr
20
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
for more details. For all tasks, each dataset is chronologically split into 70%, 15%,
iew
and 15% for training, validation, and testing, respectively. The hyperparameter config-
urations for baseline models adhere to those meticulously outlined in their respective
publications, which followed [37].
All models undergo a training regimen of up to 100 epochs, incorporating the early
ev
stopping strategy with a patience parameter set to 10. The model achieving optimal per-
formance on the validation set is selected for subsequent testing. The Adam optimizer
is uniformly employed across all models, and uses supervised binary cross-entropy loss
r
as the objective function, maintaining a consistent learning rate of 0.00001 and a batch
size of 200. We conduct our experiments on a machine with the Intel(R) Xeon(R) Gold
6330 CPU @ 2.00GHz with 256 GiB RAM, and four NVIDIA 3090 GPU cards, which
is implemented in Python 3.9 with Pytorch.
er
pe
Table 3: AP(%) for transductive temporal link prediction with random, historical, and inductive negative
sampling strategies. NSS is the abbreviation of Negative Sampling Strategies.
NSS Datasets JODIE DyRep TGAT TGN CAWN EdgeBank TCL GraphMixer DyGFormer TGFormer
Wikipedia 96.50 ± 0.14 94.86 ± 0.06 96.94 ± 0.06 98.45 ± 0.06 98.76 ± 0.03 90.37 ± 0.00 96.47 ± 0.16 97.25 ± 0.03 99.03 ± 0.02 99.79 ± 0.10
Reddit 98.31 ± 0.14 98.22 ± 0.04 98.52 ± 0.02 98.63 ± 0.06 99.11 ± 0.01 94.86 ± 0.00 97.53 ± 0.02 97.31 ± 0.01 99.22 ± 0.01 99.86 ± 0.01
LastFM 70.85 ± 2.13 71.92 ± 2.21 73.42 ± 0.21 77.07 ± 3.97 86.99 ± 0.06 79.29 ± 0.00 67.27 ± 2.16 75.61 ± 0.24 93.00 ± 0.12 96.87 ± 0.34
Random Enron 84.77 ± 0.30 82.38 ± 3.36 71.12 ± 0.97 86.53 ± 1.11 89.56 ± 0.09 83.53 ± 0.00 79.70 ± 0.71 82.25 ± 0.16 92.47 ± 0.12 97.14 ± 0.72
UCI 89.43 ± 1.09 65.14 ± 2.30 79.63 ± 0.70 92.34 ± 1.04 95.18 ± 0.06 76.20 ± 0.00 89.57 ± 1.63 93.25 ± 0.57 95.79 ± 0.17 99.09 ± 0.17
CollegeMsg 75.41 ± 1.82 56.92 ± 6.03 80.27 ± 0.29 92.62 ± 0.99 95.86 ± 0.06 76.42 ± 0.00 83.64 ± 0.10 93.04 ± 0.29 95.79 ± 0.02 99.17 ± 0.22
Avg. Rank 7.17 8.5 7.17 4.5 2.83 7.83 7.83 6 2.17 1
ot
Wikipedia 83.01 ± 0.66 79.93 ± 0.56 87.38 ± 0.22 86.86 ± 0.33 71.21 ± 1.67 73.35 ± 0.00 89.05 ± 0.39 90.90 ± 0.10 82.23 ± 2.54 92.47 ± 0.19
Reddit 80.03 ± 0.36 79.83 ± 0.31 79.55 ± 0.20 81.22 ± 0.61 80.82 ± 0.45 73.59 ± 0.00 77.14 ± 0.16 78.44 ± 0.18 81.57 ± 0.67 84.44 ± 1.49
LastFM 74.35 ± 3.81 74.92 ± 2.46 71.59 ± 0.24 76.87 ± 4.64 69.86 ± 0.43 73.03 ± 0.00 59.30 ± 2.31 72.47 ± 0.49 81.57 ± 0.48 84.52 ± 0.67
Historical Enron 69.85 ± 2.70 71.19 ± 2.76 64.07 ± 1.05 73.91 ± 1.76 64.73 ± 0.36 76.53 ± 0.00 70.66 ± 0.39 77.98 ± 0.92 75.63 ± 0.73 82.64 ± 1.39
UCI 75.24 ± 5.80 55.10 ± 3.14 68.27 ± 1.37 80.43 ± 2.12 65.30 ± 0.43 65.50 ± 0.00 80.25 ± 2.74 84.11 ± 1.35 82.17 ± 0.82 86.12 ± 1.50
CollegeMsg 64.41 ± 6.52 47.73 ± 0.97 68.18 ± 1.04 80.65 ± 1.13 84.54 ± 0.11 44.16 ± 0.00 68.53 ± 0.08 83.93 ± 0.09 80.93 ± 0.37 90.02 ± 1.19
Avg. Rank 6.33 7.17 7.17 4.17 7.17 7.67 6.67 4 3.67 1
Wikipedia 75.65 ± 0.79 70.21 ± 1.58 87.00 ± 0.16 85.62 ± 0.44 74.06 ± 2.62 80.63 ± 0.00 86.76 ± 0.72 88.59 ± 0.17 78.29 ± 5.38 93.10 ± 0.71
tn
Reddit 86.98 ± 0.16 86.30 ± 0.26 89.59 ± 0.24 88.10 ± 0.24 91.67 ± 0.24 85.48 ± 0.00 87.45 ± 0.29 85.26 ± 0.11 91.11 ± 0.40 92.37 ± 1.64
LastFM 62.67 ± 4.49 64.41 ± 2.70 71.13 ± 0.17 65.95 ± 5.98 67.48 ± 0.77 75.49 ± 0.00 58.21 ± 0.89 68.12 ± 0.33 73.97 ± 0.50 74.01 ± 1.35
Inductive Enron 68.96 ± 0.98 67.79 ± 1.53 63.94 ± 1.36 70.89 ± 2.72 75.15 ± 0.58 73.89 ± 0.00 71.29 ± 0.32 75.01 ± 0.79 77.41 ± 0.89 84.34 ± 1.74
UCI 65.99 ± 1.40 54.79 ± 1.76 68.67 ± 0.84 70.94 ± 0.71 64.61 ± 0.48 57.43 ± 0.00 76.01 ± 1.11 80.10 ± 0.51 72.25 ± 1.71 80.70 ± 1.16
CollegeMsg 53.49 ± 0.53 54.13 ± 1.64 68.54 ± 0.85 71.90 ± 0.06 75.12 ± 0.09 43.49 ± 0.00 68.70 ± 0.02 79.17 ± 0.09 70.44 ± 0.83 80.15 ± 1.72
Avg. Rank 8 8.6 6.2 5.6 4.4 6.8 6.2 4.6 3.4 1.2
rin
pling strategies. We put the transductive results in Table 3 and Table 4 and inductive
results in Table 5 and Table 6, respectively. The best and second-best results are marked
in bold and underlined fonts. Note that EdgeBank [54] can only evaluate transductive
Pr
21
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Table 4: AUC-ROC(%) for transductive temporal link prediction with random, historical, and inductive
iew
negative sampling strategies. NSS is the abbreviation of Negative Sampling Strategies.
NSS Datasets JODIE DyRep TGAT TGN CAWN EdgeBank TCL GraphMixer DyGFormer TGFormer
Wikipedia 96.33 ± 0.07 94.37 ± 0.09 96.67 ± 0.07 98.37 ± 0.07 98.54 ± 0.04 90.78 ± 0.00 95.84 ± 0.18 96.92 ± 0.03 98.91 ± 0.02 99.77 ± 0.10
Reddit 98.31 ± 0.05 98.17 ± 0.05 98.47 ± 0.02 98.60 ± 0.06 99.01 ± 0.01 95.37 ± 0.00 97.42 ± 0.02 97.17 ± 0.02 99.15 ± 0.01 99.82 ± 0.01
LastFM 70.49 ± 1.66 71.16 ± 1.89 71.59 ± 0.18 78.47 ± 2.94 85.92 ± 0.10 83.77 ± 0.00 64.06 ± 1.16 73.53 ± 0.12 93.05 ± 0.10 96.85 ± 0.44
Random Enron 87.96 ± 0.52 84.89 ± 3.00 68.89 ± 1.10 88.32 ± 0.99 90.45 ± 0.14 87.05 ± 0.00 75.74 ± 0.72 84.38 ± 0.21 93.33 ± 0.13 97.13 ± 0.90
UCI 90.44 ± 0.49 68.77 ± 2.34 78.53 ± 0.74 92.03 ± 1.13 93.87 ± 0.08 77.30 ± 0.00 87.82 ± 1.36 91.81 ± 0.67 94.49 ± 0.26 98.68 ± 0.30
CollegeMsg 78.87 ± 1.71 61.72 ± 6.95 79.25 ± 0.76 92.39 ± 1.00 94.63 ± 0.11 77.32 ± 0.00 83.32 ± 0.08 91.71 ± 0.28 94.52 ± 0.03 98.88 ± 0.30
Avg. Rank 6.83 8.5 7.17 4.17 2.83 8 8 6.33 2.17 1
Wikipedia 80.77 ± 0.73 77.74 ± 0.33 82.87 ± 0.22 82.74 ± 0.32 67.84 ± 0.64 77.27 ± 0.00 85.76 ± 0.46 87.68 ± 0.17 78.80 ± 1.95 92.07 ± 0.33
Reddit 80.52 ± 0.32 80.15 ± 0.18 79.33 ± 0.16 81.11 ± 0.19 80.27 ± 0.30 78.58 ± 0.00 76.49 ± 0.16 77.80 ± 0.12 80.54 ± 0.29 83.08 ± 2.43
LastFM 75.22 ± 2.36 74.65 ± 1.98 64.27 ± 0.26 77.97 ± 3.04 67.88 ± 0.24 78.09 ± 0.00 47.24 ± 3.13 64.21 ± 0.73 78.78 ± 0.35 82.39 ± 0.67
Historical Enron 75.39 ± 2.37 74.69 ± 3.55 61.85 ± 1.43 77.09 ± 2.22 65.10 ± 0.34 79.59 ± 0.00 67.95 ± 0.88 75.27 ± 1.14 76.55 ± 0.52 80.59 ± 1.47
UCI 78.64 ± 3.50 57.91 ± 3.12 58.89 ± 1.57 77.25 ± 2.68 57.86 ± 0.15 69.56 ± 0.00 72.25 ± 3.46 77.54 ± 2.02 76.97 ± 0.24 83.80 ± 2.67
ev
CollegeMsg 66.92 ± 5.00 49.37 ± 1.37 58.19 ± 1.11 77.65 ± 1.40 79.45 ± 0.15 34.64 ± 0.00 58.55 ± 0.18 77.50 ± 0.16 76.25 ± 0.12 88.58 ± 1.10
Avg. Rank 4.67 7.5 7.5 3.5 7.17 6.5 7.33 5.5 4.33 1
Wikipedia 70.96 ± 0.78 67.36 ± 0.96 81.93 ± 0.22 80.97 ± 0.31 70.95 ± 0.95 81.73 ± 0.00 82.19 ± 0.48 84.28 ± 0.30 75.09 ± 3.70 91.01 ± 0.76
Reddit 83.51 ± 0.15 82.90 ± 0.31 87.13 ± 0.20 84.56 ± 0.24 88.04 ± 0.29 85.93 ± 0.00 84.67 ± 0.29 82.21 ± 0.13 86.23 ± 0.51 89.42 ± 2.17
LastFM 61.32 ± 3.49 62.15 ± 2.12 63.99 ± 0.21 65.46 ± 4.27 67.92 ± 0.44 77.37 ± 0.00 46.93 ± 2.59 60.22 ± 0.32 69.25 ± 0.36 70.54 ± 0.67
Inductive Enron 70.92 ± 1.05 68.73 ± 1.34 60.45 ± 2.12 71.34 ± 2.46 75.17 ± 0.50 75.00 ± 0.00 67.64 ± 0.86 71.53 ± 0.85 74.07 ± 0.64 81.19 ± 1.21
UCI 64.14 ± 1.26 54.25 ± 2.01 60.80 ± 1.01 64.11 ± 1.04 58.06 ± 0.26 58.03 ± 0.00 70.05 ± 1.86 74.59 ± 0.74 65.96 ± 1.18 75.98 ± 2.90
CollegeMsg 53.34 ± 0.22 53.13 ± 2.16 59.84 ± 1.16 65.58 ± 0.42 69.28 ± 0.24 30.53 ± 0.00 59.67 ± 0.16 74.06 ± 0.25 64.77 ± 0.13 76.92 ± 1.31
Avg. Rank 7.2 8.6 6.4 5.6 3.8 5.6 7 5.6 4 1.2
r
Table 5: AP(%) for inductive temporal link prediction with random, historical, and inductive negative sam-
pling strategies. NSS is the abbreviation of Negative Sampling Strategies.
NSS
Random
Datasets
Wikipedia
Reddit
LastFM
Enron
UCI
CollegeMsg
JODIE
94.82 ± 0.20
96.50 ± 0.13
81.61 ± 3.82
80.72 ± 1.39
79.86 ± 1.48
63.30 ± 1.26
DyRep
92.43 ± 0.37
96.09 ± 0.11
83.02 ± 1.48
74.55 ± 3.95
57.48 ± 1.87
52.72 ± 2.58
TGAT
96.22 ± 0.07
97.09 ± 0.04
78.63 ± 0.31
67.05 ± 1.51
79.54 ± 0.48
79.58 ± 0.47
TGN
er
97.83 ± 0.04
97.50 ± 0.07
81.45 ± 4.29
77.94 ± 1.02
88.12 ± 2.05
88.07 ± 1.44
CAWN
98.24 ± 0.03
98.62 ± 0.01
89.42 ± 0.07
86.35 ± 0.51
92.73 ± 0.06
94.44 ± 0.05
TCL
96.22 ± 0.17
94.09 ± 0.07
73.53 ± 1.66
76.14 ± 0.79
87.36 ± 2.03
81.18 ± 0.04
GraphMixer
96.65 ± 0.02
95.26 ± 0.02
82.11 ± 0.42
75.88 ± 0.48
91.19 ± 0.42
90.94 ± 0.22
DyGFormer
98.59 ± 0.03
98.84 ± 0.02
94.23 ± 0.09
89.76 ± 0.34
94.54 ± 0.12
94.36 ± 0.12
TGFormer
99.27 ± 0.25
99.78 ± 0.04
97.97 ± 0.37
91.88 ± 1.26
99.16 ± 0.15
97.73 ± 0.22
pe
Avg. Rank 6.5 7.67 7.25 5 2.83 7.08 5.5 2.17 1
Wikipedia 68.69 ± 0.39 62.18 ± 1.27 84.17 ± 0.22 81.76 ± 0.32 67.27 ± 1.63 82.20 ± 2.18 87.60 ± 0.30 71.42 ± 4.43 78.69 ± 1.25
Reddit 62.34 ± 0.54 61.60 ± 0.72 63.47 ± 0.36 64.85 ± 0.85 63.67 ± 0.41 60.83 ± 0.25 64.50 ± 0.26 65.37 ± 0.60 72.95 ± 1.13
LastFM 70.39 ± 4.31 71.45 ± 1.76 75.27 ± 0.25 66.65 ± 6.11 71.33 ± 0.47 65.78 ± 0.65 76.42 ± 0.22 76.35 ± 0.52 76.81 ± 0.79
Historical Enron 65.86 ± 3.71 62.08 ± 2.27 61.40 ± 1.31 62.91 ± 1.16 60.70 ± 0.36 67.11 ± 0.62 72.37 ± 1.37 67.07 ± 0.62 76.77 ± 1.01
UCI 63.11 ± 2.27 52.47 ± 2.06 70.52 ± 0.93 70.78 ± 0.78 64.54 ± 0.47 76.71 ± 1.00 81.66 ± 0.49 72.13 ± 1.87 77.24 ± 1.59
CollegeMsg 50.51 ± 0.75 54.43 ± 1.79 70.50 ± 1.18 71.60 ± 0.31 74.14 ± 0.17 69.80 ± 0.23 80.15 ± 0.18 69.59 ± 1.25 77.80 ± 1.11
Avg. Rank 7 7.5 5 5 6.17 5.5 2.5 4.33 2
Wikipedia 68.70 ± 0.39 62.19 ± 1.28 84.17 ± 0.22 81.77 ± 0.32 67.24 ± 1.63 82.20 ± 2.18 87.60 ± 0.29 71.42 ± 4.43 78.13 ± 1.56
Reddit 62.32 ± 0.54 61.58 ± 0.72 63.40 ± 0.36 64.84 ± 0.84 63.65 ± 0.41 60.81 ± 0.26 64.49 ± 0.25 65.35 ± 0.60 72.95 ± 1.13
LastFM 70.39 ± 4.31 71.45 ± 1.75 76.28 ± 0.25 69.46 ± 4.65 71.33 ± 0.47 65.78 ± 0.65 76.42 ± 0.22 76.35 ± 0.52 74.21 ± 1.02
Inductive Enron 65.86 ± 3.71 62.08 ± 2.27 61.40 ± 1.30 62.90 ± 1.16 60.72 ± 0.36 67.11 ± 0.62 72.37 ± 1.38 67.07 ± 0.62 76.77 ± 1.02
63.16 ± 2.27 52.47 ± 2.09 70.49 ± 0.93 70.73 ± 0.79 64.54 ± 0.47 76.65 ± 0.99 81.64 ± 0.49 72.13 ± 1.86 77.19 ± 1.09
ot
UCI
CollegeMsg 50.57 ± 0.76 54.47 ± 1.81 70.50 ± 1.19 71.63 ± 0.31 74.11 ± 0.17 69.80 ± 0.24 80.13 ± 0.18 69.55 ± 1.27 77.72 ± 1.11
Avg. Rank 7.2 7.4 5.4 5.2 5.8 6 2.2 3.8 2
Table 6: AUC-ROC(%) for inductive temporal link prediction with random, historical, and inductive negative
tn
Wikipedia 61.86 ± 0.53 57.54 ± 1.09 78.38 ± 0.20 75.75 ± 0.29 62.04 ± 0.65 79.79 ± 0.96 82.87 ± 0.21 68.33 ± 2.82 76.75 ± 1.67
Reddit 61.69 ± 0.39 60.45 ± 0.37 64.43 ± 0.27 64.55 ± 0.50 64.94 ± 0.21 61.43 ± 0.26 64.27 ± 0.13 64.81 ± 0.25 69.26 ± 2.38
LastFM 68.44 ± 3.26 68.79 ± 1.08 69.89 ± 0.28 66.99 ± 5.62 67.69 ± 0.24 55.88 ± 1.85 70.07 ± 0.20 70.73 ± 0.37 70.33 ± 1.02
Historical Enron 65.32 ± 3.57 61.50 ± 2.50 57.84 ± 2.18 62.68 ± 1.09 62.25 ± 0.40 64.06 ± 1.02 68.20 ± 1.62 65.78 ± 0.42 68.79 ± 1.58
UCI 60.24 ± 1.94 51.25 ± 2.37 62.32 ± 1.18 62.69 ± 0.90 56.39 ± 0.10 70.46 ± 1.94 75.98 ± 0.84 65.55 ± 1.01 67.72 ± 1.97
CollegeMsg 48.57 ± 1.27 52.31 ± 1.53 61.53 ± 1.45 63.89 ± 0.71 67.77 ± 0.09 60.05 ± 0.08 74.54 ± 0.16 63.15 ± 0.44 74.18 ± 0.49
Avg. Rank 6.83 8 5.5 5.33 5.67 5.5 2.33 3.67 2.17
Wikipedia 61.87 ± 0.53 57.54 ± 1.09 78.38 ± 0.20 75.76 ± 0.29 62.02 ± 0.65 79.79 ± 0.96 82.88 ± 0.21 68.33 ± 2.82 76.04 ± 2.08
Reddit 61.69 ± 0.39 60.44 ± 0.37 64.39 ± 0.27 64.55 ± 0.50 64.91 ± 0.21 61.36 ± 0.26 64.27 ± 0.13 64.80 ± 0.25 69.25 ± 2.38
ep
LastFM 68.44 ± 3.26 68.79 ± 1.08 69.89 ± 0.28 66.99 ± 5.61 67.68 ± 0.24 55.88 ± 1.85 70.07 ± 0.20 70.73 ± 0.37 73.21 ± 1.42
Inductive Enron 65.32 ± 3.57 61.50 ± 2.50 57.83 ± 2.18 62.68 ± 1.09 62.27 ± 0.40 64.05 ± 1.02 68.19 ± 1.63 65.79 ± 0.42 72.79 ± 1.59
UCI 60.27 ± 1.94 51.26 ± 2.40 62.29 ± 1.17 62.66 ± 0.91 56.39 ± 0.11 70.42 ± 1.93 75.97 ± 0.85 65.58 ± 1.00 72.78 ± 1.34
CollegeMsg 48.64 ± 1.26 52.36 ± 1.53 61.51 ± 1.47 63.93 ± 0.70 67.72 ± 0.09 60.05 ± 0.10 74.53 ± 0.16 63.14 ± 0.44 73.45 ± 0.51
Avg. Rank 6.6 7.8 6 5.4 5.4 6.4 2.6 3.4 1.4
Pr
22
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
temporal link prediction and therefore does not give its results in the inductive setting.
iew
From the above table, we can see that TGFormer outperforms the existing methods in
most cases, which is far superior to the second one.
We summarize the superiority of TGFormer in two aspects. First, the transformer
architecture allows TGFormer to process longer histories and capture long-term tem-
poral dependencies effectively. As illustrated in Fig. 6, the input sequence lengths han-
ev
dled by TGFormer are significantly longer than those managed by baseline across most
datasets, underscoring TGFormer’s superior aptitude in leveraging longer sequences.
Second, benefiting from capturing periodic temporal patterns, especially for Wikipedia
r
and Reddit in the case of harder negative sampling, TGFormer also has a small perfor-
mance degradation compared to the other approaches.
Methods Wikipedia
88.99 ± 1.05
er
Table 7: AUC-ROC(%) for the transductive node classification on Wikipedia and Reddit.
Reddit
60.37 ± 2.58
Avg. Rank
pe
JODIE 5
DyRep 86.39 ± 0.98 63.72 ± 1.32 5.5
TGAT 84.09 ± 1.27 70.04 ± 1.09 4.5
TGN 86.38 ± 2.34 63.27 ± 0.90 6.5
CAWN 84.88 ± 1.33 66.34 ± 1.78 6
TCL 77.83 ± 2.13 68.87 ± 2.15 5.5
GraphMixer 86.80 ± 0.79 64.22 ± 3.32 5.5
ot
For the node classification task, we adhere to the evaluation protocols established
by DyGFormer [37]. The objective of this task is to predict the state label of the
rin
source node, given the node and future timestamps. Specifically, we employ the model
obtained from the preceding transductive link prediction as the pre-trained model for
node classification. A classifier decoder, such as a three-layer MLP, is then trained
ep
separately for the node classification task. We evaluate this task on two datasets with
dynamic node labels, namely Wikipedia and Reddit, while excluding other datasets due
to the absence of node labels.
Pr
23
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
The comparative results of our method and the baseline methods on the node clas-
iew
sification task are presented in Table 7. While our method did not attain the best perfor-
mance, it nonetheless demonstrated commendable and competitive results. We sum-
marize the reasons for the suboptimal performance as follows: First, our method pri-
oritizes the interaction between nodes, such as periodic temporal dependencies, while
paying less attention to the individual node statuses. Second, to leverage long-term his-
ev
torical data, we may inadvertently include more noise and node anomalies [60], which
we have not filtered effectively.
Wikipedia Enron UCI
100 TGFormer TGFormer 100 GraphMixer TGFormer
TGN 1.0868M 1.0868M 0.6419M 1.0868M
95
r
1.4602M JODIE 95
99 0.1959M
DyGFormer DyGFormer TGN DyGFormer
1.0866M 90 1.0866M 90 1.4602M 1.0866M
GraphMixer TCL CAWN
98 0.6419M GraphMixer 0.8842M 4.0229M
CAWN 85 0.6419M TGN 85
JODIE 4.0229M 1.4602M CAWN
AP
AP
97 80 TGAT
TGAT 0.1959M 1.0523M
1.0523M 80 DyRep 75
96
TCL
95 0.8842M
940 100
DyRep
1.19M
200 300 400
Training Time (s/epoch)
500 600
70
0
TCL
75 0.8842M
50
1.19M
100
Figure 4: Comparison of model performance, parameter size and training time per epoch on Wikipedia,
150
er
TGAT
1.0523M
50 100 150
Training Time (s/epoch)
200
pe
Reddit, and UCI.
depicted in Fig. 4. It is evident that the walk-based method, such as CAWN, requires
a longer training time due to its inefficient temporal walk operations and has a sub-
stantial number of parameters. Conversely, simpler and memory-based methods such
rin
as DyRep and JODIE may possess fewer parameters and exhibit faster training times;
however, they exhibit a significant performance gap compared to the best-performing
methods. In contrast, TGFormer outperforms the transformer-based method DyG-
Former by achieving superior performance with a smaller size of trainable parameters
ep
24
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
100 100
95
99 90
99
iew
85
98
AP
AP
AP
80
98
97 75
70
97 w/o All w/o Fre w/o ACM w/o Sat TGFormer 96 w/o All w/o Fre w/o ACM w/o Sat TGFormer w/o All w/o Fre w/o ACM w/o Sat TGFormer
AP
AP
ev
85 95
94
94
80 w/o All w/o Fre w/o ACM w/o Sat TGFormer 93 w/o All w/o Fre w/o ACM w/o Sat TGFormer 92 w/o All w/o Fre w/o ACM w/o Sat TGFormer
Figure 5: Ablation study in the transductive setting with the random negative sampling strategy. AP (%)
r
values are reported.
the traditional attention mechanics [38]. In w/o AR, we replace the proposed adaptive
readout with the traditional mean function [21]. We report the performance of the dif-
ferent modules on the six datasets in Fig. 5. We find that TGFormer typically performs
tn
best when all components are used, and the results are worse when any component
is removed or changed. We can get the following conclusions: 1) removing the node
interaction frequency encoding scheme significantly affects performance, as it directly
rin
captures the relationships between nodes and their interactions. 2) Although ACM
plays different roles on different datasets, on large datasets or datasets with significant
periodicity, ACM even plays a similar efficiency as FE. This suggests that ACM can
benefit from longer historical information and capture the periodic dependencies of the
ep
dataset. 3) The adaptive readout function can still achieve additional improvement.
Pr
25
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Table 8: TGFormer performance under different choices of hyper-parameter c in the auto-correlation mech-
iew
anism.
Dataset Wikipedia Reddit LastFM Enron UCI CollegeMsg
Metric AP AUC AP AUC AP AUC AP AUC AP AUC AP AUC
c=1 99.61 99.58 99.87 99.83 96.26 96.55 97.17 97.24 98.93 98.35 99.23 98.95
c=2 99.97 99.97 99.88 99.84 96.28 96.54 96.62 96.68 99.02 98.65 99.29 99.09
c=3 99.86 99.83 99.89 99.86 96.34 96.64 97.85 97.94 99.22 98.79 99.18 98.75
c=4 99.58 99.57 99.87 99.82 96.44 95.70 97.56 97.72 99.24 98.82 99.28 98.94
c=5 99.93 99.93 99.88 99.85 96.63 96.50 97.73 97.84 99.27 98.92 99.30 98.97
ev
5.9. Parameter Sensitivity Analysis
r
hyper-parameter c (Eq. (9)). To achieve an optimal balance between performance and
computational efficiency, we set c to the range of 1 to 5. It was also observed that the
Reddit
er
effects of most of the data sets did not fluctuate significantly and were more stable, all
LastFM
pe
100 100
95 TGAT
99 90 TGN
98 85 CAWN
80 TCL
AP
97 GraphMixer
75 DyGFormer
96 70 TGFormer
65
ot
95 1 2 3 4 5 6 7 8 60 1 2 3 4 5 6 7 8
log2 L log2 L
Figure 6: The performance of various methods on Reddit and LastFM datasets across different historical
tn
lengths L.
ficiently utilizing longer histories. Our experiments are conducted on the Reddit and
LastFM datasets, chosen for their potential to benefit from long-term historical data.
For baseline comparisons, we augment the sampling of neighbors or increase the num-
ep
ber of causal anonymous walks (starting from 2) to empower them with access to ex-
tended histories. The experimental results are illustrated in Fig. 6, with the x-axis
depicted on a logarithmic scale with a base of 2. It is noteworthy that some base-
Pr
26
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Wikipedia Wikipedia
90 Interact Sequence 90 Interact Sequence
Attention Time Delay
iew
80 80
Interact Frequence
70 70
60 60
50 50
40 40
0 5 10 15 20 25 30 0 5 10 15 20 25 30
ev
Time(days) Time(days)
Figure 7: The performance of different attention mechanisms (left) and ACM (right) in capturing periodic
temporal patterns.
r
line results are incomplete due to out-of-memory errors encountered at longer histori-
cal lengths. For instance, CAWN reaches an out-of-memory state when the historical
er
length extends to 256. From Fig. 6, we deduce the following: 1) a majority of the base-
lines exhibit deteriorated performance with increasing historical lengths, indicative of
pe
their limitations in capturing long-term temporal dependencies; 2) the baselines gener-
ally incur substantial computational costs when processing extended histories. While
memory network-based methods (e.g., TGN) manage to handle longer histories with
reasonable computational costs, they do not significantly benefit from these extended
histories due to issues such as vanishing or exploding gradients; 3) TGFormer consis-
ot
cal records.
We also verified the advantage of the ACM in capturing periodic time dependen-
cies. We conducted experiments on Wikipedia to determine the model’s effectiveness
in capturing periodicity by measuring the interaction frequency of one of the nodes.
rin
For clarity, we visualized the learned periodic dependency in Fig 7, where the x-axis
represents the node interaction time divided by days, and the y-axis denotes the node
interaction frequency during this period. For comparison with attention mechanisms,
ep
the top -9 time delay sizes {δ1 , · · · , δ9 , } of ACM are highlighted within the raw series
using red lines. In contrast, the top-9 most similar data points relative to the final time
step, as determined by attention mechanisms, are denoted by yellow lines with red
Pr
27
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
stars. As illustrated, while attention mechanisms capture node interaction intent, they
iew
fail to capture interaction periodicity effectively. This demonstrates that our model can
discover relevant information more comprehensively and accurately.
6. Conclusion
ev
tioning temporal graph learning towards time series analysis, encapsulated in the pro-
posed TGFormer model. We proposed an auto-correlation mechanism that extends be-
yond the traditional attention mechanism, capturing long-term temporal dependencies,
r
and periodic temporal patterns, and boosting computational efficiency and information
utilization. Extensive experiments conducted on various real-world datasets demon-
er
strate the effectiveness and efficiency of our proposed TGFormer model.
Acknowledgment
pe
This work was supported in part by the Zhejiang Provincial Natural Science Foun-
dation of China under Grant LDT23F01012F01, in part by the National Natural Science
Foundation of China under Grant 62372146 and in part by the Fundamental Research
Funds for the Provincial Universities of Zhejiang Grant GK229909299001-008
ot
References
tn
[1] C. Li, L. Lin, W. Zuo, J. Tang, M.-H. Yang, Visual tracking via dynamic graph
learning, IEEE transactions on pattern analysis and machine intelligence 41 (11)
(2018) 2770–2782.
rin
[2] M. Ren, Y. Wang, Y. Zhu, K. Zhang, Z. Sun, Multiscale dynamic graph represen-
tation for biometric recognition with occlusions, IEEE Transactions on Pattern
Analysis and Machine Intelligence 45 (12) (2023) 15120–15136.
ep
[3] Y. Li, L. Hou, J. Li, Preference-aware graph attention networks for cross-domain
recommendations with collaborative knowledge graph, ACM Transactions on In-
formation Systems 41 (3) (2023) 1–26.
Pr
28
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
[4] X. Zhang, P. Jiao, M. Gao, T. Li, Y. Wu, H. Wu, Z. Zhao, Vggm: Variational
iew
graph gaussian mixture model for unsupervised change point detection in dy-
namic networks, IEEE Transactions on Information Forensics and Security 19
(2024) 4272–4284.
ev
dynamic graphs, in: Proceedings of the AAAI conference on artificial intelli-
gence, 2020.
[6] A. Sankar, Y. Wu, L. Gou, W. Zhang, H. Yang, Dysat: Deep neural representation
r
learning on dynamic graphs via self-attention networks, in: Proceedings of the
er
13th international conference on web search and data mining, 2020, pp. 519–527.
[7] J. You, T. Du, J. Leskovec, Roland: graph learning framework for dynamic
graphs, in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge
pe
Discovery and Data Mining, 2022, pp. 2358–2366.
[8] P. Jiao, H. Chen, H. Tang, Q. Bao, L. Zhang, Z. Zhao, H. Wu, Contrastive repre-
sentation learning on dynamic networks, Neural Networks 174 (2024) 106240.
ot
[10] J. Li, Z. Han, H. Cheng, J. Su, P. Wang, J. Zhang, L. Pan, Predicting path failure
in time-evolving graphs, in: Proceedings of the 25th ACM SIGKDD international
conference on knowledge discovery & data mining, 2019, pp. 1279–1289.
rin
[11] H. Chen, P. Jiao, H. Tang, H. Wu, Temporal graph representation learning with
adaptive augmentation contrastive, in: Joint European Conference on Machine
Learning and Knowledge Discovery in Databases, Springer, 2023, pp. 683–699.
ep
[12] P. Jiao, T. Li, H. Wu, C.-D. Wang, D. He, W. Wang, Hb-dsbm: Modeling the dy-
namic complex networks from community level to node level, IEEE Transactions
on Neural Networks and Learning Systems 34 (11) (2023) 8310–8323.
Pr
29
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
[13] M. Qin, D.-Y. Yeung, Temporal link prediction: A unified framework, taxonomy,
iew
and review, ACM Computing Surveys 56 (4) (2023) 1–40.
ev
[15] B. Chang, M. Chen, E. Haber, E. H. Chi, Antisymmetricrnn: A dynamical system
view on recurrent neural networks (2019).
r
[16] D. Xu, C. Ruan, E. Korpeoglu, S. Kumar, K. Achan, Inductive representation
learning on temporal graphs, in: International Conference on Learning Represen-
tations, 2020.
er
[17] A. Souza, D. Mesquita, S. Kaski, V. Garg, Provably expressive temporal graph
pe
networks, Advances in Neural Information Processing Systems 35 (2022) 32257–
32269.
[22] A.-L. Barabasi, The origin of bursts and heavy tails in human dynamics, Nature
ep
[23] C. Chatfield, H. Xing, The analysis of time series: an introduction with R, CRC
press, 2019.
Pr
30
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
[24] S. Unnikrishna Pillai, Probability, random variables and stochastic processes,
iew
2002.
[25] J. Gao, B. Ribeiro, On the equivalence between temporal and static equivariant
graph representations, in: International Conference on Machine Learning, PMLR,
2022, pp. 7052–7076.
ev
[26] K. Zhu, J. Chen, J. Wang, N. Z. Gong, D. Yang, X. Xie, Dyval: Graph-informed
dynamic evaluation of large language models, in: International Conference on
Learning Representations, 2024.
r
[27] T. Zheng, X. Wang, Z. Feng, J. Song, Y. Hao, M. Song, X. Wang, X. Wang,
C. Chen, Temporal aggregation and propagation graph neural networks for dy-
[28] J. Su, D. Zou, C. Wu, Pres: Toward scalable memory-based dynamic graph neural
pe
networks, in: International Conference on Learning Representations, 2024.
[29] Y. Wang, Y. Cai, Y. Liang, H. Ding, C. Wang, S. Bhatia, B. Hooi, Adaptive data
augmentation on temporal graphs, Advances in Neural Information Processing
Systems 34 (2021) 1440–1452.
ot
[30] L. Luo, G. Haffari, S. Pan, Graph sequential neural ode process for link pre-
diction on dynamic and sparse graphs, in: Proceedings of the Sixteenth ACM
tn
International Conference on Web Search and Data Mining, 2023, pp. 778–786.
[31] M. Jin, Y.-F. Li, S. Pan, Neural temporal walks: Motif-aware representation learn-
ing on continuous-time dynamic graphs, Advances in Neural Information Pro-
rin
[33] Y. Wang, Y.-Y. Chang, Y. Liu, J. Leskovec, P. Li, Inductive representation learning
in temporal networks via causal anonymous walks, in: International Conference
on Learning Representations, 2021.
Pr
31
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
[34] J.-w. Lee, J. Jung, Time-aware random walk diffusion to improve dynamic graph
iew
learning, in: Proceedings of the AAAI Conference on Artificial Intelligence,
Vol. 37, 2023, pp. 8473–8481.
ev
[36] T. Wang, D. Luo, W. Cheng, H. Chen, X. Zhang, Dyexplainer: Explainable dy-
namic graph neural networks, arXiv preprint arXiv:2310.16375 (2023).
r
[37] L. Yu, L. Sun, B. Du, W. Lv, Towards better dynamic graph learning: New archi-
tecture and unified library, Advances in Neural Information Processing Systems
(2023).
er
[38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
pe
Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural infor-
mation processing systems 30 (2017).
pp. 4171–4186.
[40] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle-
tn
32
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
[43] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner,
iew
M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16
words: Transformers for image recognition at scale, in: International Conference
on Learning Representations, 2020.
[44] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer:
Hierarchical vision transformer using shifted windows, in: Proceedings of the
ev
IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022.
[45] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y.-X. Wang, X. Yan, Enhancing the
locality and breaking the memory bottleneck of transformer on time series fore-
r
casting, Advances in neural information processing systems 32 (2019).
er
[46] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, W. Zhang, Informer:
Beyond efficient transformer for long sequence time-series forecasting, in: Pro-
ceedings of the AAAI conference on artificial intelligence, Vol. 35, 2021, pp.
pe
11106–11115.
[48] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, R. Jin, Fedformer: Frequency en-
hanced decomposed transformer for long-term series forecasting, in: Interna-
tn
[49] J. Zhang, H. Zhang, C. Xia, L. Sun, Graph-bert: Only attention is needed for
learning graph representations, arXiv preprint arXiv:2001.05140 (2020).
rin
[51] C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, T.-Y. Liu, Do trans-
formers really perform badly for graph representation?, Advances in Neural In-
formation Processing Systems 34 (2021) 28877–28888.
Pr
33
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
[52] Y. Zhong, C. Huang, A dynamic graph representation learning based on temporal
iew
graph transformer, Alexandria Engineering Journal 63 (2023) 359–369.
[53] Y. Wu, Y. Fang, L. Liao, On the feasibility of simple transformer for dynamic
graph modeling (2024).
ev
dynamic link prediction, Advances in Neural Information Processing Systems 35
(2022) 32928–32941.
[55] N. Wiener, Generalized harmonic analysis, Acta mathematica 55 (1) (1930) 117–
r
258.
er
[56] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan,
L. Wang, T. Liu, On layer normalization in the transformer architecture, in: Inter-
national Conference on Machine Learning, PMLR, 2020, pp. 10524–10533.
pe
[57] S. Narang, H. W. Chung, Y. Tay, L. Fedus, T. Fevry, M. Matena, K. Malkan,
N. Fiedel, N. Shazeer, Z. Lan, et al., Do transformer modifications transfer across
implementations and applications?, in: Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing, 2021.
ot
2019.
[59] L. Wang, X. Chang, S. Li, Y. Chu, H. Li, W. Zhang, X. He, L. Song, J. Zhou,
H. Yang, Tcl: Transformer-based dynamic graph modelling via contrastive learn-
rin
[60] S. Tian, J. Dong, J. Li, W. Zhao, X. Xu, B. Wang, B. Song, C. Meng, T. Zhang,
L. Chen, Sad: semi-supervised anomaly detection on dynamic graphs, in: Pro-
ep
34
This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]