0% found this document useful (0 votes)
52 views36 pages

TGFormer: Temporal Graph Transformer Model

The document presents TGFormer, a novel transformer model designed for temporal graph learning, addressing challenges such as long-term dependencies and periodic patterns. It introduces an auto-correlation mechanism that enhances performance by effectively aggregating information at the series level, outperforming traditional attention methods. Extensive experiments demonstrate TGFormer’s superior capabilities compared to existing state-of-the-art approaches on various datasets.

Uploaded by

varunnshah111103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views36 pages

TGFormer: Temporal Graph Transformer Model

The document presents TGFormer, a novel transformer model designed for temporal graph learning, addressing challenges such as long-term dependencies and periodic patterns. It introduces an auto-correlation mechanism that enhances performance by effectively aggregating information at the series level, outperforming traditional attention methods. Extensive experiments demonstrate TGFormer’s superior capabilities compared to existing state-of-the-art approaches on various datasets.

Uploaded by

varunnshah111103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Title Page for Pattern Recognition

ed
TGFormer: Towards Temporal Graph Transformer with Auto-

Correlation Mechanism

iew
Hongjiang Chen1, Pengfei Jiao1, Ming Du1, Xuan Guo2, Zhidong Zhao1, Di Jin2
1Hangzhou Dianzi University, Hangzhou, 310018, China
2Tianjin University, Tianjin, 300072, China

Email addresses:
hchen@[Link] (Hongjiang Chen)

v
pjiao@[Link] (Pengfei Jiao)

re
mdu@[Link] (Ming Du)
guoxuan@[Link] (Xuan Guo)
zhaozd@[Link] (Zhidong Zhao)
jindi@[Link] (Di Jin)

Corresponding Author:
er
Pengfei Jiao,
School of Cyberspace,
pe
Hangzhou Dianzi University,
Hangzhou 310018,
China.
Email: pjiao@[Link]
ot

Present address:
Xiasha Higher Education Zone,
Hangzhou,
tn

310018,
Zhejiang Province,
China.
rin
ep
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Highlights

iew
TGFormer: Towards Temporal Graph Transformer with Auto-Correlation Mech-
anism

Hongjiang Chen, Pengfei Jiao, Ming Du, Xuan Guo, Zhidong Zhao, Di Jin

ev
• We introduce a perspective shift that navigates temporal graph learning toward
time series analysis.

• We formulate an auto-correlation mechanism by identifying dependencies and

r
aggregating information at the series level.

• TGFormer achieves state-of-the-art performance compared with the baselines on


downstream tasks.
er
pe
ot
tn
rin
ep
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
TGFormer: Towards Temporal Graph Transformer with
Auto-Correlation Mechanism

iew
Hongjiang Chena , Pengfei Jiaoa,∗, Ming Dua , Xuan Guob , Zhidong Zhaoa , Di Jinb
a Hangzhou Dianzi University, Hangzhou, 310018, China
b Tianjin
University, Tianjin, 300072, China

ev
Abstract

r
The burgeoning fascination in Temporal Graph Neural Networks (TGNNs), is attributable
to their adeptness in modeling complex dynamics while delivering superior perfor-

er
mance. However, TGNNs face intrinsic constraints, inclusive of long-term temporal
dependency and periodic temporal pattern problems. Simultaneously, the inherent ca-
pability of transformer architectures is a potent solution to the aforementioned predica-
pe
ment. Accordingly, we introduce TGFormer, a novel transformer tailored for temporal
graphs. The model shifts the traditional temporal graph learning paradigm towards a
trajectory aligning with time series analysis. Through this innovative perspective, TG-
Former gains the capability to extract node representations from historical interactions,
ot

embarking on an examination of node interactions within sequential time points. Fur-


ther, bolstered by the foundational theory of the stochastic process, we have architected
an auto-correlation methodology to unearth the periodic dependencies of node interac-
tn

tions. This methodology enables TGFormer to undertake dependency discovery and


representation aggregation at the sub-interaction level. Auto-correlation proves to be
more efficient and accurate when compared to the widely used attention mechanisms.
rin

We conduct extensive experiments on six public datasets, confirming the effectiveness


and efficiency of our approach.
Keywords: Temporal Graph, Graph Transformer, Graph Neural Network,
ep

∗ Corresponding
author
Email addresses: hchen@[Link] (Hongjiang Chen), pjiao@[Link] (Pengfei Jiao),
mdu@[Link] (Ming Du), guoxuan@[Link] (Xuan Guo), zhaozd@[Link] (Zhidong
Zhao), jindi@[Link] (Di Jin)
Pr

Preprint submitted to Elsevier September 29, 2024

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Representation Learning

iew
1. Introduction

Recently, representation learning on temporal graphs has emerged as a linchpin in


contemporary research discourse. Among various methods, Temporal Graph Neural

ev
Networks (TGNNs) have been identified as robust and versatile apparatuses for mas-
tering temporal graph learning, and finding efficacious applications in vast fields en-
compassing community detection [1], protein design [2], recommendation system [3],
and social network scrutiny [4], among others.

r
Current TGNNs methodologies principally segregate into two distinct categories
in the form of input graphs: discrete-time dynamic graph (DTDG) [5, 6, 7, 8] and

er
continuous-time dynamic graph (CTDG) [9, 10, 11, 12]. Recently, there has been a
growing preference for the latter models due to their proficiency in capturing fine-
pe
grained information and managing temporal intricacies [13]. Following this trend, this
paper also focuses on learning about CTDG.
Despite the impressive results achieved by previous CTDG approaches, they still
face challenges. Most of these approaches adopt an interaction-level learning paradigm,
limiting their applicability to nodes with fewer interactions. Nodes with longer histo-
ot

ries necessitate sampling strategies to truncate interactions, enabling feasible calcu-


lations for computationally expensive modules. Subsequent use of sequence mod-
tn

els, such as recurrent neural networks (RNN) [14], may face challenges in learning
long-term temporal dependencies mainly due to vanishing or exploding gradients [15].
Moreover, existing methods usually use attention mechanisms [16] as encoders to iden-
tify relevant time steps for prediction. However, according to theoretical proofs in [17],
rin

the attention mechanisms commonly act as persistent low-pass filters, treating high-
frequency information as noise and continuously erasing it. Also, the pure attention
mechanism could not realize the periodic temporal patterns, i.e., Most people are re-
ep

peating what they did the day before. Consequently, capturing the periodic temporal
patterns hidden behind time series data also becomes challenging. In conclusion, pre-
vious approaches fail to capture either long-term temporal dependencies or periodic
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
temporal patterns.

iew
Transformers have emerged as a promising approach for addressing the aforemen-
tioned challenges in natural language processing [18], computer vision [19], and time
series analysis [20]. Recently, Graph transformers have seen remarkable advance-
ments. However, it warrants mention that the realm of temporal graphs still presents
numerous unresolved challenges. Existing models, such as DyGFormer [21] utilize

ev
transformers and a patching technique to benefit from longer histories. However, this
integration may lack the necessary refinement. These models do not adequately ad-
dress the unique intricacies inherent in temporal graphs, such as periodic temporal

r
patterns that are prevalent in real-world scenarios [22]. Instead, they primarily amplify
the transformer’s input as a mechanism for enhancing impact. Consequently, designing

er
new transformer architectures that effectively capture and exploit these dependencies
is essential for overcoming these challenges.
Building upon these motivations, we conceive a bespoke Transformer variant tai-
pe
lored for CTDG, fostering model competence in capturing long-term temporal depen-
dencies and periodic temporal patterns. TGFormer adheres to a sequence of interac-
tions, encodes them as node, link, time, and frequency inputs for the Series Trans-
former, and adopts an adaptive readout for link predictions. The TGFormer scruti-
ot

nizes node interactions within continuous point time from a novel-level perspective,
ushering temporal graph learning toward time series analysis. Drawing on the stochas-
tic process theory [23, 24], the TGFormer supersedes attention mechanisms with an
tn

auto-correlation mechanism (ACM) functionality, which identifies sub-series similari-


ties based on series regularity and amalgamates similar sub-series from latent periods.
This series-wise mechanism ensures O(L log L) complexity for length-L series, modify-
rin

ing the extent of point-wise representation aggregation at the sub-series level, fostering
the learning of patterns, and capturing periodic temporal patterns. The superiority of
our proposal is demonstrated by extensive experiments on six real-world datasets with
different characteristics.
ep

The contributions are summarized as follows:

• We introduce an innovative perspective shift that effectively navigates temporal


Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
graph learning toward time series analysis. Thus, we proposed TGFormer, a

iew
pioneering temporal graph transformer by implementing this idea.

• The proposed TGFormer explicitly models crucial elements like node features,
link features, temporal information, and nodes’ interaction frequency, and effec-
tively fuses these elements.

ev
• We formulate an auto-correlation mechanism, surpassing traditional self-attention
methods by identifying dependencies and aggregating information at the series
level. Our auto-correlation mechanism captures periodic temporal patterns, en-

r
hancing both computational productivity and information utilization.

• Extensive experimentation is carried out on six widely used real-world tempo-

er
ral graph datasets, under both transductive and inductive settings. Experimental
results corroborate the superior performance of TGFormer compared to contem-
poraneous state-of-the-art methods.
pe
2. Related Work
2.1. Temporal Graph Neural Networks
Over the past few years, the scholarly focus on TGNNs has intensified [25, 26, 27].
ot

Typically, these models adopt an approach that treats temporal graph data as event
streams, facilitating the direct learning of node representations from continuously oc-
curring interactions. Specifically, existing CTDG models commonly employ RNN or
tn

attention mechanism as their sequence modules. For instance, JODIE [10] used RNN
to combine static and dynamic embedding for node representation. TGAT [16] extends
the graph attention mechanism to learn time-aware representations. Building upon this,
rin

some methods incorporate additional techniques like memory networks [28, 9, 29], or-
dinary differential equations (ODE) [30, 31, 32], and random walks [33, 34, 35] to
better learn the continuous temporal information. TGN [9], a variant of TGAT, in-
tegrates a memory module to effectively track the evolution of node-level features.
ep

GSNOP [30] proposes a sequential ODE aggregator, which considers sequential data
dependencies and learns the derivative of the neural process. This helps draw bet-
ter distributions from limited historical information. At the same time, CAWN [33]
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
extracts temporal network motifs through set-based anonymous random walks, effec-

iew
tively capturing the network’s dynamics. NeurTW [31] enhances spatial and temporal
interdependence during anonymous random walks, intertwining continuous evolution
with transient activation processes to comprehend the foundational spatiotemporal dy-
namics of base-order coding. To capture long-term temporal dependencies in the dy-
namic graph interactions, DyExplainer [36] uses a buffer-based live-updating scheme,

ev
and also DyGFormer [37] uses the Transformer-based architecture with a neighbor co-
occurrence encoding scheme and a patching technique to complete it.
In a nutshell, most existing TGNNs struggle to manage nodes with longer interac-

r
tions and effectively capture periodic temporal patterns due to the prohibitive compu-
tational costs of complex modules and optimization challenges such as vanishing or

er
exploding gradients. In this paper, we propose a novel transformer tailored for tem-
poral graphs to show the necessity of long-term temporal dependencies and periodic
dependencies, which is achieved by the Series Transformer layer.
pe
2.2. Transformers on Graphs

Transformer [38] is an innovative model that employs the SAM to handle sequen-
tial data, which has demonstrated significant success across diverse domains, includ-
ing natural language processing [39, 40, 41], computer vision [42, 43, 44] and time
ot

series forecasting [45, 46, 47]. For instance, Autoformer [47] replaces canonical at-
tention with an auto-correlation block to achieve sub-series level attention. Similarly,
tn

FEDformer [48] employs a Discrete Fourier Transform-based attention mechanism to


capture the global profile, thus achieving linear complexity.
The graph learning community has recently begun to assimilate Transformer mod-
els in diverse formulations. For instance, Graph-BERT [49] circumvents the need for
rin

message-passing by synthesizing global and relative scales of positional encoding (PE).


SAN [50] introduces a discerning inductive bias for Graph Transformers by proposing
a flexible, learnable PE grounded within the graph Laplacian domain. This encoding
ep

represents a concrete advancement towards encapsulating fundamental graph struc-


tural complexities. Graphormer [51], claiming superiority over the 1-WL (Weisfeiler-
Leman) test, replaces Laplacian PE with a preference for spatial and node centrality
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
2024.1.21

Temporal Graph Extract Layer Encoder Layer Series Transformer Layer Decoder Layer

Node Encoding
u b b 𝑋$,, %

iew
𝑡! Edge Encoding
u u %& % %) %
𝑒$$ 𝑒$'( 𝑒$' 𝑋$,+
𝑡" Feed-forward Network Downstream
u b Time Encoding 𝑿𝒕𝒖 Task
%
𝑡# 𝑡! 𝑡" 𝑡# 𝑋$,-
u b
𝑡$ , 𝑡% Frequency Encoding
a c %
1 2 2 𝑋$,.
𝑡& 𝒕𝟒 ? 𝑡$
u v Auto-Correlation Mechanism ℎ!" ℎ#"
𝑡$ , 𝑡% 𝑡& , 𝑡% Frequency Encoding
𝑡%
b d 1 1 1
%
𝑋2,.
𝑡!
v v Time Encoding
%
𝑡" 𝑡! 𝑡" 𝑡" 𝑋2,-
Chronological Order: v d Adaptive
Edge Encoding 𝑿𝒕𝒗 Q K V Readout
𝑡% < 𝑡& < 𝑡' < 𝑡( v
𝑡"
b % % %
𝑒223 𝑒24( 𝑒2'( %
𝑋2,+

ev
Node Encoding
self-loop v d b 𝑋2,, %

interaction

Figure 1: The overview of TGFormer begins with the extract layer, which employs a direct interaction
extractor. Subsequently, the encoder layer generates temporal-aware and structure-aware sequence node
representations. These representations are then used to form the time series data placed into the query

r
(Q), key (K), and value (V) matrices in the series transformer layer. The series transformer layer leverages
the Transformer’s ability to capture long-term temporal dependencies and the auto-correlation mechanism
to uncover inner periodic temporal patterns in the series data. Finally, the decoder layer utilizes adaptive
readout for various downstream tasks.

PE. Recently some applications have appeared in temporal graphs. TGT [52] proposes
a transformer-based method to preserve high-order information. SimpleDyG [53] re-
er
pe
conceptualize temporal graphs as a sequence added to the attention mechanism.
Although some studies have explored the potential of transformers for temporal
graphs, they have only enhanced the inputs to the transformer and have not proposed
a transformer that can accommodate characteristics unique to temporal graphs, such
ot

as periodicity. Addressing this, our work architects an auto-correlation methodology


to unearth the periodic dependencies of node interactions and undertake dependency
tn

discovery and representation aggregation at the sub-interaction level.

3. Preliminaries
3.1. Temporal Graph
rin

A temporal graph is represented as G = (V, E), where V is the set of nodes and E
is the set of node interactions with timestamps. For two nodes u and v, there may exist a
sequence of timestamps, which are formally denoted as Eu,v = {(u, v, t1 ), (u, v, t2 ), · · · , (u, v, tn )} ⊂
E, where the timestamps are ordered as (0 < t1 < t2 < · · · < tn ), indicating that nodes
ep

u and v have interacted at least once at each of the corresponding timestamps. Two
interacting nodes are referred to as neighbors. The symbols and their definitions are
listed in Table 1. The notation will be used in the following sections.
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Table 1: Symbols and their definitions

iew
Notation Description
G A temporal graph
V, E The node set and link set of G
S ∗t The historical interactions samples of node ∗ before time t with itself
t
X∗,N Node ∗’s representation of node features at time t
t
X∗,E Node ∗’s representation of link features at time t
t
X∗,T Node ∗’s representation of temporal information at time t

ev
t
X∗,F Node ∗’s representation of node interactions frequency at time t
F∗t The number of times a node appears in S ut and S vt , respectively
X∗t The combine node ∗ representation at time t
ht∗ The node ∗ representation at time t
d N , d E , dT , d F , d t
, X∗,E
t
, X∗,T
t
, X∗,F
t

r
The dimension of X∗,N
n oL
Xti A time series
i=1
δ The time delay
k
L
The select k series
er
The node historical interactions extract length
pe
3.2. Temporal Link Prediction

We aim to learn a model that given a pair of nodes with a specific timestamp t, we
aim to predict whether the two nodes are connected at t based on all the available his-
torical data. Note that we are not only concerned with the prediction of links between
ot

nodes seen during training. We also expect the model to predict links between nodes
that are never seen for inductive evaluation. For a more reliable comparison, we use
the three strategies in [54] (i.e., random, historical, and inductive negative sampling
tn

strategies) to comprehensively evaluate the performance of the model on the temporal


link prediction task.

3.3. Transformer
rin

A Transformer block relies heavily on a multi-head attention mechanism to learn


the context-aware representation for sequences, which can be defined as:

O = MultiHead(Q, K, V) = WO Concat(O1 , O2 , · · · , Oh ), (1)


ep

where WO is a trainable parameter. h is the number of heads. Oi is computed as:


Qi K T
!
Oi = Attention(Qi , Ki , Vi ) = SoftMax √ i Vi , (2)
d
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
where Attention(·, ·, ·) is the scaled dot-product attention. As input to the Transformer,

iew
an element in a sequence is represented by an embedding vector. The multi-head at-
tention mechanism works by injecting the Positional Embedding into the element em-
beddings to be aware of element orders in a sequence.

4. Proposed Method

ev
The framework of our TGFormer is shown in Fig. 1, which employs a Series Trans-
former as the backbone. Given an interaction (u, v, t), we first extract historical inter-
actions of source node u and destination node v before timestamp t and obtain two

r
interaction sequences S ut and S vt . Next, computing the encodings of neighbors, links,
time intervals, and interaction frequency for each sequence. Then, we concat each

er
encoding sequence feed into the Series Transformer for capturing long-term temporal
dependencies and periodic dependencies. Finally, the outputs of the Transformer adopt
an adaptive readout function to derive time-aware representations of u and v at times-
pe
tamp t (i.e., htu and htv ), which can be applied in various downstream tasks like temporal
link prediction.

4.1. Extract Layer

In our initial endeavor of predicting the connection between the nodal elements u
ot

and v at a specific timestamp t, we first affix each node with a self-loop at the spe-
cific time t. This maneuver aims to enhance the correlation of the node to its inherent
tn

characteristics. Subsequently, we extract L interactions for both nodes, ordered based


on temporal proximity, which could interact with the same node at different times-
tamps. If a node has fewer than L historical neighbors, zero-padding is employed to
reconcile the deficit. This process culminates in the successful transmutation of the
rin

intricate problem of temporal graph learning into a more comprehensible paradigm


- a time series analysis problem. Mathematically, given the predictive interaction
(u, v, t), for the source node u and the target node v, we exact the series of interac-
ep

tions encompassing nodes u and v, expressed as S ut = {(u, u′ , t′ )|t′ < t} ∪ {(u, u, t)} and
S vt = {(v, v′ , t′ )|t′ < t} ∪ {(v, v, t)}, respectively. Note, (u, u′ , t′ ) ∈ E, (v, v′ , t′ ) ∈ E, and
|S ut | = |S vt | = L.
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
4.2. Encoder Layer

iew
In this section, we exposition details about how to convert the extracted interactions
from temporal graph learning to time series analysis. These interactions, construed as
temporally continuous events, embody four encodings: node, link, time, and frequency.
Together, these constitute the comprehensive representation of X∗t , where ∗ refers to
either of the nodes u or v.

ev
4.2.1. Node/Edge Encoding.
In the realm of temporal graphs, both vertices (nodes) and interactions (edges or
links) often harbor concomitant features. To delineate the embeddings allied with in-

r
teractions, it is imperative to harvest the inherent traits of proximate nodes and edges
considering the sequence, designated as S ∗t . Aligning our methodology with estab-

t
links correspondingly as X∗,N ∈ RL×dN and X∗,E
t er
lished approaches in the literature [9, 21], we adopt a schema to encode nodes and
∈ RL×dE , wherein dN and dE typify the
dimensions associated with the specific embeddings of nodes and edges, respectively.
pe
The specific schema is defined as:

t
X∗,N = MLP(x∗,N
t t
), X∗,E = MLP(x∗,E
t
), (3)

t
where x∗,N , x∗,E
t
represents the node ∗ interact nodes’ features and the interact link
ot

features of S ∗t . If the original features do not exist, they are all set to zero.

4.2.2. Time Encoding.


tn

Our schema for time-encoding deploys a cosine function, denoted as cos(∆tω),


n odT
where ω = α−(i−1)/β . This function is tasked with transliterating static timestamps
i=1
t
into vectors represented as X∗,T ∈ RL×dT . Here, the symbol dT signifies the dimension-
ality of time embeddings, and α and β function as adjustable parameters. Notably, these
rin

parameters ensure that the value of tmax × α−(i−1)/β converges to zero when i approaches
dT . So the time encoding can be defined as:

t
X∗,T = [cos(∆t1 ω), cos(∆t2 ω), · · · , cos(∆tL ω)]T . (4)
ep

In eq. 4, we adopt relative timestamps for encoding, in preference to absolute ones.


In other words, if an interaction occurs at timestamp t0 , and we’re to predict an interac-
tion at a specific timestamp t, the ∆t = t−t0 is computed. Importantly, ω is held constant
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
throughout the training phase, thereby accelerating the process of model optimization.

iew
Furthermore, the adoption of relative time encoding is instrumental in discerning repet-
itive temporal patterns. This function inspects the temporal gaps between interactions,
enabling the construction of a similarity index for comparable timestamps, contributing
to the precise temporal differentiation.

4.2.3. Frequency Encoding.

ev
Most existing methods often undervalue the potential for re-interactions with his-
torical nodes and overlook inherent node correlations. We address this gap via a
paradigm shift: the introduction of a node interaction frequency encoding technique.

r
This novel approach posits that the frequency of a node’s interaction within a historical
sequence signifies its relevance. Our method goes beyond analyzing merely historical

er
interaction sequences and incorporates the frequency of both interaction node occur-
rence and node pair interaction. This judicious approach effectively captures correla-
tions between two nodes’ common interaction nodes within their respective historical
pe
interaction sequences.
As illustrated in Fig. 1, we calculate the frequency of each interaction in both S ut
and S vt , represented as Fut ∈ RL×2 and Fvt ∈ RL×2 respectively. This is computed as
follows, with the first column denoting the number of times the node appears in S ut and
ot

the second column denoting the number of times the node appears in S vt . Specifically,
as shown in Fig. 1, for the interaction nodes {u, b, b} and {v, d, b} appearing in S ut and S vt ,
respectively, then Fut = [[1, 2, 2], [0, 1, 1]], and Fvt = [[0, 0, 2], [1, 1, 1]]. We then code
tn

the frequency of the node pair interaction, deriving node interaction frequency features
t
for u and v, denoted by X∗,F ∈ RL×dF , wherein dF means the dimensions of frequency
embedding, respectively. This encoding manifests mathematically as follows:
rin

t
X∗,F = f (F∗t [:, 0]) + f (F∗t [:, 1]). (5)

We implement a two-layer MLP function with ReLU activation, designated as f (·), to


ep

encode the nodes’ frequency interaction features.


The frequency Encoding fortifies our hypothesis that nodes with a high frequency
of shared historical interaction nodes are likelier to interact in the future. Concurrently,
Pr

10

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
they accommodate the variability of recurrence patterns apparent across different net-

iew
work domains and structures.

4.2.4. Transition to Time Series Analysis.


Transitioning next to the realm of time series analysis, we employ the encodings
formulated earlier. All these encodings are amalgamated and then projected into a
trainable weight space W⋄ ∈ Rd⋄ ×d along with b⋄ ∈ Rd to yield embeddings Xu,⋄
t
∈ RL×d

ev
t
and Xv,⋄ ∈ RL×d . It’s imperative to note that the symbol ⋄ can be a representative of
N, E, T or F. These alignments can be mathematically explained as:
t
Xu,⋄ = Xu,⋄
t
W⋄ + b⋄ ∈ RL×d ,

r
(6)
t
Xv,⋄ = Xv,⋄
t
W⋄ + b⋄ ∈ RL×d .

Xut = Xu,N
t t
||Xu,E t
||Xu,T t
||Xu,F
er
We then focus on the concatenation of these encoded nodes which can be calculated as
follows:
∈ RL×4d ,
(7)
pe
Xvt = Xv,N
t t
||Xv,E t
||Xv,T t
||Xv,F ∈ RL×4d .
Following this, we treat Xut and Xvt as a time series of length L, thus positioning us for
a comprehensive analysis of the time series.
2024.1.27
auto-correlation
attention
ot

Linear Linear

Contact Contact

Product Time Delay Agg


Top k
tn

SoftMax Inverse FFT

Product Product

FFT FFT

Linear Linear Linear Linear Linear Linear


rin

𝑄𝑄 (𝑋𝑋∗𝑡𝑡 ) 𝐾𝐾 (𝑋𝑋∗𝑡𝑡 ) 𝑉𝑉 (𝑋𝑋∗𝑡𝑡 ) 𝑄𝑄 (𝑋𝑋∗𝑡𝑡 ) 𝐾𝐾 (𝑋𝑋∗𝑡𝑡 ) 𝑉𝑉 (𝑋𝑋∗𝑡𝑡 )

Attention Auto-Correlation

Figure 2: Attention (left) and ACM (right). We utilize the Fast Fourier Transform (FFT) to calculate the
ACM, which reflects the time-delay similarities.
ep

4.3. Series Transformer Layer


In this section, we will describe how to use the Series Transformer layer, including
auto-correlated mechanism and Feed-forward networks.
Pr

11

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
4.3.1. Auto-Correlation Mechanism.

iew
As depicted in Fig. 2, we propose an ACM that utilizes node interaction sequences
with series-wise connections to optimize information usage. The ACM uncovers period-
based dependencies by calculating the auto-correlation of series and amalgamating
similar sub-sequences through time delay aggregation.
Period-based Dependencies. Inspired by the theory of stochastic processes, we

ev
observe that identical phase positions across periods naturally dictate similar sub-processes.
n oL
Thus, we treat each interaction sequence as a discrete-time process Xti , where Xti
i=1
is the i-th row of X∗t , i.e., the feature of the i-th sampled interaction. Then its auto-

r
correlation, RX,X (δ), can be computed as follows:
L
1X t t
RX,X (δ) = lim XX . (8)

er
L→∞ L i=1 i i−δ

In this equation, RX,X (δ) mirrors the time-delay similarity between Xti and the lag se-
ries, Xti−δ . As shown in Fig. 3, we employ the auto-correlation R(δ) as the unnormalized
pe
auto-correlation period length δ. We then select the most probable k period
confidence of the estimated
attention
2024.1.27
lengths, δ1 , · · · , δk . The resulting period-based dependencies, drawn from the afore-
mentioned estimated periods, can be weighted by the corresponding auto-correlation.

L
ot

Time Delay u
𝑡& 𝑡$
u d
𝑡%
u
𝑡,
u
𝑡-
original

a b u a a series
SoftMax
tn

u d u u u
Roll (δ% ) 𝑡$ 𝑡% 𝑡, … 𝑡- 𝑡&
X ℛ(δ% )
b u a a a

u u u u d
Roll (δ& ) 𝑡, … 𝑡- 𝑡& 𝑡$ 𝑡% ℛ(δ& )
a a a b u X
rin

u u u d u
Roll (δ+ ) 𝑡- 𝑡& 𝑡$ 𝑡% 𝑡, … X ℛ(δ+ )
a a b u a

Figure 3: Time-delay aggregation block. R(δ) reflects the time-delay similarities. Then the similar sub-
ep

processes are rolled to the same index based on selected delay δ and aggregated by R(δ).

Time Delay Aggregation. Building on these period-based dependencies, it fol-


lows that these dependencies interconnect the sub-sequences across projected periods.
Pr

12

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
In response, we introduce the time delay aggregation block (illustrated in Fig. 3) that
allows for the series to roll based on the selected time delay, δ1 , · · · , δk . This operation

iew
aligns similar sub-series onto the same phase position of estimated periods, a technique
that contrasts with the point-wise dot-product aggregation method used by the atten-
tion family. Lastly, we complete the process by incorporating the softmax-normalized
confidences to aggregate the sub-sequences.

ev
We initiate our discourse with a single-head scenario involving a time series Xti of
length-L. To analogize attention, we use symbols Q = K = V = X∗t , which are derived
from the encoder layer. Thus, it can seamlessly interchange the attention mechanism.

r
The subsequent step involves introducing the ACM. This mechanism incorporates
several components, with the initial process given by:

er
δ1 , · · · , δk = arg Topk RQ,K (δ) ,
δ∈{1,··· ,L}


where argTopk(·) retrieves the arguments of Topk auto-correlations, and k = c × log L .


 
(9)
pe
Here, c represents a hyper-parameter. We then normalize these auto-correlations using
a SoftMax function, as shown in the equation:

RQ,K (δ1 ), · · · , b
RQ,K (δk ) = SoftMax RQ,K (δ1 ), · · · , RQ,K (δk ) . (10)
b 
ot

Finally, the ACM culminates with the equation:

k
tn

X
ACM(Q, K, V) = Roll(V, δi )b
RQ,K (δi ), (11)
i=1

where Roll(Xt , δ) signifies an operation on Xt with time delay δ. This operation rein-
states any elements shifted beyond the initial position at the end of the sequence.
rin

Advancing to the multi-head version employed in auto-correlation, we handle hid-


den variables with 4d channels and h heads. For each i-th head, the query, key, and
4d
value are designated as Qi , Ki , Vi ∈ RL× h , i ∈ {1, · · · , h}. The operation progression
ep

results into:

O = MultiHead(Q, K, V) = WO Concat(O1 , O2 , · · · , Oh ), (12)


Pr

13

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
where Oi = ACM(Qi , Ki , Vi ) from Eq. (11), WO ∈ Rd is a trainable parameter.

iew
Efficient Computation. For period-based dependencies are inherently sparse, link-
ing to sub-processes at identical phase positions within specific periods, prioritizing
most probable delays and hence preventing the selection of converse phases. This
mechanism, exhibiting computational aptness, aggregates series equivalent to O(log L),
each measuring L in length. The complexity of Eq. (11) and Eq. (12) is O(L log L), il-

ev
lustrating computational efficiency.
Auto-correlation calculation (Eq. (8)) hinges on Fast Fourier Transforms (FFT),
n oL
according to the Wiener–Khinchin theorem [55], for a time series Xti . The resultant
i=1

r
RX,X (δ) can be deciphered via the following integral equations:

∞ ∞
    Z
SX,X ( f ) = F Xi F Xti =
t ∗

er
RX,X (δ) = F −1 SX,X ( f ) =

−∞
Z
Xti e−2πi f di

−∞
Z

−∞
Xti e−2πi f di,

SX,X ( f )e2π f δ d f,
(13)

(14)
pe
where δ belongs to the set {1, · · · , L}, F signifies FFT, its inverse is F −1 , and ∗ denotes
the conjugate operation. By using FFT, the auto-correlation of all lags in {1, · · · , L}
can be computed concurrently, enhancing computational efficiency with complexity at
O(L log L).
ot

4.3.2. Feed-Forward Network.


Our Series Transformer layer emanates from the paradigm of the conventional
tn

Transformer encoder, as illustrated in [38]. A unique adaptation within our imple-


mentation involves the placement of the layer normalization (LN) before, rather than
after [56], the multi-head ACM, and the feed-forward blocks (FFN). This strategic al-
teration, now prevalent amongst contemporary Transformer implementations, ensures
rin

more effective optimization, as advocated by [57].


Particularly in the context of the FFN sub-layer, we maintain a uniform dimension-
ality of the input, output, and inner layers, matching the dimensional structure with
ep

4d. The operational functionality of the Series Transformer layer can thus be formally
expounded as follows:

Z∗′( j) = MultiHead(Q, K, V) + Z∗( j−1) , (15)


Pr

14

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Z∗( j) = FFN(LN(Z∗′( j) )) + Z∗′( j) . (16)

iew
where Q, K, V all equal Z∗( j−1) . The input of the first layer is Z∗0 = X∗t ∈ RL×4d , and
the output of the J-th layer is denoted by H∗t = Z∗(J) ∈ RL×4d . 1 ≤ j ≤ J denotes the
transformer layer number.

4.4. Adaptive Readout.

ev
For the output matrix H∗t ∈ RL×4d of a node, H∗,1
t
∈ R1×4d is the token representation
t
of the node interacting with itself and H∗,l ∈ R1×4d is its l-th event representation. We
calculate the normalized attention coefficients for its l-th event:
t t
||H∗,l )WaT )

r
exp((H∗,1
λl = PL t t
, (17)
T
i=2 exp((H∗,1 ||H∗,i )Wa )

er
where Wa ∈ R1×8d denotes the learnable projection and l = 2, · · · , L. Therefore, the
readout function takes the correlation between each event and the node representation
into account. The node representation is finally aggregated as follows:
pe
L
X
ht∗ = H∗,1
t
+ λl H∗,l
t
. (18)
l=2

4.5. Loss Function

For link prediction loss, we adopt binary cross-entropy loss function, which is de-
ot

fined as:
S
X
Lp = − (yi log ŷi + (1 − yi ) log(1 − ŷi )), (19)
tn

i=1

ŷ = Softmax(MLP(RELU(MLP(htu ||htυ )))), (20)

where yi represents the ground-truth label of i-th sample and the ŷi uses two nodes’
representation represents the prediction value.
rin

4.6. Complexity Analysis

To analyze the time complexity of TGFormer, we first present its pseudo-code as


shown in Algorithm 1. For simplicity, the representation dimensions for both input
ep

and hidden features are denoted by d, and M signifies the size of temporal edges |E|.
The sampling process responsible for acquiring the L most recent neighbors achieves a
time complexity of O(1). Modules can be executed in parallel within the encoder layer,
Pr

15

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Algorithm 1 Training pipeline for TGFormer
Input: A temporal graph G, a node pair (u, v) with a specific timestamp (t), the neigh-

iew
bor sample number L, maximum training epoch of 200, early stopping strategy
with patience = 20.
Output: The probability of the node pair interacting at timestamp t.
1: initial patience = 0;
2: for training epoch = 1, 2, 3, . . . do
3: Acquire the L most recent first-hop interaction neighbors of nodes u and v from
G prior to timestamp t as S ut and S vt ;

ev
4: for S ut and S vt in parallel do
t t
5: Obtain node encoding X∗,N and edge encoding X∗,E from Eq. (3);
t
6: Obtain time encoding X∗,T from Eq. (4);
t
7: Obtain frequency encoding X∗,F from Eq.( 5);
t t t t t
X∗ ← X∗,N ||X∗,E ||X∗,T ||X∗,F ;

r
8:
9: Initial Z∗(0) ← X∗t ;
10: for Series Transformer Layer j do
11:
12:
13:
14:
Q, K, V ← X∗t ;
er
Z∗′( j) ← ACM(Q, K, V) + Z∗( j−1) ;
Z∗( j) ← FFN(LN(Z∗′( j) + Z∗′( j−1) ;
end for
pe
15: Adaptive readout with Eq. (18);
16: end for
17: Conduct link prediction with Eq. (20);
18: Compute loss L p with Eq. (19);
19: if current epoch’s metrics worse than the previous epoch’s then
20: patience = patience + 1
21: else
ot

22: Save the model parameters from the current epoch;


23: end if
24: if patience = 20 then
tn

25: Exit training process;


26: end if
27: end for

where the encoder complexity reaches a maximum of O(dL). Furthermore, the series
rin

transformer layer demands significant computational resources; however, we mitigate


the time complexity through the use of subsequences (as mentioned in Section 4.3.1),
reducing it to O(dL log L). Consequently, the overall time complexity of TGFormer is
ep

at most O(MdL log L).


Pr

16

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Table 2: Statistics of the datasets.
Datasets Domains #Nodes #Links #Node & Link Features Bipartite Time Granularity Duration

iew
Wikipedia Social 9,227 157,474 – & 172 True Unix timestamp 1 month
Reddit Social 10,984 672,447 – & 172 True Unix timestamp 1 month
LastFM Interaction 1,980 1,293,103 –&– True Unix timestamp 1 month
Enron Social 184 125,235 –&– False Unix timestamp 3 years
UCI Social 1,899 59,835 –&– False Unix timestamp 196 days
CollegeMsg Social 1,899 59,835 –&– False Unix timestamp 193 days

5. Experiments

ev
5.1. Datasets

We evaluate the performance of TGFormer on temporal link prediction tasks utiliz-


ing six publicly available temporal graph datasets: Wikipedia, Reddit, LastFM, Enron,

r
UCI, and CollegeMsg. The statistical characteristics of these datasets are comprehen-

er
sively detailed in Table 2. Further elaboration on the datasets is provided below.

• Wikipedia1 comprises a bipartite interaction graph that chronicles user edits on


Wikipedia pages over a one-month period. In this dataset, nodes represent both
pe
users and pages, while links denote editing actions, each associated with a times-
tamp and a 172-dimensional LIWC (Linguistic Inquiry and Word Count) feature.
Additionally, dynamic labels are provided to indicate users who faced temporary
bans from the platform.
ot

• Reddit2 consists of a bipartite graph monitoring user posts on Reddit for one
month. Here, nodes represent users and subreddits, whereas links denote times-
tn

tamped posting requests, each accompanied by a 172-dimensional LIWC feature.


This dataset also includes dynamic labels indicating whether users were banned
from posting.
rin

• LastFM3 contains a bipartite dataset that records user interactions with songs
over a one-month period. Nodes represent users and songs, while links denote
the listening behaviors of users.
ep

1 [Link]
2 [Link]
3 [Link]
Pr

17

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
• Enron4 documents email communications among employees of the ENRON en-

iew
ergy corporation over a three-year period. This dataset does not include node
attributes or link features.

• UCI5 comprises an online communication network where nodes represent uni-


versity students, and links denote messages exchanged among them. This dataset
is particularly pertinent for scholarly examinations of virtual interactions within

ev
student communities.

• CollegeMsg6 represents an online social network at the University of Califor-

r
nia, depicting user interactions through private messages exchanged at various
timestamps. This dataset does not include node labels or edge features.

5.2. Baselines
er
Our model is compared with nine state-of-the-art methods on temporal graphs.
They are based on graph convolutions, memory networks, random walks, sequential
pe
models, and transformer mechanisms. Here, we briefly introduce the mechanisms of
these methods for our assessment.

• JODIE [14] utilizes two interconnected recurrent neural networks to update


ot

user and item states, enhancing precision in capturing intricate temporal pat-
terns within user-item interactions. The incorporation of a projection operation
enables accurate learning of future representation trajectories.
tn

• DyRep [58] introduces a recurrent architecture that systematically updates node


states after each interaction within dynamic graphs. Furthermore, it incorpo-
rates a temporal-attentive aggregation module, enabling the model to consider
rin

the evolving structural information in dynamic graphs over time. This dual-
component framework provides a consistent and effective approach to capture
intricate temporal dynamics and evolving graph structures in dynamic networks.
ep

4 [Link] ./enron/
5 [Link]
6 [Link]
Pr

18

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
• TGAT [16] utilizes self-attention to compute node representations, aggregat-

iew
ing features from temporal neighbors and employing a time encoding function
for capturing nuanced temporal patterns. This concise approach ensures pre-
cise analysis of evolving graph structures and temporal dynamics in dynamic
networks.

• TGN [9] utilizes an evolving memory system for each node, updated upon node

ev
interactions through the message function, message aggregator, and memory up-
dater mechanisms. Simultaneously, an embedding module is employed to gener-
ate temporal node representations, ensuring a dynamic and comprehensive anal-

r
ysis of evolving graph structures in complex networks.

er
• CAWN [33] initiates its process by extracting multiple causal anonymous walks
for each node, facilitating an in-depth examination of the network dynamics’
causality and the establishment of relative node identities. Following this, RNN
pe
is utilized to encode each individual walk. The encoded walks are subsequently
aggregated to synthesize the final node representation.

• EdgeBank [54] stands as a purely memory-based approach for transductive tem-


poral link prediction, distinguished by its absence of trainable parameters. The
ot

method operates by storing observed interactions within a dedicated memory


unit, which is then meticulously updated using diverse strategies. This consistent
reliance on memory and strategic updates forms the core of EdgeBank, ensuring
tn

a robust and parameter-free framework for precise and transductive predictions


in temporal link prediction tasks.

• TCL [59] initiates its methodology by generating interaction sequences for each
rin

node via a breadth-first search algorithm applied to the temporal dependency


interaction sub-graph. Subsequently, it employs a graph transformer that simul-
taneously considers graph topology and temporal information to derive node rep-
ep

resentations. Furthermore, TCL integrates a cross-attention mechanism to model


the complex interdependencies between paired interaction nodes effectively.

• GraphMixer [21] adopts a fixed time encoding function, surpassing its trainable
Pr

19

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
counterpart in performance. It incorporates this function into an MLP-Mixer-

iew
based link encoder to learn from temporal links efficiently. The framework uti-
lizes neighbor mean-pooling in the node encoder for a concise summarization of
node features.

• DyGFormer [37] introduces a Transformer-based architecture enhanced by a


neighbor co-occurrence coding scheme, effectively capturing node correlations

ev
within interactions. Also, it employs patching techniques to enable the model to
capture long-term temporal dependencies.

r
5.3. Evaluation Tasks and Metrics

We closely follow [37] by evaluating the model performance for temporal link

er
prediction, which entails forecasting the probability of a link formation between two
specified nodes at a given timestamp. This evaluation is conducted under two distinct
settings: the transductive setting, which aims to predict future links among nodes ob-
pe
served during the training phase, and the inductive setting, which endeavors to predict
future links involving previously unseen nodes. To this end, a multi-layer perceptron
is utilized, which takes the concatenated representations of the two nodes as input and
outputs the likelihood of a link. Evaluation metrics employed include Average Preci-
ot

sion (AP) and Area Under the Receiver Operating Characteristic Curve (AUC-ROC).
For the node classification task, AUC-ROC is adopted as the performance metric. All
results presented are the aggregate of 10 independent experimental runs.
tn

To provide more comprehensive comparisons, we adopt three negative sampling


strategies in [54] for evaluating temporal link prediction, including random, historical,
and inductive negative sampling strategies, where the latter two strategies are more
rin

challenging. Specifically, the three distinct negative sampling strategies are precisely
defined as follows: 1) Random Negative Sampling Strategy, where negative edges are
randomly chosen from virtually all conceivable node pairs within the graphs. 2) Histor-
ical Negative Sampling Strategy, entailing the sampling of negative edges from the set
ep

of edges observed in preceding timestamps but absent in the current step. 3) Inductive
Negative Sampling Strategy, involving the selection of negative edges from previously
unseen edges that were not encountered during the training phase. Please refer to [54]
Pr

20

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
for more details. For all tasks, each dataset is chronologically split into 70%, 15%,

iew
and 15% for training, validation, and testing, respectively. The hyperparameter config-
urations for baseline models adhere to those meticulously outlined in their respective
publications, which followed [37].

5.4. Experiments Setting

All models undergo a training regimen of up to 100 epochs, incorporating the early

ev
stopping strategy with a patience parameter set to 10. The model achieving optimal per-
formance on the validation set is selected for subsequent testing. The Adam optimizer
is uniformly employed across all models, and uses supervised binary cross-entropy loss

r
as the objective function, maintaining a consistent learning rate of 0.00001 and a batch
size of 200. We conduct our experiments on a machine with the Intel(R) Xeon(R) Gold
6330 CPU @ 2.00GHz with 256 GiB RAM, and four NVIDIA 3090 GPU cards, which
is implemented in Python 3.9 with Pytorch.
er
pe
Table 3: AP(%) for transductive temporal link prediction with random, historical, and inductive negative
sampling strategies. NSS is the abbreviation of Negative Sampling Strategies.
NSS Datasets JODIE DyRep TGAT TGN CAWN EdgeBank TCL GraphMixer DyGFormer TGFormer
Wikipedia 96.50 ± 0.14 94.86 ± 0.06 96.94 ± 0.06 98.45 ± 0.06 98.76 ± 0.03 90.37 ± 0.00 96.47 ± 0.16 97.25 ± 0.03 99.03 ± 0.02 99.79 ± 0.10
Reddit 98.31 ± 0.14 98.22 ± 0.04 98.52 ± 0.02 98.63 ± 0.06 99.11 ± 0.01 94.86 ± 0.00 97.53 ± 0.02 97.31 ± 0.01 99.22 ± 0.01 99.86 ± 0.01
LastFM 70.85 ± 2.13 71.92 ± 2.21 73.42 ± 0.21 77.07 ± 3.97 86.99 ± 0.06 79.29 ± 0.00 67.27 ± 2.16 75.61 ± 0.24 93.00 ± 0.12 96.87 ± 0.34
Random Enron 84.77 ± 0.30 82.38 ± 3.36 71.12 ± 0.97 86.53 ± 1.11 89.56 ± 0.09 83.53 ± 0.00 79.70 ± 0.71 82.25 ± 0.16 92.47 ± 0.12 97.14 ± 0.72
UCI 89.43 ± 1.09 65.14 ± 2.30 79.63 ± 0.70 92.34 ± 1.04 95.18 ± 0.06 76.20 ± 0.00 89.57 ± 1.63 93.25 ± 0.57 95.79 ± 0.17 99.09 ± 0.17
CollegeMsg 75.41 ± 1.82 56.92 ± 6.03 80.27 ± 0.29 92.62 ± 0.99 95.86 ± 0.06 76.42 ± 0.00 83.64 ± 0.10 93.04 ± 0.29 95.79 ± 0.02 99.17 ± 0.22
Avg. Rank 7.17 8.5 7.17 4.5 2.83 7.83 7.83 6 2.17 1
ot

Wikipedia 83.01 ± 0.66 79.93 ± 0.56 87.38 ± 0.22 86.86 ± 0.33 71.21 ± 1.67 73.35 ± 0.00 89.05 ± 0.39 90.90 ± 0.10 82.23 ± 2.54 92.47 ± 0.19
Reddit 80.03 ± 0.36 79.83 ± 0.31 79.55 ± 0.20 81.22 ± 0.61 80.82 ± 0.45 73.59 ± 0.00 77.14 ± 0.16 78.44 ± 0.18 81.57 ± 0.67 84.44 ± 1.49
LastFM 74.35 ± 3.81 74.92 ± 2.46 71.59 ± 0.24 76.87 ± 4.64 69.86 ± 0.43 73.03 ± 0.00 59.30 ± 2.31 72.47 ± 0.49 81.57 ± 0.48 84.52 ± 0.67
Historical Enron 69.85 ± 2.70 71.19 ± 2.76 64.07 ± 1.05 73.91 ± 1.76 64.73 ± 0.36 76.53 ± 0.00 70.66 ± 0.39 77.98 ± 0.92 75.63 ± 0.73 82.64 ± 1.39
UCI 75.24 ± 5.80 55.10 ± 3.14 68.27 ± 1.37 80.43 ± 2.12 65.30 ± 0.43 65.50 ± 0.00 80.25 ± 2.74 84.11 ± 1.35 82.17 ± 0.82 86.12 ± 1.50
CollegeMsg 64.41 ± 6.52 47.73 ± 0.97 68.18 ± 1.04 80.65 ± 1.13 84.54 ± 0.11 44.16 ± 0.00 68.53 ± 0.08 83.93 ± 0.09 80.93 ± 0.37 90.02 ± 1.19
Avg. Rank 6.33 7.17 7.17 4.17 7.17 7.67 6.67 4 3.67 1
Wikipedia 75.65 ± 0.79 70.21 ± 1.58 87.00 ± 0.16 85.62 ± 0.44 74.06 ± 2.62 80.63 ± 0.00 86.76 ± 0.72 88.59 ± 0.17 78.29 ± 5.38 93.10 ± 0.71
tn

Reddit 86.98 ± 0.16 86.30 ± 0.26 89.59 ± 0.24 88.10 ± 0.24 91.67 ± 0.24 85.48 ± 0.00 87.45 ± 0.29 85.26 ± 0.11 91.11 ± 0.40 92.37 ± 1.64
LastFM 62.67 ± 4.49 64.41 ± 2.70 71.13 ± 0.17 65.95 ± 5.98 67.48 ± 0.77 75.49 ± 0.00 58.21 ± 0.89 68.12 ± 0.33 73.97 ± 0.50 74.01 ± 1.35
Inductive Enron 68.96 ± 0.98 67.79 ± 1.53 63.94 ± 1.36 70.89 ± 2.72 75.15 ± 0.58 73.89 ± 0.00 71.29 ± 0.32 75.01 ± 0.79 77.41 ± 0.89 84.34 ± 1.74
UCI 65.99 ± 1.40 54.79 ± 1.76 68.67 ± 0.84 70.94 ± 0.71 64.61 ± 0.48 57.43 ± 0.00 76.01 ± 1.11 80.10 ± 0.51 72.25 ± 1.71 80.70 ± 1.16
CollegeMsg 53.49 ± 0.53 54.13 ± 1.64 68.54 ± 0.85 71.90 ± 0.06 75.12 ± 0.09 43.49 ± 0.00 68.70 ± 0.02 79.17 ± 0.09 70.44 ± 0.83 80.15 ± 1.72
Avg. Rank 8 8.6 6.2 5.6 4.4 6.8 6.2 4.6 3.4 1.2
rin

5.5. Link Prediction

We compare our TGFormer with the previous state-of-the-art in both transductive


and inductive link prediction using the AP and AUC-ROC metrics. To provide a more
comprehensive study of our TGFormer, we present results for all three negative sam-
ep

pling strategies. We put the transductive results in Table 3 and Table 4 and inductive
results in Table 5 and Table 6, respectively. The best and second-best results are marked
in bold and underlined fonts. Note that EdgeBank [54] can only evaluate transductive
Pr

21

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Table 4: AUC-ROC(%) for transductive temporal link prediction with random, historical, and inductive

iew
negative sampling strategies. NSS is the abbreviation of Negative Sampling Strategies.
NSS Datasets JODIE DyRep TGAT TGN CAWN EdgeBank TCL GraphMixer DyGFormer TGFormer
Wikipedia 96.33 ± 0.07 94.37 ± 0.09 96.67 ± 0.07 98.37 ± 0.07 98.54 ± 0.04 90.78 ± 0.00 95.84 ± 0.18 96.92 ± 0.03 98.91 ± 0.02 99.77 ± 0.10
Reddit 98.31 ± 0.05 98.17 ± 0.05 98.47 ± 0.02 98.60 ± 0.06 99.01 ± 0.01 95.37 ± 0.00 97.42 ± 0.02 97.17 ± 0.02 99.15 ± 0.01 99.82 ± 0.01
LastFM 70.49 ± 1.66 71.16 ± 1.89 71.59 ± 0.18 78.47 ± 2.94 85.92 ± 0.10 83.77 ± 0.00 64.06 ± 1.16 73.53 ± 0.12 93.05 ± 0.10 96.85 ± 0.44
Random Enron 87.96 ± 0.52 84.89 ± 3.00 68.89 ± 1.10 88.32 ± 0.99 90.45 ± 0.14 87.05 ± 0.00 75.74 ± 0.72 84.38 ± 0.21 93.33 ± 0.13 97.13 ± 0.90
UCI 90.44 ± 0.49 68.77 ± 2.34 78.53 ± 0.74 92.03 ± 1.13 93.87 ± 0.08 77.30 ± 0.00 87.82 ± 1.36 91.81 ± 0.67 94.49 ± 0.26 98.68 ± 0.30
CollegeMsg 78.87 ± 1.71 61.72 ± 6.95 79.25 ± 0.76 92.39 ± 1.00 94.63 ± 0.11 77.32 ± 0.00 83.32 ± 0.08 91.71 ± 0.28 94.52 ± 0.03 98.88 ± 0.30
Avg. Rank 6.83 8.5 7.17 4.17 2.83 8 8 6.33 2.17 1
Wikipedia 80.77 ± 0.73 77.74 ± 0.33 82.87 ± 0.22 82.74 ± 0.32 67.84 ± 0.64 77.27 ± 0.00 85.76 ± 0.46 87.68 ± 0.17 78.80 ± 1.95 92.07 ± 0.33
Reddit 80.52 ± 0.32 80.15 ± 0.18 79.33 ± 0.16 81.11 ± 0.19 80.27 ± 0.30 78.58 ± 0.00 76.49 ± 0.16 77.80 ± 0.12 80.54 ± 0.29 83.08 ± 2.43
LastFM 75.22 ± 2.36 74.65 ± 1.98 64.27 ± 0.26 77.97 ± 3.04 67.88 ± 0.24 78.09 ± 0.00 47.24 ± 3.13 64.21 ± 0.73 78.78 ± 0.35 82.39 ± 0.67
Historical Enron 75.39 ± 2.37 74.69 ± 3.55 61.85 ± 1.43 77.09 ± 2.22 65.10 ± 0.34 79.59 ± 0.00 67.95 ± 0.88 75.27 ± 1.14 76.55 ± 0.52 80.59 ± 1.47
UCI 78.64 ± 3.50 57.91 ± 3.12 58.89 ± 1.57 77.25 ± 2.68 57.86 ± 0.15 69.56 ± 0.00 72.25 ± 3.46 77.54 ± 2.02 76.97 ± 0.24 83.80 ± 2.67

ev
CollegeMsg 66.92 ± 5.00 49.37 ± 1.37 58.19 ± 1.11 77.65 ± 1.40 79.45 ± 0.15 34.64 ± 0.00 58.55 ± 0.18 77.50 ± 0.16 76.25 ± 0.12 88.58 ± 1.10
Avg. Rank 4.67 7.5 7.5 3.5 7.17 6.5 7.33 5.5 4.33 1
Wikipedia 70.96 ± 0.78 67.36 ± 0.96 81.93 ± 0.22 80.97 ± 0.31 70.95 ± 0.95 81.73 ± 0.00 82.19 ± 0.48 84.28 ± 0.30 75.09 ± 3.70 91.01 ± 0.76
Reddit 83.51 ± 0.15 82.90 ± 0.31 87.13 ± 0.20 84.56 ± 0.24 88.04 ± 0.29 85.93 ± 0.00 84.67 ± 0.29 82.21 ± 0.13 86.23 ± 0.51 89.42 ± 2.17
LastFM 61.32 ± 3.49 62.15 ± 2.12 63.99 ± 0.21 65.46 ± 4.27 67.92 ± 0.44 77.37 ± 0.00 46.93 ± 2.59 60.22 ± 0.32 69.25 ± 0.36 70.54 ± 0.67
Inductive Enron 70.92 ± 1.05 68.73 ± 1.34 60.45 ± 2.12 71.34 ± 2.46 75.17 ± 0.50 75.00 ± 0.00 67.64 ± 0.86 71.53 ± 0.85 74.07 ± 0.64 81.19 ± 1.21
UCI 64.14 ± 1.26 54.25 ± 2.01 60.80 ± 1.01 64.11 ± 1.04 58.06 ± 0.26 58.03 ± 0.00 70.05 ± 1.86 74.59 ± 0.74 65.96 ± 1.18 75.98 ± 2.90
CollegeMsg 53.34 ± 0.22 53.13 ± 2.16 59.84 ± 1.16 65.58 ± 0.42 69.28 ± 0.24 30.53 ± 0.00 59.67 ± 0.16 74.06 ± 0.25 64.77 ± 0.13 76.92 ± 1.31
Avg. Rank 7.2 8.6 6.4 5.6 3.8 5.6 7 5.6 4 1.2

r
Table 5: AP(%) for inductive temporal link prediction with random, historical, and inductive negative sam-
pling strategies. NSS is the abbreviation of Negative Sampling Strategies.
NSS

Random
Datasets
Wikipedia
Reddit
LastFM
Enron
UCI
CollegeMsg
JODIE
94.82 ± 0.20
96.50 ± 0.13
81.61 ± 3.82
80.72 ± 1.39
79.86 ± 1.48
63.30 ± 1.26
DyRep
92.43 ± 0.37
96.09 ± 0.11
83.02 ± 1.48
74.55 ± 3.95
57.48 ± 1.87
52.72 ± 2.58
TGAT
96.22 ± 0.07
97.09 ± 0.04
78.63 ± 0.31
67.05 ± 1.51
79.54 ± 0.48
79.58 ± 0.47
TGN

er
97.83 ± 0.04
97.50 ± 0.07
81.45 ± 4.29
77.94 ± 1.02
88.12 ± 2.05
88.07 ± 1.44
CAWN
98.24 ± 0.03
98.62 ± 0.01
89.42 ± 0.07
86.35 ± 0.51
92.73 ± 0.06
94.44 ± 0.05
TCL
96.22 ± 0.17
94.09 ± 0.07
73.53 ± 1.66
76.14 ± 0.79
87.36 ± 2.03
81.18 ± 0.04
GraphMixer
96.65 ± 0.02
95.26 ± 0.02
82.11 ± 0.42
75.88 ± 0.48
91.19 ± 0.42
90.94 ± 0.22
DyGFormer
98.59 ± 0.03
98.84 ± 0.02
94.23 ± 0.09
89.76 ± 0.34
94.54 ± 0.12
94.36 ± 0.12
TGFormer
99.27 ± 0.25
99.78 ± 0.04
97.97 ± 0.37
91.88 ± 1.26
99.16 ± 0.15
97.73 ± 0.22
pe
Avg. Rank 6.5 7.67 7.25 5 2.83 7.08 5.5 2.17 1
Wikipedia 68.69 ± 0.39 62.18 ± 1.27 84.17 ± 0.22 81.76 ± 0.32 67.27 ± 1.63 82.20 ± 2.18 87.60 ± 0.30 71.42 ± 4.43 78.69 ± 1.25
Reddit 62.34 ± 0.54 61.60 ± 0.72 63.47 ± 0.36 64.85 ± 0.85 63.67 ± 0.41 60.83 ± 0.25 64.50 ± 0.26 65.37 ± 0.60 72.95 ± 1.13
LastFM 70.39 ± 4.31 71.45 ± 1.76 75.27 ± 0.25 66.65 ± 6.11 71.33 ± 0.47 65.78 ± 0.65 76.42 ± 0.22 76.35 ± 0.52 76.81 ± 0.79
Historical Enron 65.86 ± 3.71 62.08 ± 2.27 61.40 ± 1.31 62.91 ± 1.16 60.70 ± 0.36 67.11 ± 0.62 72.37 ± 1.37 67.07 ± 0.62 76.77 ± 1.01
UCI 63.11 ± 2.27 52.47 ± 2.06 70.52 ± 0.93 70.78 ± 0.78 64.54 ± 0.47 76.71 ± 1.00 81.66 ± 0.49 72.13 ± 1.87 77.24 ± 1.59
CollegeMsg 50.51 ± 0.75 54.43 ± 1.79 70.50 ± 1.18 71.60 ± 0.31 74.14 ± 0.17 69.80 ± 0.23 80.15 ± 0.18 69.59 ± 1.25 77.80 ± 1.11
Avg. Rank 7 7.5 5 5 6.17 5.5 2.5 4.33 2
Wikipedia 68.70 ± 0.39 62.19 ± 1.28 84.17 ± 0.22 81.77 ± 0.32 67.24 ± 1.63 82.20 ± 2.18 87.60 ± 0.29 71.42 ± 4.43 78.13 ± 1.56
Reddit 62.32 ± 0.54 61.58 ± 0.72 63.40 ± 0.36 64.84 ± 0.84 63.65 ± 0.41 60.81 ± 0.26 64.49 ± 0.25 65.35 ± 0.60 72.95 ± 1.13
LastFM 70.39 ± 4.31 71.45 ± 1.75 76.28 ± 0.25 69.46 ± 4.65 71.33 ± 0.47 65.78 ± 0.65 76.42 ± 0.22 76.35 ± 0.52 74.21 ± 1.02
Inductive Enron 65.86 ± 3.71 62.08 ± 2.27 61.40 ± 1.30 62.90 ± 1.16 60.72 ± 0.36 67.11 ± 0.62 72.37 ± 1.38 67.07 ± 0.62 76.77 ± 1.02
63.16 ± 2.27 52.47 ± 2.09 70.49 ± 0.93 70.73 ± 0.79 64.54 ± 0.47 76.65 ± 0.99 81.64 ± 0.49 72.13 ± 1.86 77.19 ± 1.09
ot

UCI
CollegeMsg 50.57 ± 0.76 54.47 ± 1.81 70.50 ± 1.19 71.63 ± 0.31 74.11 ± 0.17 69.80 ± 0.24 80.13 ± 0.18 69.55 ± 1.27 77.72 ± 1.11
Avg. Rank 7.2 7.4 5.4 5.2 5.8 6 2.2 3.8 2

Table 6: AUC-ROC(%) for inductive temporal link prediction with random, historical, and inductive negative
tn

sampling strategies. NSS is the abbreviation of Negative Sampling Strategies.


NSS Datasets JODIE DyRep TGAT TGN CAWN TCL GraphMixer DyGFormer TGFormer
Wikipedia 94.33 ± 0.27 91.49 ± 0.45 95.90 ± 0.09 97.72 ± 0.03 98.03 ± 0.04 95.57 ± 0.20 96.30 ± 0.04 98.48 ± 0.03 99.15 ± 0.28
Reddit 96.52 ± 0.13 96.05 ± 0.12 96.98 ± 0.04 97.39 ± 0.07 98.42 ± 0.02 93.80 ± 0.07 94.97 ± 0.05 98.71 ± 0.01 99.69 ± 0.06
LastFM 81.13 ± 3.39 82.24 ± 1.51 76.99 ± 0.29 82.61 ± 3.15 87.82 ± 0.12 70.84 ± 0.85 80.37 ± 0.18 94.08 ± 0.08 97.93 ± 0.37
Random Enron 81.96 ± 1.34 76.34 ± 4.20 64.63 ± 1.74 78.83 ± 1.11 87.02 ± 0.50 72.33 ± 0.99 76.51 ± 0.71 90.69 ± 0.26 89.92 ± 1.45
UCI 78.80 ± 0.94 58.08 ± 1.81 77.64 ± 0.38 86.68 ± 2.29 90.40 ± 0.11 84.49 ± 1.82 89.30 ± 0.57 92.63 ± 0.13 98.84 ± 0.22
CollegeMsg 66.00 ± 2.05 54.09 ± 3.53 77.00 ± 0.17 86.73 ± 1.82 92.29 ± 0.06 78.76 ± 0.05 88.40 ± 0.18 92.21 ± 0.13 96.49 ± 0.31
Avg. Rank 6.5 7.67 7.17 4.5 2.83 7.5 5.67 2 1.17
rin

Wikipedia 61.86 ± 0.53 57.54 ± 1.09 78.38 ± 0.20 75.75 ± 0.29 62.04 ± 0.65 79.79 ± 0.96 82.87 ± 0.21 68.33 ± 2.82 76.75 ± 1.67
Reddit 61.69 ± 0.39 60.45 ± 0.37 64.43 ± 0.27 64.55 ± 0.50 64.94 ± 0.21 61.43 ± 0.26 64.27 ± 0.13 64.81 ± 0.25 69.26 ± 2.38
LastFM 68.44 ± 3.26 68.79 ± 1.08 69.89 ± 0.28 66.99 ± 5.62 67.69 ± 0.24 55.88 ± 1.85 70.07 ± 0.20 70.73 ± 0.37 70.33 ± 1.02
Historical Enron 65.32 ± 3.57 61.50 ± 2.50 57.84 ± 2.18 62.68 ± 1.09 62.25 ± 0.40 64.06 ± 1.02 68.20 ± 1.62 65.78 ± 0.42 68.79 ± 1.58
UCI 60.24 ± 1.94 51.25 ± 2.37 62.32 ± 1.18 62.69 ± 0.90 56.39 ± 0.10 70.46 ± 1.94 75.98 ± 0.84 65.55 ± 1.01 67.72 ± 1.97
CollegeMsg 48.57 ± 1.27 52.31 ± 1.53 61.53 ± 1.45 63.89 ± 0.71 67.77 ± 0.09 60.05 ± 0.08 74.54 ± 0.16 63.15 ± 0.44 74.18 ± 0.49
Avg. Rank 6.83 8 5.5 5.33 5.67 5.5 2.33 3.67 2.17
Wikipedia 61.87 ± 0.53 57.54 ± 1.09 78.38 ± 0.20 75.76 ± 0.29 62.02 ± 0.65 79.79 ± 0.96 82.88 ± 0.21 68.33 ± 2.82 76.04 ± 2.08
Reddit 61.69 ± 0.39 60.44 ± 0.37 64.39 ± 0.27 64.55 ± 0.50 64.91 ± 0.21 61.36 ± 0.26 64.27 ± 0.13 64.80 ± 0.25 69.25 ± 2.38
ep

LastFM 68.44 ± 3.26 68.79 ± 1.08 69.89 ± 0.28 66.99 ± 5.61 67.68 ± 0.24 55.88 ± 1.85 70.07 ± 0.20 70.73 ± 0.37 73.21 ± 1.42
Inductive Enron 65.32 ± 3.57 61.50 ± 2.50 57.83 ± 2.18 62.68 ± 1.09 62.27 ± 0.40 64.05 ± 1.02 68.19 ± 1.63 65.79 ± 0.42 72.79 ± 1.59
UCI 60.27 ± 1.94 51.26 ± 2.40 62.29 ± 1.17 62.66 ± 0.91 56.39 ± 0.11 70.42 ± 1.93 75.97 ± 0.85 65.58 ± 1.00 72.78 ± 1.34
CollegeMsg 48.64 ± 1.26 52.36 ± 1.53 61.51 ± 1.47 63.93 ± 0.70 67.72 ± 0.09 60.05 ± 0.10 74.53 ± 0.16 63.14 ± 0.44 73.45 ± 0.51
Avg. Rank 6.6 7.8 6 5.4 5.4 6.4 2.6 3.4 1.4
Pr

22

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
temporal link prediction and therefore does not give its results in the inductive setting.

iew
From the above table, we can see that TGFormer outperforms the existing methods in
most cases, which is far superior to the second one.
We summarize the superiority of TGFormer in two aspects. First, the transformer
architecture allows TGFormer to process longer histories and capture long-term tem-
poral dependencies effectively. As illustrated in Fig. 6, the input sequence lengths han-

ev
dled by TGFormer are significantly longer than those managed by baseline across most
datasets, underscoring TGFormer’s superior aptitude in leveraging longer sequences.
Second, benefiting from capturing periodic temporal patterns, especially for Wikipedia

r
and Reddit in the case of harder negative sampling, TGFormer also has a small perfor-
mance degradation compared to the other approaches.

Methods Wikipedia
88.99 ± 1.05
er
Table 7: AUC-ROC(%) for the transductive node classification on Wikipedia and Reddit.

Reddit
60.37 ± 2.58
Avg. Rank
pe
JODIE 5
DyRep 86.39 ± 0.98 63.72 ± 1.32 5.5
TGAT 84.09 ± 1.27 70.04 ± 1.09 4.5
TGN 86.38 ± 2.34 63.27 ± 0.90 6.5
CAWN 84.88 ± 1.33 66.34 ± 1.78 6
TCL 77.83 ± 2.13 68.87 ± 2.15 5.5
GraphMixer 86.80 ± 0.79 64.22 ± 3.32 5.5
ot

DyGFormer 87.44 ± 1.08 68.00 ± 1.74 2.5


TGFormer 86.40 ± 2.24 66.41 ± 2.16 4
tn

5.6. Node Classification

For the node classification task, we adhere to the evaluation protocols established
by DyGFormer [37]. The objective of this task is to predict the state label of the
rin

source node, given the node and future timestamps. Specifically, we employ the model
obtained from the preceding transductive link prediction as the pre-trained model for
node classification. A classifier decoder, such as a three-layer MLP, is then trained
ep

separately for the node classification task. We evaluate this task on two datasets with
dynamic node labels, namely Wikipedia and Reddit, while excluding other datasets due
to the absence of node labels.
Pr

23

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
The comparative results of our method and the baseline methods on the node clas-

iew
sification task are presented in Table 7. While our method did not attain the best perfor-
mance, it nonetheless demonstrated commendable and competitive results. We sum-
marize the reasons for the suboptimal performance as follows: First, our method pri-
oritizes the interaction between nodes, such as periodic temporal dependencies, while
paying less attention to the individual node statuses. Second, to leverage long-term his-

ev
torical data, we may inadvertently include more noise and node anomalies [60], which
we have not filtered effectively.
Wikipedia Enron UCI
100 TGFormer TGFormer 100 GraphMixer TGFormer
TGN 1.0868M 1.0868M 0.6419M 1.0868M
95

r
1.4602M JODIE 95
99 0.1959M
DyGFormer DyGFormer TGN DyGFormer
1.0866M 90 1.0866M 90 1.4602M 1.0866M
GraphMixer TCL CAWN
98 0.6419M GraphMixer 0.8842M 4.0229M
CAWN 85 0.6419M TGN 85
JODIE 4.0229M 1.4602M CAWN

AP
AP

0.1959M 4.0229M JODIE


AP

97 80 TGAT
TGAT 0.1959M 1.0523M
1.0523M 80 DyRep 75
96
TCL
95 0.8842M

940 100
DyRep
1.19M
200 300 400
Training Time (s/epoch)
500 600
70
0
TCL
75 0.8842M

50
1.19M

100

Figure 4: Comparison of model performance, parameter size and training time per epoch on Wikipedia,
150
er
TGAT
1.0523M

200 250 300


Training Time (s/epoch)
350 400 450
70
65
0
DyRep
1.19M

50 100 150
Training Time (s/epoch)
200
pe
Reddit, and UCI.

5.7. Efficiency Analysis

To compare the efficiency of the evaluated models, we conducted a comprehensive


ot

analysis encompassing performance in transductive link prediction using AP metrics,


training time per epoch, and the number of trainable parameters between TGFormer
and baseline methods across the Wikipedia, Enron, and UCI datasets. The results are
tn

depicted in Fig. 4. It is evident that the walk-based method, such as CAWN, requires
a longer training time due to its inefficient temporal walk operations and has a sub-
stantial number of parameters. Conversely, simpler and memory-based methods such
rin

as DyRep and JODIE may possess fewer parameters and exhibit faster training times;
however, they exhibit a significant performance gap compared to the best-performing
methods. In contrast, TGFormer outperforms the transformer-based method DyG-
Former by achieving superior performance with a smaller size of trainable parameters
ep

and a moderate training time per epoch.


Pr

24

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
100 100
95
99 90
99

iew
85
98
AP

AP

AP
80
98
97 75
70
97 w/o All w/o Fre w/o ACM w/o Sat TGFormer 96 w/o All w/o Fre w/o ACM w/o Sat TGFormer w/o All w/o Fre w/o ACM w/o Sat TGFormer

(a) Wikipedia (b) Reddit (c) LastFM


99 100
95 98
98
97
90 96 96
AP

AP

AP

ev
85 95
94
94
80 w/o All w/o Fre w/o ACM w/o Sat TGFormer 93 w/o All w/o Fre w/o ACM w/o Sat TGFormer 92 w/o All w/o Fre w/o ACM w/o Sat TGFormer

(d) Enron (e) UCI (f) CollegeMsg

Figure 5: Ablation study in the transductive setting with the random negative sampling strategy. AP (%)

r
values are reported.

5.8. Ablation Studies

We further validate the effectiveness of modules in TGFormer through ablation


studies, including frequency encoding (FE), auto-correlation mechanics (ACM), and
er
pe
adaptive readout (AR). We refer to TGFormer without these three modules as w/o FE,
w/o ACM, and w/o AR respectively, and to TGFormer without all as w/o ALL. In
detail, Specifically, in w/o FE, the encoding alignment module only takes node/link en-
coding and time encoding as input. In w/o ACM, we replace the proposed ACM with
ot

the traditional attention mechanics [38]. In w/o AR, we replace the proposed adaptive
readout with the traditional mean function [21]. We report the performance of the dif-
ferent modules on the six datasets in Fig. 5. We find that TGFormer typically performs
tn

best when all components are used, and the results are worse when any component
is removed or changed. We can get the following conclusions: 1) removing the node
interaction frequency encoding scheme significantly affects performance, as it directly
rin

captures the relationships between nodes and their interactions. 2) Although ACM
plays different roles on different datasets, on large datasets or datasets with significant
periodicity, ACM even plays a similar efficiency as FE. This suggests that ACM can
benefit from longer historical information and capture the periodic dependencies of the
ep

dataset. 3) The adaptive readout function can still achieve additional improvement.
Pr

25

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Table 8: TGFormer performance under different choices of hyper-parameter c in the auto-correlation mech-

iew
anism.
Dataset Wikipedia Reddit LastFM Enron UCI CollegeMsg
Metric AP AUC AP AUC AP AUC AP AUC AP AUC AP AUC
c=1 99.61 99.58 99.87 99.83 96.26 96.55 97.17 97.24 98.93 98.35 99.23 98.95
c=2 99.97 99.97 99.88 99.84 96.28 96.54 96.62 96.68 99.02 98.65 99.29 99.09
c=3 99.86 99.83 99.89 99.86 96.34 96.64 97.85 97.94 99.22 98.79 99.18 98.75
c=4 99.58 99.57 99.87 99.82 96.44 95.70 97.56 97.72 99.24 98.82 99.28 98.94
c=5 99.93 99.93 99.88 99.85 96.63 96.50 97.73 97.84 99.27 98.92 99.30 98.97

ev
5.9. Parameter Sensitivity Analysis

As demonstrated in Table 8, we can confirm the model’s robustness concerning the

r
hyper-parameter c (Eq. (9)). To achieve an optimal balance between performance and
computational efficiency, we set c to the range of 1 to 5. It was also observed that the

being somewhat periodic.

Reddit
er
effects of most of the data sets did not fluctuate significantly and were more stable, all

LastFM
pe
100 100
95 TGAT
99 90 TGN
98 85 CAWN
80 TCL
AP

97 GraphMixer
75 DyGFormer
96 70 TGFormer
65
ot

95 1 2 3 4 5 6 7 8 60 1 2 3 4 5 6 7 8
log2 L log2 L

Figure 6: The performance of various methods on Reddit and LastFM datasets across different historical
tn

lengths L.

5.10. Motivation Verification

We validate the advantages of our transformer architecture in effectively and ef-


rin

ficiently utilizing longer histories. Our experiments are conducted on the Reddit and
LastFM datasets, chosen for their potential to benefit from long-term historical data.
For baseline comparisons, we augment the sampling of neighbors or increase the num-
ep

ber of causal anonymous walks (starting from 2) to empower them with access to ex-
tended histories. The experimental results are illustrated in Fig. 6, with the x-axis
depicted on a logarithmic scale with a base of 2. It is noteworthy that some base-
Pr

26

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
Wikipedia Wikipedia
90 Interact Sequence 90 Interact Sequence
Attention Time Delay

iew
80 80
Interact Frequence
70 70
60 60
50 50
40 40
0 5 10 15 20 25 30 0 5 10 15 20 25 30

ev
Time(days) Time(days)

Figure 7: The performance of different attention mechanisms (left) and ACM (right) in capturing periodic
temporal patterns.

r
line results are incomplete due to out-of-memory errors encountered at longer histori-
cal lengths. For instance, CAWN reaches an out-of-memory state when the historical

er
length extends to 256. From Fig. 6, we deduce the following: 1) a majority of the base-
lines exhibit deteriorated performance with increasing historical lengths, indicative of
pe
their limitations in capturing long-term temporal dependencies; 2) the baselines gener-
ally incur substantial computational costs when processing extended histories. While
memory network-based methods (e.g., TGN) manage to handle longer histories with
reasonable computational costs, they do not significantly benefit from these extended
histories due to issues such as vanishing or exploding gradients; 3) TGFormer consis-
ot

tently demonstrates performance gains with longer historical sequences, underscoring


the superiority of the transformer architecture in effectively utilizing extended histori-
tn

cal records.
We also verified the advantage of the ACM in capturing periodic time dependen-
cies. We conducted experiments on Wikipedia to determine the model’s effectiveness
in capturing periodicity by measuring the interaction frequency of one of the nodes.
rin

For clarity, we visualized the learned periodic dependency in Fig 7, where the x-axis
represents the node interaction time divided by days, and the y-axis denotes the node
interaction frequency during this period. For comparison with attention mechanisms,
ep

the top -9 time delay sizes {δ1 , · · · , δ9 , } of ACM are highlighted within the raw series
using red lines. In contrast, the top-9 most similar data points relative to the final time
step, as determined by attention mechanisms, are denoted by yellow lines with red
Pr

27

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
stars. As illustrated, while attention mechanisms capture node interaction intent, they

iew
fail to capture interaction periodicity effectively. This demonstrates that our model can
discover relevant information more comprehensively and accurately.

6. Conclusion

In this paper, we present TGFormer, which champions a shift in perspective, posi-

ev
tioning temporal graph learning towards time series analysis, encapsulated in the pro-
posed TGFormer model. We proposed an auto-correlation mechanism that extends be-
yond the traditional attention mechanism, capturing long-term temporal dependencies,

r
and periodic temporal patterns, and boosting computational efficiency and information
utilization. Extensive experiments conducted on various real-world datasets demon-

er
strate the effectiveness and efficiency of our proposed TGFormer model.

Acknowledgment
pe
This work was supported in part by the Zhejiang Provincial Natural Science Foun-
dation of China under Grant LDT23F01012F01, in part by the National Natural Science
Foundation of China under Grant 62372146 and in part by the Fundamental Research
Funds for the Provincial Universities of Zhejiang Grant GK229909299001-008
ot

References
tn

[1] C. Li, L. Lin, W. Zuo, J. Tang, M.-H. Yang, Visual tracking via dynamic graph
learning, IEEE transactions on pattern analysis and machine intelligence 41 (11)
(2018) 2770–2782.
rin

[2] M. Ren, Y. Wang, Y. Zhu, K. Zhang, Z. Sun, Multiscale dynamic graph represen-
tation for biometric recognition with occlusions, IEEE Transactions on Pattern
Analysis and Machine Intelligence 45 (12) (2023) 15120–15136.
ep

[3] Y. Li, L. Hou, J. Li, Preference-aware graph attention networks for cross-domain
recommendations with collaborative knowledge graph, ACM Transactions on In-
formation Systems 41 (3) (2023) 1–26.
Pr

28

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
[4] X. Zhang, P. Jiao, M. Gao, T. Li, Y. Wu, H. Wu, Z. Zhao, Vggm: Variational

iew
graph gaussian mixture model for unsupervised change point detection in dy-
namic networks, IEEE Transactions on Information Forensics and Security 19
(2024) 4272–4284.

[5] A. Pareja, G. Domeniconi, J. Chen, T. Ma, T. Suzumura, H. Kanezashi, T. Kaler,


T. Schardl, C. Leiserson, Evolvegcn: Evolving graph convolutional networks for

ev
dynamic graphs, in: Proceedings of the AAAI conference on artificial intelli-
gence, 2020.

[6] A. Sankar, Y. Wu, L. Gou, W. Zhang, H. Yang, Dysat: Deep neural representation

r
learning on dynamic graphs via self-attention networks, in: Proceedings of the

er
13th international conference on web search and data mining, 2020, pp. 519–527.

[7] J. You, T. Du, J. Leskovec, Roland: graph learning framework for dynamic
graphs, in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge
pe
Discovery and Data Mining, 2022, pp. 2358–2366.

[8] P. Jiao, H. Chen, H. Tang, Q. Bao, L. Zhang, Z. Zhao, H. Wu, Contrastive repre-
sentation learning on dynamic networks, Neural Networks 174 (2024) 106240.
ot

[9] E. Rossi, B. Chamberlain, F. Frasca, D. Eynard, F. Monti, M. Bronstein, Temporal


graph networks for deep learning on dynamic graphs, in: ICML 2020 Workshop
on Graph Representation Learning, 2020.
tn

[10] J. Li, Z. Han, H. Cheng, J. Su, P. Wang, J. Zhang, L. Pan, Predicting path failure
in time-evolving graphs, in: Proceedings of the 25th ACM SIGKDD international
conference on knowledge discovery & data mining, 2019, pp. 1279–1289.
rin

[11] H. Chen, P. Jiao, H. Tang, H. Wu, Temporal graph representation learning with
adaptive augmentation contrastive, in: Joint European Conference on Machine
Learning and Knowledge Discovery in Databases, Springer, 2023, pp. 683–699.
ep

[12] P. Jiao, T. Li, H. Wu, C.-D. Wang, D. He, W. Wang, Hb-dsbm: Modeling the dy-
namic complex networks from community level to node level, IEEE Transactions
on Neural Networks and Learning Systems 34 (11) (2023) 8310–8323.
Pr

29

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
[13] M. Qin, D.-Y. Yeung, Temporal link prediction: A unified framework, taxonomy,

iew
and review, ACM Computing Surveys 56 (4) (2023) 1–40.

[14] S. Kumar, X. Zhang, J. Leskovec, Predicting dynamic embedding trajectory in


temporal interaction networks, in: Proceedings of the 25th ACM SIGKDD in-
ternational conference on knowledge discovery & data mining, 2019, pp. 1269–
1278.

ev
[15] B. Chang, M. Chen, E. Haber, E. H. Chi, Antisymmetricrnn: A dynamical system
view on recurrent neural networks (2019).

r
[16] D. Xu, C. Ruan, E. Korpeoglu, S. Kumar, K. Achan, Inductive representation
learning on temporal graphs, in: International Conference on Learning Represen-
tations, 2020.
er
[17] A. Souza, D. Mesquita, S. Kaski, V. Garg, Provably expressive temporal graph
pe
networks, Advances in Neural Information Processing Systems 35 (2022) 32257–
32269.

[18] T. Lin, Y. Wang, X. Liu, X. Qiu, A survey of transformers, AI Open (2022).

[19] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, M. Shah, Transformers


ot

in vision: A survey, ACM computing surveys (CSUR) 54 (10s) (2022) 1–41.

[20] Q. Wen, T. Zhou, C. Zhang, W. Chen, Z. Ma, J. Yan, L. Sun, Transformers in


tn

time series: a survey, in: Proceedings of the Thirty-Second International Joint


Conference on Artificial Intelligence, 2023, pp. 6778–6786.

[21] W. Cong, S. Zhang, J. Kang, B. Yuan, H. Wu, X. Zhou, H. Tong, M. Mahdavi,


rin

Do we really need complicated model architectures for temporal networks?, in:


International Conference on Learning Representations, 2023.

[22] A.-L. Barabasi, The origin of bursts and heavy tails in human dynamics, Nature
ep

435 (7039) (2005) 207–211.

[23] C. Chatfield, H. Xing, The analysis of time series: an introduction with R, CRC
press, 2019.
Pr

30

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
[24] S. Unnikrishna Pillai, Probability, random variables and stochastic processes,

iew
2002.

[25] J. Gao, B. Ribeiro, On the equivalence between temporal and static equivariant
graph representations, in: International Conference on Machine Learning, PMLR,
2022, pp. 7052–7076.

ev
[26] K. Zhu, J. Chen, J. Wang, N. Z. Gong, D. Yang, X. Xie, Dyval: Graph-informed
dynamic evaluation of large language models, in: International Conference on
Learning Representations, 2024.

r
[27] T. Zheng, X. Wang, Z. Feng, J. Song, Y. Hao, M. Song, X. Wang, X. Wang,
C. Chen, Temporal aggregation and propagation graph neural networks for dy-

35 (10) (2023) 10151–10165. er


namic representation, IEEE Transactions on Knowledge and Data Engineering

[28] J. Su, D. Zou, C. Wu, Pres: Toward scalable memory-based dynamic graph neural
pe
networks, in: International Conference on Learning Representations, 2024.

[29] Y. Wang, Y. Cai, Y. Liang, H. Ding, C. Wang, S. Bhatia, B. Hooi, Adaptive data
augmentation on temporal graphs, Advances in Neural Information Processing
Systems 34 (2021) 1440–1452.
ot

[30] L. Luo, G. Haffari, S. Pan, Graph sequential neural ode process for link pre-
diction on dynamic and sparse graphs, in: Proceedings of the Sixteenth ACM
tn

International Conference on Web Search and Data Mining, 2023, pp. 778–786.

[31] M. Jin, Y.-F. Li, S. Pan, Neural temporal walks: Motif-aware representation learn-
ing on continuous-time dynamic graphs, Advances in Neural Information Pro-
rin

cessing Systems 35 (2022) 19874–19886.

[32] A. Gravina, G. Lovisotto, C. Gallicchio, D. Bacciu, C. Grohnfeldt, Long range


propagation on continuous-time dynamic graphs (2024).
ep

[33] Y. Wang, Y.-Y. Chang, Y. Liu, J. Leskovec, P. Li, Inductive representation learning
in temporal networks via causal anonymous walks, in: International Conference
on Learning Representations, 2021.
Pr

31

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
[34] J.-w. Lee, J. Jung, Time-aware random walk diffusion to improve dynamic graph

iew
learning, in: Proceedings of the AAAI Conference on Artificial Intelligence,
Vol. 37, 2023, pp. 8473–8481.

[35] L. Oettershagen, P. Mutzel, N. M. Kriege, Temporal walk centrality: ranking


nodes in evolving networks, in: Proceedings of the ACM Web Conference 2022,
2022, pp. 1640–1650.

ev
[36] T. Wang, D. Luo, W. Cheng, H. Chen, X. Zhang, Dyexplainer: Explainable dy-
namic graph neural networks, arXiv preprint arXiv:2310.16375 (2023).

r
[37] L. Yu, L. Sun, B. Du, W. Lv, Towards better dynamic graph learning: New archi-
tecture and unified library, Advances in Neural Information Processing Systems
(2023).
er
[38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
pe
Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural infor-
mation processing systems 30 (2017).

[39] J. D. M.-W. C. Kenton, L. K. Toutanova, Bert: Pre-training of deep bidirectional


transformers for language understanding, in: Proceedings of NAACL-HLT, 2019,
ot

pp. 4171–4186.

[40] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle-
tn

moyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach,


arXiv preprint arXiv:1907.11692 (2019).

[41] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Nee-


rin

lakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot


learners, Advances in neural information processing systems 33 (2020) 1877–
1901.
ep

[42] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-


to-end object detection with transformers, in: European conference on computer
vision, Springer, 2020, pp. 213–229.
Pr

32

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
[43] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner,

iew
M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16
words: Transformers for image recognition at scale, in: International Conference
on Learning Representations, 2020.

[44] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer:
Hierarchical vision transformer using shifted windows, in: Proceedings of the

ev
IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022.

[45] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y.-X. Wang, X. Yan, Enhancing the
locality and breaking the memory bottleneck of transformer on time series fore-

r
casting, Advances in neural information processing systems 32 (2019).

er
[46] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, W. Zhang, Informer:
Beyond efficient transformer for long sequence time-series forecasting, in: Pro-
ceedings of the AAAI conference on artificial intelligence, Vol. 35, 2021, pp.
pe
11106–11115.

[47] H. Wu, J. Xu, J. Wang, M. Long, Autoformer: Decomposition transformers with


auto-correlation for long-term series forecasting, Advances in Neural Information
Processing Systems 34 (2021) 22419–22430.
ot

[48] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, R. Jin, Fedformer: Frequency en-
hanced decomposed transformer for long-term series forecasting, in: Interna-
tn

tional conference on machine learning, PMLR, 2022, pp. 27268–27286.

[49] J. Zhang, H. Zhang, C. Xia, L. Sun, Graph-bert: Only attention is needed for
learning graph representations, arXiv preprint arXiv:2001.05140 (2020).
rin

[50] D. Kreuzer, D. Beaini, W. Hamilton, V. Létourneau, P. Tossou, Rethinking graph


transformers with spectral attention, Advances in Neural Information Processing
Systems 34 (2021) 21618–21629.
ep

[51] C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, T.-Y. Liu, Do trans-
formers really perform badly for graph representation?, Advances in Neural In-
formation Processing Systems 34 (2021) 28877–28888.
Pr

33

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]
ed
[52] Y. Zhong, C. Huang, A dynamic graph representation learning based on temporal

iew
graph transformer, Alexandria Engineering Journal 63 (2023) 359–369.

[53] Y. Wu, Y. Fang, L. Liao, On the feasibility of simple transformer for dynamic
graph modeling (2024).

[54] F. Poursafaei, S. Huang, K. Pelrine, R. Rabbany, Towards better evaluation for

ev
dynamic link prediction, Advances in Neural Information Processing Systems 35
(2022) 32928–32941.

[55] N. Wiener, Generalized harmonic analysis, Acta mathematica 55 (1) (1930) 117–

r
258.

er
[56] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan,
L. Wang, T. Liu, On layer normalization in the transformer architecture, in: Inter-
national Conference on Machine Learning, PMLR, 2020, pp. 10524–10533.
pe
[57] S. Narang, H. W. Chung, Y. Tay, L. Fedus, T. Fevry, M. Matena, K. Malkan,
N. Fiedel, N. Shazeer, Z. Lan, et al., Do transformer modifications transfer across
implementations and applications?, in: Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing, 2021.
ot

[58] R. Trivedi, M. Farajtabar, P. Biswal, H. Zha, Dyrep: Learning representations


over dynamic graphs, in: International Conference on Learning Representations,
tn

2019.

[59] L. Wang, X. Chang, S. Li, Y. Chu, H. Li, W. Zhang, X. He, L. Song, J. Zhou,
H. Yang, Tcl: Transformer-based dynamic graph modelling via contrastive learn-
rin

ing, arXiv preprint arXiv:2105.07944 (2021).

[60] S. Tian, J. Dong, J. Li, W. Zhao, X. Xu, B. Wang, B. Song, C. Meng, T. Zhang,
L. Chen, Sad: semi-supervised anomaly detection on dynamic graphs, in: Pro-
ep

ceedings of the Thirty-Second International Joint Conference on Artificial Intel-


ligence, 2023, pp. 2306–2314.
Pr

34

This preprint research paper has not been peer reviewed. Electronic copy available at: [Link]

You might also like