Recurrent Marked Temporal Point Processes
Recurrent Marked Temporal Point Processes
ABSTRACT
Large volumes of event data are becoming increasingly avail-
9:00 AM
able in a wide variety of applications, such as healthcare an-
alytics, smart cities and social network analysis. The precise
11:05 AM
time interval or the exact distance between two events car- 10:30 AM
What’s next ?
ries a great deal of information about the dynamics of the
underlying systems. These characteristics make such data
fundamentally different from independently and identically
Figure 1: Given the trace of past locations and time, can we
distributed data and time-series data where time and space
predict the location and time of the next stop?
are treated as indexes rather than random variables. Marked
temporal point processes are the mathematical framework
for modeling event data with covariates. However, typical
Keywords
point process models often make strong assumptions about Marked temporal point process; Stochastic process; Recur-
the generative processes of the event data, which may or rent neural network;
may not reflect the reality, and the specifically fixed para-
metric assumptions also have restricted the expressive power 1. INTRODUCTION
of the respective processes. Can we obtain a more expres-
sive model of marked temporal point processes? How can Event data with marker information can be produced from
we learn such a model from massive data? social activities, to financial transactions, to electronic health
In this paper, we propose the Recurrent Marked Temporal records, which contains rich information about what type
Point Process (RMTPP) to simultaneously model the event of event is happening between which entities by when and
timings and the markers. The key idea of our approach is where. For instance, people might visit various places at dif-
to view the intensity function of a temporal point process ferent moments of a day. Algorithmic trading systems buy
as a nonlinear function of the history, and use a recurrent and sell large volume of stocks within short-time frames. Pa-
neural network to automatically learn a representation of tients regularly go to the clinic with a longitudinal data of
influences from the event history. We develop an efficient diagnoses about their concerned diseases.
stochastic gradient algorithm for learning the model param- Although the aforementioned situations come from a broad
eters which can readily scale up to millions of events. Using range of domains, we are interested in a commonly encoun-
both synthetic and real world datasets, we show that, in tered question: based on the observed sequence of events,
the case where the true models have parametric specifica- can we predict what kind of event will take place at what
tions, RMTPP can learn the dynamics of such models with- time in the future? Accurately predicting the type and the
out the need to know the actual parametric forms; and in timing of the next event will have many interesting appli-
the case where the true models are unknown, RMTPP can cations. For mainstream personal assistants, shown in Fig-
also learn the dynamics and achieve better predictive per- ure 1, since people tend to visit different places specific to
formance than other parametric alternatives based on par- the temporal/spatial contexts, successfully predicting their
ticular prior assumptions. next destinations at the most likely time will make such ser-
vices more relevant and usable. In stock market, accurately
forecasting when to sell or buy a particular stock means
critical business success. For modern healthcare, patients
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed may have several diseases that have complicated dependen-
for profit or commercial advantage and that copies bear this notice and the full cita- cies on each other. Accurately estimating when a clinical
tion on the first page. Copyrights for components of this work owned by others than event might occur can effectively facilitate patient-specific
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission care and prevention to reduce the potential future risks.
and/or a fee. Request permissions from permissions@[Link]. Existing studies in literature attempt to approach this
KDD ’16, August 13-17, 2016, San Francisco, CA, USA problem mainly in two ways: first, classic varying-order
c 2016 ACM. ISBN 978-1-4503-4232-2/16/08. . . $15.00 Markov models [4] formulate the problem as a discrete-time
DOI: [Link] sequence prediction task. Based on the observed sequence
The input data is a set of sequences C = S 1 , S 2 , . . . .
of states, they can predict the most likely state the process
will evolve into on the next step. As a result, one limit of Each S = (t1 , y1 ), (t2 , y2 ), . . . is a sequence of pairs (tij , yji )
i i i i i
the family of classic Markov models is that it assumes the where tij is the time when the event of type (or marker) yji
process proceeds by unit time-steps, so it cannot capture the has occurred to the entity i, and tij < tij+1 . Depending on
heterogeneity of the time to predict the timing of the next specific applications, the entity and the event type can have
event. Furthermore, when the number of states is large, different meanings. For example, in transportation, S i can
Markov model usually cannot capture long dependency on be a trace of time and location pairs for a taxi i where tij
the history since the overall state-space will grow exponen- is the time when the taxi picks up or drops off customers
tially in the number of time steps considered. Semi-Markov in the neighborhood yji . In financial transactions, S i can be
model [26] can model the continuous time-interval between a sequence of time and action pairs for a particular stock i
two successive states to some extent by assuming the inter- where tij is the time when a transaction of selling (yji = 0) or
vals have very simple distributions, but it still has the same
buying (yji = 1) has occurred. In electronic health records,
state-space explosion issue when the order grows.
Second, marked temporal point processes and intensity S i is a series of clinical events for patient i where tij is the
functions are a more general mathematical framework for time when the patient is diagnosed with the disease yji . De-
modeling such event data. For example, in seismology, marked spite that these applications emerge from a diverse range of
temporal point processes have originally been widely used domains, we want to build models which are able to:
for modeling earthquakes and aftershocks [20, 21, 31]. Each • Predict the next event pair (tin+1 , yn+1 i
) given a sequence
earthquake can be represented as a point in the temporal- of past events for entity i;
spatial space, and seismologists have proposed different for- • Evaluate the likelihood of a given sequence of events;
mulations to capture the randomness of these events. In the • Simulate a new sequence of events based on the learned
financial area, temporal point processes are active research parameters of the model.
topics of econometrics, which often leads to many simple in-
terpretations of the complex dynamics of modern electronic 3. RELATED WORK
markets [2, 3]. Temporal point processes [6] are mathematical abstrac-
However, typical point process models, such as the Hawkes tions for many different phenomena across a wide range of
processes [20], the autoregressive conditional duration pro- domains. In seismology, marked temporal point processes
cesses [12], are making specific assumptions about the func- have originally been widely used for modeling earthquakes
tional forms of the generative processes, which may or may and aftershocks [20, 19, 21, 31]. In computational finance,
not reflect the reality, and thus the respective fixed sim- temporal point processes are very active research topics in
ple parametric representations may restrict the expressive econometrics [2, 3]. In sociology, temporal-spatial point pro-
power of these models. How can we obtain a more expres- cesses have been used to model networks of criminals [35]. In
sive model of marked temporal point processes, and learn human activity modeling, Poisson Process and its variants,
such a model from large volume of data? In this paper, we have been used to model the inter-event durations of human
propose a novel marked temporal point process, referred to activities [29, 15]. More recently, the self-excitation point
as the Recurrent Marked Temporal Point Process, to simul- process [20], has become an ongoing hot topic for modeling
taneously model the event timings and markers. The key the latent dynamics of information diffusion [16, 9, 10, 8, 40,
idea of our approach is to view the intensity function of a 14, 22], online-user engagement [13], news-feed streams [7],
temporal point process as a nonlinear function of the his- and context-aware recommendations [11].
tory of the process, and parameterize the function using a A major limitation of these existing studies is that they
recurrent neural network. More specifically, our work makes often draw various parametric assumptions about the latent
the following contributions: dynamics governing the generation of the observed point
• We propose a novel marked point process to jointly model patterns. In contrast, in this work, we seek to propose a
the time and the marker information by learning a gen- model that can learn a general and efficient representation of
eral representation of the nonlinear dependency over the the underlying dynamics from the event history without as-
history based on recurrent neural networks. Using our suming a fixed parametric forms in advance. The advantage
model, event history is embedded into a compact vector is that the proposed model can be more flexible to be au-
representation which can be used for predicting the next tomatically adapted to the data. We compare the proposed
event time and marker type. RMTPP with many other processes of specific parametric
• We point out that the proposed Recurrent Marked Tem- forms in Section 6 to demonstrate the superb robustness of
poral Point Process establishes a previously unexplored RMTPP to model misspecification.
connection between recurrent neural networks and point
processes, which has implications beyond temporal-spatial 4. MARKED TEMPORAL POINT PROCESS
settings by incorporating more rich contextual informa-
Marked temporal point process is a powerful mathemati-
tion and features.
cal tool to model the latent mechanisms governing the ob-
• We conduct large-scale experiments on both synthetic and
served random event patterns along time. Since the occur-
real-world datasets across a wide range of domains to show
rence of an event may be triggered by what happened in
that our model has consistently better performance for
the past, we can essentially specify models for the timing
predicting both the event type and timing compared to
of the next event given what we have already known so far.
alternative competitors.
More formally, a marked temporal point process is a random
process of which the realization consists of a list of discrete
events localized in time, {tj , yj }, with the timing tj ∈ R+ ,
2. PROBLEM DEFINITION the marker yj ∈ Y and j ∈ Z+ . Let the history Ht be
the list of event time and marker pairs up to the time t. dent of the history Ht but it can be a function varying over
The length dj+1 = tj+1 − tj of the time interval between time, i.e., λ∗ (t) = g(t) > 0.
neighboring successive events tj and tj+1 is referred to as Hawkes process [20]. A Hawkes process captures the
the inter-event duration. mutual excitation phenomenon among events with the con-
Given the history of past events, we can explicitly spec- ditional intensity being defined as
ify the conditional density function that the next event will X
λ∗ (t) = γ0 + α γ(t, tj ), (5)
happen at time t with type y as f ∗ (t, y) = f (t, y|Ht ) where
tj <t
f ∗ (t, y) emphasizes that this density is conditional on the
history. By applying the chaining rule, we can derive the where γ(t, tj ) > 0 is the triggering kernel capturing tem-
joint likelihood of observing a sequence as the following: poral dependencies, γ0 > 0 is a baseline intensity indepen-
Y Y ∗ dent of the history and the summation of kernel terms is
f {(tj , yj )}n
j=1 = f (tj , yj |Ht ) = f (tj , yj ) (1) history dependent and a stochastic process by itself. The
j j kernel function can be chosen in advance, e.g., γ(t, tj ) =
One can design many forms for f ∗ (tj , yj ). However, in prac- exp(−β (t − tj )) or γ(t, tj ) = I[t > tj ], or directly learned
tice, people typically choose very simple factorized formu- from data. A distinctive feature of the Hawkes process is
lations like f (tj , yj |Ht ) = f (yj )f (tj | . . . , tj−2 , tj−1 ) due to that the occurrence of each historical event increases the in-
the excessive complications caused by jointly and explicitly tensity by a certain amount. Since the intensity function
modeling the timing and the marker information. One can depends on the history up to time t, the Hawkes process is
think of f (yj ) as a multinomial distribution when yj can essentially a conditional Poisson process (or doubly stochas-
only take finite number of values and is totally independent tic Poisson process [27]) in the sense that conditioned on the
on the history. f (tj | . . . , tj−2 , tj−1 ) is the conditional den- history Ht , the Hawkes process is a Poisson process formed
sity of the event occurring at the time tj given the timing by the superposition of a background homogeneous Poisson
sequence of past events. However, note that f ∗ (tj ) cannot process with the intensity γ0 and a set of inhomogeneous
capture the influence of past markers. Poisson processes with the intensity γ(t, tj ). However, be-
cause the events in a past interval can affect the occurrence
4.1 Parametrizations of the events in later intervals, the Hawkes process in general
The temporal information in a marked point process can is more expressive than a Poisson process.
be well captured by a typical temporal point process. An Self-correcting process [25]. In contrast to the Hawkes
important way to characterize temporal point processes is process, the self-correcting process seeks to produce regular
via the conditional intensity function — the stochastic model point patterns with the conditional intensity function
!
for the next event time given all previous events. Within a ∗
X
small window [t, t + dt), λ∗ (t)dt is the probability for the λ (t) = exp µt − α , (6)
ti <t
occurrence of a new event given the history Ht :
∗ where µ > 0, α > 0. The intuition is that while the intensity
λ (t)dt = P {event in [t, t + dt)|Ht } . (2)
increases steadily, every time when a new event appears, it is
The ∗ notation reminds us that the function depends on the decreased by multiplying a constant e−α < 1, so the chance
history. Given the conditional density function f ∗ (t), the of new points decreases after an event has occurred recently.
conditional intensity function can be specified as: Autoregressive Conditional Duration process [12].
f ∗ (t)dt f ∗ (t)dt An alternative way of conditional intensity parametrization
λ∗ (t)dt = ∗
= , (3)
S (t) 1 − F ∗ (t) is to capture the dependency between inter-event durations.
where F ∗ (t) is the cumulative probability that a new event Let di = ti − ti−1 . The expectation for di is given by
will happen before ψi = E(di | . . . , di−2 , di−1 ). The simplest form assumes that
R time t since the last event time tn , and di = ψi i where i is independently and identically dis-
∗ t ∗
S (t) = exp − tn λ (τ )dτ is the respective probability
tributed exponential variables with expectation one. As
that no new event has ever happened up to time t since tn . a consequence, the conditional intensity has the following
As a consequence, the conditional density function can be form:
alternatively specified by
Z t λ∗ (t) = ψN
−1
(t) , (7)
f ∗ (t) = λ∗ (t) exp − λ∗ (τ )dτ .
Pm
(4) where ψi = γ0 + j=0 αj di−j to capture the influences from
tn
the most recent m durations, and N (t) is the total number
Particular functional forms of the conditional intensity of events up to t.
function λ∗ (t) are often designed to capture the phenom-
ena of interests [1]. In the following, we review a few rep-
resentative examples of typical point processes where the 4.2 Major Limitations
conditional intensity has particularly specified parametric Curse of Model Misspecification. All these different
forms. parameterizations of the conditional intensity function seek
Poisson process [28]. The homogeneous Poisson pro- to capture and represent certain forms of dependency on the
cess is the simplest point process. The inter-event times history in different ways: Poisson process makes the assump-
are independent and identically distributed random vari- tion that the duration is stationary; Hawkes process assumes
ables conforming to the exponential distribution. The con- that the influences from past events are linearly additive to-
ditional intensity function is assumed to be independent wards the current event; Self-correcting process specifies a
of the history Ht and keeping constant over time, i.e., non-linear dependency over these past events; and autore-
λ∗ (t) = λ0 > 0. For a more general inhomogeneous pois- gressive conditional duration model imposes a linear struc-
son process, the intensity is also assumed to be indepen- ture between successive inter-event durations. These differ-
ent parameterizations encode our prior knowledge about the log P (yj+1 |hj ) log f ⇤ (tj+1 ) log P (yj+2 |hj+1 ) log f ⇤ (tj+2 )
latent dynamics we try to model. In practice, however, the
true model is never known. Thus, we have to try different hj 1 hidden hj hidden hj+1
specifications for λ∗ (t) to tune the predictive performance
and most often we can expect to suffer from certain errors
caused by the model misspecification. (tj , yj ) f ⇤ (tj+1 ) = f (dj+1 |hj ) (tj+1 , yj+1 )
time
Marker Generation. Furthermore, it is quite often that dj+1 = tj+1 tj
we have additional information (or covariates) associated Figure 2: Illustration of Recurrent Marked Temporal Point
with each event like the markers. For instance, the marker Process. For each event with the timing tj and the marker
of a NYC taxi can be the neighborhood-name of the place yj , we treat the pair (tj , yj ) as the input to a recurrent neural
where it picks up (or drops off) passengers; the marker of network where the embedding hj up to the time tj learns a
each financial transaction can be the action of buying (or general representation of a nonlinear dependency over both
selling); and the marker of a clinical event can be the diag- the timing and the marker information from past events.
nosis of the major disease. Classic temporal point processes Note that the solid diamond and the circle on the timeline
can be extended to capture the marker information mainly indicate two events of different types yj 6= yj+1 .
in the following two ways: first, the marker is directly incor-
porated into the intensity function; second, each marker can ple. In theory, finite-sized recurrent neural networks with
be regarded as an independent dimension to have a multi- sigmoidal activation units can simulate a universal Turing
dimensional temporal point process. In terms of the for- machine [36], which is able to perform an extremely rich
mer approach, we still need to specify a proper form for the family of computations. In practice, RNN has been shown
conditional intensity function. Moreover, due to the extra to be a powerful tool for general purpose sequence model-
complexity of the function induced by the markers, people ing. For instance, in Natural Language Processing, recurrent
normally make strong assumptions that the marker is inde- neural network has state-of-the-arts predictive performance
pendent on the history [33], which greatly reduces the flex- for sequence-to-sequence translations [24], image caption-
ibility of the model. With respect to the latter method, it ing [38], handwriting recognition [18]. It has also been used
is very often to have large number of markers, which results for discrete-time series data prediction [30, 39, 34](treat time
in a sparsity problem associated with each dimension where as discrete indices) for a long time.
only very few events can happen. Our key idea is to let the RNN (or its modern variant
LSTM [23], GRU [5], etc.) model the nonlinear dependency
over both of the markers and the timings from past events.
5. RECURRENT MARKED TEMPORAL As shown in Figure 2, for the event occurring at the time
POINT PROCESS tj of type yj , the pair (tj , yj ) is fed as the input into a re-
Each parametric form of the conditional intensity function current neural network unfolded up to the j + 1-th event.
determines the temporal characteristics of a family of point The embedding hj−1 represents the memory of the influence
processes. However, it will be hard to correctly decide which from the timings and the markers of past events. The neu-
form to use without any sufficient prior knowledge in order ral network updates hj−1 to hj by taking into account the
to take into account both the marker and the timing infor- effect of the current event (tj , yj ). Since now hj represents
mation. To tackle this challenge, in this section, we propose the influence of the history up to the j-th event, the con-
a unified model capable of modeling a general nonlinear de- ditional density for the next event timing can be naturally
pendency over the history of both the event timing and the represented as
marker information. f ∗ (tj+1 ) = f (tj+1 |Ht ) = f (tj+1 |hj ) = f (dj+1 |hj ), (8)
5.1 Model Formulation where dj+1 = tj+1 − tj . As a consequence, we can depend
on hj to make predictions to the timing t̂j+1 and the type
By carefully investigating the various forms of the condi- ŷj+1 of the next event.
tional intensity function (5), (6), and (7), we can observe The advantage of this formulation is that we explicitly
that they are inherently different representations and real- embed the event history into a latent vector space, and by
izations of various kinds of dependency structures over the the elegant relation (4), we are now able to capture a general
past events. Inspired by this critical insight, we seek to form of the conditional intensity function λ∗ (t) without the
learn a general representation to approximate the unknown need of specifying a fixed parametric specification for the
dependency structure over the history. dependency structure over the history. Figure 3 presents
Recurrent Neural Network (RNN) is a feedforward neural the overall architect of the proposed RMTPP. Given a se-
network structure where additional edges, referred to as the quence of events S = (tj , yj )n
j=1 , we design an RNN which
recurrent edges, are added such that the outputs from the computes a sequence of hidden units {hj } by iterating the
hidden units at the current time step are fed into them again following components.
as the future inputs at the next time step. In consequence, Input Layer. At the j-th event, the input layer first
the same feedforward neural network structure is replicated projects the sparse one-hot vector representation of the marker
at each time step, and the recurrent edges connect the hid- yj into a latent space. We add an embedding layer with the
den units of the network replicates at adjacent time steps weight matrix Wem to achieve a more compact and efficient
together along time, that is, the hidden units with recur- representation yj = Wem >
yj + bem , where bem is the bias.
rent edges not only receive the input from the current data We learn Wem and bem while we train the network. In
sample but also from the hidden units in the last time step. addition, for the timing input tj , we can extract the associ-
This feedback mechanism creates an internal state of the ated temporal features tj (e.g., like the inter-event duration
network to memorize the influence of each past data sam-
at the time t given the history by the following equation:
log P (yj+1 |hj ) log f ⇤ (tj+1 ) Z t !
∗ ∗ ∗
f (t) = λ (t) exp − λ (τ )dτ
Vy vt
tj
Output Layer (
> 1 >
= exp v t · hj + wt (t − tj ) + bt + exp(v t · hj + bt )
Wh w
Recurrent Layer hidden hj )
1 >
Wy
− exp(v t · hj + wt (t − tj ) + bt ) . (12)
w
Wt embedding yj Then, we can estimate the timing for the next event using
Input Layer
the expectation
Wem Z ∞
Timing tj Marker yj t̂j+1 = t · f ∗ (t)dt. (13)
tj
Figure 3: Architect of RMTPP. For a given sequence S =
(tj , yj )n In general, the integration in (13) does not have analytic
j=1 , at the j-th event, the marker yj is first em-
bedded into a latent space. Then, the embedded vector and solutions, so we can apply commonly used numerical in-
the temporal features are fed into the recurrent layer. The tegration techniques [32] for one-dimensional functions to
recurrent layer learns a representation that summaries the compute (13) instead.
nonlinear dependency over the previous events. Based on Remark. Based on the hidden unit of RNN, we are able
the learned representation hj , it outputs the prediction for to learn a unified representation of the dependency over the
the next marker ŷj+1 and timing t̂j+1 to calculate the re- history. In consequence, the direct formulation (11) of the
spective loss functions. conditional intensity function λ∗ (tj+1 ) captures both of the
information from past event timings and event markers. On
dj = tj − tj−1 ). the other hand, since the prediction for the marker also
Hidden Layer. We update the hidden vector after re- depends nonlinearly on the past timing information, this
ceiving the current input and the memory hj−1 from the may improve the performance of the classification task as
past. In RNN, we have well when both of these two information are correlated with
n o each other. In fact, experiments on synthetic and real world
hj = max W y yj + W t tj + W h hj−1 + bh , 0 . (9) datasets in the following experimental section do verify such
Marker Generation. Given the learned representation mutual boosting phenomenon.
hj , we model the marker generation with a multinomial dis-
tribution by 5.2 Parameter Learning
Given a collection of sequences C = S i , where S i =
y
exp Vk,: hj + byk
(tij , yji )ni
P (yj+1 = k|hj ) = P , (10) j=1 , we can learn the model by maximizing the joint
K y y
k=1 exp Vk,: hj + bk log-likelihood of observing C.
XX
where K is the number of markers, and Vk,: y
is the k-th row `({S i }) = i
log P (yj+1 |hj ) + log f (dij+1 |hj ) , (14)
y
of matrix V . i j
Conditional Intensity. Based on hj , we can now for- We exploit the Back Propagation Through Time (BPTT) for
mulate the conditional intensity function by training RMTPP. Given the size of BPTT as b, we unroll
> our model in Figure 3 by b steps. In each training iteration,
λ∗ (t) = exp v t · hj + wt (t − tj ) + |{z}
bt , (11) j+b
we take b consecutive samples {(tik , yki )k=j } from a single
| {z } | {z }
past current base
intensity
sequence, apply the feed-forward operation through the net-
influence influence
work, and update the parameters with respect to the loss
where v t is a column vector, and wt , bt are scalars. More function. After we unroll the model for b steps through time,
specifically, all the parameters are shared across these copies, and will be
> updated sequentially in the back propagation stage. In our
• The first term v t · hj represents the accumulative in- algorithm framework1 , we need both sparse (the marker yj )
fluence from the marker and the timing information of and dense features at time tj . Meanwhile, the output is also
the past events. Compared to the fixed parametric for- mixed of discrete markers and real-value time, which is then
mulations of (5), (6), and (7) for the past influence, we fed into different loss functions including the cross-entropy of
now have a highly non-linear general specification of the the next predicted marker and the negative log-likelihood of
dependency over the history. the next predicted event timing. Therefore, we build an effi-
• The second term emphasizes the influence of the current cient and flexible platform2 particularly optimized for train-
event j. ing general directed acyclic structured computational graph
• The last term gives a base intensity level for the occur- (DAG). The backend is supported via CUDA and MKL for
rence of the next event. GPU and CPU platform, respectively. In the end, we ap-
• The exponential function outside acts as a non-linear trans- ply stochastic gradient descent (SGD) with mini-batch and
formation and guarantees that the intensity is positive. several other techniques of training neural networks [37].
By invoking the elegant relation between the conditional
1
intensity function and the conditional density function in (4), [Link]
2
we can derive the likelihood that the next event will occur [Link]
6. EXPERIMENT 1. For each time tn−1 , we take the mod of tn−1 by a period
We evaluate RMTPP in large-scale synthetic and real of P = 24. If the residual is greater than 12, the process
world data. We compare it to several discrete-time and is defined to be in the time state rn−1 = 0; otherwise, it
continuous-time sequential models showing that RMTPP is is in the time state rn−1 = 1.
more robust to model misspecificationmisspecification than 2. Based on the combination of both the time state {rn−j }m j=1
these alternatives. and the marker state {yn−j }m j=1 of the previous m events,
the process will have the marker k for the next step in
6.1 Baselines the probability P (yn = k| {yn−j }m m
j=1 , {rn−j }j=1 ).
To evaluate the predictive performance of forecasting mark- 3. Similarly, based on the combination of {yn−j }m j=1 and
ers, we compare with the following discrete-time models: {rn−j }mj=1 from the previous m events, the duration dn =
• Majority Prediction. This is also known as the 0-order tn − tn−1 has a Poisson distribution with the expectation
Markov Chain (MC-0), where at each time step, we al- determined by {rn−j }m m
j=1 and {yn−j }j=1 jointly. Here,
ways predict the most popular marker regardless of the we use the Poisson distribution to mimic the elapsed time
history. Most often, predicting the most popular type is units (e.g., hours, minutes).
a strong heuristic. In our experiments, without loss of generality, we set the to-
• Markov Chain. We compare with Markov models with tal number of markers to two, m = 3 and randomly initialize
varying orders from one to three, denoted as MC-1, MC- the transition probabilities between states.
2, and MC-3, respectively. Experimental Results. Figure 4 presents the predic-
To show the effectiveness of predicting time, we compare tive performance of RMTPP fitted to different types of time-
with the following continuous-time models: series data, where we simulate 1,000,000 events and use 90%
• ACD. We fit a second-order autoregressive conditional for training and the rest 10% for testing for each case. We
duration process with the intensity function given in (7). first compare the predictive performance of RMTPP with
• Homogeneous Poisson Process. The intensity func- the optimal estimator in the left column where the opti-
tion is a constant, which produces an estimate of the av- mal estimator knows the true conditional intensity function.
erage inter-event gaps. We treat the expectation of the time interval between the
• Hawkes Process. We fit a self-excitation Hawkes pro- current and the next event as our estimation. Grey curves
cess with the intensity function in (5). are the observed inter-event durations from 100 successive
• Self-correcting Process. We fit a self-correcting pro- events in the testing data. Blue curves are the respective
cess with the intensity function in (6). expectations given by the optimal estimator. Red curves
Finally, we compare with the Continuous-Time Markov are the predictions from RMTPP. We can observe that even
Chain (CTMC) model, which learns continuous transition though RMTPP has no prior knowledge about the true func-
rates between two states (or markers). This model predicts tional form of each process, its predictive performance is
the next state with the earliest transition time, so it can almost consistent with the respective optimal estimator.
predict both the marker and the timing for the next event The middle column of Figure 4 compares the learned con-
jointly. ditional intensity functions (red curves) with the true ones
(blue curves). It clearly demonstrates that RMTPP is able
6.2 Synthetic Data to adaptively and accurately capture the unknown hetero-
To show the robustness of RMTPP, we propose the fol- geneous temporal dynamics of different time-series data. In
lowing generative processes3 : particular, because the order of dependency over the his-
Autoregressive Conditional Duration. The condi- tory is fixed for ACD, RMTPP almost exactly learns the
tional density function for the next duration dn conforms to conditional intensity function with comparable BPTT steps.
an exponential distribution with the expectation determined The Hawkes and the self-correcting processes are more chal-
by the past m subsequent durations in the following form, lenging in that the conditional intensity function depends
which is denoted as ACD. on the entire history. Because the events are far from be-
m
!−1 ing uniformly distributed, the influence from individual past
event to the occurrence of new future events can vary widely.
X
f (dn |Hn−1 ) = αn exp(−αn τn ), αn = µ0 + γ dn−i ,
i=1
From this perspective, these processes essentially have ran-
(15) dom varying order dependency on the history compared to
ACD. However, with properly chosen BPTT steps, RMTPP
where µ0 is the base duration to generate the first event
can accurately capture the general shape and each single
starting from zero, d1 ∼ (µ0 )−1 exp(−d1 /µ0 ). We set m = 2,
change point of the true intensity function. In particular,
µ0 = 0.5 and γ = 0.25.
for the Hawkes case, the abruptly increased intensity from
Hawkes Process. The P conditional intensity function is
time index 60 to 100 results in 40 events in a very tiny time
given by λ(t) = λ0 + α ti <t exp − t−t i
where λ0 = 0.2,
σ interval, but still, the predictions of RMTPP can capture
α = 0.8 and σ = 1.0.
the trend of the true data.
Self-Correcting Process. The conditional intensity func-
P The right column of Figure 4 reports the overall RMSE of
tion is given by λ(t) = exp µt − ti <t α where µ = 1 and different processes between the predictions and the true test-
α = 0.2. ing data. We can observe that RMTPP has very strong com-
State-Space Continuous-Time Model. To model the petitive performance and better robustness against model
influence from both markers and time, we further propose misspecification to capture the heterogeneity of the latent
the State-Time Mixture model with the following steps: temporal dynamics of different time-series data compared
to other parametric alternatives.
3
[Link] In addition to time, the state-space continuous-time model
Time Prediction Curve Intensity Function Time Prediction Error
18 2 1.5 Methods
Data
Poisson
Opt 1.8
16 RMTPP
Hawkes
SelfCorrecting
14 1.6 ACD
1.3
Inter-event time
OPT
12 1.4 RMTPP
Intensity
ACD
1.2
RMSE
10 1.2042 1.2039
1.1703
1
8 1.1 1.1098 1.1097 1.1109
0.8
6
0.6
4
0.4 0.9
2 Opt
0.2 RMTPP
0
0 20 40 60 80 100 0 20 40 60 80 100
Time Index Time Index Methods
14 Data 12
Methods
2.675 Poisson
Opt
2.6 Hawkes
12 RMTPP SelfCorrecting
10 ACD
Inter-event time
10 OPT
Hawkes
8 RMTPP
Intensity
RMSE
2.4
6
2.3543 2.3548
6
4 2.2659
4 2.2422 2.2395
2.2
2 2
Opt
RMTPP
0 0
0 20 40 60 80 100 0 20 40 60 80 100 2.0
Time Index Time Index Methods
8 ACD
1
Inter-event time
0.8
RMSE
0.6 6
0.4 5
0.1758
0.175 0.1753 0.1752
0.2 4
Opt
0 3 RMTPP
18 50
Methods Methods
Data 7.2966
7 Poisson MC-0
16 Opt
Hawkes MC-1
RMTPP CTMC
46.35 46.35 46.35
MC-2
14 SelfCorrecting MC-3
Inter-event time
5
8 4.9856
4.4559 4.4787
6
4
30
4 3.7
27.16
2 3 26.05
2.7395
0 2.5851
0 20 40 60 80 100 2 20
Time Index Methods Methods
(a) Time Prediction Curve (b) Time Prediction Error (c) Marker Prediction Error
Figure 5: Performance evaluation of predicting timings and markers on the state-space continuous-time model.
NYC Taxi Financial Stack Overflow MIMIC II
Methods
50 Methods
90 Methods Methods
92.5 49.48 89.33
92.22 CTMC CTMC CTMC 68.21 CTMC
MC-0 MC-0 MC-0 MC-0
MC-1 MC-1 MC-1 60 MC-1
80 60.12
90.0 90.29 MC-2 MC-2 MC-2 MC-2
MC-3 MC-3 MC-3 MC-3
RMTPP 45 RMTPP RMTPP RMTPP
70
Error %
Error %
Error %
Error %
87.5 40 40.23
86.34 86.12
86.01 85.75
85.0 60
40 56.85
55.41 55.04 55.27 20 21.56
19.38
50 51.4 17.49
82.5 38.032 38.024 38.046 38.058
37.678
80.0 35 40 0
Methods Methods Methods Methods
1.15 Methods
2.5 Methods Methods
8 Methods
CTMC CTMC CTMC CTMC
Poisson Poisson 14 Poisson Poisson
Hawkes Hawkes Hawkes Hawkes
1.10 SelfCorrecting SelfCorrecting SelfCorrecting
7.3056
SelfCorrecting
1.0882 ACD
2.0 13.466 7
1.0876 1.0845 ACD ACD ACD
13.0671
RMSE (hour)
RMSE (day)
RMSE (day)
6.7299
RMSE (s)
1.804
1.05 12.3113
1.0514 12
11.9885 11.8859 6.2213
1.5 1.5468 1.5451 1.5472 6
1.0217 1.4714 5.9703 5.9224
1.409 5.7639
1.00
10.78
10
1.0 5
0.95
0.9477
0.90 0.5 8 4
Methods Methods Methods Methods
Figure 7: Performance evaluation for predicting both marker and timing of the next event. The top row presents the
classification error of predicting th event markers, and the bottom row gives the RMSE of predicting the event timings.
Methods
45 Methods
4.3252 RNN
RMTPP
RNN
RMTPP
contains ∼173 million trip records of individual Taxi for
4.0 consecutive 12 months in 2013. The location information
40
39.59 is available in the form of latitude/longitude coordinates.
Error %
3.5
35
up (drop-off) passengers associated with every trip. We
have used NYC Neighborhood Names GIS dataset5 to map
3.0 30
the coordinates to neighborhood names. For those coordi-
2.7395 27.16 nates of which the location name is not directly available
2.5 25
Methods Methods in the GIS dataset, we use geodesic distance to map them
(a) Time Prediction Error (b) Marker Prediction Error to the nearest neighborhood name. With this process we
Figure 6: Predictive performance comparison with RNN obtained 299 unique locations as our markers. An event is
which is trained for predicting the next timing only in (a), a pickup record for a taxi. Further, we have divided each
and for predicting the next marker only in (b). single sequence of a taxi into multiple fine-grained subse-
quences where two consecutive events are within 12 hours.
also includes the marker information. Figure 5 compares the We obtained 670,753 sequences in total out of which 536,603
error rates of different processes in predicting both event were used for training and 134,150 were used for testing. We
timings and markers. Compared to the other baselines, predict the location and the time of the next pickup event.
RMTPP is again consistent with the optimal estimator with- Financial Transaction Dataset. We have collected
out any prior knowledge about the true underlying genera- a raw limited order book data from NYSE of the high-
tive process. frequency transactions for a stock in one day. It contains
Finally, since the occurrences of future events depend on 0.7 million transaction records, each of which records the
both of the past marker and timing information, we would time (in millisecond) and the possible action (B = buy, S
like to investigate whether learning a unified representation = sell). We treat the type of actions as markers. The in-
of the joint information can further improve future predic- put data is a single long sequence with 624,149 events for
tions. Therefore, we train an RNN by only using the tempo- training and 69,350 events for testing. The task is to predict
ral and the marker information, respectively. Figure 6 gives which action will be taken next at what time.
the comparisons between RMTPP and RNN where in panel Electrical Medical Records. MIMIC II medical dataset
(a), RNN has the 4.3522 RMSE while RMTPP achieves a is a collection of de-identified clinical visit records of Inten-
2.7395 RMSE, and in panel (b), RNN reports 39.59% clas- sive Care Unit patients for seven years. We have filtered out
sification error while RMTPP reaches to the 27.16% level. 650 patients and 204 diseases. Each event records the time
Clearly, they verify that the joint modeling of both informa- when a patient had a visit to the hospital. We have used
tion can boost the performance of predicting future events. the sequences of 585 patients to train, and the rest for test.
The goal is to predict which major disease will happen to a
6.3 Real Data given patient at what time in the future.
We evaluate the predictive performance of RMTPP on Stack OverFlow Dataset. Stack Overflow6 is a question-
real world datasets from a diverse range of domains. 5
New York City Taxi Dataset. The NYC taxi dataset4 [Link]
Neighborhood-Names-GIS/99bc-9p23
4 6
[Link] [Link]
NYC Taxi Financial Stack Overflow MIMIC II
87.0 Methods
38.00 Methods
55.0 Methods
20 Methods
RNN RNN RNN RNN
RMTPP RMTPP RMTPP RMTPP
19
86.5 37.75 37.762 52.5 52.724
86.482
37.678
51.4 18
Error %
Error %
Error %
Error %
18.15
RMSE (day)
11
RMSE (day)
1.409
5.8121
RMSE (s)
0.92
9 5.25
0.75
4
×10
140 14
answering website which exploits badges to encourage user
120 12
engagement and guide behaviors [17]. There are 81 types
100 10
of non-topical (i.e., non-tag affiliated) badges which can be
80 8
awarded either only once (e.g. Altruist, Inquisitive, etc.)
count
count
60 6
or multiple times (e.g. Stellar Question, Guru, Great An-
40 4
swer, etc.) to a user. By ignoring the badges which can be 20 2
awarded only once, we first select users who have earned at 0 0
-8 -6 -4 -2 0 2 4 -12 -10 -8 -6 -4 -2 0
least 40 badges between 2012-01-01 and 2014-01-01 and then Log inter-event time Log inter-event time
those badges which have been awarded at least 100 times (a) Stack Overflow (b) Financial Transaction
to the users selected in the first step. We have removed Figure 9: Empirical distribution for the inter-event times.
the users who have been instantaneously awarded multiple The x-axis is in log-scale.
badges due to technical issues of the servers. In the end, ferent characteristics, RMTPP is more flexible to capture
we have ∼6 thousand users with a total of ∼480 thousand such heterogeneity in general.
events where each badge is treated as a marker.
Experimental Results. We compare and report the
predictive performance of different models on the testing 7. DISCUSSIONS
data of each dataset in Figure 7. The hyper-parameters We present the Recurrent Marked Temporal Point Pro-
of RMTPP across all these datasets are tuned as follow- cess, which builds a connection between recurrent neural
ing: learning rate in {0.1, 0.01, 0.001}; hidden layer size in networks and point processes. The recurrent neural net-
{64, 128, 256, 512, 1024}; momentum = 0.9 and L2 penalty = work supports different architects, including the classic RNN
0.001; and batch-size in {16, 32, 64}. Figure 7 compares the and modern LSTM, GRU, etc. Besides, in addition to the
predictive performance of forecasting markers and timings inter-event temporal features, our model can be readily gen-
for the next event of different processes across the four real eralized to incorporate other contextual information. For
datasets. RMTPP outperforms the other alternatives with instance, in addition to training a global model, we can also
lower errors for predicting both timings and markers. Be- take the potential user-profile features into account for per-
cause the MIMIC-II dataset has many short sequences and sonalization. Furthermore, based on the structural informa-
is the smallest out of the four datasets, increasing the order tion of social networks, our model can be generalized in such
of Markov chain will decrease its classification performance. a way that the prediction of one user sequence depends not
We also compare RMTPP with RNN trained only with only on her own history but also on the other users’ history
the marker and with the timing information separately in to capture their interactions in a networked setting.
Figure 8. We can observe that RMTPP trained by incor- To conclude, RMTPP inherits the advantages from both
porating both the past marker and timing information per- the recurrent neural networks and the temporal point pro-
forms consistently better than RNN trained with either one cesses to predict both the marker and the timing of the fu-
source of the information alone. Finally, Figure 9 shows ture events without any prior knowledge about the hidden
the empirical distribution for the inter-event times on the functional forms of the latent temporal dynamics. Exper-
Stack Overflow and the financial transaction data. Com- iments on both synthetic and real world datasets demon-
pared to the other temporal processes of fixed parametric strate that RMTPP is robust to model mis-specifications
forms, even though the real datasets might have quite dif- and has consistently better performance compared to the
other alternatives.
Acknowledgements [21] A. G. Hawkes and D. Oakes. A cluster process
representation of a self-exciting process. Journal of Applied
This project was supported in part by NSF/NIH BIGDATA Probability, pages 493–503, 1974.
1R01GM108341, ONR N00014-15-1-2340, NSF IIS-1218749, [22] X. He, T. Rekatsinas, J. Foulds, L. Getoor, and Y. Liu.
and NSF CAREER IIS-1350983. Hawkestopic: A joint model for network inference and topic
modeling from text-based cascades. In ICML, pages
871–880, 2015.
8. REFERENCES [23] S. Hochreiter and J. Schmidhuber. Long short-term
[1] O. Aalen, O. Borgan, and H. Gjessing. Survival and event memory. Neural computation, 9(8):1735–1780, 1997.
history analysis: a process point of view. Springer, 2008.
[24] O. V. Ilya Sutskever and Q. V. Le. Sequence to sequence
[2] E. Bacry, A. Iuga, M. Lasnier, and C.-A. Lehalle. Market learning with neural networks. 2014.
impacts and the life cycle of investors orders. 2014.
[25] V. Isham and M. Westcott. A self-correcting pint process.
[3] E. Bacry, T. Jaisson, and J.-F. Muzy. Estimation of slowly Advances in Applied Probability, 37:629–646, 1979.
decreasing hawkes kernels: Application to high frequency
order book modelling. 2015. [26] J. Janssen and N. Limnios. Semi-Markov Models and
Applications. Kluwer Academic, 1999.
[4] R. Begleiter, R. El-Yaniv, and G. Yona. On prediction
using variable order markov models. J. Artif. Intell. Res. [27] J. Kingman. On doubly stochastic poisson processes.
Mathematical Proceedings of the Cambridge Philosophical
(JAIR), 22:385–421, 2004.
Society, pages 923–930, 1964.
[5] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio.
[28] J. F. C. Kingman. Poisson processes, volume 3. Oxford
On the properties of neural machine translation:
Encoder-decoder approaches. CoRR, abs/1409.1259, 2014. university press, 1992.
[29] R. D. Malmgren, D. B. Stouffer, A. E. Motter, and L. A. N.
[6] D. Daley and D. Vere-Jones. An introduction to the theory
of point processes: volume II: general theory and structure, Amaral. A Poissonian explanation for heavy tails in e-mail
volume 2. Springer, 2007. communication. Proceedings of the National Academy of
Sciences, 105(47):18153–18158, 2008.
[7] N. Du, M. Farajtabar, A. Ahmed, A. J. Smola, and L. Song.
Dirichlet-hawkes processes with applications to clustering [30] H. Min, X. Jiahui, X. Shiguo, and Y. Fuliang. Prediction of
continuous-time document streams. In KDD. ACM, 2015. chaotic time series based on the recurrent predictor neural
network. IEEE Transactions on Signal Processing,
[8] N. Du, L. Song, M. Gomez-Rodriguez, and H. Zha. Scalable 52:3409–3416, 2004.
influence estimation in continuous-time diffusion networks.
In NIPS, 2013. [31] Y. Ogata. Space-time point-process models for earthquake
occurrences. Annals of the Institute of Statistical
[9] N. Du, L. Song, A. J. Smola, and M. Yuan. Learning Mathematics, 50(2):379–402, 1998.
networks of heterogeneous influence. In NIPS, 2012.
[32] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P.
[10] N. Du, L. Song, H. Woo, and H. Zha. Uncover Flannery. Numerical Recipes in C. The Art of Scientific
topic-sensitive information diffusion networks. In Artificial Computation. Cambridge University Press, Cambridge,
Intelligence and Statistics (AISTATS), 2013. UK, 1994.
[11] N. Du, Y. Wang, N. He, and L. Song. Time-sensitive [33] J. G. Rasmussen. Temporal point processes: the
recommendation from recurrent user activities. In NIPS, conditional intensity function. [Link]
2015. ˜{}jgr/teaching/punktproc11/[Link], 2009.
[12] R. F. Engle and J. R. Russell. Autoregressive conditional [34] C. Rohitash and Z. Mengjie. Cooperative coevolution of
duration: A new model for irregularly spaced transaction elman recurrent neural networks for chaotic time series
data. Econometrica, 66(5):1127–1162, Sep 1998. prediction. Neurocomputing, 86:116–123, 2012.
[13] M. Farajtabar, N. Du, M. Gomez-Rodriguez, I. Valera, [35] M. Short, G. Mohler, P. Brantingham, and G. Tita. Gang
H. Zha, and L. Song. Shaping social activity by rivalry dynamics via coupled point process networks.
incentivizing users. In NIPS, 2014. Discrete and Continuous Dynamical Systems Series B,
[14] M. Farajtabar, M. Gomez-Rodriguez, N. Du, M. Zamani, 19:1459–1477, 2014.
H. Zha, and L. Song. Back to the past: Source [36] H. T. Siegelmann and E. D. Sontag. Turing computability
identification in diffusion networks from partially observed with neural nets. Applied Mathematics Letters, 4:77–80,
cascades. In AISTAT, 2015. 1991.
[15] A. Ferraz Costa, Y. Yamaguchi, A. Juci Machado Traina, [37] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton. On
C. Traina, Jr., and C. Faloutsos. Rsc: Mining and modeling the importance of initialization and momentum in deep
temporal activity in social media. In Proceedings of the learning. In S. Dasgupta and D. Mcallester, editors,
21th ACM SIGKDD International Conference on Proceedings of the 30th International Conference on
Knowledge Discovery and Data Mining, KDD ’15, pages Machine Learning (ICML-13), volume 28, pages 1139–1147.
269–278, 2015. JMLR Workshop and Conference Proceedings, May 2013.
[16] M. Gomez-Rodriguez, D. Balduzzi, and B. Schölkopf. [38] O. Vinyals, A. Toshev, S. Bengio, , and D. Erhan. Show
Uncovering the temporal dynamics of diffusion networks. In and tell: A neural image caption generator. IEEE
Proceedings of the International Conference on Machine Conference on Computer Vision and Pattern Recognition,
Learning, 2011. pages 3156–3164, 2015.
[17] S. Grant and B. Betts. Encouraging user behaviour with [39] C. Xindi, Z. Nian, V. Ganesh K., and W. I. Donald C.
achievements: an empirical study. In Mining Software Time series prediction with recurrent neural networks
Repositories (MSR), 2013 10th IEEE Working Conference
on, pages 65–68. IEEE, 2013. trained by a hybrid psoâĂŞea algorithm. Neurocomputing,
70:2342–2353, 2007.
[18] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami,
H. Bunke, , and J. Schmidhuber. A novel connectionist [40] Q. Zhao, M. A. Erdogdu, H. Y. He, A. Rajaraman, and
system for uncon- strained handwriting recognition. IEEE J. Leskovec. Seismic: A self-exciting point process model
Transactions on Pattern Analysis and Machine for predicting tweet popularity. In Proceedings of the 21th
Intelligence, pages 855–868, 2009. ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, KDD ’15, pages 1513–1522,
[19] A. G. Hawkes. Point spectra of some mutually exciting
2015.
point processes. Journal of the Royal Statistical Society
Series B, 33:438–443, 1971.
[20] A. G. Hawkes. Spectra of some self-exciting and mutually
exciting point processes. Biometrika, 58(1):83–90, 1971.
The RMTPP offers advantages over models like Poisson or Hawkes processes due to its ability to handle model misspecification and capture the heterogeneity of latent temporal dynamics more robustly. RMTPP uses recurrent neural networks to learn non-parametric, nonlinear dependencies between event timings and markers, offering more flexibility and accuracy in predictions. Its performance in predicting both future event times and types remains competitive and robust against complex data patterns compared to the limitations of fixed parametric models .
Assuming that markers are independent of history restricts the model's flexibility because it ignores any potential dependencies between the type of current events and past events. This can lead to a loss of important contextual information that might affect the occurrence of future events, thus reducing the model's ability to accurately capture complex temporal patterns and dynamics present in the data .
Incorporating RNNs into recurrent marked temporal point processes enhances modeling by allowing the network to learn a general representation of nonlinear dependencies over both the timing and marker information from past events. RNNs provide a mechanism to update the current state by incorporating past information, effectively capturing temporal patterns and dependencies across events. This improves the ability to predict future events by embedding the event history into a latent vector space, enabling the handling of complex temporal dependencies without predefined parametric specifications .
High-frequency data from financial transactions, due to its granular and voluminous nature, enhances the modeling capabilities of RMTPP by providing rich temporal patterns and frequency of events which a recurrent framework can exploit. RMTPP can leverage this dense event data to accurately discern and predict complex dependencies and trends intrinsic to high-frequency financial transactions, thus improving prediction precision for actions and timings .
RMTPP leverages embedding layers to convert sparse input vectors, such as event markers, into dense, continuous latent representations. This transformation facilitates a more compact and informative representation, reducing the dimensionality of the input vectors while preserving important data features and interactions. By embedding both temporal and marker information into a latent space, RMTPP captures intricate patterns and dependencies, enhancing the model's prediction accuracy and capacity to comprehend complex event sequences .
RMTPP achieves lower prediction error in sequence modeling by employing RNNs to jointly learn from both timing and marker information within a unified framework. This combined modeling approach allows RMTPP to capture complex dependencies and temporal dynamics that traditional parametric sequence models might overlook. By using a non-parametric, data-driven learning process, RMTPP reduces model bias and achieves more accurate event predictions, thus minimizing prediction errors such as root mean square error (RMSE).
The conditional intensity function λ*(t) in a marked temporal point process represents the instantaneous rate of event occurrence given the past history Ht. It is defined as λ*(t)dt = P{event in [t, t+dt)|Ht}, meaning it provides the probability of an event occurring within a small time window [t, t+dt) given the past events. This function reflects the influence of past events on the likelihood of future events and is crucial for defining the process dynamics .
RMTPP handles temporal prediction effectively in complex datasets like NYC Taxi or MIMIC II by embedding event sequences into a high-dimensional latent space using RNNs, which can model complex temporal patterns and dependencies across events. This allows RMTPP to leverage both spatial and temporal data to enhance predictive accuracy, as opposed to simpler models that may struggle with the sparse and heterogeneous nature of such datasets. Consequently, RMTPP can predict the timing and type of future events more accurately .
Learning a general representation in modeling temporal point processes with recurrent neural networks is crucial as it enables capturing complex dependencies and interactions within historical event sequences without the need for specific parametric assumptions. This flexibility allows the model to adaptively learn from diverse patterns and structures present in the data, enhancing its predictive robustness and capability to generalize across different event types and timing scenarios .
RMTPP performs better because it jointly models both temporal and marker information within a unified framework using recurrent neural networks, which allows it to capture complex interactions between timing and event types. This joint approach improves the predictive accuracy by learning a more comprehensive representation of the event history, reducing root mean square error (RMSE) and classification error more effectively than RNNs that use either temporal or marker information separately, which may miss key interactions .