DeepETA - How Uber Predicts Arrival Times Using Deep Learning
DeepETA - How Uber Predicts Arrival Times Using Deep Learning
Engineering
AI, Data / ML
At Uber, magical customer experiences depend on accurate arrival time predictions (ETAs). We
use ETAs to calculate fares, estimate pickup times, match riders to drivers, plan deliveries, and
more. Traditional routing engines compute ETAs by dividing up the road network into small
road segments represented by weighted edges in a graph. They use shortest-path algorithms
to find the best path through the graph and add up the weights to derive an ETA. But as we all
know, the map is not the terrain: a road graph is just a model, and it can’t perfectly capture
conditions on the ground. Moreover, we may not know which route a particular rider and driver
will choose to their destination. By training machine learning (ML) models on top of the road
https://www.uber.com/en-SA/blog/deepeta-how-uber-predicts-arrival-times/ 1/18
6/29/24, 6:54 PM DeepETA: How Uber Predicts Arrival Times Using Deep Learning | Uber Blog
graph prediction using historical data in combination with real-time signals, we can refine
ETAs that better predict real-world outcomes.
For several years, Uber used gradient-boosted decision tree ensembles to refine ETA
predictions. The ETA model and its training dataset grew steadily larger with each release. To
keep pace with this growth, Uber’s Apache Spark™ team contributed upstream improvements
[1, 2] to XGBoost to allow the model to grow ever deeper, making it one of the largest and
deepest XGBoost ensembles in the world at that time. Eventually, we reached a point where
increasing the dataset and model size using XGBoost became untenable. To continue scaling
the model and improving accuracy, we decided to explore deep learning because of the
relative ease of scaling to large datasets using data-parallel SGD [3]. To justify switching to
deep learning we needed to overcome three main challenges:
Latency: The model must return an ETA within a few milliseconds at most.
Accuracy: The mean absolute error (MAE) must improve significantly over the
incumbent XGBoost model.
Generality: The model must provide ETA predictions globally across all of Uber’s lines of
business such as mobility and delivery.
To meet these challenges, Uber AI partnered with Uber’s Maps team on a project called
DeepETA to develop a low-latency deep neural network architecture for global ETA prediction.
In this blog post, we’ll be walking you through some of the learnings and design choices that
helped DeepETA become the new production ETA model at Uber. For more information, please
refer to our full paper.
Problem Statement
In the past few years, there has been increasing interest in systems that combine physical
models of the world with deep learning. We take a similar approach to ETA prediction at Uber.
Our physical model is a routing engine that uses map data and real-time traffic measurements
to predict an ETA as a sum of segment-wise traversal times along the best path between two
points. We then use machine learning to predict the residual between the routing engine ETA
and real-world observed outcomes. We call this hybrid approach ETA post-processing, and
DeepETA is an example of a post-processing model. From a practical perspective, it’s generally
easier to assimilate new data sources and accommodate fast-changing business
requirements by updating the post-processing model than it is to refactor the routing engine
itself.
https://www.uber.com/en-SA/blog/deepeta-how-uber-predicts-arrival-times/ 2/18
6/29/24, 6:54 PM DeepETA: How Uber Predicts Arrival Times Using Deep Learning | Uber Blog
To be able to predict the ETA residual, a post-processing ML model takes into account spatial
and temporal features, such as the origin, destination and time of the request, as well
information about real-time traffic and the nature of the request, such as whether it is a
delivery dropoff or rideshare pickup, as illustrated in Figure 1. This post-processing model is
the highest QPS (queries per second) model at Uber. This model needs to be fast so as not to
add too much latency to an ETA request, and they need to improve ETA accuracy as measured
by mean absolute error (MAE) across different segments of the data.
https://www.uber.com/en-SA/blog/deepeta-how-uber-predicts-arrival-times/ 3/18
6/29/24, 6:54 PM DeepETA: How Uber Predicts Arrival Times Using Deep Learning | Uber Blog
Figure 2: An overview of the DeepETA model pipeline
In a language model, each vector represents a single word token, but in the case of DeepETA,
each vector represents a single feature, such as the origin of the trip or the time of day. Self-
attention uncovers pairwise interactions among the K features in a tabular dataset by explicitly
computing a K*K attention matrix of pairwise dot products, using the softmax of these scaled
dot-products to reweight the features. When the self-attention layer processes each feature, it
looks at every other feature in the input for clues and outputs the representation of this
feature as a weighted sum of all features. This process is illustrated in Figure 3. Through this
way, we can bake the understanding of all the temporal and spatial features into the one
feature currently being processed and focus on the features that matter. In contrast to
language models, there is no positional encoding in DeepETA since the order of the features
doesn’t matter.
Taking a trip from origin A to destination B as an example, the self-attention layer scales the
importance of the features given the time of the day, origin and destination, traffic conditions
etc. A visualization of the self-attention is shown in Figure 4 (generated from Tensor2Tensor
[11]) with 8 colors corresponding to 8 attention heads and shares corresponding to random
generated attention weights.
https://www.uber.com/en-SA/blog/deepeta-how-uber-predicts-arrival-times/ 4/18
6/29/24, 6:54 PM DeepETA: How Uber Predicts Arrival Times Using Deep Learning | Uber Blog
Feature Encoding
The DeepETA model embeds all categorical features, and bucketizes all continuous features
before embedding them. Somewhat counterintuitively, bucketizing continuous features led to
better accuracy than using continuous features directly. While bucketizing isn’t strictly
necessary because neural networks can learn any non-linear discontinuous function, a
network with bucketized features may have an advantage because it doesn’t have to spend any
of its parameter budget learning to partition the input space. Like the Gradient Boosted
Decision Tree Neural Network paper [12], we found that using quantile buckets provided better
accuracy than equal-width buckets. We suspect that quantile buckets perform well because
they maximize entropy: for any fixed number of buckets, quantile buckets convey the most
information (in bits) about the original feature value compared to any other bucketing scheme.
Geospatial Embeddings
Post-processing models receive the origin and destination of a trip as latitudes and longitudes.
Because these start and end points are very important for predicting ETAs, DeepETA encodes
them differently from other continuous features. Location data is distributed very unevenly
over the globe and contains information at multiple spatial resolutions. We therefore quantize
locations into multiple resolution grids based on latitude and longitude. As the resolution
https://www.uber.com/en-SA/blog/deepeta-how-uber-predicts-arrival-times/ 5/18
6/29/24, 6:54 PM DeepETA: How Uber Predicts Arrival Times Using Deep Learning | Uber Blog
increases, the number of distinct grid cells grows exponentially and the average amount of
data in each grid cell decreases proportionally. We explored three different strategies for
mapping these grids to embeddings:
Exact indexing, which maps each grid cell to a dedicated embedding. This takes up the
most space.
Feature hashing [16], which maps each grid cell into a compact range of bins using a
hash function. The number of bins is much smaller than with exact indexing.
Multiple feature hashing, which extends feature hashing by mapping each grid cell to
multiple compact ranges of bins using independent hash functions. See Figure 5:
Our experiments showed that while feature hashing saved space compared to exact indexing,
the accuracy was the same or slightly worse depending on the grid resolution. This is likely due
to hash collisions that cause some information to be lost. Multiple feature hashing provided
the best accuracy and latency while still saving space compared to exact indexing. This implies
that the network is able to combine the information from multiple independent hash buckets
to undo the negative effects of single-bucket collisions.
Fast Transformer
While the transformer-based encoder provided the best accuracy, it was too slow to meet the
latency requirements for online real-time serving. The original self-attention model has
quadratic complexity because it computes a K K attention matrix from K inputs. There have
been multiple research works that linearize the self-attention calculation, for example linear
transformer [10], linformer [13], performer [14]. After experimentation, we chose the linear
transformer, which uses the kernel trick to avoid calculating the attention matrix.
To illustrate the time complexity, let’s use the following example: say we have K inputs with
dimension d. The original transformer’s time complexity is O(K2d), while the linear
transformer’s time complexity is O(Kd2). If we have 40 features all with 8 dimensions, i.e. K = 40,
d = 8, K2d = 12,800, while Kd2 = 2,560. It’s clear that the linear transformer is faster whenever
K>d.
First of all, the model itself is relatively shallow with just a handful of layers. The vast majority
of the parameters exist in embedding lookup tables. By discretizing the inputs and mapping
them to embeddings, we avoid evaluating any of the unused embedding table parameters.
Discretizing the inputs gives us a clear speed advantage at serving time compared to
alternative implementations. Take the geospatial embeddings pictured in Figure 5 as an
example. To map a latitude and longitude to an embedding, DeepETA simply quantizes the
coordinates and performs a hash lookup, which takes O(1) time. In comparison, storing
embeddings in a tree data structure would require O(log N) lookup time, while using fully-
connected layers to learn the same mapping would require O(N2) lookup time. Seen from this
perspective, discretizing and embedding inputs is simply an instance of the classic space vs
time tradeoff in computer science: by precomputing partial answers in the form of large
embedding tables learned during training, we reduce the amount of computation needed at
serving time.
https://www.uber.com/en-SA/blog/deepeta-how-uber-predicts-arrival-times/ 7/18
6/29/24, 6:54 PM DeepETA: How Uber Predicts Arrival Times Using Deep Learning | Uber Blog
https://www.uber.com/en-SA/blog/deepeta-how-uber-predicts-arrival-times/ 8/18
6/29/24, 6:54 PM DeepETA: How Uber Predicts Arrival Times Using Deep Learning | Uber Blog
Different business use cases require different types of ETA point estimates and will also have
varying proportions of outliers in their data. For example, we want to estimate a mean ETA for
fare computation but need to control for the effect of outliers. Other use cases may call for a
specific quantile of the ETA distribution. To accommodate this diversity, DeepETA uses a
parameterized loss function, asymmetric Huber loss, which is robust to outliers and can
support a range of commonly used point estimates.
Asymmetric Huber loss has two parameters, delta and omega, that control the degree of
robustness to outliers and the degree of asymmetry respectively. By varying delta (as shown in
Figure 7), squared error and absolute error can be smoothly interpolated, with the latter being
less sensitive to outliers. By varying omega, you can control the relative cost of
underprediction vs overprediction, which is useful in situations where being a minute late is
worse than being a minute early. These parameters not only make it possible to mimic other
commonly used regression loss functions, but also make it possible to tailor the point estimate
produced by the model to meet diverse business goals.
https://www.uber.com/en-SA/blog/deepeta-how-uber-predicts-arrival-times/ 9/18
6/29/24, 6:54 PM DeepETA: How Uber Predicts Arrival Times Using Deep Learning | Uber Blog
consumers’ requests are routed through various services to the uRoute service. The uRoute
service serves as a frontend for all routing lookups. It makes requests to the routing engine to
produce route-lines and ETAs. It uses this ETA and other model features to make requests to
the Michelangelo Online prediction service to get predictions from the DeepETA model.
Periodic auto retraining workflows are set up to retrain and validate the model.
https://www.uber.com/en-SA/blog/deepeta-how-uber-predicts-arrival-times/ 11/18
6/29/24, 6:54 PM DeepETA: How Uber Predicts Arrival Times Using Deep Learning | Uber Blog
DeepETA enables the ETA team to explore diverse model architectures and output multiple
ETAs that can be tailored for each consumer. More work is underway to further expand the
accuracy improvements it can deliver by looking at every aspect of the modeling process
covering Dataset, Features, Transformations, Model Architectures, Training Algorithms, Loss
Functions, and Tooling/Infrastructure Improvements. In the future, the team will explore
enhancements such as continuous, incremental training, which will enable ETAs to be trained
on fresher data.
Acknowledgements
DeepETA is a product of an Uber AI <> Maps partnership. We’d like to thank the following people
who helped make this possible:
Software Engineers: Michael Albada, Apurva Bhandari, Aditya Bhave, Tanmay Binaykiya,
Chongxiao Cao, Sashikanth Chandrasekaran, Eric Chen, Hao Jiang, Vladimir Kuzmin,
Michael Mallory, Michael Mui, Sean Po, Rich Porter, Vivek Sah, Raajay Viswanathan,
Joseph Wang, Peng Zhang
Data Scientists: Wenqi Hu, Alvira Swalin
Research Scientists: Olcay Cirit, Eric Frank, Xinyu Hu, Felipe Petroski Such
Product Managers: Wataru Ueno
Technical Program Managers: Samuel Chan, Ramit Hora
Eng Managers: Bozhena Bidyuk, Himaanshu Gupta, Sally Lee, Xu Ning
Leadership: Jaikumar Ganesh, Bobby Parikh, Smitha Shyam, Thomas Williams, Zoubin
Ghahramani, Jan Pedersen, Douglas Bemis
Apache Spark™ and Spark™ are trademarks of the Apache Software Foundation in the United
States and/or other countries. No endorsement by The Apache Software Foundation is implied
by the use of these marks.
References
1. Distributed training upstream improvement contributed by Uber’s Spark team.
2. Scalability upstream improvement contributed by Uber’s Spark team.
3. Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for
TensorFlow.
4. Multilayer perceptron
https://www.uber.com/en-SA/blog/deepeta-how-uber-predicts-arrival-times/ 12/18
6/29/24, 6:54 PM DeepETA: How Uber Predicts Arrival Times Using Deep Learning | Uber Blog
5. Popov, Sergei, Stanislav Morozov, and Artem Babenko. “Neural oblivious decision
ensembles for deep learning on tabular data.” arXiv preprint arXiv:1909.06312 (2019).
6. Arık, Sercan O., and Tomas Pfister. “Tabnet: Attentive interpretable tabular learning.” arXiv
(2020).
7. Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey
Hinton, and Jeff Dean. “Outrageously large neural networks: The sparsely-gated mixture-
of-experts layer.” arXiv preprint arXiv:1701.06538 (2017).
8. Ha, David, Andrew Dai, and Quoc V. Le. “Hypernetworks.” arXiv preprint arXiv:1609.09106
(2016).
9. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention Is All You Need.” In Advances in
neural information processing systems, pp. 5998-6008. 2017.
10. Katharopoulos, Angelos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret.
“Transformers are rnns: Fast autoregressive transformers with linear attention.” In
International Conference on Machine Learning, pp. 5156-5165. PMLR, 2020.
11. Vaswani, Ashish, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez,
Stephan Gouws, Llion Jones et al. “Tensor2tensor for neural machine translation.” arXiv
preprint arXiv:1803.07416 (2018).
12. Saberian, Mohammad, Pablo Delgado, and Yves Raimond. “Gradient boosted decision
tree neural network.” arXiv preprint arXiv:1910.09340 (2019).
13. Wang, Sinong, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. “Linformer: Self-
attention with linear complexity.” arXiv preprint arXiv:2006.04768 (2020).
14. Choromanski, Krzysztof, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane,
Tamas Sarlos, Peter Hawkins et al. “Rethinking attention with performers.” arXiv preprint
arXiv:2009.14794 (2020).
15. Meet Michelangelo: Uber’s Machine Learning Platform
16. Weinberger, Kilian, et al. “Feature hashing for large scale multitask learning.” Proceedings
of the 26th annual international conference on machine learning. 2009.
Xinyu Hu
Xinyu Hu is a Senior Research Scientist at Uber AI focused on large-
scale machine learning applications in spatial-temporal problems
and causal inference. Xinyu holds a Ph.D. in Biostatistics from
Columbia University.
https://www.uber.com/en-SA/blog/deepeta-how-uber-predicts-arrival-times/ 13/18
6/29/24, 6:54 PM DeepETA: How Uber Predicts Arrival Times Using Deep Learning | Uber Blog
Olcay Cirit
Olcay Cirit is a Staff Research Scientist at Uber AI focused on ML
systems and large-scale deep learning problems. Prior to Uber AI, he
worked on ad targeting at Google.
Tanmay Binaykiya
Tanmay Binaykiya is a Senior Software Engineer on the Maps ETA
Prediction team, and focuses on projects at the intersection of ML
and systems.
Ramit Hora
Ramit Hora is the Technical Program Management Lead for Uber AI
& Maps where he works to drive innovation across Uber’s
technical systems and product experiences.
Related articles
https://www.uber.com/en-SA/blog/deepeta-how-uber-predicts-arrival-times/ 14/18
6/29/24, 6:54 PM DeepETA: How Uber Predicts Arrival Times Using Deep Learning | Uber Blog
June 13 / Global
May 2 / Global
AI
April 23 / Global
https://www.uber.com/en-SA/blog/deepeta-how-uber-predicts-arrival-times/ 15/18
6/29/24, 6:54 PM DeepETA: How Uber Predicts Arrival Times Using Deep Learning | Uber Blog
March 28 / Global
February 1 / Global
Most popular
Engineering, AI, Backend June 13 / Global
Personalized Marketing at Scale: Uber’s Out-of-App Recommendation System
https://www.uber.com/en-SA/blog/deepeta-how-uber-predicts-arrival-times/ 16/18
6/29/24, 6:54 PM DeepETA: How Uber Predicts Arrival Times Using Deep Learning | Uber Blog
Uber
Company
About us
Our offerings
Newsroom
Investors
Blog
Careers
AI
Gift cards
Products
Ride
Drive
Deliver
Eat
Uber Freight
Global citizenship
https://www.uber.com/en-SA/blog/deepeta-how-uber-predicts-arrival-times/ 17/18
6/29/24, 6:54 PM DeepETA: How Uber Predicts Arrival Times Using Deep Learning | Uber Blog
Safety
Sustainability
Travel
Airports
Cities
English
Riyadh
https://www.uber.com/en-SA/blog/deepeta-how-uber-predicts-arrival-times/ 18/18