Uber Data Analysis
Uber Data Analysis
(2020-21)
Submitted By:-
Unnati Goyal (181500768)
Nandinee Gupta (181500414)
Saumya Gupta (181500632)
Roshni Rawat (181500594)
Supervised By: -
Mr. Pawan Verma
Assistant Professor
GLA University, Mathura
Department of Computer Engineering & Applications
1
Department of Computer Engineering and Applications
GLA University, 17 km. Stone NH#2, Mathura-Delhi Road,
Chaumuha, Mathura – 281406 U.P (India)
Declaration
We hereby declare that the work which is being presented in the B.Tech.
Project “Uber Data Analysis”, in partial fulfillment of the requirements for
the award of the Bachelor of Technology in Computer Science and
Engineering and submitted to the Department of Computer Engineering and
Applications of GLA University, Mathura, is an authentic record of our own
work carried under the supervision of Mr. Pawan Verma, Assistant
Professor.
The contents of this project report, in full or in parts, have not been
submitted to any other Institute or University for the award of any degree.
2
Department of Computer Engineering and Applications
GLA University, 17 km. Stone NH#2, Mathura-Delhi Road,
Chaumuha, Mathura – 281406 U.P (India)
Certificate
This is to certify that the above statements made by the candidate are
correct to the best of my/our knowledge and belief.
_______________________
Supervisor
Mr. Pawan Verma
Assistant Professor
____________________ ____________________________
3
ACKNOWLEDGEMENT
It gives us a great sense of pleasure to present the report of the B. Tech Machine
Learning Project undertaken during B. Tech. Third Year. This project in itself is an
acknowledgement to the inspiration, drive and technical assistance contributed to it by
many individuals. This project would never have seen the light of the day without the
help and guidance that we have received.
Our heartiest thanks to Dr. (Prof.) Anand Singh Jalal, Head of Dept., Department of
CEA for providing us with an encouraging platform to develop this project, which
thus helped us in shaping our abilities towards a constructive goal.
We owe special debt of gratitude to Mr. Pawan Verma, Assistant Professor and Mr.
Neeraj Gupta, Assistant Professor, for his constant support and guidance throughout
the course of our work. His sincerity, thoroughness and perseverance
have been a constant source of inspiration for us. He has showered us with all his
extensively experienced ideas and insightful comments at virtually all stages of the
project & has also taught us about the latest industry-oriented technologies.
We also do not like to miss the opportunity to acknowledge the contribution of all
faculty members of the department for their kind guidance and cooperation during the
development of our project. Last but not the least, we acknowledge our friends for
their contribution in the completion of the project.
Nandinee Gupta
Unnati Goyal
Saumya Gupta
Roshni Rawat
4
Abstract
Uber was founded just eleven years ago, and it was already one of the
fastest-growing companies in the world. In Boston, UberX claims to
charge 30% less than taxis – a great way to get customers' attention.
Nowadays, we see applications of Machine Learning and Artificial
Intelligence in almost all the domains so we try to use the same for Uber
cabs price prediction. In this project, we did experiment with a real-
world dataset and explore how machine learning algorithms could be
used to find the patterns in data. We mainly discuss about the price
prediction of different Uber cabs that is generated by the machine
learning algorithm. Our problem belongs to the regression supervised
learning category. We use different machine learning algorithms, for
example, Linear Regression, Decision Tree, Random Forest Regressor,
and Gradient Boosting Regressor but finally, choose the one that proves
best for the price prediction. We must choose the algorithm which
improves the accuracy and reduces overfitting. We got many experiences
while doing the data preparation of Uber Dataset of Boston of the year
2018. It was also very interesting to know how different factors affect the
pricing of Uber cabs.
5
CONTENTS
Declaration ii
Certificate iii
Acknowledge iv
Abstract v
List of figures vi
List of Tables vii
CHAPTER 1 Introduction 1
1.1 Overview and Motivation 1
1.2 Objective 1
1.3 Issues and Challenges 2
1.4 Contribution 3
1.5 Organization of the Project 3
CHAPTER 5 Conclusion 24
CHAPTER 6 References 25
Uber Data Analysis
List of Figures
Page 6
Uber Data Analysis
List of Tables
Page 7
Uber Data Analysis
INTRODUCTION
Together, the pair developed the Uber app to help connect riders and local drivers.
The service was initially launched in San Francisco and eventually expanded to
Chicago in April 2012, proving to be a highly convenient great alternative to taxis and
poorly-funded public transportation systems. Over time, Uber has since expanded into
smaller communities and has become popular throughout the world. In December
2013, USA Today named Uber its tech company of the year.
In Supervised learning, we have a training set and a test set. The training and test set
consists of a set of examples consisting of input and output vectors, and the goal of
the supervised learning algorithm is to infer a function that maps the input vector to
the output vector with minimal error. We applied machine learning algorithms to
make a prediction of Price in the Uber Dataset of Boston. Several features will be
selected from 55 columns. Predictive analysis is a procedure that incorporates the use
of computational methods to determine important and useful patterns in large data.
1.2 Objective
learning models were compared and analyzed on the basis of accuracy, and then the
best performing model was suggested for further predictions of the label ‘Price’.
2. Strip-plot and Scatter diagram:- One problem with strip plots is how to
display multiple points with the same value. If it uses the jitter option, a small
amount of random noise is added to the vertical coordinate and if it goes with
the stack option it increments the repeated values to the vertical coordinate
which gives the strip plot a histogram-like appearance.
Scatter plot does not show the relationship for more than two variables. Also,
it is unable to give the exact extent of correlation.
than the number of samples. Also, it sometimes runs endlessly and never
completes execution.
1.4 Contribution
Each team member is responsible and has willing participation in the group. The work
within the group is equally done by each team member. First, the project work is
divided like one has to be done the exploratory data analysis part, two members work
on feature engineering, and the rest work of modeling and testing was equally divided
among all four members. And the second part i.e. written work is done in pairs like
two members work on report and the other two works on presentation.
LITERATURE REVIEW
As we are researching on Uber and found what different researchers had done. So,
they do research on the Uber dataset but on different factors. The rise of Uber as the
global alternative has attracted a lot of interest recently. Our work on Uber's
predicting pricing strategy is still relatively new. In this research, "Uber Data
Analysis" we aim to shed light on Uber's Price. We are predicting the price of
different types of Uber based on different factors. Some of the other factors that we
found in other researches are:
Abel Brodeurand & Kerry Nield (2018) analyses the effect of rain on Uber rides in
New York City after entering Uber rides in the market in May 2011, passengers and
fare will decrease in all other rides such as taxi-ride. Also, dynamic pricing makes
Uber drivers compete for rides when demand suddenly increases, i.e., during rainy
hours. On increasing rain, the Uber rides are also increasing by 22% while the number
of taxi rides per hour increases by only 5%. Taxis do not respond differently to
increased demand in rainy hours than non-rainy hours since the entrance of Uber.
Surge Pricing is an algorithmic technique that Uber uses when there is a demand-
supply imbalance. It occurs when there is a downward shift in both the rider's demand
and driver's availability. During such a time of the rise in demand for rides, fares tend
to usually high. Surge pricing is essential in a way that it helps in matching the
driver's efforts with the demand from consumers. (Junfeng Jiao, 2018) did an
investigation of Uber on surge multiplier in Austin, Texas founds that during times of
high usage, Uber will enhance their prices to reflect this demand via a surge
multiplier. According to communications released by (Uber, 2015), this pricing is
meant to attract more drivers into service at certain times, while also reducing demand
on the part of riders. (Chen & Sheldon, 2016) While some research is mixed, in
general, surge pricing does appear to control both supply and demand while keeping
wait time consistently under 5 minutes.
Anna Baj-Rogowska (2017) analyses the user's feedback from social networking sites
such as Facebook in the period between July 2016 and July 2017. Uber is one of the
most dynamically growing companies representing the so-called sharing economy. It
Uber Data Analysis
is also a basis for the ongoing evaluation of brand perception by the community and
can be helpful in developing such a marketing strategy and activities, which will
effectively improve the current rating and reduce possible losses. So, it can be
concluded that feedback should be an important instrument to improve the market
performance of Uber today.
Anderson (2014) concluded from surveying San Francisco drivers that driver behavior
and characteristics are likely determining the overall vehicle miles traveled (VMT).
Full-time drivers are likely to increase overall VMT, while occasional drivers are
more likely to reduce overall VMT. We also analyze the research on the driving
behavior of the driver while driving on the road. The driver has been categorized
based on ages and genders that focus on their driving reactions from how they
braking, speeding, and steer handling. For gender differences, male driver practice
higher-risk of driving while female drivers are lacks of pre-caution over obstacles and
dangerous spot. More or less, adult drivers which regularly drive vehicles can manage
the vehicle quite well as compared with young drivers with less experience. In
conclusion, the driver's driving behavior is related to their age, gender, and driving
experiences.
Some papers take a comparison between the iconic yellow taxi and its modern
competitor, Uber. (Vsevolod Salnikov, Renaud Lambiotte, Anastasios Noulas, and
Cecilia Mascolo, 2014) identify situations when UberX, the cheapest version of the
Uber taxi service, tends to be more expensive than yellow taxis for the same journey.
Our observations show that it might be financially advantageous on average for
travelers to choose either Yellow Cabs or Uber depending on the duration of their
journey. However, the specific journey they are willing to take matters.
MACHINE LEARNING
From the above chart, it was clear that Shared trip was cheapest among all and
BlackSuv was most expensive. UberX and UberPool have almost same prices and
Lux has moderate price. There is no graph for taxi which reveals that in the dataset
there were no values of taxi was given.
From the above chart, it was clear there were some outliers in cloudy type weather,
some data has an anonymously high price above 80 while the other was below 60. In
this plot, we analyze that in cloudy-day weather price was the highest while in foggy
weather price was minimum.
The above bar chart represents the value count of the icon column and from the graph,
it was clear that cloudy weather has the most data due to which we can say that may
be in cloudy weather cab also opted most.
The above bar chart represents the value count of the UV-index column and from the
graph, it was clear that when UV-index is 0, the dataset has the most data due to
which we can say that when there is less UV-index cab was opted most.
● Preparing the proper input dataset, compatible with the machine learning
algorithm requirements.
● Improving the performance of machine learning models.
“According to a survey in Forbes, data scientists spend 80% of their time on data
preparation.”
RFE works by searching for a subset of features by starting with all features in the
training dataset and successfully removing features until the desired number remains.
This is achieved by fitting the given machine learning algorithm used in the core of
the model, ranking features by importance, discarding the least important features,
and re-fitting the model. This process is repeated until a specified number of features
On applying RFE in our dataset with Linear Regression model first we divide our
dataset into dependent (features) and independent (target) variables then split it into
train and test after that we found different accuracies in different number of features
(k value) as follows:
1 56 0.8054834220
2 40 0.8050662132
3 25 0.8055355151
4 15 0.8050457819
From the above table, it was clear that 25 features have the highest accuracy as
compared to all other k values which mean these 25 features are the best features
given by RFE. So, we only consider these 25 features for further working and rest we
eliminate. Now our dataset reduces from 56 features to 25 features.
4.3.5 Binning
Many times we use a method called data smoothing to make the data proper. During
this process, we define a range also called bin and any data value within the range is
made to fit into the bin. This is called the binning. Binning is used to smoothing the
data or to handle noisy data.
So after dropping useless features, some features are not in range so to make all the
features in the same range we apply binning and get our final dataset which is further
used for modeling.
4.4 Modeling
The process of modeling means training a machine-learning algorithm to predict the
labels from the features, tuning it for the business needs, and validating it on holdout
data. When you train an algorithm with data it will become a model. One important
aspect of all machine learning models is to determine their accuracy. Now to
determine their accuracy, one can train the model using the given dataset and then
predict the response values for the same dataset using that model and hence, find the
accuracy of the model.
In this project, we use Scikit-Learn to rapidly implement a few models such as Linear
Regression, Decision Tree, Random Forest, and Gradient Boosting.
questions to the data, each question narrowing our possible values until the model gets
confident enough to make a single prediction. The order of the question as well as
their content is being determined by the model. In addition, the questions asked are all
in a True/False form. Here in our project, we are focusing on decision tree regression
only. It is used for the continuous output problem. Continuous output means the
output of the result is not discrete. It observes features of an object and trains a model
in the structure of a tree to predict data that produce meaningful continuous output.
different subset of features for selecting the best split. But there is a slight difference
in gradient boosting in comparison to random forest that is gradient boosting builds
one tree at a time and combines the results along the way. Also, it gives better
performance than random forest. The idea of gradient boosting originated in the
observation by Leo Breiman that boosting can be interpreted as an optimization
algorithm on a suitable cost function. Gradient Boosting trains many models in a
gradual, additive, and sequential manner.
The modeling is done in the following steps:-
● First, we split the dataset into a training set and a testing set.
● Then we train the model on the training set.
● And at last, we test the model on the testing set and evaluate how well our
model performs.
The error estimation is averaged over all k trials to get the total effectiveness of our
model.
4.5 Testing
In Machine Learning the main task is to model the data and predict the output using
various algorithms. But since there are so many algorithms, it was really difficult to
choose the one for predicting the final data. So we need to compare our models and
choose the one with the highest accuracy.
Machine learning applications are not 100% accurate, and approx never will be. There
are some of the reasons why testers cannot ignore learning about machine learning.
The fundamental reason is that these applications learning limited by data they have
used to build algorithms. For example, if 99% of emails aren't spammed, then
classifying all emails as not spam gets 99% accuracy through chance. Therefore, you
need to check your model for algorithmic correctness. Hence testing is
required. Testing is a subset or part of the training dataset that is built to test all the
possible combinations and also estimates how well the model trains. Based on the test
data set results, the model was fine-tuned.
Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared
Error (RMSE) are used to evaluate the regression problem's accuracy. These can be
implemented using sklearn’s mean_absolute_error method and
sklearn's mean_squared_error method.
It is the mean of all absolute error. MAE (ranges from 0 to infinity, lower is better) is
much like RMSE, but instead of squaring the difference of the residuals and taking the
square root of the result, it just averages the absolute difference of the residuals. This
produces positive numbers only and is less reactive to large errors. MAE takes
the average of the error from every sample in a dataset and gives the output.
In our project, we perform testing on two models: Linear Regression and Random
Forest.
Linear Regression Model Testing:
We draw a scatter plot between predicted and tested values and then find errors like
MSE, MAE, and RMSE. After that, we also draw a distribution plot of the difference
between actual and predicted values using the seaborn library. A distplot or
distribution plot represents the overall distribution of continuous data variables.
and icon (weather). As the dataset train on the continuous values and not on
categorical values, these values are also passed in the same manner i.e. in integer type.
We create a manual for users which gives instructions about the input like what do
you need to type for a specific thing and in which sequence.
We use random forest model in our function to predict the price. First, we search for
all the desired rows which have the input cab name and extract their row number.
After then we create an array x which is of thelength of the new dataset and it’s
initially all values are zero. After creating the blank array we assign the input values
of source, surge multiplier, and icon to the respected indices. Following it we check
the count of all desired rows if it was greater than zero or not. If the condition gets
true, we assign the value 1 to the index of x array and return the price using the
predict function with trained random forest algorithm.
It somehow works like a hypothesis space because it gives an output for any input
from input the space.
CONCLUSION
Before working on features first we need to know about the data insights which we
get to know by EDA. Apart from that, we visualize the data by drawing various plots,
due to which we understand that we don’t have any data for taxi’s price, also the price
variations of other cabs and different types of weather. Other value count plots show
the type and amount of data the dataset has. After this, we convert all categorical
values into continuous data type and fill price Nan by the median of other values.
Then the most important part of feature selection came which was done with the help
of recursive feature elimination. With the help of RFE, the top 25 features were
selected. Among those 25 features still, there are some features which we think are
not that important to predict the price so we drop them and left with 8 important
columns.
We apply four different models on our remaining dataset among which Decision Tree,
Random Forest, and Gradient Boosting Regressor prove best with 96%+ accuracy on
training for our model. This means the predictive power of all these three algorithms
in this dataset with the chosen features is very high but in the end, we go with random
forest because it does not prone to overfitting and design a function with the help of
the same model to predict the price.
REFERENCES
● Abel Brodeurand & Kerry Nield (2018) An empirical analysis of taxi, Lyft
and Uber rides: Evidence from weather shocks in NYC
● Junfeng Jiao (2018) Investigating Uber price surges during a special event in
Austin, TX
● Anna Baj-Rogowska (2017) Sentiment analysis of Facebook posts: The Uber
Case
● Anastasios Noulas, Cecilia Mascolo, Renaud Lambiotte, and Vsevolod
Salnikov (2014) OpenStreetCab: Exploiting Taxi Mobility Patterns in New
York City to Reduce Commuter Costs
● https://www.singlegrain.com/blog-posts/business/10-lessons-startups-can-learn-
ubers-growth/
● https://www.kaggle.com/brllrb/uber-and-lyft-dataset-boston
● https://matplotlib.org/1.3.1/users/legend_guide.html
● https://www.sciencedirect.com/science/article/abs/pii/S0167268118301598
● https://sci-
hub.se/https://www.sciencedirect.com/science/article/abs/pii/S2210539517301165
● https://ieeexplore.ieee.org/abstract/document/8260068
● https://www.researchgate.net/publication/305524879_Dynamic_Pricing_in_a_Labo
r_Market_Surge_Pricing_and_Flexible_Work_on_the_Uber_Platform
● https://www.sciencedirect.com/science/article/abs/pii/S0167268118301598#:~:text
=We%20look%20at%20the%20effect,rides%20in%20New%20York
%20City.&text=The%20number%20of%20Uber%20(Lyft,higher%20when%20it%20is
%20raining.&text=The%20number%20of%20taxi%20rides,higher%20when%20it
%20is%20raining.&text=Taxi%20rides%2C%20passengers%20and%20fare,after
%20Uber%20entered%20the%20market.
● https://arxiv.org/abs/1503.03021
● https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781
789808452/1/ch01lvl1sec19/label-encoding
● https://github.com/Ankush123456-
code/house_price_prediction_end_to_end_project/blob/main/model/python.ipynb
● https://github.com/ankita1112/House-Prices-Advanced-
Regression/blob/master/Housing_Prediction_full.ipynb
● https://scikit-learn.org/stable/modules/multiclass.html
● https://www.codegrepper.com/code-examples/python/confusion+matrix+python
● https://topepo.github.io/caret/recursive-feature-elimination.html
● https://arxiv.org/pdf/1503.03021.pdf
● https://www.sciencedirect.com/science/article/abs/pii/S0167268118301598
● https://www.kaggle.com/punit0811/machine-learning-project-basic-linear-
regression
● https://gdcoder.com/decision-tree-regressor-explained-in-depth/
● https://medium.com/towards-artificial-intelligence/machine-learning-algorithms-
for-beginners-with-python-code-examples-ml-19c6afd60daa
● https://www.tutorialspoint.com/machine_learning_with_python/machine_learning
_algorithms_performance_metrics.htm
● https://blog.paperspace.com/implementing-gradient-boosting-regression-python/
● https://www.studytonight.com/post/what-is-mean-squared-error-mean-absolute-
error-root-mean-squared-error-and-r-squared
● https://statisticsbyjim.com/regression/overfitting-regression-models/
● https://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/striplot.htm#:~
:text=One%20problem%20with%20strip%20plots,increment%20to%20the
%20vertical%20coordinate.
● https://www.toppr.com/ask/question/what-are-the-limitations-of-a-scatter-
diagram/
● https://www.geeksforgeeks.org/ml-label-encoding-of-datasets-in-
python/#:~:text=Limitation%20of%20label%20Encoding,in%20training%20of
%20data%20sets.
● https://digitalag.osu.edu/sites/digitag/files/imce/images/ag_sensing/Figure3.png
● https://www.researchgate.net/publication/325800934/figure/fig1/AS:63813259676
8768@1529154075317/The-main-procedure-of-the-recursive-feature-elimination-
RFE-method.png
● https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/K-
fold_cross_validation_EN.svg/1200px-K-fold_cross_validation_EN.svg.png
● https://static.packt-cdn.com/products/9781789345070/graphics/108f2a01-3e31-
4907-a1a5-4baf441c3eed.png