Uber Rides Data Analysis using Python
Last Updated :
19 Sep, 2024
In this article, we will use Python and its different libraries to analyze the Uber Rides Data.
Importing Libraries
The analysis will be done using the following libraries :Â
- Pandas: Â This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
- Numpy: Numpy arrays are very fast and can perform large computations in a very short time.
- Matplotlib / Seaborn: This library is used to draw visualizations.
To importing all these libraries, we can use the  below code :
Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Importing Dataset
After importing all the libraries, Â download the data using the link.
Once downloaded, you can import the dataset using the pandas library.
Python
dataset = pd.read_csv("UberDataset.csv")
dataset.head()
Output :Â

Â
To find the shape of the dataset, we can use dataset.shape
Python
Output :Â
(1156, 7)
To understand the data more deeply, we need to know about the null values count, datatype, etc. So for that we will use the below code.
Python
Output :Â

Â
Data Preprocessing
As we understood that there are a lot of null values in PURPOSE column, so for that we will me filling the null values with a NOT keyword. You can try something else too.
Python
dataset['PURPOSE'].fillna("NOT", inplace=True)
Changing the START_DATE and END_DATE to the date_time format so that further it can be use to do analysis.
Python
dataset['START_DATE'] = pd.to_datetime(dataset['START_DATE'],
errors='coerce')
dataset['END_DATE'] = pd.to_datetime(dataset['END_DATE'],
errors='coerce')
Splitting the START_DATE to date and time column and then converting the time into four different categories i.e. Morning, Afternoon, Evening, Night
Python
from datetime import datetime
dataset['date'] = pd.DatetimeIndex(dataset['START_DATE']).date
dataset['time'] = pd.DatetimeIndex(dataset['START_DATE']).hour
#changing into categories of day and night
dataset['day-night'] = pd.cut(x=dataset['time'],
bins = [0,10,15,19,24],
labels = ['Morning','Afternoon','Evening','Night'])
Once we are done with creating new columns, we can now drop rows with null values.
Python
dataset.dropna(inplace=True)
It is also important to drop the duplicates rows from the dataset. To do that, refer the code below.
Python
dataset.drop_duplicates(inplace=True)
Data Visualization
In this section, we will try to understand and compare all columns.
Let’s start with checking the unique values in dataset of the columns with object datatype.
Python
obj = (dataset.dtypes == 'object')
object_cols = list(obj[obj].index)
unique_values = {}
for col in object_cols:
unique_values[col] = dataset[col].unique().size
unique_values
Output :Â
{'CATEGORY': 2, 'START': 108, 'STOP': 112, 'PURPOSE': 7, 'date': 113}
Now, we will be using matplotlib and seaborn library for countplot the CATEGORY and PURPOSE columns.
Python
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.countplot(dataset['CATEGORY'])
plt.xticks(rotation=90)
plt.subplot(1,2,2)
sns.countplot(dataset['PURPOSE'])
plt.xticks(rotation=90)
Output :Â

Â
Let’s do the same for time column, here we will be using the time column which we have extracted above.
Python
sns.countplot(dataset['day-night'])
plt.xticks(rotation=90)
Output :Â

Â
Now, we will be comparing the two different categories along with the PURPOSE of the user.
Python
plt.figure(figsize=(15, 5))
sns.countplot(data=dataset, x='PURPOSE', hue='CATEGORY')
plt.xticks(rotation=90)
plt.show()
Output :Â

Â
Insights from the above count-plots :Â
- Most of the rides are booked for business purpose.
- Most of the people book cabs for Meetings and Meal / Entertain purpose.
- Most of the cabs are booked in the time duration of 10am-5pm (Afternoon).
As we have seen that CATEGORY and PURPOSE columns are two very important columns. So now we will be using OneHotEncoder to categories them.
Python
from sklearn.preprocessing import OneHotEncoder
object_cols = ['CATEGORY', 'PURPOSE']
OH_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
OH_cols = pd.DataFrame(OH_encoder.fit_transform(dataset[object_cols]))
OH_cols.index = dataset.index
OH_cols.columns = OH_encoder.get_feature_names_out()
df_final = dataset.drop(object_cols, axis=1)
dataset = pd.concat([df_final, OH_cols], axis=1)
# This code is modified by Susobhan Akhuli
After that, we can now find the correlation between the columns using heatmap.
Python
# Select only numerical columns for correlation calculation
numeric_dataset = dataset.select_dtypes(include=['number'])
sns.heatmap(numeric_dataset.corr(),
cmap='BrBG',
fmt='.2f',
linewidths=2,
annot=True)
# This code is modified by Susobhan Akhuli
Output :

heatmap
Insights from the heatmap:
- Business and Personal Category are highly negatively correlated, this have already proven earlier. So this plot, justifies the above conclusions.
- There is not much correlation between the features.
Now, as we need to visualize the month data. This can we same as done before (for hours).Â
Python
dataset['MONTH'] = pd.DatetimeIndex(dataset['START_DATE']).month
month_label = {1.0: 'Jan', 2.0: 'Feb', 3.0: 'Mar', 4.0: 'April',
5.0: 'May', 6.0: 'June', 7.0: 'July', 8.0: 'Aug',
9.0: 'Sep', 10.0: 'Oct', 11.0: 'Nov', 12.0: 'Dec'}
dataset["MONTH"] = dataset.MONTH.map(month_label)
mon = dataset.MONTH.value_counts(sort=False)
# Month total rides count vs Month ride max count
df = pd.DataFrame({"MONTHS": mon.values,
"VALUE COUNT": dataset.groupby('MONTH',
sort=False)['MILES'].max()})
p = sns.lineplot(data=df)
p.set(xlabel="MONTHS", ylabel="VALUE COUNT")
Output :

lineplot
Insights from the above plot :Â
- The counts are very irregular.
- Still its very clear that the counts are very less during Nov, Dec, Jan, which justifies the fact that  time winters are there in Florida, US.
Visualization for days data.
Python
dataset['DAY'] = dataset.START_DATE.dt.weekday
day_label = {
0: 'Mon', 1: 'Tues', 2: 'Wed', 3: 'Thus', 4: 'Fri', 5: 'Sat', 6: 'Sun'
}
dataset['DAY'] = dataset['DAY'].map(day_label)
Python
day_label = dataset.DAY.value_counts()
sns.barplot(x=day_label.index, y=day_label);
plt.xlabel('DAY')
plt.ylabel('COUNT')
Output :

barplot
Now, let’s explore the MILES Column .
We can use boxplot to check the distribution of the column.
Python
sns.boxplot(dataset['MILES'])
Output :

boxplot(dataset[‘MILES’])
As the graph is not clearly understandable. Let’s zoom in it for values lees than 100.
Python
sns.boxplot(dataset[dataset['MILES']<100]['MILES'])
Output :

boxplot(dataset[dataset[‘MILES’]<100][‘MILES’])
It’s bit visible. But to get more clarity we can use distplot for values less than 40.
Python
sns.distplot(dataset[dataset['MILES']<40]['MILES'])
Output :

distplot
Insights from the above plots :
- Most of the cabs booked for the distance of 4-5 miles.
- Majorly people chooses cabs for the distance of 0-20 miles.
- For distance more than 20 miles cab counts is nearly negligible.
Get the complete notebook and dataset link here:
Notebook link : click here.
Dataset link : click here
Similar Reads
Data analysis using R
Data Analysis is a subset of data analytics, it is a process where the objective has to be made clear, collect the relevant data, preprocess the data, perform analysis(understand the data, explore insights), and then visualize it. The last step visualization is important to make people understand wh
10 min read
Olympics Data Analysis Using Python
In this article, we are going to see the Olympics analysis using Python. The modern Olympic Games or Olympics are leading international sports events featuring summer and winter sports competitions in which thousands of athletes from around the world participate in a variety of competitions. The Oly
4 min read
Data Analysis with Python
In this article, we will discuss how to do data analysis with Python. We will discuss all sorts of data analysis i.e. analyzing numerical data with NumPy, Tabular data with Pandas, data visualization Matplotlib, and Exploratory data analysis. Data Analysis With Python Data Analysis is the technique
15+ min read
Sequential Data Analysis in Python
Sequential data, often referred to as ordered data, consists of observations arranged in a specific order. This type of data is not necessarily time-based; it can represent sequences such as text, DNA strands, or user actions. In this article, we are going to explore, sequential data analysis, it's
8 min read
Time Series Analysis & Visualization in Python
Every dataset has distinct qualities that function as essential aspects in the field of data analytics, providing insightful information about the underlying data. Time series data is one kind of dataset that is especially important. This article delves into the complexities of time series datasets,
11 min read
Python For Data Analysis
Exploratory Data Analysis (EDA) serves as the foundation of any data science project. It is an essential step where data scientists investigate datasets to understand their structure, identify patterns, and uncover insights. Data preparation involves several steps, including cleaning, transforming,
4 min read
Types of Data Analysis Techniques
Data analysis techniques have significantly evolved, providing a comprehensive toolkit for understanding, interpreting, and predicting data patterns. These methods are crucial in extracting actionable insights from data, enabling organizations to make informed decisions. This article will cover majo
7 min read
Data Analysis in Research: Types & Methods
Data analysis is a crucial step in the research process, transforming raw data into meaningful insights that drive informed decisions and advance knowledge. This article explores the various types and methods of data analysis in research, providing a comprehensive guide for researchers across discip
7 min read
Data analysis and Visualization with Python
Python is widely used as a data analysis language due to its robust libraries and tools for managing data. Among these libraries is Pandas, which makes data exploration, manipulation, and analysis easier. we will use Pandas to analyse a dataset called Country-data.csv from Kaggle. While working with
5 min read
How to Use SPSS for Data Analysis
Data Analysis involves the use of statistics and other techniques to interpret the data. It involves cleaning, analyzing, finding statistics and finally visualizing them in graphs or charts. Data Analytics tools are mainly used to deal with structured data. The steps involved in Data Analysis are as
5 min read