AirBnB Booking Analysis using EDA

Last Updated : 23 Jul, 2025

Exploratory Data Analysis (EDA) is a crucial step in understanding and analyzing any dataset. In this article, we will perform EDA on an AirBnB booking dataset to reveal the insights and trends. This process includes cleaning the data, visualizing it, and identifying key patterns.

What is AirBnB Platform?

AirBnB is a leading online marketplace that connects people looking to rent out their homes with those seeking accommodations. Founded in 2008, the platform has grown exponentially, offering over 7 million listings in more than 220 countries and regions worldwide. AirBnB provides a wide range of lodging options, from single rooms and apartments to entire houses and unique properties like treehouses and castles. This diversity allows travelers to find accommodations that suit their preferences and budgets, while hosts can monetize their unused spaces.

Importance of Analyzing Booking Data

Analyzing booking data on AirBnB is crucial for several reasons:

  1. Understanding Market Trends: By examining booking patterns, one can identify peak seasons, popular destinations, and emerging travel trends. This information is invaluable for hosts to optimize their listings and pricing strategies.
  2. Enhancing Guest Experience: Insights from booking data help identify what guests value most, enabling hosts to tailor their offerings to meet guest expectations. For instance, understanding the demand for certain amenities or property types can lead to more targeted and effective listings.
  3. Improving Host Performance: Analysis of booking data can reveal key factors that contribute to higher occupancy rates and better reviews. Hosts can use this information to improve their property management practices and increase their revenue.
  4. Strategic Decision-Making: For AirBnB as a platform, booking data analysis is essential for strategic planning. It helps in understanding the competitive landscape, assessing the effectiveness of marketing campaigns, and making informed decisions about platform enhancements.
  5. Enhancing Safety and Compliance: By analyzing booking data, AirBnB can detect unusual patterns that may indicate fraudulent activity or violations of local regulations. This proactive approach ensures a safer and more reliable platform for both hosts and guests.

Exploring the AirBnB Dataset in Python

Dataset Link - AirBnB

Step 1: Importing Necessary Libraries and Loading the AirBNB Dataset

This script imports essential data manipulation and visualization libraries, loads an Airbnb dataset from a CSV file, and displays the first few rows of the dataset. This initial step helps in understanding the structure and content of the data before performing further analysis or visualization.

Python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
file_path = "/content/Airbnb_Open_Data.csv"
df = pd.read_csv(file_path)
print(df.head())

Output:

        id                                              NAME  ...                                        house_rules license
0 1001254 Clean & quiet apt home by the park ... Clean up and treat the home the way you'd like... NaN
1 1002102 Skylit Midtown Castle ... Pet friendly but please confirm with me if the... NaN
2 1002403 THE VILLAGE OF HARLEM....NEW YORK ! ... I encourage you to use my kitchen, cooking and... NaN
3 1002755 NaN ... NaN NaN
4 1003689 Entire Apt: Spacious Studio/Loft by central park ... Please no smoking in the house, porch or on th... NaN

[5 rows x 26 columns]

Step 2: Check the column names in the Dataset

Python
df.columns

Output:

Index(['id', 'NAME', 'host id', 'host_identity_verified', 'host name',
'neighbourhood group', 'neighbourhood', 'lat', 'long', 'country',
'country code', 'instant_bookable', 'cancellation_policy', 'room type',
'Construction year', 'price', 'service fee', 'minimum nights',
'number of reviews', 'last review', 'reviews per month',
'review rate number', 'calculated host listings count',
'availability 365', 'house_rules', 'license'],
dtype='object')

Step 3: Check for Missing Values

Python
print(df.isnull().sum())

Output:

id                                     0
NAME 250
host id 0
host_identity_verified 289
host name 406
neighbourhood group 29
neighbourhood 16
lat 8
long 8
country 532
country code 131
instant_bookable 105
cancellation_policy 76
room type 0
Construction year 214
price 247
service fee 273
minimum nights 409
number of reviews 183
last review 15893
reviews per month 15879
review rate number 326
calculated host listings count 319
availability 365 448
house_rules 52131
license 102597
dtype: int64

Step 4: Handle Missing Values

This code ensures that the 'last review' column is properly formatted as datetime, missing values in key columns are appropriately handled, and incomplete records are removed, preparing the dataset for further analysis or visualization.

Python
# Convert 'last review' to datetime and handle errors
df['last review'] = pd.to_datetime(df['last review'], errors='coerce')

# Fill missing values
df.fillna({'reviews per month': 0, 'last review': df['last review'].min()}, inplace=True)

# Drop records with missing 'name' or 'host name'
df.dropna(subset=['NAME', 'host name'], inplace=True)

Step 5: Correct Data Types

Ensure that all columns have the correct data types.

Python
# Remove dollar signs and convert to float
df['price'] = df['price'].replace('[\$,]', '', regex=True).astype(float)
df['service fee'] = df['service fee'].replace('[\$,]', '', regex=True).astype(float)

Step 6: Remove Duplicates

Check for and remove any duplicate records.

Python
df.drop_duplicates(inplace=True)

Step 7: Confirm Data Cleaning

Verify that the data cleaning steps were successful.

Python
print(df.info())

Output:

Index: 101410 entries, 0 to 102057
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 101410 non-null int64
1 NAME 101410 non-null object
2 host id 101410 non-null int64
3 host_identity_verified 101134 non-null object
4 host name 101410 non-null object
5 neighbourhood group 101384 non-null object
6 neighbourhood 101394 non-null object
7 lat 101402 non-null float64
8 long 101402 non-null float64
9 country 100884 non-null object
10 country code 101288 non-null object
11 instant_bookable 101314 non-null object
12 cancellation_policy 101340 non-null object
13 room type 101410 non-null object
14 Construction year 101210 non-null float64
15 price 101171 non-null float64
16 service fee 101142 non-null float64
17 minimum nights 101016 non-null float64
18 number of reviews 101228 non-null float64
19 last review 101410 non-null datetime64[ns]
20 reviews per month 101410 non-null float64
21 review rate number 101103 non-null float64
22 calculated host listings count 101092 non-null float64
23 availability 365 100990 non-null float64
24 house_rules 49831 non-null object
25 license 2 non-null object
dtypes: datetime64[ns](1), float64(11), int64(2), object(12)
memory usage: 20.9+ MB
None

Now, we can explore the dataset to uncover patterns and insights.

Step 8: Descriptive Statistics

The df.describe() function in pandas generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values. This function is useful for understanding the basic statistical properties of the data.

Python
print(df.describe())

Output:

                 id       host id            lat  ...  review rate number  calculated host listings count  availability 365
count 1.014100e+05 1.014100e+05 101402.000000 ... 101103.000000 101092.000000 100990.000000
mean 2.920959e+07 4.926155e+10 40.728082 ... 3.278558 7.948463 141.164660
min 1.001254e+06 1.236005e+08 40.499790 ... 1.000000 1.000000 -10.000000
25% 1.507574e+07 2.459183e+10 40.688730 ... 2.000000 1.000000 3.000000
50% 2.922911e+07 4.912069e+10 40.722300 ... 3.000000 1.000000 96.000000
75% 4.328308e+07 7.399747e+10 40.762750 ... 4.000000 2.000000 269.000000
max 5.736742e+07 9.876313e+10 40.916970 ... 5.000000 332.000000 3677.000000
std 1.626820e+07 2.853703e+10 0.055850 ... 1.285369 32.328974 135.419199

[8 rows x 14 columns]

Step 9: Visualization

Distribution of Prices

Plot the distribution of listing prices.

Python
plt.figure(figsize=(10, 6))
sns.histplot(df['price'], bins=50, kde=True)
plt.title('Distribution of Listing Prices')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
plt.show()

Output:

Screenshot-2024-07-27-084728
Distribution of Prices

The histogram shows a fairly even distribution of listing prices across different price ranges, indicating no particular concentration of listings in any specific price range. The KDE line helps visualize this even spread more clearly, confirming that the dataset contains listings with a wide variety of prices.

Room Type Analysis

Analyze the distribution of different room types.

Python
plt.figure(figsize=(8, 5))
sns.countplot(x='room type', data=df , color='hotpink')
plt.title('Room Type Distribution')
plt.xlabel('Room Type')
plt.ylabel('Count')
plt.show()

Output:

Screenshot-2024-07-27-085258
Distribution of Room Type

The count plot shows a clear distribution of the different room types available in the Airbnb dataset. The majority of listings are for 'Entire home/apt' and 'Private room', with 'Shared room' and 'Hotel room' being much less common. This insight can be useful for understanding the availability and popularity of different types of accommodations on Airbnb.

Neighborhood Analysis

Examine how listings are distributed across different neighborhoods.

Python
plt.figure(figsize=(12, 8))
sns.countplot(y='neighbourhood group', data=df,color="lightgreen" , order=df['neighbourhood group'].value_counts().index)
plt.title('Number of Listings by Neighborhood Group')
plt.xlabel('Count')
plt.ylabel('Neighborhood Group')
plt.show()

Output:

Screenshot-2024-07-27-085456
Neighborhood Analysis

The count plot shows a clear distribution of the number of listings across different neighborhood groups. Manhattan and Brooklyn dominate the listings, suggesting they are prime locations for Airbnb. Queens, Bronx, and Staten Island have fewer listings, indicating less availability or popularity.

Note: The presence of possible typographical errors highlights the need for data cleaning to ensure accuracy in analysis. This insight can be useful for understanding the distribution and popularity of Airbnb listings across different neighborhoods.

Price vs. Room Type

Visualize the relationship between price and room type.

Python
plt.figure(figsize=(10, 6))
sns.boxplot(x='room type', y='price', hue='room type', data=df, palette='Set2')
plt.title('Price vs. Room Type')
plt.xlabel('Room Type')
plt.ylabel('Price ($)')
plt.legend(title='Room Type')
plt.show()

Output:

Screenshot-2024-07-27-085819
Price vs. Room Type

The box plot provides a detailed view of how prices vary across different room types in the Airbnb dataset. It shows that while 'Shared room' tends to have lower prices, 'Private room', 'Entire home/apt', and 'Hotel room' have higher and more varied price ranges. This visualization helps in understanding the pricing dynamics for different types of accommodations on Airbnb.

Reviews Over Time

Plot the number of reviews over time.

Python
df['last review'] = pd.to_datetime(df['last review'])
reviews_over_time = df.groupby(df['last review'].dt.to_period('M')).size()

plt.figure(figsize=(12, 6))
reviews_over_time.plot(kind='line',color='red')
plt.title('Number of Reviews Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Reviews')
plt.show()

Output:

number-of-reviews-over-time

The line plot provides a clear visualization of the number of reviews over time. It helps identify trends and patterns in review activity, such as periods of high or low activity. This information can be useful for understanding the dynamics of user engagement and the popularity of Airbnb listings over time. The significant spikes and drops in reviews might be worth further investigation to understand the underlying causes, such as changes in Airbnb policies, market conditions, or external events.

Key Insights From Exploratory Data Analysis of AirBnB Dataset

The key insights derived from the exploratory data analysis are discussed below:

  1. Pricing Distribution:
    • Most Airbnb listings are priced within a moderate range.
    • There are a few high-priced outliers, indicating some premium listings with significantly higher prices.
  2. Room Type Distribution:
    • The majority of listings are either entire homes/apartments or private rooms.
    • Shared rooms and hotel rooms constitute a very small portion of the listings.
  3. Geographical Distribution:
    • Listings are predominantly concentrated in popular areas like Brooklyn and Manhattan.
    • Other boroughs such as Queens, Bronx, and Staten Island have fewer listings.
  4. Price Comparison by Room Type:
    • Entire homes/apartments generally cost more than private rooms.
    • Shared rooms tend to have the lowest prices among the room types.
  5. Seasonal Trends in Reviews:
    • There are observable seasonal trends in the number of reviews.
    • Certain months experience higher review activity,

Conclusion

EDA helps us understand the main trends and patterns in the AirBnB dataset. We found that most listings are reasonably priced, with popular areas having the highest concentration of listings. Entire homes/apartments are typically more expensive than private rooms. Additionally, review patterns show seasonal variations. These insights can guide both hosts and guests in making informed decisions.

Comment