AirBnB Booking Analysis using EDA

Exploratory Data Analysis (EDA) is a crucial step in understanding and analyzing any dataset. In this article, we will perform EDA on an AirBnB booking dataset to reveal the insights and trends. This process includes cleaning the data, visualizing it, and identifying key patterns.

What is AirBnB Platform?

AirBnB is a leading online marketplace that connects people looking to rent out their homes with those seeking accommodations. Founded in 2008, the platform has grown exponentially, offering over 7 million listings in more than 220 countries and regions worldwide. AirBnB provides a wide range of lodging options, from single rooms and apartments to entire houses and unique properties like treehouses and castles. This diversity allows travelers to find accommodations that suit their preferences and budgets, while hosts can monetize their unused spaces.

Importance of Analyzing Booking Data

Analyzing booking data on AirBnB is crucial for several reasons:

Understanding Market Trends: By examining booking patterns, one can identify peak seasons, popular destinations, and emerging travel trends. This information is invaluable for hosts to optimize their listings and pricing strategies.
Enhancing Guest Experience: Insights from booking data help identify what guests value most, enabling hosts to tailor their offerings to meet guest expectations. For instance, understanding the demand for certain amenities or property types can lead to more targeted and effective listings.
Improving Host Performance: Analysis of booking data can reveal key factors that contribute to higher occupancy rates and better reviews. Hosts can use this information to improve their property management practices and increase their revenue.
Strategic Decision-Making: For AirBnB as a platform, booking data analysis is essential for strategic planning. It helps in understanding the competitive landscape, assessing the effectiveness of marketing campaigns, and making informed decisions about platform enhancements.
Enhancing Safety and Compliance: By analyzing booking data, AirBnB can detect unusual patterns that may indicate fraudulent activity or violations of local regulations. This proactive approach ensures a safer and more reliable platform for both hosts and guests.

Exploring the AirBnB Dataset in Python

Dataset Link - AirBnB

Step 1: Importing Necessary Libraries and Loading the AirBNB Dataset

This script imports essential data manipulation and visualization libraries, loads an Airbnb dataset from a CSV file, and displays the first few rows of the dataset. This initial step helps in understanding the structure and content of the data before performing further analysis or visualization.

Python

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
file_path = "/content/Airbnb_Open_Data.csv"
df = pd.read_csv(file_path)
print(df.head())

Output:

        id                                              NAME  ...                                        house_rules license
0  1001254                Clean & quiet apt home by the park  ...  Clean up and treat the home the way you'd like...     NaN
1  1002102                             Skylit Midtown Castle  ...  Pet friendly but please confirm with me if the...     NaN
2  1002403               THE VILLAGE OF HARLEM....NEW YORK !  ...  I encourage you to use my kitchen, cooking and...     NaN
3  1002755                                               NaN  ...                                                NaN     NaN
4  1003689  Entire Apt: Spacious Studio/Loft by central park  ...  Please no smoking in the house, porch or on th...     NaN

[5 rows x 26 columns]

Step 2: Check the column names in the Dataset

Python

df.columns

Output:

Index(['id', 'NAME', 'host id', 'host_identity_verified', 'host name',
       'neighbourhood group', 'neighbourhood', 'lat', 'long', 'country',
       'country code', 'instant_bookable', 'cancellation_policy', 'room type',
       'Construction year', 'price', 'service fee', 'minimum nights',
       'number of reviews', 'last review', 'reviews per month',
       'review rate number', 'calculated host listings count',
       'availability 365', 'house_rules', 'license'],
      dtype='object')

Step 3: Check for Missing Values

Python

print(df.isnull().sum())

Output:

id                                     0
NAME                                 250
host id                                0
host_identity_verified               289
host name                            406
neighbourhood group                   29
neighbourhood                         16
lat                                    8
long                                   8
country                              532
country code                         131
instant_bookable                     105
cancellation_policy                   76
room type                              0
Construction year                    214
price                                247
service fee                          273
minimum nights                       409
number of reviews                    183
last review                        15893
reviews per month                  15879
review rate number                   326
calculated host listings count       319
availability 365                     448
house_rules                        52131
license                           102597
dtype: int64

Step 4: Handle Missing Values

This code ensures that the 'last review' column is properly formatted as datetime, missing values in key columns are appropriately handled, and incomplete records are removed, preparing the dataset for further analysis or visualization.

Python

# Convert 'last review' to datetime and handle errors
df['last review'] = pd.to_datetime(df['last review'], errors='coerce')

# Fill missing values
df.fillna({'reviews per month': 0, 'last review': df['last review'].min()}, inplace=True)

# Drop records with missing 'name' or 'host name'
df.dropna(subset=['NAME', 'host name'], inplace=True)

Step 5: Correct Data Types

Ensure that all columns have the correct data types.

Python

# Remove dollar signs and convert to float
df['price'] = df['price'].replace('[\$,]', '', regex=True).astype(float)
df['service fee'] = df['service fee'].replace('[\$,]', '', regex=True).astype(float)

Step 6: Remove Duplicates

Check for and remove any duplicate records.

Python

df.drop_duplicates(inplace=True)

Step 7: Confirm Data Cleaning

Verify that the data cleaning steps were successful.

Python

print(df.info())

Output:

Index: 101410 entries, 0 to 102057
Data columns (total 26 columns):
 #   Column                          Non-Null Count   Dtype
---  ------                          --------------   -----
 0   id                              101410 non-null  int64
 1   NAME                            101410 non-null  object
 2   host id                         101410 non-null  int64
 3   host_identity_verified          101134 non-null  object
 4   host name                       101410 non-null  object
 5   neighbourhood group             101384 non-null  object
 6   neighbourhood                   101394 non-null  object
 7   lat                             101402 non-null  float64
 8   long                            101402 non-null  float64
 9   country                         100884 non-null  object
 10  country code                    101288 non-null  object
 11  instant_bookable                101314 non-null  object
 12  cancellation_policy             101340 non-null  object
 13  room type                       101410 non-null  object
 14  Construction year               101210 non-null  float64
 15  price                           101171 non-null  float64
 16  service fee                     101142 non-null  float64
 17  minimum nights                  101016 non-null  float64
 18  number of reviews               101228 non-null  float64
 19  last review                     101410 non-null  datetime64[ns]
 20  reviews per month               101410 non-null  float64
 21  review rate number              101103 non-null  float64
 22  calculated host listings count  101092 non-null  float64
 23  availability 365                100990 non-null  float64
 24  house_rules                     49831 non-null   object
 25  license                         2 non-null       object
dtypes: datetime64[ns](1), float64(11), int64(2), object(12)
memory usage: 20.9+ MB
None

Now, we can explore the dataset to uncover patterns and insights.

Step 8: Descriptive Statistics

The df.describe() function in pandas generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values. This function is useful for understanding the basic statistical properties of the data.

Python

print(df.describe())

Output:

                 id       host id            lat  ...  review rate number  calculated host listings count  availability 365
count  1.014100e+05  1.014100e+05  101402.000000  ...       101103.000000                   101092.000000     100990.000000
mean   2.920959e+07  4.926155e+10      40.728082  ...            3.278558                        7.948463        141.164660
min    1.001254e+06  1.236005e+08      40.499790  ...            1.000000                        1.000000        -10.000000
25%    1.507574e+07  2.459183e+10      40.688730  ...            2.000000                        1.000000          3.000000
50%    2.922911e+07  4.912069e+10      40.722300  ...            3.000000                        1.000000         96.000000
75%    4.328308e+07  7.399747e+10      40.762750  ...            4.000000                        2.000000        269.000000
max    5.736742e+07  9.876313e+10      40.916970  ...            5.000000                      332.000000       3677.000000
std    1.626820e+07  2.853703e+10       0.055850  ...            1.285369                       32.328974        135.419199

[8 rows x 14 columns]

Step 9: Visualization

Distribution of Prices

Plot the distribution of listing prices.

Python

plt.figure(figsize=(10, 6))
sns.histplot(df['price'], bins=50, kde=True)
plt.title('Distribution of Listing Prices')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
plt.show()

Output:

Screenshot-2024-07-27-084728 — Distribution of Prices

The histogram shows a fairly even distribution of listing prices across different price ranges, indicating no particular concentration of listings in any specific price range. The KDE line helps visualize this even spread more clearly, confirming that the dataset contains listings with a wide variety of prices.

Room Type Analysis

Analyze the distribution of different room types.

Python

plt.figure(figsize=(8, 5))
sns.countplot(x='room type', data=df , color='hotpink')
plt.title('Room Type Distribution')
plt.xlabel('Room Type')
plt.ylabel('Count')
plt.show()

Output:

Screenshot-2024-07-27-085258 — Distribution of Room Type

The count plot shows a clear distribution of the different room types available in the Airbnb dataset. The majority of listings are for 'Entire home/apt' and 'Private room', with 'Shared room' and 'Hotel room' being much less common. This insight can be useful for understanding the availability and popularity of different types of accommodations on Airbnb.

Neighborhood Analysis

Examine how listings are distributed across different neighborhoods.

Python

plt.figure(figsize=(12, 8))
sns.countplot(y='neighbourhood group', data=df,color="lightgreen" , order=df['neighbourhood group'].value_counts().index)
plt.title('Number of Listings by Neighborhood Group')
plt.xlabel('Count')
plt.ylabel('Neighborhood Group')
plt.show()

Output:

Screenshot-2024-07-27-085456 — Neighborhood Analysis

The count plot shows a clear distribution of the number of listings across different neighborhood groups. Manhattan and Brooklyn dominate the listings, suggesting they are prime locations for Airbnb. Queens, Bronx, and Staten Island have fewer listings, indicating less availability or popularity.

Note: The presence of possible typographical errors highlights the need for data cleaning to ensure accuracy in analysis. This insight can be useful for understanding the distribution and popularity of Airbnb listings across different neighborhoods.

Price vs. Room Type

Visualize the relationship between price and room type.

Python

plt.figure(figsize=(10, 6))
sns.boxplot(x='room type', y='price', hue='room type', data=df, palette='Set2')
plt.title('Price vs. Room Type')
plt.xlabel('Room Type')
plt.ylabel('Price ($)')
plt.legend(title='Room Type')
plt.show()

Output:

Screenshot-2024-07-27-085819 — Price vs. Room Type

The box plot provides a detailed view of how prices vary across different room types in the Airbnb dataset. It shows that while 'Shared room' tends to have lower prices, 'Private room', 'Entire home/apt', and 'Hotel room' have higher and more varied price ranges. This visualization helps in understanding the pricing dynamics for different types of accommodations on Airbnb.

Reviews Over Time

Plot the number of reviews over time.

Python

df['last review'] = pd.to_datetime(df['last review'])
reviews_over_time = df.groupby(df['last review'].dt.to_period('M')).size()

plt.figure(figsize=(12, 6))
reviews_over_time.plot(kind='line',color='red')
plt.title('Number of Reviews Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Reviews')
plt.show()

Output:

The line plot provides a clear visualization of the number of reviews over time. It helps identify trends and patterns in review activity, such as periods of high or low activity. This information can be useful for understanding the dynamics of user engagement and the popularity of Airbnb listings over time. The significant spikes and drops in reviews might be worth further investigation to understand the underlying causes, such as changes in Airbnb policies, market conditions, or external events.

Key Insights From Exploratory Data Analysis of AirBnB Dataset

The key insights derived from the exploratory data analysis are discussed below:

Pricing Distribution:
- Most Airbnb listings are priced within a moderate range.
- There are a few high-priced outliers, indicating some premium listings with significantly higher prices.
Room Type Distribution:
- The majority of listings are either entire homes/apartments or private rooms.
- Shared rooms and hotel rooms constitute a very small portion of the listings.
Geographical Distribution:
- Listings are predominantly concentrated in popular areas like Brooklyn and Manhattan.
- Other boroughs such as Queens, Bronx, and Staten Island have fewer listings.
Price Comparison by Room Type:
- Entire homes/apartments generally cost more than private rooms.
- Shared rooms tend to have the lowest prices among the room types.
Seasonal Trends in Reviews:
- There are observable seasonal trends in the number of reviews.
- Certain months experience higher review activity,

Conclusion

EDA helps us understand the main trends and patterns in the AirBnB dataset. We found that most listings are reasonably priced, with popular areas having the highest concentration of listings. Entire homes/apartments are typically more expensive than private rooms. Additionally, review patterns show seasonal variations. These insights can guide both hosts and guests in making informed decisions.

AirBnB Booking Analysis using EDA

What is AirBnB Platform?

Importance of Analyzing Booking Data

Exploring the AirBnB Dataset in Python

Step 1: Importing Necessary Libraries and Loading the AirBNB Dataset

Step 2: Check the column names in the Dataset

Step 3: Check for Missing Values

Step 4: Handle Missing Values

Step 5: Correct Data Types

Step 6: Remove Duplicates

Step 7: Confirm Data Cleaning

Step 8: Descriptive Statistics

Step 9: Visualization

Distribution of Prices

Room Type Analysis

Neighborhood Analysis

Price vs. Room Type

Reviews Over Time

Key Insights From Exploratory Data Analysis of AirBnB Dataset

Conclusion

Explore