做预测时预测降维升维作用
As I write these lines on a sunny day from a little town on the island of Mallorca, news about lockdowns, prevention measures, social distancing, and economic recession have become common. Most hotels are closing before the summer season ends, and airports are experiencing a previously unseen uncanny calm. It looks like there is no glimpse of a very flattering future for the travel industry in this gloomy landscape. According to data from the Statistic Institute of the Balearic Islands (IBESTAT), there has been a 92 % annual decrease in the number of tourists that have visited the islands, and a recent report by the World Tourist Organization predicts a 60–80 % fall in the international tourist’s numbers in 2020. It feels like the industry is walking through a long tunnel, and no one knows if there is light at the end.
当我在马略卡岛上的一个小镇上的晴天写这些线路时,有关封锁,预防措施,社会疏远和经济衰退的新闻变得很普遍。 大多数酒店在夏季结束之前关闭,而机场正在经历前所未有的超现实的平静。 在这个阴暗的环境中,似乎看不到旅游业的前景非常喜人。 根据巴利阿里群岛统计研究所(IBESTAT)的数据,到访该岛的游客数量每年以92%的速度下降,世界旅游组织最近的一份报告预测该数字将下降60-80%以2020年的国际游客人数来衡量。感觉这个行业正走过一条漫长的隧道,没人知道最后是否有阳光。
Yet people still want to travel.
但是人们仍然想旅行。
In one of the most recent surveys about travel confidence, 44 % of respondents said they would travel via plane for their main trip in 2021. The travel industry is an important player in the economy of the planet. The direct contribution to the industry worldwide in 2019 was 2.9 trillion USD. And it will recover sooner or later. When that happens, managers, stakeholders, and newcomers must be prepared for unseen challenges in this new landscape, where the smart use of data will become an essential part of the business.
在有关旅行信心的最新调查之一中,有44%的受访者表示,他们将乘坐飞机旅行,这是2021年的主要旅行。旅行业是地球经济的重要参与者。 2019年,全球对该行业的直接贡献为2.9万亿美元。 而且它迟早会恢复的。 当这种情况发生时,经理,利益相关者和新来者必须为在这种新形势下看不见的挑战做好准备,在这种新形势下,数据的智能使用将成为企业必不可少的一部分。
Fortunately for the analytics world, some real data for hotel bookings have been made public. An effort by Nuno Antonio, Ana de Almeida and Luis Nunes, has led to the publication of anonymized booking data from two hotels in Portugal: one resort and one city hotel, from 2015 to 2017. This offers an extraordinary opportunity for research and analytics related to hotel booking demand. We can, for instance, analyze booking patterns and getting insights for the hotel market in the zone or answer more advanced business questions with potential applications in the travel industry.
幸运的是,对于分析界来说,一些酒店预订的真实数据已经公开。 Nuno Antonio,Ana de Almeida和Luis Nunes的努力,导致在2015年至2017年期间发布了来自葡萄牙两家酒店(一家度假酒店和一家城市酒店)的匿名预订数据。这为与研究和分析相关的非凡机会到酒店预订需求。 例如,我们可以分析预订模式并获得对该区域酒店市场的见解,或者回答旅游行业潜在应用的更高级的业务问题。
Taking a cleaned version of this dataset from Kaggle, I’ll analyze two basic essential concepts in pricing strategies: seasonality and groups, and then a more advanced one: predict booking cancelations. We’ll take a look at how well we can predict whether a booking is going to be canceled taking information available at the moment of the reservation.
我从Kaggle那里获得了这个数据集的干净版本,我将分析定价策略中的两个基本基本概念:季节性和分组,然后是一个更高级的概念:预测预订取消。 我们将利用预订时可用的信息,来看看我们如何预测预订是否将被取消。
季节性有多强? (How strong is the seasonality?)
Bookings, prices, and therefore revenue, are in most part determined by seasonality, at least in resort hotels. Recognize seasonality patters is a basic step to set a pricing policy. To which point the hotels in the dataset are subjected to these patterns? Is seasonality stronger in the resort hotel?
预订,价格以及收入在很大程度上取决于季节性,至少在度假酒店中如此。 认识季节性模式是制定定价政策的基本步骤。 数据集中的酒店在什么时候受到这些模式的影响? 度假酒店的季节性更强吗?
Let’s look at the time series with daily data for the average daily rate (or ADR), room nights, and revenue. To remove noise, it’s always better to look at a 7-day rolling average instead of the raw daily data.
让我们看一下具有每日数据的时间序列,以了解平均每日房价(或ADR),住宿天数和收入。 要消除噪音,最好查看7天的滚动平均值,而不是原始的每日数据。

As we can see, prices and occupation are strongly influenced by the season in these hotels, especially in the resort hotel, where there is a clear increase in revenue from July to October. Room nights change abruptly in both cases. There is also a small peak at the beginning of the year.
如我们所见,这些酒店的价格和占领受到季节的强烈影响,尤其是在度假酒店,从7月到10月,收入明显增加。 在这两种情况下,房间的夜晚都会突然改变。 年初还有一个小高峰。
These patterns become clearer if we average the KPIs by month, and present the data relative to the yearly average.
如果我们按月对KPI进行平均,并相对于年度平均值呈现数据,则这些模式将变得更加清晰。

Revenue and ADR peak in August in the resort hotel. Revenue reaches a 100 % increase above the average, while ADR goes as far as to a 200 % increase. The city hotel presents a similar seasonality pattern, but nowhere near as intense as the resort hotel. The exception here is room nights, where the variations for the two hotels have similar magnitudes. The summer season seems to last longer in the city hotel.
度假酒店的收入和ADR在8月达到顶峰。 收入比平均水平高出100%,而ADR则高达200%。 城市酒店呈现出相似的季节性模式,但远没有度假酒店那么强烈。 这里是房间夜晚的例外,这两家酒店的变化幅度相似。 夏季似乎持续更长的时间在城市酒店。
As we just saw, there is a huge variation in prices and rooms sold throughout the year. Accommodate your pricing strategy to seasonality patterns is a must, especially in resort hotels.
正如我们所看到的,全年的价格和售出的房间差异很大。 必须使您的定价策略适应季节性模式,尤其是在度假酒店中。
团体预订的ADR,停留时间和交货时间到什么时候才不同于个人/临时预订? (Up to what point ADR, length of stay, and lead time for group reservations differ from individual/transient ones?)
Groups, to be honest, are sometimes a pain in the neck for the revenue managers: more often than not they book way too early when the prices are often lower than the closing ADR. And they can have a shorter length of stay than the usual transient traveler. So as a revenue manager, one could be in the awkward position of having 90% of the hotel rooms sold to a large group, at an irrisory price, for one night in the middle of the week. In this situation, it can become an arduous task to fill in the rooms in the adjacent days ( 7-day stays in a natural week are common in resorts).
组,说实话,有时为收入经理颈部疼痛:往往不是他们预定为时尚早当价格往往比收盘ADR降低。 而且它们的停留时间可能比通常的瞬态旅行者短。 因此,作为一名收入经理,可能会处于尴尬的境地,即在一周中的一个晚上以暴躁的价格将90%的酒店客房出售给一大群人。 在这种情况下,在接下来的几天里加满房间可能会成为一项艰巨的任务(在自然假期中,通常要在自然一周中住7天)。
Let’s look at the group behavior in these hotels and try to get some interesting insights. Does their behavior differ from the transient traveler? Are groups getting too good of a deal? Should hotels increase the rates to improve ADR and revenue? Are those rooms in the proximity of a group stay so hard to sell?
让我们看看这些酒店的团体行为,并尝试获得一些有趣的见解。 他们的行为与瞬态旅行者不同吗? 团体交易变得太好了吗? 酒店是否应提高房价以提高ADR和收入? 一群人附近的那些房间住得这么难卖吗?
Group bookings for these hotels represent 9% (13,386 room nights) of the total revenue in the resort hotel and 8 % (11,502 room nights) in the city hotel. A relatively decent share of the total revenue, so it’s important to plan a pricing strategy for this segment.
这些酒店的团体预订占度假酒店总收入的9%(13,386个房间夜)和城市酒店的8%(11,502个房间夜)。 在总收入中所占的比例相对较高,因此规划此细分受众群的定价策略非常重要。
The size of the groups varies considerably, around an average of 10 % relative to máx occupation. Large groups are more common in the resort hotel.
群体的规模差异很大,相对于男性职业而言,平均约为10%。 大型团体在度假酒店中更为常见。

If we examine the length of stay, transient bookings and group bookings are pretty similar, though the latter has a larger proportion of short stays (between 1 and 3 nights). The median length of stay in all cases is three days.
如果我们检查一下停留时间,则短暂预订和团体预订非常相似,尽管后者的短期住宿比例较高(1到3晚)。 在所有情况下,平均住院时间为三天。

Regarding lead time, groups book significantly earlier. In the city hotel, the average lead time for groups is 122 days, whereas for transient bookings, is 76 days. In the resort hotel, the figures are 162 days for groups and 69 days for transient. The explanation for this big difference is that a high percentage of transient bookings make reservations with very little time in advance. 43 % of transient travelers book three weeks in advance or less in the resort hotel, while only 16 % of group reservations occur inside this time window. For the city hotel, the figures are respectively, 35 % and 18 %.
关于提前期,小组预订的时间要早得多。 在城市酒店,团体的平均提前期为122天,而临时预订的平均提前期为76天。 在度假酒店中,团体人数为162天,临时人数为69天。 造成这种巨大差异的原因是,很大比例的临时预订会提前很少的时间进行预订。 43%的暂住旅行者提前三周或更短时间在度假酒店预订,而只有16%的团体预订在此时间段内进行。 对于城市酒店,数字分别为35%和18%。

As a corollary it’s worth saying that these differences are also explained for the apparent lack of strategy in pricing: the shape of the curve in the figure above for transient booking would be much different having an established pricing strategy, with room nights increasing at a steady but constant pace.
作为推论,值得一提的是,这些差异也可以解释为明显缺乏定价策略:上图中临时定价曲线的形状与既定定价策略大不相同,房间住宿时间稳定增长但步伐恒定。
Groups, also have on average lower rates: for the city hotel, the mean ADR for groups is $ 84, while for transient bookings is $ 108. For the resort hotel, the numbers are $ 68 and $ 94, respectively. This difference, though, is partially explained by the stay date: groups tend to stay in months with low occupancy, especially in the resort hotel.
团体的平均房价也要低一些:对于城市酒店,团体的平均ADR为84美元,而对于短期预订,平均ADR为108美元。对于度假酒店,平均ADR分别为68美元和94美元。 但是,这种差异部分由住宿日期来解释:团体通常会在入住率低的月份中入住,尤其是在度假酒店中。

If we control this and other factors that can influence ADR like the meal, room type, and the number of adults, group rates are still, on average, $ 10 lower than transients for the City hotel. $ 35 for the Resort hotel.
如果我们控制这一因素以及其他可能影响ADR的因素,例如用餐,房型和成人人数,则平均而言,团体房价仍然比City酒店的瞬态价格低10美元。 度假酒店35美元。
Finally, a very interesting question is whether or not the rooms at adjacent stay days of a large group stay are harder to sell. To analyze this, we’ve compared the ADR in adjacent days of large groups stays throughout a year, with the same dates of the year before, with the condition that there have to be no group stays in the nearby dates for this year. The idea is that if the rooms are harder to sell in the nearby dates of a group stay, they will end up being sold at a lower rate.
最后,一个非常有趣的问题是,大型团体住宿的相邻住宿日的房间是否更难出售。 为了对此进行分析,我们比较了在一年中的同一天(与一年前相同的日期)中的大型团体住宿的相邻天中的ADR,条件是今年附近的日期中不得有团体住宿。 这个想法是,如果在团体住宿的临近日期更难售出房间,那么最终将以较低的价格出售。
We consider as a large group, a group that occupies at least 30 % of the maximum occupation registered during the period analyzed. The periods compared are July 2016 — July 2017 against July 2015 — July 2016. Instead of comparing ADR directly, I’ve normalized the KPI by dividing it by the mean ADR across each fo the two periods, to take into account the possible differences in ADR from one period to another.
我们认为,在所分析的时期内,至少占注册的最大职业的30%的人群是大型团体。 所比较的期间是2016年7月-2017年7月与2015年7月-2016年7月。我没有直接比较ADR,而是通过将KPI除以两个期间每个期间的平均ADR来对KPI进行了标准化。 ADR从一个时期到另一个时期。
There are no significant differences in the resort hotel. 44 dates met the requirements. Mean ADR relative to the yearly average for dates without groups nearby was 83 %, and for dates with groups in the proximities relative ADR was 85 %.
度假酒店没有显着差异。 符合要求的日期为44个。 与附近没有分组的日期相比,平均ADR相对于年平均值的平均值为83%,对于具有邻近分组的日期,相对ADR的平均值为85%。
On the contrary, there is a significant shift in relative ADR for city hotels. 38 dates met the requirements. Relative ADR for dates with groups in the proximities was 83 % of the yearly average, whereas relative ADR for dates without groups nearby was 95 %. The distribution of the relative ADR for both the city and the resort hotel in the forenamed dates is displayed below.
相反,城市酒店的相对ADR发生了重大变化。 38个日期符合要求。 临近分组的日期的相对ADR为年平均数的83%,而邻近分组的日期的相对ADR为95%。 下面显示了预定日期中城市和度假酒店的相对ADR分布。

Having a good pricing strategy specific to groups is important. As we’ve just seen, behavior patterns concerning lead time and length of stay are different. And filling up the rooms in the adjacent afterward can be a difficult task in some cases.
制定针对群体的良好定价策略非常重要。 正如我们已经看到的,关于提前期和停留时间的行为模式是不同的。 在某些情况下,在以后填充相邻的房间可能是一项艰巨的任务。
我们可以仅凭预订时提供的信息来预测取消吗? (Can we predict a cancellation, just with the information available at the moment this reservation has been made?)
There are important potential gains in avoiding cancellations. 42 % of the bookings in the city hotel and 28 % of bookings in the resort were canceled. Marketing initiatives targeted towards the reservations that are more prone to be canceled could result in important gains. In this section, I’ll try to predict whether a reservation will be canceled using information available at the moment of the reservation.
避免取消有重要的潜在收益。 城市酒店中42%的预订被取消,度假村中28%的预订被取消。 针对更容易被取消的预订的营销计划可能会产生重大收益。 在本节中,我将尝试使用预订时可用的信息来预测是否将取消预订。
Believe it or not, a lot of useful information with regards to the possibility of cancelation can be extracted at the moment of the reservation. Next, we’ll visualize the variables that have a higher influence on the probability of cancelation.
信不信由你,您可以在预订时提取有关取消可能性的许多有用信息。 接下来,我们将可视化对取消概率有较大影响的变量。
Let’s begging by taking a look at cancelation frequency relative to lead time (measured in the number of weeks prior check-in)
首先我们来看看相关筹备时间取消频率乞讨(以周数测前办理登机手续)

The relationship between the two is almost linear (except “last minute” reservations, where cancellations drop rapidly) and more intense for the city hotel: while cancelations are rare when booking only a few weeks in advance, approximately 80 % of the reservations made one year in advance are canceled in the city hotel, and near 50 % in the resort.
两者之间的关系几乎是线性的(“最后一刻”预订除外,因为取消预订会Swift下降),对于城市酒店来说,这种关系更为激烈:尽管仅提前几周预订的情况很少见,但大约80%的预订是取消预订的提前一年在城市酒店取消,在度假胜地取消近50%。
Country of origin also plays a big role in determining the probability of a cancelation: clients who tend to cancel at a higher frequency are Portuguese, that is, the local market. Below a figure with the percentage of cancelations by country of origin is displayed.
原籍国在确定取消的可能性方面也起着重要作用:倾向于较高取消频率的客户是葡萄牙人,即当地市场。 在下面的图上显示了按原产国划分的取消百分比。

Among the main markets, Portugal almost doubles the proportion of cancelations for the city and the resort hotel. We are talking about an important segment here. Portuguese bookings amount for 40 % of total bookings.
在主要市场中,葡萄牙取消城市和度假酒店的比例几乎翻了一番。 我们在这里谈论一个重要的部分。 葡萄牙语的预订占总预订的40%。
We have to take these results with caution though, since up to the moment of check-in hoteliers are not 100 % sure what is the client nationality.
不过,我们还是要谨慎对待这些结果,因为直到入住酒店的人还不能100%确定客户的国籍。
Odd results come from the deposit type. Surprisingly, a large share of non-refundable reservations is canceled for both hotels.
奇怪的结果来自存款类型。 令人惊讶的是,两家酒店都取消了很大一部分不可退款的预订。

Non-refundables only amount to 12 % of the bookings, if we take the two hotels as a whole. Still surprising results nonetheless. Refundable bookings are also canceled in high proportions in the resort hotel, but the number of these reservations is residual (0.1 %).
如果我们将这两家酒店作为整体,不可退款仅占预订的12%。 尽管如此,结果仍然令人惊讶。 度假酒店也大量取消了可退款的预订,但这些预订的数量是剩余的(0.1%)。
Another important factor in determining whether or not a reservation will be canceled is the number of special requests the customer has made when booking the room.
决定是否取消预订的另一个重要因素是客户在预订房间时提出的特殊要求的数量。

It looks like the more special request the less the likelihood of cancellation.
看起来,越特殊的请求,取消的可能性就越小。
Finally, knowing whether or not the client made previous cancellations in the past, it is also important when it comes to predicting cancellations. The proportion who did that is fairly low: 7 % for the city hotel and 3 % for the resort hotel, but there is a high imbalance regarding people who canceled their trip. 93 % of clients who had previously canceled a trip have also canceled the last one in city hotels. In the resort hotel, the figure is 84 %.
最后,了解客户过去是否曾经取消过订单,在预测取消方面也很重要。 这样做的比例相当低:城市酒店为7%,度假酒店为3%,但是取消旅行的人之间存在很大的失衡。 93%以前取消旅行的客户也取消了城市酒店的最后一位。 在度假酒店中,这一数字为84%。
With all this information, and adding some other variables, one can build a model to predict if a specific reservation is going to be canceled. A simple random forest was trained predicting correctly whether or not a book will end up being canceled 85 % of the time. Among the ones the model predicted as cancelations, it was correct in 81 % of the cases. And among the bookings that were canceled, the model was correct in 77 % of cases.
有了所有这些信息,并添加了一些其他变量,就可以构建模型来预测是否将取消特定的预订。 一个简单的随机森林经过训练,可以正确地预测一本书是否会在85%的时间被取消。 在模型预测为取消的模型中,有81%的情况是正确的。 在被取消的预订中,该模型在77%的情况下是正确的。
With some basic exploration and the training of one sole model, in a limited, time-constrained data, one can achieve decent results when trying to predict booking cancellations. With all the battery resources and data available in a modern hotelier management environment, there is a realistic expectation to achieve better results.
通过一些基本的探索和对一个单一模型的训练,可以在有限的时间受限的数据中,在尝试预测预订取消时可以取得不错的结果。 在现代化的酒店管理环境中,利用所有电池资源和数据,人们对实现更好的结果抱有现实的期望。
We’ve seen how with just a relatively simple dataset of hotel bookings, multiple, diverse, and useful analysis can be performed. From understanding seasonality patterns, to predict booking cancellations. Nowadays hoteliers have a lot of data at their disposal and it is relatively cheap to perform advanced analytics. It’s time to start using data smartly.
我们已经了解了如何仅使用相对简单的酒店预订数据集就可以执行多种,多样且有用的分析。 从了解季节性模式,以预测预订取消。 如今,酒店经营者可以使用大量数据,执行高级分析相对便宜。 现在该开始聪明地使用数据了。
The repository with the analysis can be accessed here.
可以在此处访问包含分析的存储库。
做预测时预测降维升维作用