BERT4Cache: a bidirectional encoder representations for data prefetching in cache

View article
PeerJ Computer Science

Introduction

In the past few years, Internet services have undergone exponential growth in terms of user numbers, applications, and data traffic. Its service scope has expanded from multimedia streaming to online gaming (Liu et al., 2018). As the existing network architecture faces challenges in efficiently handling massive data transfers with minimal latency requirements (Shafi et al., 2017), as a technique which can enhance system response rates and reduce redundant network traffic, caching plays an increasingly vital role in the current network architecture.

Caching is a technique for temporarily storing data. It operates by storing copies of previously retrieved or computed data in high-speed storage media, providing direct access to the required data objects during data retrieval without waiting for the time it takes to retrieve them from slower storage medium, which reduces the accessing frequency, enhancing system response speed and availability. Caching has been used in many scenarios. In computer systems, CPUs often utilize multilayer high-speed cache to store instructions and data, to accelerate data processing. Server caching is employed to store previously retrieved network resources, reducing access latency to remote servers. Web services, content delivery networks (CDNs), and technologies like RL-Cache (Kirilin et al., 2019) use caching to enhance loading and response speeds. In file systems, caching is employed to store recently read or written file data, minimizing disk access, as seen in the heterogeneous memory cache framework (FLAC) (Zhang et al., 2020).

The architecture of caching varies based on specific application scenarios. For instance, server caching may adopt centralized, distributed, or multi-level cache architectures. In many large-scale Internet storage systems, memory-based caching systems like Memcached (opensource, 2024), Redis (opensource, 2024), or local caching libraries such as Ehcache (M LGSSKGT, 2023) are deployed to ensure service-level objectives (SLOs).

Despite the various applications of caching, they all leverage a common advantage of caching: storing copies of data that might be accessed in the future in high-speed media. Therefore, the effectiveness and efficiency of caching depends on whether the caching data will actually be accessed in the future. In other words, the efficacy of caching directly relies on its hit rate, and the hit rate is constrained by the accuracy of cache algorithms in predicting the popularity of content in the imminent future. Consequently, modeling the popularity of content in the imminent future, predicting it by analyzing historical sequence data, and designing cache algorithms based on this prediction have become key research areas in designing efficient caching systems.

Given the vast number of objects that may be requested in the future, one of the primary challenges faced by caching algorithms is effectively determining which objects should be stored in the cache. Based on the approach used for cache decision-making, caching algorithms can be categorized into four main types: Rule-based caching, Data mining-based caching, Classical machine learning-based caching, and Deep learning-based caching. Rule-based caching involves low computational overhead and fast response times but cannot learn user interests and preferences. Data mining-based caching and classical machine learning-based caching, while capable of learning user access patterns to some extent, are limited by the algorithms themselves in finding optimal solutions and may not delve deeply into pattern mining. Deep learning-based caching, on the other hand, achieves a deeper understanding of user behavior and patterns through deep learning techniques. It possesses stronger learning capabilities and adaptability, allowing for more accurate predictions of objects that might be requested in the future. It excels in handling large-scale data, automatically extracting and comprehending complex patterns in the data. Despite its higher computational complexity, with the widespread increase in server computing power and the gradual adoption of edge computing architectures (Cao et al., 2020), deep learning-based caching has shown vast application prospects. It has garnered attention from researchers for its ability to handle evolving user demands and access patterns effectively.

According to the scope of information on which decisions rely, caching algorithms can be broadly categorized into two types: reactive caching and proactive caching. Reactive caching makes decisions based on real-time, current information, and request context, usually at the time of request arrival. When a new request arrives, a reactive caching algorithm considers the current cache state and the request content to determine whether to place the object in the cache and whether to replace existing objects in the cache. Rule-based Least Recently Used (LRU), Least Frequently Used (LFU), and their variants are representative of reactive caching. These algorithms make cache decisions solely based on the recently observed object access patterns. However, a key drawback of reactive caching is that it can only perceive the popularity of an object after it has been requested, making it challenging to capture dynamic changes in popularity in a timely manner. Therefore, recent research has gradually shifted focus towards proactive caching.

Proactive caching is based on historical request information, predicting and pre-storing data objects in the cache before actual requests arrive. It involves analyzing historical access patterns, content popularity trends, and other information to determine which content should be preloaded into the cache. The proactive approach often gains global knowledge of object access patterns and predicts imminent future requested objects based on the recent object access sequences during the inference process. Traditional proactive caching methods commonly utilize data mining and shallower machine learning techniques, such as dependency graph-based methods, Markov model-based methods, recurrent neural network (RNN) methods, and their variants. In recent years, some emerging deep learning-based proactive caching methods have drawn researchers’ attention, including methods based on attention mechanisms.

The effectiveness of proactive prefetching largely depends on the accuracy of predicting the popularity of content in the imminent future. Existing methods model the entire cache access history sequence directly without partitioning the cache access history sequence into multiple independent single-user sub-sequences, which prevents the model from distinguishing between the commonalities in access patterns across users and the unique characteristics of each individual user. Such lack of differentiation may lead to suboptimal learning of user behavior, as the model would be forced to generalize across all users without considering the specific nuances of each user’s access patterns. Besides, since user access requests do not occur at uniform time intervals, the overlay of multiple user request sequences can confound the pattern information of user requests. This confounding effect leads to the instability of content popularity from the perspective of the entire system. Additionally, user’s historical interactions in real-world applications may not follow a rigid order assumption (Wang et al., 2018). Active caching methods based on self-regressive structures like RNN, LSTM, etc., employ unidirectional models that limit the hidden representation capacity of items in the historical sequence. This limitation is insufficient for learning the optimal representation of user behavior sequences (Hu et al., 2017), as the bidirectional Transformer architecture, similar to BERT, is capable of simultaneously considering the information contained in the tokens preceding and following each token during the encoding process. This enables the model to better comprehend the patterns within the access sequence, thereby making more accurate predictions. Therefore, it is necessary to partition the access sequences into user sessions and introduce bidirectional context representations in active caching algorithms to address these issues.

To tackle these problems, we introduce BERT4Cache, which utilizes a bidirectional Transformer network with attention mechanisms to learn user access patterns. It predicts the most likely objects to be accessed in the imminent future based on the user’s historical access and guides cache prefetching to improve cache hit rates. The contributions of our article are as follows:

  • 1)

    Proposed BERT4Cache, an end-to-end bidirectional Transformer model with attention mechanisms, to explore users’ access patterns, predict objects users are likely to request in the near future, and prefetch them into the cache.

  • 2)

    Introduced forward consistency to enhance user access sequences during data modeling. Combined with a mask-based sequence prediction task, the model is trained to learn users’ data access patterns.

  • 3)

    Built a simulated caching system and demonstrated, using real-world data, that BERT4Cache achieves advanced performance compared to generic reactive and advanced proactive caching strategies.

Related works

Classical caching updating

A significant amount of research has been conducted on caching algorithms, with examples including reactive caching algorithms like LRU, LFU, and their variants, which are commonly used in processor memory caches and web caches. Some studies have proposed proactive caching algorithms based on rules and data mining, such as graph-based caching, Markov model-based caching, and cost function-based caching. Some researchers employed a dependency graph method to predict the popularity of requested objects. However, the dependency graph only examines dependency relationships between two objects, leading to lower prediction accuracy. To address issues such as low modeling hierarchy for low-order Markov models and low coverage for high-order Markov models, the full k-order Markov model was introduced. However, the full k-order Markov model exacerbates complexity problems as the states of all Markov models are additive. Khan et al. (2023) proposed SHADE, which dynamically updates the importance scores of all samples during training to make informed caching decisions, thereby improving the cache hit rate for deep learning training tasks. Einziger, Himelbrand & Waisbard (2023) introduced an adaptive access time-aware caching strategy, using changes in access time as an additional signal for the caching algorithm, thereby improving the average access time of web cache. Li et al. (2022) mined frequent block access patterns and compared the requests in the current time window to direct prefetching data into the cache of SSDs. However, cost function-based and data mining-based algorithms introduce too many manually observed patterns, exhibiting weak robustness across different contexts and struggling to effectively learn user interests and preferences.

Machine learning-based caching updating

As a powerful tool, machine learning algorithms can learn the popularity of data objects, the mobility of user preferences, content characteristics of data objects, and consequently cache the most popular files while removing the least likely to be accessed files to enhance hit rates. The machine learning algorithm used in this process can be a self-regressive method, a reinforcement learning-based approach, and so on. Ale et al. (2019) proposed an online proactive caching scheme based on a bidirectional recurrent neural network (BRNN) model to predict temporal content requests and update the cache accordingly. Narayanan et al. (2018) improved cache hit rates by applying the DEEPCACHE content caching framework, based on deep LSTM encoder-decoder, to predict object popularity and proactively cache the most needed items in the future. Wei et al. (2018) used reinforcement learning to learn the optimal caching strategy through interaction with the network environment. Rodriguez et al. (2021) modeled dynamically changing workloads as a composition of four workload primitive types. They employed reinforcement learning to select multiple caching strategies such as ARC, LIRS, adjusting their weights to achieve adaptive cache updates. Yang et al. (2023) clustered similar objects into groups and perform cache prefetching and replacement operations based on the features of these groups. Navarro et al. (2017) run a set of applications on a processor simulator, obtained their instruction features, and trained a model based on this to predict the optimal cache reconfiguration for any given application. Huang et al. (2016) introduced a “one-time-access criteria” applied to the cache space. They propose a policy to exclude unnecessary SSD cache writes and design a prediction-based classifier to support the policy. Zarif et al. (2020) employed an RNN-based LFU and proposed a proactive caching framework independent of the database system. This framework predicts the time cost and frequency of queries to guide the caching process. However, these methods do not distinguish between users, and complex access sequences may confound pattern information of user requests, affecting cache hit rates. Constrained by their inherent characteristics, such algorithms struggle to find the optimal solution for cache updates and may not delve deep enough into pattern mining. Additionally, these methods face the challenge of low inference efficiency due to difficulties in parallel computing.

Attention mechanism and proactive prefetching

The attention mechanism dynamically adjusts the weights assigned to different parts of input data based on their distinct features. This enables selective focus during data processing, making the model more effective in exploring long-range dependencies and complex patterns within sequences. With models like Transformer (Vaswani et al., 2017), BERT (Devlin et al., 2019), etc., built on multi-head self-attention achieving leading results in sequence data modeling such as machine translation and text classification, some researchers are exploring the use of attention mechanisms to enhance the hit rate and interpretability of recommendation systems. Li et al. (2017) proposed the SASRec model, based on a Transformer decoder, to learn user browsing behavior, achieving state-of-the-art results across multiple public datasets. Sun et al. (2019), through a cloze task, modeled user behavior sequences using a bidirectional self-attention network, demonstrating significant potential in sequence recommendation. Both sequence recommendation and proactive prefetching require modeling user data access patterns to predict future user behavior before actual requests. Therefore, the combination of attention mechanisms and proactive prefetching has become a popular topic in recent years.

The proposed bert4cache method

In this article, our objective is to leverage sequentialized data for the purpose of characterizing user request sequences through the application of BERT with integrated forward consistency. Subsequently, we employ this characterization to construct a predictor based on user request sequences, thereby guiding the prefetching process of cache prefetching.

To this end, the proposed methodology comprises two primary stages: (1) Data modeling, involving the transformation of continuous, unsegmented raw log data into forward consistency pairs (FCP) and sequence-prefetch sets (SPS). The detailed procedure will be elucidated in “Data Modeling”; (2) Training, which entails the utilization of masked data sequence (MDS) and forward consistency (FC) tasks to train a BERT network that is particularly attuned to recent request history. The specific steps will be expounded upon in “Training BERT4Cache”.

Problem formulation

In the context of sequentialized data L, let U={u1,u2,...,u|U|} denote the set of users, and O={o1,o2,...,o|O|} represent the set of objects to be prefetched. For a given user uU, their request sequence set S^u={s1,...,sN} is obtained by slicing the sequences with a maximum time interval g, where N is the number of sequences after slicing. For each sequence slice sS^u, s=[o1,...,ot], where ot is the object last requested in the slice according to the time order. Given the data L, cache prefetching aims to capture the latest sequence slice snowui for each user ui and model all possible items for the next request by that user, as shown in Eq. (1).

p(ot+1ui=o|snowui).

BERT4Cache architecture

In this section, we present an innovative framework, Bidirectional Encoder Representations from Transformers for cache prefetching (BERT4Cache).

As depicted in Fig. 1, BERT4Cache comprises five essential components: (1) Sequentialized data processor, responsible for transforming sequentialized data into request object sequences organized by user sessions; (2) forward consistency pair generator, leveraging a sliding window approach to construct FCPs based on the sequences obtained in step (1); (3) Sequence-Prefetch Set Generator, tasked with appending a special token [mask] to the end of each request object sequence and simultaneously replacing random p% of tokens in the sequence with the special token [mask]; (4) BERT4Cache Network, trained on the data generated in steps (2) and (3) using the Forward Consistency Task and Cloze Task to enable the model to learn complex patterns in user’s object requests. Specifically, the main network architecture of BERT4Cache comprises multiple layers of bidirectional Transformer blocks. Each block is constructed with multi-head self-attention layers, capturing implicit relationships among encoded representation tokens. In this study, we employed 12 Transformer blocks, each incorporating 12 attention heads. The dimensionality of each input was set to D=768, with a total of L=512 input tokens; (5) Decoding Network, responsible for decoding the context-encoded results from the BERT4Cache Network into actual categories of objects that can be prefetched in the cache system. In this study, we employ a two-layer feed-forward network with GELU activation in between to produce a |O|-dimensional output distribution.