GAP-Net: Calibrating User Intent via Gated Adaptive Progressive Learning for CTR Prediction

Shenqiang Ke MeituanBeijingChina [email protected] , Jianxiong Wei MeituanBeijingChina [email protected] and Qingsong Hua MeituanBeijingChina [email protected]
(2018)
Abstract.

Sequential user behavior modeling is pivotal for Click-Through Rate (CTR) prediction yet is hindered by three intrinsic bottlenecks: (1) the ”Attention Sink” phenomenon, where standard Softmax compels the model to allocate probability mass to noisy behaviors; (2) the Static Query Assumption, which overlooks dynamic shifts in user intent driven by real-time contexts; and (3) Rigid View Aggregation, which fails to adaptively weight heterogeneous temporal signals according to the decision context. To bridge these gaps, we propose GAP-Net (Gated Adaptive Progressive Network), a unified framework establishing a ”Triple Gating” architecture to progressively refine information from micro-level features to macro-level views. GAP-Net operates through three integrated mechanisms: (1) Adaptive Sparse-Gated Attention (ASGA) employs micro-level gating to enforce sparsity, effectively suppressing massive noise activations; (2) Gated Cascading Query Calibration (GCQC) dynamically aligns user intent by bridging real-time triggers and long-term memories via a meso-level cascading channel; and (3) Context-Gated Denoising Fusion (CGDF) performs macro-level modulation to orchestrate the aggregation of multi-view sequences. Extensive experiments on industrial datasets demonstrate that GAP-Net achieves substantial improvements over state-of-the-art baselines, exhibiting superior robustness against interaction noise and intent drift.

Sequential Recommendation, Click-Through Rate Prediction, Gating Mechanism, User Intent Calibration
copyright: acmlicensedjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: Make sure to enter the correct conference title from your rights confirmation email; June 03–05, 2018; Woodstock, NYisbn: 978-1-4503-XXXX-X/2018/06ccs: Information systems Recommender systems

1. Introduction

Click-Through Rate (CTR) prediction stands as the cornerstone of modern online advertising (Guo et al., 2017; Ma et al., 2018) and recommendation systems (Covington et al., 2016; Kang and McAuley, 2018; Xu et al., 2025), playing a pivotal role in optimizing traffic allocation and maximizing platform revenue. In this domain, the dominant paradigm has transitioned from static user profiling to Sequential User Behavior Modeling (Kang and McAuley, 2018; Li et al., 2023). By encoding variable-length interaction histories into expressive representations, sequential models capture complex, non-linear dependencies between historical behaviors and current candidates, providing a robust foundation for identifying fine-grained user preferences.

To effectively encode these behavioral signals, the research landscape has evolved from foundational attention mechanisms to high-capacity architectures. Early works like DIN (Zhou et al., 2018) and DIEN (Zhou et al., 2019) pioneered the use of Target Attention, enabling models to track the dynamic evolution of user interests relative to candidate items. Building on this foundation, efforts to enhance model expressiveness have expanded along two principal dimensions. The first dimension involves Extending Sequence Length, where retrieval-based frameworks (e.g., SIM (Pi et al., 2020), ETA (Chen et al., 2021)) filter relevant actions from massive historical data, thereby broadening the receptive field from short-term sessions to lifelong horizons. The second dimension focuses on Broadening Information Width to enrich context representation. Beyond traditional attribute (e.g., DIF-SR (Xie et al., 2022a)), a burgeoning trend leverages the semantic capabilities of Large Language Models (Yang et al., 2022a; Hu et al., 2024) to integrate heterogeneous multi-modal side information—such as deep textual semantics and visual cues (Zhang et al., 2025)—into the recommendation loop. Fundamentally, both streams share a unified objective: augmenting the model’s input space with longer historical sequences and wider information channels.

Illustration of three intrinsic blind spots.
Figure 1. Illustration of intrinsic blind spots. (a) Micro-Level: Inductive Bias Flaw. The standard Softmax enforces a strict sum-to-one constraint, forcing the model to assign spurious weights to irrelevant noise. (b) Meso-Level: Representation Gap. Static target embeddings fail to capture dynamic intent shifts driven by real-time context cues (e.g., shifting from ”daily meal” to ”social dining”), leading to intent misalignment. (c) Macro-Level: Rigid Fusion. Static aggregation lacks adaptivity, failing to dynamically modulate the trade-off between short-term impulses and long-term habits based on the specific decision scenario.

However, this pursuit of ”Longer and Wider” sequences masks intrinsic flaws in the Micro-level Interaction Mechanism. Most state-of-the-art models still rely on standard Softmax Attention as the atomic operation, which suffers from three critical blind spots:

(1) Lack of ”Rejection” Capability: The standard Softmax function enforces a strict sum-to-one normalization, compelling the model to distribute its entire probability mass across the history sequence. This structural rigidity leads to the ”Attention Sink” phenomenon, yet with a critical distinction in manifestation between NLP and RecSys. In LLMs, residual attention scores—when no meaningful token is found—are typically absorbed by specific ”sink tokens” (e.g., the initial token or delimiters) (Qiu et al., 2025). In contrast, user behavior sequences generally lack such dedicated canonical sinks. Consequently, the ”sink” in sequential recommendation inevitably manifests as noisy behaviors (e.g., accidental clicks or irrelevant history). The model is effectively forced to assign spurious relevance to these noise artifacts, amplifying noise accumulation across long sequences. This highlights an urgent need for sparsity-inducing mechanisms that enable true ”zero-attention” to irrelevant features (Ma et al., 2019; Zaheer et al., 2021).

(2) Static Interaction Paradigm: The conventional assumption that the Target Item (Query) serves as a static anchor overlooks the fluidity of user intent, which is inherently driven by real-time context. This creates a semantic gap, as the user’s intention toward the same item can vary drastically depending on situational cues. For instance, consider the target item ”Baby Cabbage”. On a weekday, it typically signifies a routine ”daily meal” intent; however, on a weekend, if it appears alongside items like ”Hot Pot Soup Base” or ”Shrimp Paste”, the intent shifts significantly to a ”social dining” (hot pot) context. Existing models that treat the target embedding as immutable fail to capture this nuance, often retrieving historically relevant but contextually mismatched behaviors. While approaches like DIEN (Zhou et al., 2019) model interest evolution, they typically fix the query vector during the retrieval process.

(3) Rigid View Aggregation: Existing multi-view frameworks typically resort to static concatenation or summation to merge heterogeneous temporal signals (e.g., real-time triggers vs. long-term memories). This context-agnostic strategy fails to dynamically modulate the importance of each view, allowing macro-level noise from irrelevant time windows to dilute valid signals—especially when user intent shifts rapidly between habitual and impulsive modes. Although methods like DIF-SR (Xie et al., 2022b) and MIND (Li et al., 2019) decouple short- and long-term interests, their fusion stages lack adaptivity. While recent advances propose context-gated fusion mechanisms (Chen et al., 2025; Li et al., 2025) or reinforcement learning-based selection (Ji et al., 2025), a unified adaptive framework remains elusive.

To remedy these fundamental defects, we draw inspiration from parallel advancements in Large Language Models (LLMs). Noting the success of Gated Attention in stabilizing massive activations, we rethink the systemic potential of gating in sequential recommendation. While gating units (e.g., GRU gates (Guo et al., 2022; Chang et al., 2021), MoE routers (Bian et al., 2023; Xu et al., 2024)) have been sporadically applied in RecSys, they are predominantly utilized as isolated tools for sub-tasks like structure pruning or static expert routing. A critical gap remains: current approaches overlook gating as a comprehensive defense mechanism against noise. This motivates our core inquiry: Can we establish a systematic gating philosophy that orchestrates sparsity-based denoising and context-aware calibration across all granularities?

Addressing these challenges, we propose the Gated Adaptive Progressive Network (GAP-Net), a unified framework based on a ”Triple Gating” philosophy. The core objective of GAP-Net is to systematically eliminate interaction noise and calibrate user intent through gating mechanisms orchestrated at the micro, meso, and macro levels. Specifically, the framework consists of three integrated modules: (1) Adaptive Sparse-Gated Attention (ASGA) operates at the micro-feature level, introducing a learnable sparse gating mechanism that relaxes the strict sum-to-one constraint to suppress feature-level noise and mitigate the ”Attention Sink” problem. (2) Gated Cascading Query Calibration (GCQC) operates at the meso-intent level, abandoning the static query assumption to construct a cascaded channel that progressively refines the query vector using real-time contextual triggers, ensuring the retrieval process aligns with dynamic intent. (3) Context-Gated Denoising Fusion (CGDF) operates at the macro-view level, utilizing a purified decision context to adaptively modulate the contribution of heterogeneous temporal views (e.g., real-time vs. long-term sequences), preventing noise propagation from irrelevant time windows.

The main contributions of this paper are summarized as follows:

  • We systematically identify three intrinsic bottlenecks in existing CTR models—the ”Attention Sink” phenomenon, the ”Static Intent” assumption, and ”Rigid View Aggregation”—and introduce a unified ”Triple Gating” philosophy to resolve them.

  • We propose GAP-Net, a novel architecture that orchestrates micro-level sparse attention (ASGA), meso-level intent calibration (GCQC), and macro-level dynamic fusion (CGDF) to ensure comprehensive noise resilience.

  • Extensive experiments on industrial datasets demonstrate that GAP-Net achieves state-of-the-art performance, exhibiting superior robustness against interaction noise and intent drift.

2. Related Work

In this section, we review the evolution of Target-Attention-based models and gating mechanisms, highlighting the critical limitations in existing literature that motivate the design of GAP-Net.

Sequential Recommendation. Sequential modeling serves as the backbone of CTR prediction, having evolved from early pooling methods to sophisticated attention-based paradigms. DIN (Zhou et al., 2018) pioneered the Target Attention mechanism to capture diverse user interests relative to candidate items, while successors like DIEN (Zhou et al., 2019) and DSIN (Feng et al., 2019) further incorporated interest evolution and session-level dynamics. However, these foundational approaches predominantly rely on the standard Softmax function, which enforces a strict sum-to-one constraint. This design inherently assumes the presence of relevant items within any history window, compelling the model to allocate probability mass even to purely noisy or irrelevant behaviors. Consequently, these models lack the capability for ”soft rejection,” leading to the accumulation of spurious signals—a phenomenon typically referred to as the ”Attention Sink.”

To address scalability and expressiveness, recent research has bifurcated into two streams. Regarding sequence length, search-based models like SIM (Pi et al., 2020) and ETA (Chen et al., 2021) introduced two-stage retrieval (GSU/ESU) to handle life-long sequences. Nevertheless, these methods rely on rigid hard-retrieval metrics (e.g., category matching) that act as static filters. They fail to account for the fluidity of user intent, often retrieving historically relevant but mismatched items (e.g., recommending ”daily supplies” during a ”gift-giving” context). While recent works such as TWIN (Chang et al., 2023a; Si et al., 2024) and LONGER (Chai et al., 2025) push boundaries to ultra-long horizons, they prioritize sequence length over interaction quality, leaving the underlying sensitivity to noise unresolved. Regarding sequence width, methods such as DIF-SR (Xie et al., 2022a), ASIF (Wang et al., 2024), and CAIN (Guo et al., 2025) focus on fusing heterogeneous side information. However, they typically employ context-agnostic aggregation strategies (e.g., concatenation), which lack the adaptivity to down-weight irrelevant views when the dominant signal shifts between real-time triggers and long-term habits. In contrast, GAP-Net re-imagines the core interaction mechanism, introducing a unified multi-level gating philosophy to systematically calibrate user intent and dynamically fuse heterogeneous views.

Gated Mechanism in Recommendation System. Gating mechanisms have transitioned from foundational RNN components into versatile tools for feature selection and dynamic routing. DCN V2 (Wang et al., 2021) and AdaSparse (Yang et al., 2022b) integrated gating to induce adaptive structural sparsity, while PEPNet (Chang et al., 2023b) leverages GateNet for personalized embedding pruning. In multi-task settings, frameworks like PLE (Tang et al., 2020) and M2M (Zhang et al., 2022) employ gating to route samples to specific experts. Despite their success, these approaches typically deploy gating as isolated modules for structural pruning or static routing. They treat gating primarily as a static feature selector rather than an interaction refinement mechanism, failing to orchestrate gating dynamically across the temporal dimension. Consequently, they cannot effectively prevent noise propagation during the sequential evolution of user intent. GAP-Net fills this void by establishing a unified ”Triple Gating” architecture that progressively filters noise and refines intent from micro-level features to macro-level views.

3. Method

In this section, we present the proposed Gated Adaptive Progressive Network (GAP-Net), a unified framework designed to calibrate user intent via a multi-level gating philosophy. As illustrated in Figure 2, GAP-Net establishes a ”Triple Gating” architecture that filters noise across three granularities. The framework comprises three core modules: (1) Adaptive Sparse-Gated Attention for micro-level feature denoising; (2) Gated Cascading Query Calibration for meso-level intent evolution; and (3) Context-Gated Denoising Fusion for macro-level view modulation. We detail the design and implementation of each component in the following subsections.

An Overview of the proposed GAP-Net
Figure 2. An Overview of the proposed GAP-Net. The framework employs a ”Triple Gating” architecture for progressive denoising and calibration: (a) Micro-Level (ASGA): Replaces Softmax with learnable sparse gating, eliminating the strict sum-to-one constraint. (b) Meso-Level (GCQC): Evolves static target embeddings by fusing real-time context triggers. (c) Macro-Level (CGDF): A context-aware network that dynamically modulates fusion weights for heterogeneous views.

3.1. Problem Definition

The objective of Click-Through Rate (CTR) prediction is to estimate the probability that a target user will engage with a candidate item within a specific context. Let 𝒰\mathcal{U} and \mathcal{I} denote the sets of users and items, respectively. For a given instance comprising a user u𝒰u\in\mathcal{U}, a candidate item v+v^{+}\in\mathcal{I}, and a serving context cc, the model input consists of static features and multi-granularity behavior sequences: (1) Static Features: The user profile 𝐱u\mathbf{x}_{u}, candidate item attributes 𝐱v+\mathbf{x}_{v^{+}}, and context features 𝐱c\mathbf{x}_{c}; (2) Multi-Granularity Behavior Sequences: The user’s historical interactions are partitioned into three temporal views: Long-term behaviors 𝐒lt=[v1,,vTlt]\mathbf{S}^{\mathrm{lt}}=[v_{1},\ldots,v_{T_{\mathrm{lt}}}], Short-term behaviors 𝐒st=[vTlt+1,,vTlt+Tst]\mathbf{S}^{\mathrm{st}}=[v_{T_{\mathrm{lt}}+1},\ldots,v_{T_{\mathrm{lt}}+T_{\mathrm{st}}}], and Real-time behaviors 𝐒rt=[vTlt+Tst+1,,vT]\mathbf{S}^{\mathrm{rt}}=[v_{T_{\mathrm{lt}}+T_{\mathrm{st}}+1},\ldots,v_{T}].

Here, T=Tlt+Tst+TrtT=T_{\mathrm{lt}}+T_{\mathrm{st}}+T_{\mathrm{rt}} denotes the total sequence length. The typical magnitudes for these sequences are Tlt103-104T_{\mathrm{lt}}\sim 10^{3}\text{-}10^{4} (lifelong history), Tst102-103T_{\mathrm{st}}\sim 10^{2}\text{-}10^{3} (recent interests), and Trt100-102T_{\mathrm{rt}}\sim 10^{0}\text{-}10^{2} (current session triggers).

All categorical and numerical features are mapped into dense embedding vectors 𝐞u,𝐞c,𝐞+d\mathbf{e}_{u},\mathbf{e}_{c},\mathbf{e}^{+}\in\mathbb{R}^{d}. Similarly, each historical interaction item vtv_{t} is embedded as 𝐞vtd\mathbf{e}_{v_{t}}\in\mathbb{R}^{d}, yielding the comprehensive behavior embedding matrices 𝐄lt\mathbf{E}_{\mathrm{lt}}, 𝐄st\mathbf{E}_{\mathrm{st}}, and 𝐄rt\mathbf{E}_{\mathrm{rt}}. The complete input representation for the CTR model is formalized as:

(1) 𝒳=(𝐞u,𝐞c,𝐞+,𝐄lt,𝐄st,𝐄rt).\mathcal{X}=\left(\mathbf{e}_{u},\mathbf{e}_{c},\mathbf{e}^{+},\mathbf{E}_{\mathrm{lt}},\mathbf{E}_{\mathrm{st}},\mathbf{E}_{\mathrm{rt}}\right).

The model learns a mapping function f:𝒳y^[0,1]f:\mathcal{X}\mapsto\hat{y}\in[0,1], where y^\hat{y} represents the predicted probability of a positive interaction (e.g., click or purchase). Given a training dataset 𝒟={(𝒳(i),y(i))}i=1N\mathcal{D}=\{(\mathcal{X}^{(i)},y^{(i)})\}_{i=1}^{N} with binary labels y(i){0,1}y^{(i)}\in\{0,1\}, the parameters of f()f(\cdot) are optimized by minimizing the binary cross-entropy loss:

(2) =1Ni=1N[y(i)logy^(i)+(1y(i))log(1y^(i))],\mathcal{L}=-\frac{1}{N}\sum_{i=1}^{N}\left[y^{(i)}\log\hat{y}^{(i)}+(1-y^{(i)})\log(1-\hat{y}^{(i)})\right],

where y^(i)=f(𝒳(i))\hat{y}^{(i)}=f(\mathcal{X}^{(i)}).

Our work primarily focuses on designing the sequential encoding architecture within f()f(\cdot). Specifically, we aim to model 𝐒lt\mathbf{S}^{\mathrm{lt}}, 𝐒st\mathbf{S}^{\mathrm{st}}, and 𝐒rt\mathbf{S}^{\mathrm{rt}} in a disentangled, multi-granular, and context-aware manner, while seamlessly integrating them with static features for end-to-end CTR prediction.

3.2. Adaptive Sparse-Gated Attention (ASGA)

Conventional Target Attention paradigms (e.g., DIN) typically employ linear projections for Query, Key, and Value, followed by Softmax normalization. However, this architecture faces two intrinsic limitations in modeling behaviors: (1) Representation Bottleneck: Simple linear mappings lack the non-linearity required to capture high-order feature interactions between the target item and history; and (2) Noise Propagation: Softmax normalization enforces a strict sum-to-one constraint, compelling the model to assign probability mass to noisy or irrelevant behaviors even when the current intent is semantically disconnected from past actions. This issue is analogous to the ”Attention Sink” phenomenon observed in LLMs (Xiao et al., 2023).

To address these challenges, we propose Adaptive Sparse-Gated Attention (ASGA). This module integrates Pre-Attention Feature Sifting (PAFS) to refine input embeddings and a Query-Guided Adaptive Output Gating (QAOG) mechanism to dynamically modulate the attention output based on the decision context.

3.2.1. Pre-Attention Feature Sifting

Prior to computing interaction scores, it is essential to filter feature noise and enhance the representational capacity of the embeddings. Departing from standard linear projections, we implement a Gated Feed-Forward Network (SwiGLU-FFN) as a learnable feature sifter.

Let 𝐱d\mathbf{x}\in\mathbb{R}^{d} denote an input embedding (corresponding to either the target item 𝐞+\mathbf{e}^{+} or a sequence item 𝐞vt\mathbf{e}_{v_{t}}). We first project 𝐱\mathbf{x} into a higher-dimensional latent space dd^{\prime} (defined as the next power of 2 relative to the hidden size) through two parallel layers: a gating path and an information path. We formally define the operation as follows:

(3) 𝐡gate\displaystyle\mathbf{h}_{\text{gate}} =Swish(𝐱𝐖g+𝐛g)\displaystyle=\text{Swish}(\mathbf{x}\mathbf{W}_{g}+\mathbf{b}_{g})
𝐡up\displaystyle\mathbf{h}_{\text{up}} =𝐱𝐖u+𝐛u\displaystyle=\mathbf{x}\mathbf{W}_{u}+\mathbf{b}_{u}

where 𝐖g,𝐖ud×d\mathbf{W}_{g},\mathbf{W}_{u}\in\mathbb{R}^{d\times d^{\prime}} are the gate and up-projection matrices, respectively, and Swish(z)=zσ(z)\text{Swish}(z)=z\cdot\sigma(z) denotes the non-linear activation function. The feature filtering is performed via element-wise interaction, followed by a down-projection to restore the original embedding dimension:

(4) PAFS(𝐱)=(𝐡gate𝐡up)𝐖d+𝐛d\text{PAFS}(\mathbf{x})=(\mathbf{h}_{\text{gate}}\odot\mathbf{h}_{\text{up}})\mathbf{W}_{d}+\mathbf{b}_{d}

where 𝐖dd×d\mathbf{W}_{d}\in\mathbb{R}^{d^{\prime}\times d} is the down-projection matrix. Applying this transformation to our inputs yields:

(5) 𝐞~t=PAFS(𝐞t),𝐄~s=PAFS(𝐄s)\tilde{\mathbf{e}}_{t}=\text{PAFS}(\mathbf{e}_{t}),\quad\tilde{\mathbf{E}}_{s}=\text{PAFS}(\mathbf{E}_{s})

This expansion-compression architecture functions as a learnable information bottleneck. The Swish gate (𝐡gate\mathbf{h}_{\text{gate}}) dynamically suppresses noisy signals within the expanded feature space, while the down-projection (𝐖d\mathbf{W}_{d}) synthesizes the filtered features back into a compact representation, ensuring that only high-quality signals propagate to the subsequent attention mechanism.

3.2.2. Query-Guided Adaptive Output Gating

Inspired by the Gated Attention mechanism in Qwen (Qiu et al., 2025), we adapt this concept for target attention. Unlike standard multi-head attention, we introduce a Query-Guided gating mechanism. The fundamental premise is that the validity of the retrieved history is intrinsically dependent on the current target item (Query). If a user engages with an item contextually isolated from their history (e.g., accidental click), the model should autonomously suppress the historical context.

Specifically, we expand the Query projection to jointly learn the query representation and a relevance gate. For the hh-th head:

(6) [𝐐(h),𝐆logit(h)]=Split(PAFS(𝐞t)𝐖Q(h))[\mathbf{Q}^{(h)},\mathbf{G}_{\text{logit}}^{(h)}]=\text{Split}(\text{PAFS}(\mathbf{e}_{t})\mathbf{W}_{Q}^{(h)})

where 𝐖Q(h)d×2dk\mathbf{W}_{Q}^{(h)}\in\mathbb{R}^{d\times 2d_{k}}. The projection is split along the last dimension to yield the query vector 𝐐(h)dk\mathbf{Q}^{(h)}\in\mathbb{R}^{d_{k}} and the gating logit 𝐆logit(h)dk\mathbf{G}_{\text{logit}}^{(h)}\in\mathbb{R}^{d_{k}}. Keys (𝐊\mathbf{K}) and Values (𝐕\mathbf{V}) are projected normally from the sequence inputs:

(7) 𝐊=𝐄~s𝐖K,𝐕=𝐄~s𝐖V\quad\mathbf{K}=\tilde{\mathbf{E}}_{s}\mathbf{W}_{K},\quad\mathbf{V}=\tilde{\mathbf{E}}_{s}\mathbf{W}_{V}

We then modulate the standard Scaled Dot-Product Attention (SDPA) output 𝐇att(h)\mathbf{H}_{\text{att}}^{(h)} using the learned query gate:

(8) 𝐇att(h)\displaystyle\mathbf{H}_{\text{att}}^{(h)} =Softmax(𝐐(h)(𝐊(h))dk)𝐕(h)\displaystyle=\text{Softmax}\left(\frac{\mathbf{Q}^{(h)}(\mathbf{K}^{(h)})^{\top}}{\sqrt{d_{k}}}\right)\mathbf{V}^{(h)}
𝐇final(h)\displaystyle\mathbf{H}_{\text{final}}^{(h)} =𝐇att(h)σ(𝐆logit(h))\displaystyle=\mathbf{H}_{\text{att}}^{(h)}\odot\sigma(\mathbf{G}_{\text{logit}}^{(h)})

where σ()\sigma(\cdot) is the Sigmoid function. While Softmax determines the relative importance distribution, the Sigmoid gate assesses the absolute confidence of the query’s need for context. If σ(𝐆logit(h))\sigma(\mathbf{G}_{\text{logit}}^{(h)}) approaches zero, the history is effectively ignored, eliminating the strict sum-to-one constraint and preventing noise propagation.

3.3. Gated Cascading Query Calibration (GCQC)

Conventional sequential models typically operate under a Static Query Assumption, treating the target embedding 𝐞+\mathbf{e}^{+} as an immutable anchor. This creates a semantic gap: user intent toward the same target item is fluid and contingent on real-time context. Directly querying long-term history with a static target often retrieves contextually irrelevant behaviors.

To bridge this gap, we propose Gated Cascading Query Calibration (GCQC). Unlike static retrieval, GCQC adopts a Gated Hierarchy strategy, employing a chain of Calibration Gating Units (CGU) to dynamically evolve the query vector.

3.3.1. Hierarchical Sequence Partitioning

Consistent with the Problem Definition, we stratify historical sequences (processed via PAFS) into three views:

  • Real-Time View (𝐄~rt\tilde{\mathbf{E}}_{\mathrm{rt}}): The most recent TrtT_{\mathrm{rt}} interactions, representing immediate triggers and impulse intent.

  • Short-Term View (𝐄~st\tilde{\mathbf{E}}_{\mathrm{st}}): Recent window interactions, capturing temporary interests.

  • Long-Term View (𝐄~lt\tilde{\mathbf{E}}_{\mathrm{lt}}): Extensive history of stable preferences and habits.

3.3.2. Gated Query Evolution

Let ASGA(𝐐,𝐒)\text{ASGA}(\mathbf{Q},\mathbf{S}) denote the attention operation defined in section 3.2.The query evolves as follows:

Stage 1: Real-Time Context Injection

The initial query is derived from the target item: 𝐐0=𝐞~t\mathbf{Q}_{0}=\tilde{\mathbf{e}}_{t}. We first query the Real-Time View 𝐄~rt\tilde{\mathbf{E}}_{\mathrm{rt}}. To fuse immediate context while preventing noisy drift, we introduce the first CGU:

(9) 𝐇rt\displaystyle\mathbf{H}_{rt} =ASGA(𝐐0,𝐄~rt)\displaystyle=\text{ASGA}(\mathbf{Q}_{0},\tilde{\mathbf{E}}_{\mathrm{rt}})
𝐳1\displaystyle\mathbf{z}_{1} =σ([𝐐0;𝐇rt]𝐖z1+𝐛z1)\displaystyle=\sigma([\mathbf{Q}_{0};\mathbf{H}_{rt}]\mathbf{W}_{z1}+\mathbf{b}_{z1})
𝐐rt\displaystyle\mathbf{Q}_{rt} =(1𝐳1)𝐐0+𝐳1𝐇rt\displaystyle=(1-\mathbf{z}_{1})\odot\mathbf{Q}_{0}+\mathbf{z}_{1}\odot\mathbf{H}_{rt}

Here, 𝐳1\mathbf{z}_{1} serves as an ”Intent Update Gate”.If real-time behaviors are relevant, 𝐳1\mathbf{z}_{1} activates to inject context; otherwise, the gate closes to preserve original target semantics.

Stage 2: Short-Term Intent Rectification

Crucially, we use the Real-Time Calibrated Query 𝐐rt\mathbf{Q}_{rt}—rather than the static target—to query the short-term history. This ensures retrieval is guided by the user’s current intent. A second CGU refines this representation:

(10) 𝐇st=ASGA(𝐐rt,𝐄~st)\mathbf{H}_{st}=\text{ASGA}(\mathbf{Q}_{rt},\tilde{\mathbf{E}}_{\mathrm{st}})

In this stage, 𝐐rt\mathbf{Q}_{rt} acts as a filter: only short-term behaviors that resonate with the real-time context are aggregated into 𝐇st\mathbf{H}_{st}.

Stage 3: Context-Aware Long-Term Retrieval

As used in short-term modeling, 𝐐rt\mathbf{Q}_{rt} is also used to retrieve relevant memories from the extensive Long-Term View 𝐄~lt\tilde{\mathbf{E}}_{\mathrm{lt}}. Using a precise, context-aware query is vital for accurate retrieval from noisy long sequences:

(11) 𝐇lt=ASGA(𝐐rt,𝐄~lt)\mathbf{H}_{lt}=\text{ASGA}(\mathbf{Q}_{rt},\tilde{\mathbf{E}}_{\mathrm{lt}})

Through this gated cascade, GCQC effectively transforms the retrieval probability from the traditional P(History|Target)P(\text{History}|\text{Target}) to a context-aware formulation P(History|Target,RealTime)P(\text{History}|\text{Target},\text{RealTime}), ensuring query evolution is driven by high-confidence signals.

3.4. Context-Gated Denoising Fusion (CGDF)

Existing frameworks typically adopt ”hard concatenation” to merge temporal views, assuming all views are equally reliable. This inductive bias is flawed: view relevance is highly context-dependent (e.g., Long-Term dominates ”repurchase”, Real-Time dominates ”impulse”). To address this, we propose Context-Gated Denoising Fusion (CGDF), which employs Gated Context Purification and Gated View Modulation.

3.4.1. Gated Context Purification

Let 𝒱={𝐇rt,𝐇st,𝐇lt}\mathcal{V}=\{\mathbf{H}_{rt},\mathbf{H}_{st},\mathbf{H}_{lt}\} denote the calibrated representations from GCQC. To determine fusion weights, we construct a raw decision anchor 𝐳raw\mathbf{z}_{raw} by concatenating the refined target, context features, and view outputs:

(12) 𝐳raw=Concat(𝐞~t,𝐞c,𝐇rt,𝐇st,𝐇lt)\mathbf{z}_{raw}=\text{Concat}(\tilde{\mathbf{e}}_{t},\mathbf{e}_{c},\mathbf{H}_{rt},\mathbf{H}_{st},\mathbf{H}_{lt})

To filter noise (e.g., spurious correlations), we subject the anchor to a non-linear filtration using the SwiGLU-FFN architecture (identical to PAFS in Sec. 3.2):

(13) 𝐳denoised=SwiGLU-FFN(𝐳raw)\mathbf{z}_{denoised}=\text{SwiGLU-FFN}(\mathbf{z}_{raw})

This acts as a learnable filter, focusing the subsequent gating network on high-order interaction signals.

3.4.2. Gated View Modulation

The purified anchor 𝐳denoised\mathbf{z}_{denoised} is passed through an MLP to project the context into a view-weighting space:

(14) 𝐡gate=MLP(𝐳denoised)\mathbf{h}_{gate}=\text{MLP}(\mathbf{z}_{denoised})

Subsequently, we compute the adaptive fusion weights via a linear projection followed by Softmax normalization along the view dimension:

(15) 𝜶=Softmax(𝐡gate𝐖logit)\boldsymbol{\alpha}=\text{Softmax}(\mathbf{h}_{gate}\mathbf{W}_{logit})

where 𝜶=[αrt,αst,αlt]3\boldsymbol{\alpha}=[\alpha_{rt},\alpha_{st},\alpha_{lt}]\in\mathbb{R}^{3} represents the learned importance distribution over the three temporal views, satisfying αk=1\sum\alpha_{k}=1. The final fused representation 𝐯final\mathbf{v}_{final} is obtained by the weighted aggregation (concatenation) of the expert views:

(16) 𝐯final=Concat(αrt𝐇rt,αst𝐇st,αlt𝐇lt)\mathbf{v}_{final}=\text{Concat}(\alpha_{rt}\mathbf{H}_{rt},\alpha_{st}\mathbf{H}_{st},\alpha_{lt}\mathbf{H}_{lt})

Here, scalar weights αk\alpha_{k} are broadcast across embedding dimensions. This ”Soft-Selection” mechanism empowers GAP-Net to dynamically suppress irrelevant temporal windows (e.g., αlt0\alpha_{lt}\to 0 during intent drift) while amplifying pertinent signals.

4. Experiments

In this section, we conduct extensive offline/online experiments to validate the effectiveness of GAP-Net on CTR tasks. The following Research Questions will be answered by analysis of the experimental results.

  • RQ1: How does GAP-Net perform when compared with other state-of-the-art (SOTA) CTR models?

  • RQ2: What is the influence on the performance of the core components in GAP-Net?

  • RQ3: How do different components contribute to the effectiveness of GAP-Net?

  • RQ4: How does GAP-Net perform in real industrial systems?

4.1. Experimental Setting

4.1.1. Dataset

To validate the effectiveness of our method in recommendation systems, we construct a dataset XMart using real user interaction in our own scenario. Additionally, we evaluated our model on public recommendation dataset KuaiVideo111https://huggingface.co/datasets/reczoo/KuaiVideo_XLong.

XMart. The XMart dataset which contains properties of users, items, user historically behaviors (including click, addcart, and purchase), is generated based on user logs collected from November 9th to 16th, 2025. User logs of the first 7 days in Nov. 2025 is used as the training data ,while reserve the last day for validation and test. Negative samples in the training dataset are set to those impressed products but not purchased by users, and positive samples are purchased.

KuaiVideo. Derived from the Kuaishou Challenge presented at the China MM 2018 conference, this dataset is tailored for micro-video Click-Through Rate (CTR) prediction. It encapsulates a rich set of user-video interaction dynamics, recording specific behavior types including click”, like”, and follow”, alongside passive negative feedback (not click” after thumbnail impression). For our experiments, we constructed a dense subset comprising 3,239,534 sequential interaction records sampled from 10,000 users. While absolute timestamps are anonymized, the relative temporal order of behaviors is preserved to facilitate sequential analysis. In this setup, positive samples are defined by explicit engagement behaviors (e.g., clicks), whereas impressions without subsequent clicks are treated as negative samples.

4.1.2. Compared Methods

To verify the effectiveness of the proposed method, we compare it with following methods:

  • DIN (Zhou et al., 2018): It utilizes attention mechanism to activate relevant users’ behaviors with respect to corresponding targets and learns an adaptive representation vector for users’ interests.

  • ETA (Chen et al., 2021): It proposes an end-to-end target attention framework using Locality-Sensitive Hashing (SimHash). By retrieving top-k relevant behaviors via efficient Hamming distance calculation, it captures long-term user interests while satisfying strict inference time constraints.

  • SDIM (Cao et al., 2022): It introduces a hash sampling-based strategy to approximate the target attention distribution. By directly gathering behavior items that share the same hash signatures with the candidate item, it models long-term user interests with linear time complexity.

4.1.3. Metrics

To provide a comprehensive assessment of ranking efficacy, we employ three widely adopted metrics standard in industrial recommendation scenarios: AUC (Area Under ROC Curve) to measure the model’s fundamental discriminative power, along with NDCG@K (Normalized Discounted Cumulative Gain) and MAP (Mean Average Precision) to evaluate the quality of the top-K ranking list.

Table 1. Dataset Statistics
users items inters. avg inters.
XMart 8,678,328 25,033 1,463,105,174 168.59
KuaiVideo 10,001 3,239,535 13,661,383 1366.14

Consistent with real-world serving systems, our evaluation is conducted on a per-request basis. Specifically, for each user request uu, the model scores and ranks a candidate set u\mathcal{I}_{u} composed of positive interactions u+\mathcal{I}_{u}^{+} (clicked) and negative samples u\mathcal{I}_{u}^{-} (unclicked). The metrics are formalized as follows:

(17) AUC =1|𝒟|(i,j)𝒟𝕀(y^i>y^j)\displaystyle=\frac{1}{|\mathcal{D}|}\sum_{(i,j)\in\mathcal{D}}\mathbb{I}(\hat{y}_{i}>\hat{y}_{j})
NDCG@K\displaystyle\text{NDCG}@K =1|𝒰|u𝒰(DCGu@KIDCGu@K)\displaystyle=\frac{1}{|\mathcal{U}|}\sum_{u\in\mathcal{U}}\left(\frac{\text{DCG}_{u}@K}{\text{IDCG}_{u}@K}\right)
MAP =1|𝒰|u𝒰(1|u+|k=1|u|P(k)rk)\displaystyle=\frac{1}{|\mathcal{U}|}\sum_{u\in\mathcal{U}}\left(\frac{1}{|\mathcal{I}_{u}^{+}|}\sum_{k=1}^{|\mathcal{I}_{u}|}P(k)\cdot r_{k}\right)

where definitions are strictly detailed as:

  • 𝒟={(i,j)|yi=1,yj=0}\mathcal{D}=\{(i,j)|y_{i}=1,y_{j}=0\} denotes the set of all comparable positive-negative pairs across the dataset, and 𝕀()\mathbb{I}(\cdot) is the indicator function.

  • For top-K ranking, rk{0,1}r_{k}\in\{0,1\} represents the ground-truth relevance at rank kk, and P(k)P(k) denotes the precision at cut-off kk.

  • DCGu@K=k=1K2rk1log2(k+1)\text{DCG}_{u}@K=\sum_{k=1}^{K}\frac{2^{r_{k}}-1}{\log_{2}(k+1)} accumulates the graded relevance with logarithmic decay, while IDCGu@K\text{IDCG}_{u}@K represents the score of the ideal ordering.

4.1.4. Implementation Details

All models are implemented using Industrial-TensorFlow 1.15, Python 2.7, and trained on NVIDIA A100 GPU. For all models, we initialize the model parameters using the Xavier Initialization method (Glorot and Bengio, 2010) and optimize the model with the Adam optimizer (Kingma, 2014), setting the learning rate to 0.001. The batch size is configured to 512.

Table 2. Comprehensive performance comparison on XMart (Click/Purchase) and KuaiVideos (Click).
Model XMart Dataset KuaiVideos Dataset
Click Purchase Click
AUC NDCG MAP AUC NDCG MAP AUC NDCG MAP
DIN 0.6992 0.5401 0.3786 0.7587 0.5579 0.4142 0.6721 0.6999 0.3767
w/ GAP 0.7062 0.5453 0.3848 0.7661 0.5638 0.4213 0.6759 0.7057 0.3809
ETA 0.6936 0.5362 0.3739 0.7547 0.5548 0.4104 0.6730 0.7021 0.3785
w/ GAP 0.7053 0.5446 0.3841 0.7647 0.5626 0.4198 0.6763 0.7062 0.3818
SDIM 0.7034 0.5427 0.3816 0.7614 0.5599 0.4164 0.6738 0.7026 0.3792
w/ GAP 0.7056 0.5450 0.3844 0.7645 0.5630 0.4202 0.6792 0.7078 0.3834

4.2. Overall Performance (RQ1)

The comprehensive performance comparison on the XMart and KuaiVideos datasets is reported in Table 2. From the experimental results, we can draw three key observations regarding the effectiveness of GAP-Net:

Universal Compatibility across Diverse Architectures. A primary observation is that incorporating GAP-Net yields consistent and significant improvements across all baselines, regardless of their underlying modeling paradigms. Specifically, for DIN, which represents foundational Target-Attention models processing relatively short sequences, GAP-Net achieves an AUC lift of +1.00% and +0.97% on the XMart Click and Purchase tasks, respectively. More notably, for search-based models like ETA and SDIM, which are designed to handle ultra-long sequences, the performance gains are equally substantial (e.g., ETA + GAP improves XMart Click AUC by +1.69% to reach 0.7053). This universality validates that the ”Triple Gating” philosophy addresses fundamental bottlenecks common to both architectures: it not only fixes the ”Inductive Bias Flaw” in standard attention (benefiting DIN) but also provides a dynamic calibration mechanism for long-sequence retrieval (benefiting ETA/SDIM), proving GAP-Net to be a robust, plug-and-play solution for sequential modeling.

Enhanced Intent Resolution in High-Value Conversion Tasks. Beyond basic Click-Through Rate (CTR) prediction, GAP-Net demonstrates superior efficacy on ”high-value” conversion tasks, such as the Purchase task in XMart, which inherently requires a deeper understanding of user intent than simple clicks. As shown in Table 2, the relative improvements in Purchase prediction are particularly prominent. For instance, while DIN sees a steady increase in Click AUC, its gain in Purchase AUC is markedly strong, reaching 0.7661 (an improvement of +0.97%). This phenomenon can be attributed to the Meso-Level Gated Cascading Query Calibration (GCQC). Conversion behaviors are often sparse and driven by highly specific, context-dependent intent. By progressively calibrating the query with real-time triggers, GAP-Net effectively filters out shallow ”browsing noise” and accurately locks onto the strong purchase signals buried in the history, thereby delivering larger marginal gains on harder, intent-heavy tasks.

Superior Ranking Stability via Noise Suppression. In addition to binary classification metrics (AUC), the GAP-Net enhanced models exhibit remarkable gains in list-wise ranking metrics, specifically NDCG and MAP, across both datasets. For example, on the XMart dataset, DIN + GAP boosts NDCG from 0.5401 to 0.5453 (+0.96%), and MAP from 0.3786 to 0.3848. Standard Softmax-based models suffer from the ”Attention Sink” effect, where the model is forced to allocate probability mass to irrelevant noisy items, causing them to drift to the top of the recommendation list as ”false positives.” By implementing strict Adaptive Sparse-Gated Attention (ASGA) at the micro-level, GAP-Net performs ”soft rejection” on these low-confidence signals, effectively zeroing out noise. This cleans the decision boundary and ensures that the items ranked at the top-K positions are genuinely relevant to the user’s calibrated intent, significantly improving the quality and stability of the final recommendation list.

4.3. In-depth Analysis (RQ2 & RQ3)

4.3.1. Ablation Study

To verify the effectiveness of GAP-Net and quantify component contributions, we conducted a comprehensive ablation study on the XMart dataset, summarized in Table 3. First, replacing Softmax with ASGA confers a +0.35% AUC uplift and improves NDCG to 0.5597, validating its efficacy in mitigating the “Attention Sink” effect via micro-level denoising. Second, incorporating GCQC brings a +0.28% gain, confirming that real-time query calibration captures intent drift more effectively than static retrieval. Notably, CGDF contributes the largest individual increase of +0.44%, demonstrating the clear superiority of dynamic view re-weighting over rigid hard concatenation. Ultimately, the full GAP-Net surpasses the baseline by +0.97% in AUC and +1.05% in NDCG. This substantial cumulative gain highlights the synergy of our architecture, proving that addressing noise simultaneously across micro, meso, and macro granularities is essential for robust modeling.

Table 3. Ablation study of GAP-Net on the XMart dataset.
Model Variants XMart Dataset
AUC NDCG MAP
Baseline (No Gates) 0.7587 0.5579 0.4142
   + ASGA (Micro) 0.7614 (+0.35%) 0.5597 0.4161
   + GCQC (Meso) 0.7609 (+0.28%) 0.5593 0.4157
   + CGDF (Macro) 0.7621 (+0.44%) 0.5605 0.4173
GAP-Net (Full) 0.7661 (+0.97%) 0.5638 0.4213

4.3.2. Impact of Gating Strategy in ASGA

To further elucidate the efficacy of the proposed micro-level denoising mechanism, we conduct a detailed comparative analysis in Table 4. First, we observe a counter-intuitive phenomenon: Naive Sigmoid (0.7563) actually underperforms the Standard Softmax baseline (0.7587). This reveals that merely removing the sum-to-one constraint is insufficient; without a competitive mechanism, unconstrained activation introduces optimization instability and fails to effectively distinguish signal from noise. In contrast, ASGA (0.7614) achieves significant improvements by implementing a controlled ”soft rejection.” The ablation results further validate our structural design: removing Pre-Attention Feature Sifting (w/o PAFS) or Query-Guided Gating (w/o QGG) leads to distinct performance drops (0.7614 → 0.7601 and 0.7597, respectively). This confirms that the synergy of feature-level sifting and intent-level gating is essential to effectively suppress the ”Attention Sink” while maintaining representation stability, a balance that naive methods fail to achieve.

Table 4. Performance comparison of different attention activation strategies within the ASGA module.
Attention Mechanism AUC NDCG MAP
Standard Softmax (Baseline) 0.7587 0.5579 0.4142
Naive Sigmoid (Direct Replacement) 0.7563 0.5555 0.4112
ASGA w/o PAFS 0.7601 0.5593 0.4154
ASGA w/o QGG 0.7597 0.5584 0.4147
ASGA (Ours) 0.7614 0.5597 0.4161

4.3.3. Impact of Gating Strategy in CGDF

To investigate the optimal gating strategy in CGDF, we evaluate three context input variants against a static baseline: (1) Minimalist Context, utilizing only target and sequence embeddings; (2) Full Context, which naively concatenates all user profiles and context features; and (3) Purified Context, our proposed method applying a denoising gate. The results in Figure 3 reveal that Minimalist Context yields only a marginal AUC uplift (0.7587 → 0.7598), indicating that interaction embeddings alone lack the global perspective required for accurate routing. In contrast, Full Context achieves a substantial jump to 0.7618, proving the value of incorporating rich side information; however, its potential is capped by the noise inherent in raw concatenation. Notably, Purified Context surpasses all variants, reaching peak performance with 0.7621 AUC and 0.5605 NDCG. This distinct gain over the Full Context validates that simply expanding feature width is insufficient; the Gated Context Purification is crucial for filtering low-level semantic noise, ensuring that the dynamic fusion mechanism is driven by high-fidelity signals rather than spurious correlations.

Refer to caption
Figure 3. Impact of Gating Strategy in CGDF

4.4. Online A/B Test (RQ4)

To rigorously evaluate the practical business value of our proposed model, we conducted a strictly controlled online A/B test on the Category List Page of Meituan Xiaoxiang Supermarket, a leading on-demand retail platform. The experiment spanned a 7-day window, involving live user traffic randomly bucketed into treatment and control groups. Compared to the highly optimized online baseline, our model achieved remarkable and consistent improvements across all core commercial metrics:

  • Gross Merchandise Value (GMV): We observed a robust +0.73% lift in GMV. This substantial revenue gain indicates that our model not only promotes interactions but effectively identifies high-value potential needs, encouraging users to purchase items with higher unit prices or larger basket sizes.

  • Conversion Rate (CVR): The model delivered a +0.57% increase in CVR. This improvement in click-to-purchase efficiency validates that the items recommended by our model are genuinely aligned with users’ immediate purchase intent, minimizing the gap between browsing and buying.

  • Visit-to-Purchase Rate (V2P): Most notably, the Visit-to-Purchase Rate (defined as the ratio of paying users to total visiting users) saw a +0.33% improvement. This metric serves as a direct proxy for the platform’s overall conversion efficiency, demonstrating that our intent calibration mechanism successfully helps more hesitant browsers transition into paying customers.

5. Conclusion

In this paper, we propose GAP-Net, a unified Gated Adaptive Progressive Network that addresses the critical challenges of noise amplification and static intent assumptions in sequential recommendation. By establishing a systematic multi-level gating philosophy, GAP-Net integrates three core modules: Adaptive Sparse-Gated Attention (ASGA) to filter micro-level feature noise and mitigate the ”attention sink” phenomenon; Gated Cascading Query Calibration (GCQC) to dynamically evolve user intent from real-time triggers to long-term memories; and Context-Gated Denoising Fusion (CGDF) to adaptively modulate heterogeneous temporal views based on decision context. This hierarchical architecture effectively bridges the semantic gap between static target items and dynamic user contexts. Extensive experimental results demonstrate that GAP-Net achieves state-of-the-art performance, exhibiting remarkable robustness against interaction noise and intent drift. These results validate the effectiveness of gating mechanisms in distilling complex user behaviors, providing a scalable and noise-resilient solution for next-generation recommendation systems.

References

  • S. Bian, X. Pan, W. X. Zhao, J. Wang, C. Wang, and J. Wen (2023) Multi-modal mixture of experts represetation learning for sequential recommendation. In Proceedings of the 32nd ACM international conference on information and knowledge management, pp. 110–119. Cited by: §1.
  • Y. Cao, X. Zhou, J. Feng, P. Huang, Y. Xiao, D. Chen, and S. Chen (2022) Sampling is all you need on modeling long-term user behaviors for ctr prediction. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 2974–2983. Cited by: 3rd item.
  • Z. Chai, Q. Ren, X. Xiao, H. Yang, B. Han, S. Zhang, D. Chen, H. Lu, W. Zhao, L. Yu, et al. (2025) Longer: scaling up long sequence modeling in industrial recommenders. In Proceedings of the Nineteenth ACM Conference on Recommender Systems, pp. 247–256. Cited by: §2.
  • J. Chang, C. Gao, Y. Zheng, Y. Hui, Y. Niu, Y. Song, D. Jin, and Y. Li (2021) Sequential recommendation with graph neural networks. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pp. 378–387. Cited by: §1.
  • J. Chang, C. Zhang, Z. Fu, X. Zang, L. Guan, J. Lu, Y. Hui, D. Leng, Y. Niu, Y. Song, et al. (2023a) TWIN: two-stage interest network for lifelong user behavior modeling in ctr prediction at kuaishou. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 3785–3794. Cited by: §2.
  • J. Chang, C. Zhang, Y. Hui, D. Leng, Y. Niu, Y. Song, and K. Gai (2023b) Pepnet: parameter and embedding personalized network for infusing with personalized prior information. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 3795–3804. Cited by: §2.
  • Q. Chen, C. Pei, S. Lv, C. Li, J. Ge, and W. Ou (2021) End-to-end user behavior retrieval in click-through rateprediction model. arXiv preprint arXiv:2108.04468. Cited by: §1, §2, 2nd item.
  • Z. Chen, C. Lu, and Y. Wang (2025) CIEG-net: context information enhanced gated network for multimodal sentiment analysis. Pattern Recognition 168, pp. 111785. External Links: ISSN 0031-3203, Document, Link Cited by: §1.
  • P. Covington, J. Adams, and E. Sargin (2016) Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pp. 191–198. Cited by: §1.
  • Y. Feng, F. Lv, W. Shen, M. Wang, F. Sun, Y. Zhu, and K. Yang (2019) Deep session interest network for click-through rate prediction. arXiv preprint arXiv:1905.06482. Cited by: §2.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §4.1.4.
  • H. Guo, R. Tang, Y. Ye, Z. Li, and X. He (2017) DeepFM: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247. Cited by: §1.
  • J. Guo, P. Zhang, C. Li, X. Xie, Y. Zhang, and S. Kim (2022) Evolutionary preference learning via graph nested gru ode for session-based recommendation. In Proceedings of the 31st ACM international conference on information & knowledge management, pp. 624–634. Cited by: §1.
  • T. Guo, Z. Yang, Q. Zeng, and M. Chen (2025) Context-aware lifelong sequential modeling for online click-through rate prediction. arXiv preprint arXiv:2502.12634. Cited by: §2.
  • S. Hu, Y. Tu, X. Han, C. He, G. Cui, X. Long, Z. Zheng, Y. Fang, Y. Huang, W. Zhao, et al. (2024) Minicpm: unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395. Cited by: §1.
  • L. Ji, G. Liu, M. Yin, H. Yang, and J. Zhou (2025) Hierarchical reinforcement learning for temporal abstraction of listwise recommendation. External Links: 2409.07416, Link Cited by: §1.
  • W. Kang and J. McAuley (2018) Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM), pp. 197–206. Cited by: §1.
  • D. P. Kingma (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.4.
  • C. Li, Z. Liu, M. Wu, Y. Xu, P. Huang, H. Zhao, G. Kang, Q. Chen, W. Li, and D. L. Lee (2019) Multi-interest network with dynamic routing for recommendation at tmall. External Links: 1904.08030, Link Cited by: §1.
  • J. Li, M. Wang, J. Li, J. Fu, X. Shen, J. Shang, and J. McAuley (2023) Text is all you need: learning language representations for sequential recommendation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1258–1267. Cited by: §1.
  • J. Li, S. Ding, L. Guo, and X. Li (2025) Multi-modal anchor gated transformer with knowledge distillation for emotion recognition in conversation. External Links: 2506.18716, Link Cited by: §1.
  • C. Ma, P. Kang, and X. Liu (2019) Hierarchical gating networks for sequential recommendation. External Links: 1906.09217, Link Cited by: §1.
  • X. Ma, L. Zhao, G. Huang, Z. Wang, Z. Hu, X. Zhu, and K. Gai (2018) Entire space multi-task model: an effective approach for estimating post-click conversion rate. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1137–1140. Cited by: §1.
  • Q. Pi, G. Zhou, Y. Zhang, Z. Wang, L. Ren, Y. Fan, X. Zhu, and K. Gai (2020) Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 2685–2692. Cited by: §1, §2.
  • Z. Qiu, Z. Wang, B. Zheng, Z. Huang, K. Wen, S. Yang, R. Men, L. Yu, F. Huang, S. Huang, et al. (2025) Gated attention for large language models: non-linearity, sparsity, and attention-sink-free. arXiv preprint arXiv:2505.06708. Cited by: §1, §3.2.2.
  • Z. Si, L. Guan, Z. Sun, X. Zang, J. Lu, Y. Hui, X. Cao, Z. Yang, Y. Zheng, D. Leng, et al. (2024) Twin v2: scaling ultra-long user behavior sequence modeling for enhanced ctr prediction at kuaishou. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, pp. 4890–4897. Cited by: §2.
  • H. Tang, J. Liu, M. Zhao, and X. Gong (2020) Progressive layered extraction (ple): a novel multi-task learning (mtl) model for personalized recommendations. In Proceedings of the 14th ACM conference on recommender systems, pp. 269–278. Cited by: §2.
  • R. Wang, R. Shivanna, D. Cheng, S. Jain, D. Lin, L. Hong, and E. Chi (2021) Dcn v2: improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the web conference 2021, pp. 1785–1797. Cited by: §2.
  • S. Wang, B. Shen, X. Min, Y. He, X. Zhang, L. Zhang, J. Zhou, and L. Mo (2024) Aligned side information fusion method for sequential recommendation. In Companion Proceedings of the ACM Web Conference 2024, pp. 112–120. Cited by: §2.
  • G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis (2023) Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453. Cited by: §3.2.
  • Y. Xie, P. Zhou, and S. Kim (2022a) Decoupled side information fusion for sequential recommendation. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, pp. 1611–1621. Cited by: §1, §2.
  • Y. Xie, P. Zhou, and S. Kim (2022b) Decoupled side information fusion for sequential recommendation. External Links: 2204.11046, Link Cited by: §1.
  • J. Xu, L. Sun, and D. Zhao (2024) MoME: mixture-of-masked-experts for efficient multi-task recommendation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2527–2531. Cited by: §1.
  • S. Xu, S. Wang, D. Guo, X. Guo, Q. Xiao, B. Huang, G. Wu, and C. Luo (2025) Climber: toward efficient scaling laws for large recommendation models. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management, pp. 6193–6200. Cited by: §1.
  • A. Yang, J. Pan, J. Lin, R. Men, Y. Zhang, J. Zhou, and C. Zhou (2022a) Chinese clip: contrastive vision-language pretraining in chinese. arXiv preprint arXiv:2211.01335. Cited by: §1.
  • X. Yang, X. Peng, P. Wei, S. Liu, L. Wang, and B. Zheng (2022b) Adasparse: learning adaptively sparse structures for multi-domain click-through rate prediction. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 4635–4639. Cited by: §2.
  • M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed (2021) Big bird: transformers for longer sequences. External Links: 2007.14062, Link Cited by: §1.
  • D. Zhang, Z. Nie, J. Liu, C. Fu, W. Guan, Y. Gao, J. Song, P. Wang, J. Xu, and B. Zheng (2025) MOON: generative mllm-based multimodal representation learning for e-commerce product understanding. arXiv preprint arXiv:2508.11999. Cited by: §1.
  • Q. Zhang, X. Liao, Q. Liu, J. Xu, and B. Zheng (2022) Leaving no one behind: a multi-scenario multi-task meta learning approach for advertiser modeling. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, pp. 1368–1376. Cited by: §2.
  • G. Zhou, N. Mou, Y. Fan, Q. Pi, W. Bian, C. Zhou, X. Zhu, and K. Gai (2019) Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33, pp. 5941–5948. Cited by: §1, §1, §2.
  • G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma, Y. Yan, J. Jin, H. Li, and K. Gai (2018) Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 1059–1068. Cited by: §1, §2, 1st item.