Deep Learning Model Acceleration and Optimization Strategies For Real-Time Recommendation Systems

This paper discusses optimization strategies for deep learning models in real-time recommendation systems, focusing on reducing inference latency and increasing throughput while maintaining recommendation quality. It proposes model-level techniques such as lightweight network design, pruning, and quantization, along with system-level strategies like heterogeneous computing and elastic scheduling. Experimental results demonstrate that these methods can significantly enhance performance, cutting latency to less than 30% of the baseline and more than doubling system throughput without sacrificing accuracy.

Uploaded by

naveen kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views5 pages

Deep Learning Model Acceleration and Optimization Strategies For Real-Time Recommendation Systems

Uploaded by

naveen kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Deep Learning Model Acceleration and

Optimization Strategies for Real-Time

Recommendation Systems
Jing Dong
Junli Shao*
Fu Foundation School of Engineering and Applied Science
College of Literature Science, and the Arts Columbia University,New York, NY, USA,
University of Michigan, Ann Arbor, USA [email protected]
*Corresponding author: [email protected]
Kowei Shih
Dingzhou Wang
Independent Researcher,Shenzhen,China
Pratt School of Engineer, [email protected]
Duke University, Durham, NC, USA
[email protected] Chengrui Zhou
Fu Foundation School of Engineering and Applied Science
Dannier Li
Columbia University, New York, NY, USA,
School of Computing, [email protected]
University of Nebraska - Lincoln
Lincoln, NE, USA,
[email protected]

Abstract—With the rapid growth of Internet services, II. CHALLENGES OF DEEP LEARNING MODELS IN REAL-TIME
recommendation systems play a central role in delivering RECOMMENDATION SYSTEMS
personalized content. Faced with massive user requests and
complex model architectures, the key challenge for real-time A. Dual Constraints of Latency and Throughput
recommendation systems is how to reduce inference latency and
increase system throughput without sacrificing recommendation
quality. This paper addresses the high computational cost and
resource bottlenecks of deep learning models in real-time settings
by proposing a combined set of modeling- and system-level
acceleration and optimization strategies. At the model level, we
dramatically reduce parameter counts and compute requirements
through lightweight network design, structured pruning, and
weight quantization. At the system level, we integrate multiple
heterogeneous compute platforms and high-performance
inference libraries, and we design elastic inference scheduling
and load-balancing mechanisms based on real-time load
characteristics. Experiments show that, while maintaining the
original recommendation accuracy, our methods cut latency to
less than 30% of the baseline and more than double system
throughput, offering a practical solution for deploying large-scale
online recommendation services.
Fig. 1. Illustration of the end-to-end reasoning process of a deep learning
Keywords—real-time recommendation systems; deep learning; driven real-time recommender system
model acceleration; pruning; heterogeneous computing
In Offline recommendation allows higher latency, but
I. INTRODUCTION real-time systems must complete the full pipeline—from user
Real-time recommendation systems must deliver fast, action to result delivery—within tens to hundreds of
accurate results under heavy load, but deep learning models milliseconds (Figure 1). Once a user clicks (①), features are
are often too costly for such environments. Combining LLMs processed and merged with history, passed through a DNN
with GNNs improves accuracy but adds latency and (②), and ranked results returned (③). Any delay harms user
complexity. We propose an integrated framework using engagement.
model-level optimizations (lightweight nets, sparse attention, To meet these demands, models like Shen et al.’s Multi-
pruning, quantization, distillation) and system-level strategies Scale CNN-LSTM-Attention [4] improve accuracy and
(heterogeneous computing, elastic scheduling, load speed by combining CNNs, LSTMs, and attention for better
balancing). This approach cuts latency to under 40% and spatial-temporal modeling.
doubles throughput while keeping accuracy loss below 1%, Real-time systems also face high throughput—up to tens
enabling scalable real-time recommendations.
of thousands of QPS during spikes. Inefficient scheduling or
system bottlenecks worsen delays. Scalable models and
adaptive pipelines help maintain performance under load.
Latency and throughput often trade off: faster responses
need optimized hardware; batching improves throughput but capture temporal and contextual dependencies. As in Figure
adds delay. Thus, every stage—especially from inference to 2, standard self-attention on a sequence of length L and
ranking—must be co-optimized using lightweight models, hidden size d has time complexity O(L²d) and space
pruning, and quantization to ensure low latency and high complexity O(L²). When L or d increase, latency and memory
throughput [5-10] . grow quadratically, making real-time inference impractical.
[17-20]
.
To address this bottleneck, we propose the following
B. Model Complexity and Resource Consumption
lightweighting strategies: Replace the original fully-
In the real-time recommender system shown in Figure 1, connected projections with a grouped linear transformation:
deep model inference (step 2) is the link with the most split the d-dimensional feature into k groups and apply k
intensive computational cost in the end-to-end process. The independent projections in parallel. This reduces the per-
time complexity and parameter scale of the model directly "!
determine the delay of single inference and the overall layer compute from O(d)! to O( ). Further substituting a
#
resource consumption. Assuming that the input feature depthwise-separable mapping (depthwise convolution
dimension is d, the vector dimension after Embedding is dₑ, followed by pointwise projection) lowers the cost O(d)! to
the width of the hidden layer is h, and the depth of the "!
O(#$" , k) significantly reducing multiply–accumulate
network is L, the parameter quantity of the fully connected
operations.On the premise of ensuring the diversity of the
network can be approximately expressed as formula 1.
head, we do low-rank decomposition on the projection matrix
" "
P ≈ dₑ · h + (L – 1) · h² + h · 1 ≈ O(L h²) of each head, so that its rank is reduced from % to r ≪ %,
(1) and the overall calculation amount is about O(HL! r), r ≪ %.
"

while preserving sufficient head diversity.During training,

In the case dₑ≪h, P≈L h²; And the floating-point
the large “teacher” model’s attention maps
operations (FLOPs) of a batch inference with candidate set '()&'"*+&)
size m can be expressed as formula 2. S& supervise the smaller “student” model by
minimizing the KL divergence as shown in Formula 3.
FLOPs ≈ m(dₑ · h + (L – 1) h² + h) ≈ O(m L h²) (2) '(&*/01*2) '()&'"*+&)
L-. = KL(S& ∥ S& ) (3)
Inference latency τ can be approximated as τ ≈ α·mLh² +
β, where α depends on hardware throughput and β on fixed This encourages the student to learn the crucial attention
overheads. Memory use includes parameters M_params = P patterns using fewer layers and roughly 30% fewer
× b_p and activations M_act = m × h × bₐ, with b_p and bₐ as parameters, reducing inference cost by approximately
byte sizes per value. Quantizing from 32-bit to 8-bit cuts 40%.We compute full attention over a local window of size
memory and bandwidth by ~4×. However, increasing m, L, w≪Lw \ll L to model short-term dependencies, and apply
or h greatly raises computation and memory—doubling h random or fixed sparse sampling for the remaining positions.
quadruples compute, doubling m roughly doubles latency— This reduces the nonzero attention ratio from O(L! ) to
making simple scaling infeasible for real-time systems. O(Lw) or O(LlogL), cutting overall compute to roughly as
To balance accuracy with delay and resource use, model shown in Formula 4.
complexity must be optimized. Key methods include pruning
(reducing h by removing redundant neurons), quantization O(Lwd + LlogLd) (4)
(lowering bit-widths), low-rank decomposition (splitting
large matrices), and hierarchical candidate screening Quantize both weights and activations from 32-bit to 8-
(limiting m early). Combining these model-level and system- bit, reducing memory bandwidth and storage by about
level (e.g., heterogeneous acceleration) strategies keeps 4×4\times. Using dynamic-range-aware quantization along
complexity (O(mLh²), O(Lh²)) manageable, ensuring the Value branch in Figure 2, we enable zero-copy integer
efficient, stable large-scale recommender service. inference on hardware accelerators. This further cuts latency
by ~45% and nearly doubles throughput under concurrency.
III. MODEL-LEVEL ACCELERATION TECHNIQUES Together, these methods—grouped/depthwise projections,
A. Lightweight Network Architecture Design low-rank head factorization, distillation, hybrid sparsity, and
quantization—compress both compute and memory
overhead of the self-attention module in Figure 2 without
degrading recommendation quality, laying a solid foundation
for subsequent heterogeneous acceleration and scheduling[21-
24]
.
B. Model Pruning and Weight Quantization
To further shrink model size and reduce latency in strict
real-time scenarios, we design a closed-loop pruning–
quantization workflow driven by dynamic thresholds, as
Fig. 2. Illustration of feature weighting of user behavior sequence based illustrated in Figure 3. The process first applies controllable
on self-attention binary masks to iteratively prune weights, then performs
dynamic-range quantization on the resulting sparse network,
Deep recommendation models use self-attention to
achieving dual compression of compute and storage with accumulate operations by ≈60%, reduces latency by ≈50%,
minimal accuracy loss. and boosts concurrent throughput by over 2.5×. The “prune
→ fine-tune → quantize → QAT” pipeline in Figure 3 fully
leverages structural sparsity and low-precision compute,
providing a practical path for deploying deep
recommendation models in high-concurrency, real-time
environments.
IV. SYSTEM-LEVEL OPTIMIZATION STRATEGIES
A. Heterogeneous Compute Platform and Acceleration
Fig. 3. Schematic diagram of the stepwise threshold driven neuron Library Integration
pruning with weight quantization process To deploy a lightweight deep recommendation model at
scale, it is crucial to leverage heterogeneous compute
Specifically, let the weight matrix of a layer be W ∈ resources and high-performance inference libraries as shown
R+×4 , and the elements be denoted w56 . The process is in Figure 4. First, the distilled student model is exported
divided into two main stages. using ONNX, allowing it to be mapped to various hardware
Stage 1: Dynamic-Threshold Pruning backends like GPUs, CPUs, or accelerators (e.g., NPU, TPU,
Threshold Computation: Given a target pruning ratio p, FPGA). For GPUs, NVIDIA TensorRT performs layer fusion
sort {∣ w56 ∣} in ascending order and choose the initial and optimizes with FP16 or INT8 for maximum throughput
threshold θ(7) as shown in Formula 5. and reduced latency. On CPUs, Intel OpenVINO and AMD
ROCm MIOpen apply operator fusion and vectorization for
8{(5,6):∣="# ∣>?(%) }8 core operations, supporting multi-core concurrent inference.
+4
=p (5)

Mask Generation: Define a binary mask as shown in

Formula 6.

(#) 1, ∣ w56 ∣≥ θ(#)

M56 = P (6)
0, ∣ w56 ∣< θ(#)

and prune weights as shown in Formula 7.

W (#) = W (#AB) ⊙ M(#) , k = 1,2, … , K (7)

Fig. 4. Deep Recommendation Model Training and Deployment
where ⊙ denotes element-wise multiplication. After each Architecture Based on Weighted Knowledge Distillation
pruning iteration, fine-tune the pruned network on the
For mobile and edge deployment, models run on
original training set by minimizing the task loss L&/)# (W (#) ).
TensorFlow Lite or SNPE, targeting NPU/DSP for efficiency.
Empirically, after K=3 rounds, we reduce total parameters by
In the cloud, models use asynchronous microservices with
≈40% while keeping Top-N accuracy loss under 1%.
Kubernetes/Kubeflow, supporting dynamic replica scaling.
Stage 2: Dynamic-Range Quantization
Mixed-precision training and auto-tuning ensure low latency
Step Size Determination: For the nonzero weights in
and high throughput. Containerized inference components
W (-) , let the quantization bit-width be bb. Compute the step
enable grey releases and rapid rollback. CI/CD pipelines
size as shown in Formula 8.
automate packaging, testing, and deployment, ensuring
4/CD(') A45+D(')
seamless scaling and real-time performance during traffic
s= !()* AB
(8) surges.
B. Elastic Inference Scheduling and Load Balancing
Weight Mapping: Quantize each weight via as shown in
Formula 9. To handle traffic spikes in real-time recommendation
systems, we adopt elastic inference scheduling and load
(-) balancing . The student model is deployed with a unified
Z 56 = clip(round(w56 /s) × s, minW (-) , maxW (-) )
w (9) interface (e.g., gRPC), and a hybrid rate limiter adjusts traffic
by user tier, priority, and system metrics.
Quantization-Aware Training (QAT): Insert fake- Requests are routed to the least-loaded backend; high-
quantization nodes in the forward pass to simulate integer priority ones bypass batching for low latency, while others
behavior while preserving full-precision gradients in the use asynchronous batching for efficiency.
backward pass. After iterative QAT, the final sparse- A warm pool of pre-initialized instances reduces cold
quantized model can perform zero-copy integer inference starts. Kubernetes autoscaling and geo-aware edge routing
without floating-point support. On real-world hardware, this further optimize resource use. An end-to-end monitoring
pruning–quantization loop achieves outstanding results: system ensures SLO compliance through real-time metrics
compared to the original 32-bit model, the sparse-quantized and alerts[25-30].
version uses only ≈15% of the storage, cuts multiply–
V. EXPERIMENT AND EVALUATION
A. Experimental Setup and Benchmark Selection
To To validate our optimization strategies in a realistic
scenario, we used the Alibaba Taobao User Behavior Dataset ,
which includes 50M logs from 1M users and 200K products.
We truncated each user’s behavior sequence to the latest 100
entries and set the candidate set to 50, simulating typical e-
commerce recommendations.
Experiments were conducted on NVIDIA V100 GPUs
and Intel Xeon CPUs using PyTorch 1.10, ONNX Runtime
1.9, TensorRT 8.0, and OpenVINO 2021.4 .
We evaluated five models:
(1) Baseline – original FP32 model with self-attention; Fig. 5. Inference Performance Comparison
(2) Quantized – 8-bit weights [33];
(3) Pruned – 40% dynamic pruning; Figure 5 shows pruning and quantization reduce GPU
(4) Pruned + Quantized – combined; latency by up to 43% and boost throughput over 70%. The
(5) Distilled + RT (FP16) – student model with TensorRT Distilled + RT model achieves the best GPU performance:
acceleration . 21.5 ms latency and 460 req/s throughput, 2.4× baseline.
Table I summarizes model size, parameter count, latency, Similar gains appear on CPU.
and throughput across platforms. [31-34]

TABLE I. PERFORMANCE COMPARISON OF DIFFERENT MODELS ON

THE TAOBAO DATASET

Mod Latenc
Throughp Latenc Throughp
Paramet el y (ms)
Method ut (req/s) y (ms) ut (req/s)
ers (M) Size [V100
[V100] [CPU] [CPU]
(MB) ]
Baseline 32.0 128.0 52.4 190 120.7 80
Quantized 32.0 32.0 44.1 225 102.3 95
Pruned 19.2 76.8 36.7 260 88.5 110
Pruned +
19.2 19.2 29.8 325 74.2 140
Quantized
Distilled
+ RT 6.4 12.8 21.5 460 54.8 180
(FP16) Fig. 6. Accuracy Comparison
From Table 1, it is evident that applying quantization
alone (Quantized) reduces GPU latency by about 15.8% and As shown in Figure 6, applying quantization and pruning
CPU latency by 15.3%. Pruning alone (Pruned) further separately results in a drop of approximately 1.0% and 2.0%
reduces GPU latency to 36.7 ms, which is a 30% reduction in Hit Rate, respectively. After combining pruning and
from the Baseline, while throughput increases by about 37%. quantization, accuracy slightly decreases to 97.3% of the
Combining pruning and quantization (Pruned + Quantized) Baseline. The distilled model with FP16 optimization not
reduces GPU latency to 29.8 ms, only 57% of the Baseline, only preserves the lightweight advantages but also maintains
with a throughput increase of nearly 71%. The distilled Hit Rate and NDCG close to the original level (a decrease of
model with TensorRT FP16 acceleration (Distilled + RT) less than 0.6%), with MRR decreasing by less than 0.8%,
achieves the best performance, with a GPU latency of 21.5 indicating that the distillation strategy preserves the model's
ms (41% of Baseline) and a throughput increase of over 2.4x. performance effectively.
On the CPU platform, similar trends are observed, with the
combined optimization significantly reducing latency and
improving concurrent handling capability.
B. Performance Metrics and Accuracy Comparison
In the e-commerce recommendation scenario, we
comprehensively compare the optimized and non-optimized
models across three dimensions: inference performance,
recommendation accuracy, and resource consumption.
Figure 5 shows the average latency and maximum throughput
on GPU (V100) and CPU platforms for each model. Table 5-
3 presents the online recommendation quality metrics,
including Hit Rate@50, NDCG@50, and MRR, evaluated
Fig. 7. Resource Consumption Comparison
using the Taobao User Behavior Dataset. Table 5-4
summarizes the parameter count, model size, and average
Figure 7 shows the Quantized model cuts model size to
memory usage for each model, providing valuable insights
25% and memory usage to 79% of the Baseline; Pruning
for resource budgeting in system design.
reduces peak memory by 62%. Combining both brings
memory usage down to 46%. The distilled model with FP16 [15] Hu J, Zeng H, Tian Z. Applications and Effect Evaluation of
Generative Adversarial Networks in Semi-Supervised Learning[J].
acceleration shrinks model size to 10% and memory usage arXiv preprint arXiv:2505.19522, 2025.
below 30%, freeing significant hardware resources. Overall, [16] Song Z, Liu Z, Li H. Research on feature fusion and multimodal patent
pruning, quantization, distillation, and system-level FP16 text based on graph attention network[J]. arXiv preprint
acceleration reduce latency to 21.5 ms and boost throughput arXiv:2505.20188, 2025.
beyond 460 req/s, with less than 1% accuracy loss and [17] Xiang, A., Zhang, J., Yang, Q., Wang, L., & Cheng, Y. (2024).
resource use under 30%. This offers a robust solution for Research on splicing image detection algorithms based on natural
image statistical characteristics. arXiv preprint arXiv:2404.16296. [xa]
large-scale real-time recommendation deployment. [35-36]
[18] Xiang, A., Qi, Z., Wang, H., Yang, Q., & Ma, D. (2024, August). A
multimodal fusion network for student emotion recognition based on
VI. CONCLUSION transformer and tensor product. In 2024 IEEE 2nd International
We propose a joint model–system optimization Conference on Sensors, Electronics and Computer Engineering
(ICSECE) (pp. 1-4). IEEE.
framework for real-time recommendation. Techniques
[19] Yang H, Fu L, Lu Q, et al. Research on the Design of a Short Video
include model compression (pruning, quantization, Recommendation System Based on Multimodal Information and
distillation) and system-level acceleration (elastic scheduling, Differential Privacy[J]. arXiv preprint arXiv:2504.08751, 2025.
load balancing). Results show <1% accuracy loss, 60% [20] Lin X, Cheng Z, Yun L, et al. Enhanced Recommendation Combining
latency reduction, and 2× throughput improvement. The Collaborative Filtering and Large Language Models[J]. arXiv preprint
approach enables scalable, efficient deployment, with future arXiv:2412.18713, 2024.
work on cross-model adaptation and auto-tuning. [21] Ji C, Luo H. Cloud-Based AI Systems: Leveraging Large Language
Models for Intelligent Fault Detection and Autonomous Self-
Healing[J]. arXiv preprint arXiv:2505.11743, 2025.
REFERENCES
[22] Yang Q, Ji C, Luo H, et al. Data Augmentation Through Random Style
[1] Su, Pei-Chiang, et al. "A Mixed-Heuristic Quantum-Inspired Replacement[J]. arXiv preprint arXiv:2504.10563, 2025.
Simplified Swarm Optimization Algorithm for scheduling of real-time
[23] Mao, Y., Tao, D., Zhang, S., Qi, T., & Li, K. (2025). Research and
tasks in the multiprocessor system." Applied Soft Computing 131
Design on Intelligent Recognition of Unordered Targets for Robots
(2022): 109807.
Based on Reinforcement Learning. arXiv preprint arXiv:2503.07340.
[2] Sun S, Yuan J, Yang Y. Research on Effectiveness Evaluation and
Optimization of Baseball Teaching Method Based on Machine [24] Yi, Q., He, Y., Wang, J., Song, X., Qian, S., Zhang, M., ... & Shi, T.
(2025). SCORE: Story Coherence and Retrieval Enhancement for AI
Learning[J]. arXiv preprint arXiv:2411.15721, 2024.
Narratives. arXiv preprint arXiv:2503.23512.
[3] Duan, Chenming, et al. "Real-Time Prediction for Athletes'
Psychological States Using BERT-XGBoost: Enhancing Human- [25] Qiu, S., Wang, Y., Ke, Z., Shen, Q., Li, Z., Zhang, R., & Ouyang, K.
(2025). A Generative Adversarial Network-Based Investor Sentiment
Computer Interaction." arXiv preprint arXiv:2412.05816 (2024).
Indicator: Superior Predictability for the Stock Market. Mathematics,
[4] Shen J, Wu W, Xu Q. Accurate Prediction of Temperature Indicators 13(9), 1476.
in Eastern China Using a Multi-Scale CNN-LSTM-Attention model[J].
[26] Ouyang, K., Fu, S., & Ke, Z. (2024). Graph Neural Networks Are
arXiv preprint arXiv:2412.07997, 2024.
Evolutionary Algorithms. arXiv preprint arXiv:2412.17629.
[5] Wang S, Jiang R, Wang Z, et al. Deep learning-based anomaly
detection and log analysis for computer networks[J]. arXiv preprint [27] Wang J, Zhang Z, He Y, et al. Enhancing Code LLMs with
Reinforcement Learning in Code Generation[J]. arXiv preprint
arXiv:2407.05639, 2024.
arXiv:2412.20367, 2024.
[6] Zhang T, Zhang B, Zhao F, et al. COVID-19 localization and
recognition on chest radiographs based on Yolov5 and [28] Tan C, Zhang W, Qi Z, et al. Generating Multimodal Images with GAN:
Integrating Text, Image, and Style[J]. arXiv preprint
EfficientNet[C]//2022 7th International Conference on Intelligent
arXiv:2501.02167, 2025.
Computing and Signal Processing (ICSP). IEEE, 2022: 1827-1830.
[29] Tan C, Li X, Wang X, et al. Real-time Video Target Tracking
[7] Gao Z, Tian Y, Lin S C, et al. A ct image classification network
Algorithm Utilizing Convolutional Neural Networks (CNN)[C]//2024
framework for lung tumors based on pre-trained mobilenetv2 model
4th International Conference on Electronic Information Engineering
and transfer learning, and its application and market analysis in the
and Computer (EIECT). IEEE, 2024: 847-851.
medical field[J]. arXiv preprint arXiv:2501.04996, 2025.
[8] Liu J, Huang T, Xiong H, et al. Analysis of collective response reveals [30] Zhang Z, Luo Y, Chen Y, et al. Automated Parking Trajectory
Generation Using Deep Reinforcement Learning[J]. arXiv preprint
that covid-19-related activities start from the end of 2019 in mainland
arXiv:2504.21071, 2025.
china[J]. medRxiv, 2020: 2020.10. 14.20202531.
[9] Zhao C, Li Y, Jian Y, et al. II-NVM: Enhancing Map Accuracy and [31] Zhao H, Ma Z, Liu L, et al. Optimized path planning for logistics
robots using ant colony algorithm under multiple constraints[J]. arXiv
Consistency with Normal Vector-Assisted Mapping[J]. IEEE Robotics
preprint arXiv:2504.05339, 2025.
and Automation Letters, 2025.
[10] Wang Y, Jia P, Shu Z, et al. Multidimensional precipitation index [32] Wang Z, Zhang Q, Cheng Z. Application of AI in Real-time Credit
prediction based on CNN-LSTM hybrid framework[J]. arXiv preprint Risk Detection[J]. 2025.
arXiv:2504.20442, 2025. [33] Wu S, Huang X. Psychological Health Prediction Based on the Fusion
[11] Lv K. CCi-YOLOv8n: Enhanced Fire Detection with CARAFE and of Structured and Unstructured Data in EHR: a Case Study of Low-
Context-Guided Modules[J]. arXiv preprint arXiv:2411.11011, 2024. Income Populations[J]. 2025.
[34] Lu D, Wu S, Huang X. Research on Personalized Medical Intervention
[12] Zhang L, Liang R. Avocado Price Prediction Using a Hybrid Deep
Learning Model: TCN-MLP-Attention Architecture[J]. arXiv preprint Strategy Generation System based on Group Relative Policy
Optimization and Time-Series Data Fusion[J]. arXiv preprint
arXiv:2505.09907, 2025.
arXiv:2504.18631, 2025.
[13] Zheng Z, Wu S, Ding W. CTLformer: A Hybrid Denoising Model
[35] Feng H, Dai Y, Gao Y. Personalized Risks and Regulatory Strategies
Combining Convolutional Layers and Self-Attention for Enhanced CT
of Large Language Models in Digital Advertising[J]. arXiv preprint
Image Reconstruction[J]. arXiv preprint arXiv:2505.12203, 2025.
arXiv:2505.04665, 2025.
[14] Freedman H, Young N, Schaefer D, et al. Construction and Analysis
[36] Zhao P, Wu J, Liu Z, et al. Contextual bandits for unbounded context
of Collaborative Educational Networks based on Student Concept
Maps[J]. Proceedings of the ACM on Human-Computer Interaction, distributions[J]. arXiv preprint arXiv:2408.09655, 2024.
2024, 8(CSCW1): 1-22.

M Thesis Report
No ratings yet
M Thesis Report
38 pages
DLEI PPT B-Batch Unit-6
No ratings yet
DLEI PPT B-Batch Unit-6
41 pages
Eecs 2024 108
No ratings yet
Eecs 2024 108
48 pages
Efficient Deep Learning For Edge Devices - A Review
No ratings yet
Efficient Deep Learning For Edge Devices - A Review
13 pages
MODeL Memory Optimizations For Deep Learning
No ratings yet
MODeL Memory Optimizations For Deep Learning
15 pages
Google Professional Machine Learning Engineer Updated Dumps
100% (1)
Google Professional Machine Learning Engineer Updated Dumps
54 pages
(Tahsin) PCD
No ratings yet
(Tahsin) PCD
21 pages
A Comprehensive Survey On Model Compression and Acceleration
No ratings yet
A Comprehensive Survey On Model Compression and Acceleration
43 pages
(Shifar) - Parallel and Distributed Computing
No ratings yet
(Shifar) - Parallel and Distributed Computing
21 pages
Ai 04 00047
No ratings yet
Ai 04 00047
23 pages
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
No ratings yet
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
16 pages
Machine Learning Model Training Insights
No ratings yet
Machine Learning Model Training Insights
60 pages
DeepSpeed Inference - Enabling Efficient Inference of Transformer Models at Unprecedented Scale
No ratings yet
DeepSpeed Inference - Enabling Efficient Inference of Transformer Models at Unprecedented Scale
13 pages
SLO-Aware GPU DVFS For Energy-Efficient LLM Inference Serving
No ratings yet
SLO-Aware GPU DVFS For Energy-Efficient LLM Inference Serving
4 pages
A Comprehensive Review of Low Rank Adaptation in Large Language Models For Efficient Parameter Tuning
No ratings yet
A Comprehensive Review of Low Rank Adaptation in Large Language Models For Efficient Parameter Tuning
11 pages
UCS - 401 - Unit-LV - Trends in Machine Learning - Model and Symbols - Bagging and Boosting, Multitask
No ratings yet
UCS - 401 - Unit-LV - Trends in Machine Learning - Model and Symbols - Bagging and Boosting, Multitask
44 pages
GNN for Efficient Resource Prediction
No ratings yet
GNN for Efficient Resource Prediction
8 pages
DeepSeek-V2: Efficient 236B MoE Language Model
No ratings yet
DeepSeek-V2: Efficient 236B MoE Language Model
50 pages
DeepSeek-V3: Advanced MoE Language Model
No ratings yet
DeepSeek-V3: Advanced MoE Language Model
53 pages
Toward Runtime-Throttleable Neural Networks: Jesse Hostetler
No ratings yet
Toward Runtime-Throttleable Neural Networks: Jesse Hostetler
13 pages
Neural Networks and Deep Learning: Enhancing Ai Through Neural Network Optimization
No ratings yet
Neural Networks and Deep Learning: Enhancing Ai Through Neural Network Optimization
5 pages
ISEC An Optimized Deep Learning Model For Image Classification On Edge Computing
No ratings yet
ISEC An Optimized Deep Learning Model For Image Classification On Edge Computing
10 pages
DeepSeek-V2: Efficient MoE Language Model
No ratings yet
DeepSeek-V2: Efficient MoE Language Model
50 pages
Intelligent Router For LLM Workloads: Improving Performance Through Workload-Aware Scheduling
No ratings yet
Intelligent Router For LLM Workloads: Improving Performance Through Workload-Aware Scheduling
16 pages
Efficiently Scaling Transformer Inference
No ratings yet
Efficiently Scaling Transformer Inference
18 pages
Blockwise Parallel Transformer
No ratings yet
Blockwise Parallel Transformer
17 pages
MPReport 2
No ratings yet
MPReport 2
6 pages
DeepSeek-V2: Efficient MoE Language Model
No ratings yet
DeepSeek-V2: Efficient MoE Language Model
52 pages
NeurIPS 2021 Redesigning The Transformer Architecture With Insights From Multi Particle Dynamical Systems Paper
No ratings yet
NeurIPS 2021 Redesigning The Transformer Architecture With Insights From Multi Particle Dynamical Systems Paper
14 pages
Detailed Performance Analysis of Distributed Tensorflow On A GPU Cluster Using Deep Learning Algorithms
No ratings yet
Detailed Performance Analysis of Distributed Tensorflow On A GPU Cluster Using Deep Learning Algorithms
8 pages
TF Estimators KDD Paper
No ratings yet
TF Estimators KDD Paper
9 pages
Deep Learning on GPU Clusters
No ratings yet
Deep Learning on GPU Clusters
50 pages
Adaptive 6G Resource Management
No ratings yet
Adaptive 6G Resource Management
24 pages
ML System Optimization - Lecture 10 - Model Optimization Techniques
No ratings yet
ML System Optimization - Lecture 10 - Model Optimization Techniques
33 pages
The Role of Pre-Training and Fine-Tuning
No ratings yet
The Role of Pre-Training and Fine-Tuning
2 pages
FedFMSL: Sparsely Activated LoRA in FL
No ratings yet
FedFMSL: Sparsely Activated LoRA in FL
15 pages
Deep Learning Benchmarking System
No ratings yet
Deep Learning Benchmarking System
13 pages
Simplifying Model Comparison For Machine Learning
No ratings yet
Simplifying Model Comparison For Machine Learning
11 pages
Energy-Efficient Deep Learning Inference On Edge Devices
No ratings yet
Energy-Efficient Deep Learning Inference On Edge Devices
55 pages
Machine Learning on Raspberry Pi
No ratings yet
Machine Learning on Raspberry Pi
78 pages
Self Supervised Learning For Large Scale Item Recommendations
No ratings yet
Self Supervised Learning For Large Scale Item Recommendations
10 pages
Future Proof Yourself-An AI Era Survival Guide
No ratings yet
Future Proof Yourself-An AI Era Survival Guide
259 pages
Rupam's Master Thesis
No ratings yet
Rupam's Master Thesis
58 pages
Ug4 Proj
No ratings yet
Ug4 Proj
44 pages
Efficient Rec Springer
No ratings yet
Efficient Rec Springer
16 pages
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
No ratings yet
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
15 pages
Presentation 7
No ratings yet
Presentation 7
7 pages
Jacobian Insights for Deep Learning Optimization
No ratings yet
Jacobian Insights for Deep Learning Optimization
48 pages
Unlocking Scaling Law in Industrial Recommendation Systems With A Three-Step Paradigm Based Large User Model
No ratings yet
Unlocking Scaling Law in Industrial Recommendation Systems With A Three-Step Paradigm Based Large User Model
9 pages
Dynamic Space Time Scheduling For GPU in
No ratings yet
Dynamic Space Time Scheduling For GPU in
8 pages
A Survey of Efficient LLM Inference Serving
No ratings yet
A Survey of Efficient LLM Inference Serving
20 pages
Energy Cost Modelling For Optimizing Large Language Model Inference On Hardware Accelerators
No ratings yet
Energy Cost Modelling For Optimizing Large Language Model Inference On Hardware Accelerators
6 pages
DL Questions
No ratings yet
DL Questions
30 pages
RLtools
No ratings yet
RLtools
15 pages
IntroductionToAISystems
No ratings yet
IntroductionToAISystems
29 pages
Option Pricing - CRR and BS Models With Application
No ratings yet
Option Pricing - CRR and BS Models With Application
15 pages
Quantitative Investment With Machine Learning in US Equity Market
No ratings yet
Quantitative Investment With Machine Learning in US Equity Market
9 pages
P S: A C++ and Fortran Suite of Fully Quantum Mechanical Real-Time Path Integral Methods For (Multi-) System+Bath Dynamics
No ratings yet
P S: A C++ and Fortran Suite of Fully Quantum Mechanical Real-Time Path Integral Methods For (Multi-) System+Bath Dynamics
31 pages
From Attention To Complexity
No ratings yet
From Attention To Complexity
6 pages
Loss Function in Deep Learning
No ratings yet
Loss Function in Deep Learning
15 pages
Beyond Detection: A Mathematical Framework For Persistent Latency Arbitrage in Modern Markets
No ratings yet
Beyond Detection: A Mathematical Framework For Persistent Latency Arbitrage in Modern Markets
14 pages
Spread, Volatility, and Volume Relationship in Financial Markets and Market Maker's Profit Optimization
No ratings yet
Spread, Volatility, and Volume Relationship in Financial Markets and Market Maker's Profit Optimization
10 pages
Synthesis Gpgpu Draft2012 09
No ratings yet
Synthesis Gpgpu Draft2012 09
100 pages
Profiling and Optimization of Multi-Card GPU Machine Learning Jobs
No ratings yet
Profiling and Optimization of Multi-Card GPU Machine Learning Jobs
27 pages
SSRN 2799798
No ratings yet
SSRN 2799798
31 pages
Quantum Monte Carlo Simulations For Financial Risk Analytics: Scenario Generation For Equity, Rate, and Credit Risk Factors
No ratings yet
Quantum Monte Carlo Simulations For Financial Risk Analytics: Scenario Generation For Equity, Rate, and Credit Risk Factors
35 pages
Python Programming Basics and Features
No ratings yet
Python Programming Basics and Features
366 pages
DN 200 Eng
No ratings yet
DN 200 Eng
30 pages
Balance Sartorius Secura
No ratings yet
Balance Sartorius Secura
113 pages
Design Engineering Report Example
100% (1)
Design Engineering Report Example
55 pages
Enhancing Electromagnetic Tracking Accuracy in Med
No ratings yet
Enhancing Electromagnetic Tracking Accuracy in Med
5 pages
Worksheet 3.8 Entity Framework
No ratings yet
Worksheet 3.8 Entity Framework
8 pages
RMMM Plan
No ratings yet
RMMM Plan
11 pages
DSP Unit 1 To 5 QB
No ratings yet
DSP Unit 1 To 5 QB
12 pages
Learning From Multiple Cities - A Meta-Learning Approach For
No ratings yet
Learning From Multiple Cities - A Meta-Learning Approach For
11 pages
Advanced Web Technology BCA3 Pratical File
No ratings yet
Advanced Web Technology BCA3 Pratical File
16 pages
E-commerce Trends in Online Grocery Shopping
No ratings yet
E-commerce Trends in Online Grocery Shopping
66 pages
Model of Twitter
No ratings yet
Model of Twitter
8 pages
Ecrown Beta
No ratings yet
Ecrown Beta
7 pages
Python Project Music System
100% (1)
Python Project Music System
34 pages
Hydra 1.0 Pro - How To Improve My Boost
No ratings yet
Hydra 1.0 Pro - How To Improve My Boost
32 pages
2022 PTMP Installation Training Rev1.1
No ratings yet
2022 PTMP Installation Training Rev1.1
85 pages
Sanjana - Resume - Deloitte - BA-PM
No ratings yet
Sanjana - Resume - Deloitte - BA-PM
1 page
Lab 1 - Introduction
No ratings yet
Lab 1 - Introduction
77 pages
Module 6 Enhanced Entity-Relationship Model
No ratings yet
Module 6 Enhanced Entity-Relationship Model
73 pages
CS User's Manual-Ver2.8
No ratings yet
CS User's Manual-Ver2.8
25 pages
Read-Copy Update (RCU) Publications
No ratings yet
Read-Copy Update (RCU) Publications
13 pages
Lighttools Solidworks Link Module: Optimize Parts and Assemblies
No ratings yet
Lighttools Solidworks Link Module: Optimize Parts and Assemblies
3 pages
Cybersecurity Risk Management Guidelines
No ratings yet
Cybersecurity Risk Management Guidelines
27 pages
ICMP and IPv6 Neighbor Discovery Guide
No ratings yet
ICMP and IPv6 Neighbor Discovery Guide
3 pages
B.tech-8th Sem (1ST, 2ND, 3RD & 4TH Marksheet) New
No ratings yet
B.tech-8th Sem (1ST, 2ND, 3RD & 4TH Marksheet) New
3 pages
History AIMP
No ratings yet
History AIMP
80 pages
GIS Substation Interlocking Logic Guide
No ratings yet
GIS Substation Interlocking Logic Guide
9 pages
CATH CON 2024 Orientation
No ratings yet
CATH CON 2024 Orientation
9 pages
Image Processing
No ratings yet
Image Processing
31 pages
MLOps - Definitions, Tools and Challenges
100% (1)
MLOps - Definitions, Tools and Challenges
8 pages

Deep Learning Model Acceleration and Optimization Strategies For Real-Time Recommendation Systems

Uploaded by

Deep Learning Model Acceleration and Optimization Strategies For Real-Time Recommendation Systems

Uploaded by

Deep Learning Model Acceleration and

Optimization Strategies for Real-Time

while preserving sufficient head diversity.During training,

Mask Generation: Define a binary mask as shown in

(#) 1, ∣ w56 ∣≥ θ(#)

and prune weights as shown in Formula 7.

W (#) = W (#AB) ⊙ M(#) , k = 1,2, … , K (7)

TABLE I. PERFORMANCE COMPARISON OF DIFFERENT MODELS ON

You might also like