J. Inst. Eng. India Ser.
B (October 2024) 105(5):1121–1128
https://doi.org/10.1007/s40031-024-01051-w
ORIGINAL CONTRIBUTION
A Deep Learning‑Based Object Representation Algorithm
for Smart Retail Management
Bin Liu1
Received: 20 August 2023 / Accepted: 20 March 2024 / Published online: 6 April 2024
© The Institution of Engineers (India) 2024
Abstract This study underscores the vital role of object computer vision, and data analytics, to create an intelligent
representation and detection in smart retail management and data-driven retail environment.
systems for optimizing customer experiences and opera- Among the key components of smart retail management
tional efficiency. The literature review reveals a preference systems, video surveillance is crucial in monitoring and
for deep learning techniques, citing their superior accuracy analyzing customer behavior, inventory management, and
compared to traditional methods. While acknowledging the ensuring store security [4, 5]. Object representation and
challenges of achieving high accuracy and low computation detection in video surveillance are of paramount importance,
costs simultaneously in deep learning-based object repre- as they enable the system to accurately identify and track
sentation, the paper proposes a solution using the YOLOv7 objects such as customers, products, and potential security
framework. In order to navigate the ever-changing landscape threats in real-time [6]. By effectively capturing and under-
of smart retail technologies, the study clarifies the potential standing the interactions and movements within the retail
scalability and flexibility of deep learning approaches. The environment, these systems can provide valuable insights
method employs a custom dataset, and experimental results for retailers to make informed decisions and improve overall
demonstrate the model’s efficacy, showcasing accurate store performance.
results and enhanced performance in various experiments Several algorithms have been employed to address object
and analyses. representation and detection challenges in video surveillance
for smart retail management systems [7, 8]. Conventional
Keywords Object representation · Smart retail computer vision (CV) methods have been widely used in
management · Deep learning · YOLOv7 · Computer vision the past [9–11]. However, with increased computing power
and data availability, deep learning-based techniques have
gained a lot of interest. Deep learning methods have demon-
Introduction strated exceptional capacity to learn intricate representations
straight from unprocessed data, resulting in object identifica-
Smart retail management systems have emerged as a trans- tion systems that are more reliable and accurate.
formative approach that leverages cutting-edge technologies Deep learning-based (DL) approaches are investigated by
to enhance customer experiences [1], optimize operational many researchers due to their exceptional performance in
efficiency, and boost sales [2, 3]. These systems integrate object representation and detection tasks [12, 13]. DL-based,
various technologies, including artificial intelligence, CNN-based models have demonstrated effective results in
different CV tasks [14, 15]. The ability of deep learning
models to automatically learn hierarchical and discrimina-
tive features. Deep learning-based techniques have therefore
* Bin Liu emerged as the preferred method used by academics study-
[email protected]
ing smart retail management systems.
1
Haojing College of Shaanxi University of Science Deep learning-based techniques have been remarkably
and Technology, Xi’an 710000, Shaanxi, China successful, but they are not without difficulties, particularly
13
Vol.:(0123456789)
1122 J. Inst. Eng. India Ser. B (October 2024) 105(5):1121–1128
when used in smart retail management systems. The processing speed, making it a powerful tool for automat-
demanding requirements for high accuracy and low compu- ing inventory management and optimizing retail operations.
tation complexity pose significant hurdles. Achieving high By leveraging deep learning, the pipeline provides a reli-
accuracy in object detection without compromising real- able solution for retailers to efficiently monitor their stock,
time performance remains a major challenge. Additionally, improve shelf replenishment, and enhance customer satisfac-
the resource-intensive nature of deep learning models often tion through streamlined shopping experiences.
leads to high computational requirements, making it essen- Aslam et al. [17] provide a survey on the use of deep
tial to explore novel techniques to strike a balance between learning and event-based middleware for object recogni-
accuracy and efficiency. tion in the IoMT. The method involves exploring various
The paper addresses the challenges of object representa- approaches and challenges related to object detection in the
tion and recognition in smart retail management systems IoMT context, with a particular focus on leveraging deep
through the introduction of a deep learning-based approach learning techniques and event-based middleware for efficient
employing video analysis. The model uses cutting-edge and accurate detection. The survey provides an overview of
deep learning techniques to overcome the shortcomings of the state-of-the-art methods and discusses the advantages of
conventional approaches and achieve accurate and efficient deep learning in IoMT applications. However, it also high-
object recognition. To validate the relevance and applicabil- lights the limitations and challenges faced in implementing
ity of the proposed technique, a new dataset tailored spe- object detection in this context, such as computational com-
cifically for item detection in retail contexts is created, thus plexity and data privacy concerns. The paper offers valuable
reinforcing the findings. insights into the current landscape of object detection in
The research contributions are as, IoMT and identifies potential research directions for further
advancements.
1. Custom dataset generation The first contribution is Chen et al. [18] focus on out-of-store object detection
building a custom dataset specifically suited to the diffi- based on deep learning techniques. The method involves
culties in object recognition and representation in video developing a deep learning model to accurately detect
surveillance for intelligent retail management systems. objects located outside retail stores, which is crucial for
2. Efficient deep learning method The second contribu- security and monitoring purposes. The study explores vari-
tion of this study is the proposal of an efficient deep ous deep learning architectures and optimization techniques
learning method that addresses the unique requirements to achieve robust object detection performance. However,
of object detection in smart retail environments. This the study notes significant drawbacks, such as difficulties
method aims to achieve high accuracy while considering with occlusion, different lighting, and the requirement for
computation complexity, making it suitable for real-time a sizable and varied dataset for efficient training. Notwith-
implementation. standing these drawbacks, the suggested deep learning-based
3. Comprehensive performance evaluation The third con- method exhibits encouraging outcomes in the precise iden-
tribution entails a series of comprehensive experiments tification of things outside of stores, rendering it a useful
and performance assessments aimed at verifying the instrument for augmenting security and monitoring in retail
efficacy of the deep learning approach. The objective is settings.
to prove the superiority of the methodology over current A deep-learning system for the identification and rec-
methods through extensive testing. ognition of supermarket products was created by Selvam
et al. [19]. The proposed method leverages advanced deep
In “Related Works” section of this study reviews relevant learning techniques to accurately detect and identify grocery
works. The third section discusses the suggested methodol- products in images. The framework’s usefulness in identify-
ogy. Results from the experiments are presented in Section ing different supermarket products is demonstrated by its
four, and the study is concluded in Section five. excellent accuracy and recall outcomes. By utilizing deep
learning, the model achieves superior object detection and
recognition performance, making it a valuable tool for auto-
Related Works mating inventory management and enhancing the shopping
experience in grocery retail settings.
This study demonstrated a deep learning pipeline created
especially for product recognition on retail shelves [16].
The proposed pipeline utilizes advanced deep-learning Proposed Method
techniques to achieve accurate and efficient recognition of
products placed on store shelves. The method demonstrates This study proposes the development of a deep learning-
impressive results in terms of recognition accuracy and based model utilizing YOLOv7 [20], a highly effective
13
J. Inst. Eng. India Ser. B (October 2024) 105(5):1121–1128 1123
object detection algorithm, for object detection within features from the backbone. Consisting of interconnected
smart retail management systems. The model is trained on SPPCSPC, CBS, UPSample, Cat, Bags, REP, MP2, and
a custom dataset comprising diverse retail objects, and data CBM blocks, they further process the features and facili-
augmentation techniques are applied to create variations in tate object detection predictions. Employing a grid-based
the images, covering different lighting and appearance sce- methodology, the head layers divide the image into cells
narios. By leveraging YOLOv7’s real-time performance and and assign bounding boxes and class probabilities to each
high accuracy as shown in Fig. 1 [21], the aim is to enable cell, ultimately yielding a collection of bounding boxes
efficient and accurate object detection, providing retailers with confidence scores and corresponding class labels for
with valuable insights for optimizing store operations, cus- each identified object. Utilizing YOLOv7 for object detec-
tomer experiences, and security. tion involves passing an image through its backbone and
The custom dataset and YOLOv7-based model prom- head layers, resulting in the output of bounding boxes with
ise to enhance surveillance and monitoring capabilities in confidence scores and class labels for all detected objects.
smart retail environments. The dataset’s diversity ensures The algorithm demonstrates real-time capability in detecting
robustness and the real-time capabilities of YOLOv7 enable multiple objects of varying sizes and classes.
rapid decision-making. The study focuses on addressing
the unique challenges faced in retail settings, and success- Dataset Annotation and Augmentation
ful implementation of the proposed model is expected to
contribute significantly to the advancement of smart retail For training the YOLOv7-based object detection model, a
technology. Retailers can benefit from accurate and real-time custom dataset of images captured within a smart retail envi-
object detection, leading to better inventory management, ronment was created. The dataset comprises various objects
customer insights, and overall store performance. commonly found in retail settings, such as customers, prod-
Figure 1 illustrates the architectural design of the ucts, and obstacles. Bounding boxes are used to annotate
YOLOv7 model, comprising two primary components: the each picture in the dataset with the position and class label
backbone and head layers, each serving distinct functions. of all items present in the scene. In the dataset, various
The backbone is tasked with feature extraction from the objects are collected from earrings and ringing objects. The
input image, composed of ELAN and MP1 layers linked earrings are decorative jewelry items designed to be worn
to CBS blocks, which likely represent diverse operations on the earlobes, typically attached by a piercing or a clip.
for processing and transforming input data. This component They come in various styles, materials, and designs, rang-
generates three feature maps of varying scales, subsequently ing from simple studs to elaborate dangling pieces. Earrings
forwarded to the head layers. The head layers, on the other have been a popular form of personal adornment across cul-
hand, play a pivotal role in prediction based on the extracted tures and time periods, serving both aesthetic and cultural
Fig. 1 The architecture of the
Yolov7 model
13
1124 J. Inst. Eng. India Ser. B (October 2024) 105(5):1121–1128
purposes. Ringing can refer to the sound produced by bells situations, thereby contributing to enhanced object detection
or similar resonant objects. In the context of jewelry, ring- performance.
ing might describe the tinkling or jingling noise made by Data augmentation is employed to increase the diver-
earrings when they move, particularly if they have dangling sity and size of the training dataset, thereby enhancing the
components. Figures 2 and 3 show the distribution of anno- robustness and generalization capability of machine learning
tated and augmented data images in the dataset. models. By augmenting the dataset, variations in the input
To broaden the diversity and expand the size of the data- data such as different orientations, translations, scales, or
set, data augmentation techniques were applied. This process other transformations are introduced, simulating real-world
entailed implementing various transformations to simulate scenarios and increasing the model’s ability to recognize
a spectrum of scenarios and appearance variations encoun- and adapt to different conditions. This technique is essen-
tered in real-world retail environments. Transformations tial because it helps minimize overfitting and enhances the
included rotation, translation, scaling, flipping, and adjust- model’s performance on unknown data, especially when the
ments to brightness and contrast among others, aimed at original dataset is small or lacks variety.
mimicking variations in lighting conditions. By augment- As shown in Figs. 2 and 3, the image labels in a data-
ing the dataset, the model becomes more resilient to diverse set refer to annotations or tags assigned to each image,
Fig. 2 The dataset’s image labels
13
J. Inst. Eng. India Ser. B (October 2024) 105(5):1121–1128 1125
Fig. 3 labels_correlogram of data in the dataset
providing information about the objects, features, or catego- the images and refine its weights based on the ground-truth
ries present within the image. These labels serve as a crucial annotations.
component for training machine learning models, enabling 70% of the annotated and enhanced dataset is the training
them to recognize and understand the content of images. On set, and it is this set that is essential to teaching the YOLOv7
the other hand, “labels_correlogram” in the dataset suggests model to correctly identify items in the photographs. In
a correlation matrix or a visual representation showcasing order to reduce detection errors during training, the model
the relationships and correlations between different labels or iteratively modifies its internal parameters (weights) via
features within the dataset. Such correlograms are valuable gradient descent and backpropagation. By learning from
for data analysis, helping researchers and practitioners gain the ground-truth annotations provided in the training set,
insights into how various elements within the dataset are the model gains the ability to detect and distinguish various
interconnected, ultimately informing the development and objects commonly found in retail environments. This learn-
optimization of machine learning algorithms. ing process equips the model to make precise predictions
when confronted with new, unseen images during real-time
Dataset Split and Training, Validation, and Testing object detection in smart retail management systems.
Modules The validation, constituting 20% of the dataset, is held out
from the training process. This model’s hyperparameters,
For dataset split, 70% of the annotated and augmented data- such as the learning rate and the number of epochs are set to
set is assigned to the training dataset, which is utilized in the 0.001 and 50, respectively. This fine-tuning process ensures
YOLOv7 model through the process of backpropagation and that the model is optimized for the task at hand and general-
gradient descent. The model learns to detect objects within izes well to unseen data. By continuously monitoring the
13
1126 J. Inst. Eng. India Ser. B (October 2024) 105(5):1121–1128
model’s progress in the validation set, researchers can make The F1-confidence results are displayed in Fig. 6. It
informed decisions to improve its overall performance, sta- serves as a helpful gauge of the model’s overall performance
bility, and efficiency. by striking a balance between these two criteria. In datasets
The testing set, which is entirely hidden throughout train- that are imbalanced—that is, where there are comparatively
ing and validation, is the last 10% of the dataset. Researchers more positive than negative instances—the F1 score proves
can verify the efficacy and dependability of the model by to be very useful. A model that performs well and strikes a
evaluating its accuracy and generalization capabilities on good balance between precise detections and thorough cov-
the testing set. If the model performs well on the testing set, erage is indicated by a high F1 score. This model also has
it demonstrates its readiness for practical implementation in high accuracy and high recall.
smart retail environments. The testing module thus serves The evaluation of the YOLOv7 model using precision,
as a critical benchmark to verify that the proposed deep recall, and F1-score provides a comprehensive understand-
learning-based YOLOv7 model can effectively contribute ing of its performance in object representation. High preci-
to smart retail management systems by providing precise sion indicates that the model is making fewer false positive
and efficient object detection capabilities. predictions, ensuring that the detected objects are indeed
present in the images. High recall indicates that most rel-
evant items are successfully identified by the model, reduc-
Experimental Outcomes ing the likelihood of missed detections. As results indicated,
this demonstrates that the YOLOv7 model performs well
The specifics of the experiment and an assessment of the in accurately representing objects in the images. Such per-
created models’ performance gleaned from the comprehen- formance ensures that the model can effectively detect and
sive experimental findings are presented in this part. Three represent objects in a wide range of scenarios, making it a
commonly used metrics are used to evaluate the object rep- reliable and valuable tool for various applications, including
resentation performance of the YOLOv7 model: precision, smart retail management systems.
F1-score, and recall. The precision meter for the accuracy
measurement of the model’s precision metric is shown in
Fig. 4. Conclusion
The recall metric measures the model’s capacity to rec-
ognize every pertinent object in the pictures, as shown in This paper underscores the crucial role of object repre-
Fig. 5. It is the ratio of the total number of true positive sentation and detection in smart retail management sys-
predictions in the dataset to the actual positive occurrences tems, emphasizing their impact on improving customer
(ground-truth objects). A higher recall score ensures com- experiences and operational efficiency. A comprehensive
prehensive object representation since it indicates that the literature review investigates various methods for object
model can detect most things. representation, with deep learning techniques standing
Fig. 4 Precision-curve of the
model
13
J. Inst. Eng. India Ser. B (October 2024) 105(5):1121–1128 1127
Fig. 5 Recall-curve of the
model
Fig. 6 F1-confidence curve of
the model
out for their superior accuracy compared to traditional superior performance through extensive analysis. Future
approaches, as evidenced by a numerical result of almost work could explore novel techniques to further optimize
0.99 for the f1-score. The outstanding performance of this delicate balance in deep learning-based object rep-
deep learning-based approaches in object recognition and resentation for smart retail management systems. Addi-
representation is the study’s justification for their broad tionally, investigating the integration of advanced sensor
implementation. However, the major research problem technologies and multi-modal data fusion techniques holds
in deep learning-based object representation is finding a promise for enhancing the capabilities of these systems in
compromise between obtaining high accuracy rates and comprehensive object detection and representation.
keeping computation costs low. The YOLOv7-based deep
learning model is the proposal in response, which has
been carefully trained, verified, and tested on the data- Funding No funding.
set. The experimental results confirm the efficacy of the Declarations
approach, showcasing almost 0.99 for the f1-score and
13
1128 J. Inst. Eng. India Ser. B (October 2024) 105(5):1121–1128
Conflict of interest The authors declare that there is no conflict of Aghamohammadi, M. Kooshki Forooshani, Lung infection seg-
interest. mentation for COVID-19 pneumonia based on a cascade convo-
lutional network from CT images. BioMed Res. Int. 2021, 1–16
(2021)
13. X. Wu, D. Sahoo, S.C. Hoi, Recent advances in deep learning for
References object detection. Neurocomputing 396, 39–64 (2020)
14. S.S.A. Zaidi, M.S. Ansari, A. Aslam, N. Kanwal, M. Asghar, B.
1. X. Fan, N. Ning, N. Deng, The impact of the quality of intelligent Lee, A survey of modern deep learning based object detection
experience on smart retail engagement. Mark. Intell. Plan. 38, models. Digital Signal Process. 126, 103514 (2022)
877–891 (2020) 15. A. Aghamohammadi, S.A. Beheshti Shirazi, S.Y. Banihashem, S.
2. S. Shah, Y. Patel, K. Panchal, P. Gandhi, P. Patel, A. Desai, Python Shishechi, R. Ranjbarzadeh, S. Jafarzadeh Ghoushchi, M. Bend-
and MySQL based smart digital retail management system, in echache, A deep learning model for ergonomics risk assessment
2021 6th International Conference for Convergence in Technology and sports and health monitoring in self-occluded images. Signal
(I2CT) (IEEE, 2021), pp. 1–6 Image Video Process. 1–13 (2023)
3. S. Adapa, S.M. Fazal-e-Hasan, S.B. Makam, M.M. Azeem, G. 16. A. Tonioni, E. Serra, L. Di Stefano, A deep learning pipeline for
Mortimer, Examining the antecedents and consequences of per- product recognition on store shelves, in 2018 IEEE International
ceived shopping value through smart retail technology. J. Retail. Conference on Image Processing, Applications, and Systems
Consum. Serv. 52, 101901 (2020) (IPAS) (IEEE, 2018), pp. 25–31
4. G. Sreenu, S. Durai, Intelligent video surveillance: a review 17. A. Aslam, E. Curry, A survey on object detection for the internet
through deep learning techniques for crowd analysis. J. Big Data of multimedia things (IoMT) using deep learning and event-based
6, 1–27 (2019) middleware: approaches, challenges, and future directions. Image
5. T. Erlina, M. Fikri, A YOLO algorithm-based visitor detection Vis. Comput. 106, 104095 (2021)
system for small retail stores using single board computer. J. Appl. 18. J. Chen, Z. Wang, K.-h. Cheng, H.-b. Zheng, A.-t. Pan, Out-of-
Eng. Technol. Sci. 4, 908–920 (2023) store object detection based on deep learning, Proceedings of the
6. W. Xu, Y. Zhai, A Yolo-based object monitoring approach for 2019 11th International Conference on Machine Learning and
smart shops surveillance system. J. Opt. 1–8 (2023). Computing, 2019, pp. 423–428.
7. J.M. Eyu, Application development for product recognition on- 19. P. Selvam, J.A.S. Koilraj, A deep learning framework for gro-
shelf with deep learning (UTAR, 2022). cery product detection and recognition. Food Anal. Methods 15,
8. R. Schrijvers, S. Puttemans, T. Callemein, T. Goedemé, Real-time 3498–3522 (2022)
embedded person detection and tracking for shopping behavior 20. J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look
analysis, in Advanced Concepts for Intelligent Vision Systems: once: unified, real-time object detection, in Proceedings of the
20th International Conference, ACIVS 2020, Auckland, Proceed- IEEE Conference on Computer Vision and Pattern Recognition
ings 20 (Springer, 2020), pp. 541–553 (2016), pp. 779–788
9. A. Aghamohammadi, M.C. Ang, E.A. Sundararajan, N.K. Weng, 21. H. Du, W. Zhu, K. Peng, W. Li, Improved high-speed flame detec-
M. Mogharrebi, S.Y. Banihashem, A parallel spatiotemporal sali- tion method based on YOLOv7. Open J. Appl. Sci. 12, 2004–2018
ency and discriminative online learning method for visual target (2022)
tracking in aerial videos. PLoS ONE 13, e0192246 (2018)
10. N. Rane, YOLO and faster R-CNN object detection for smart Publisher’s Note Springer Nature remains neutral with regard to
Industry 4.0 and Industry 5.0: applications, challenges, and oppor- jurisdictional claims in published maps and institutional affiliations.
tunities. Available at SSRN 4624206 (2023).
11. M. Saqlain, S. Rubab, M.M. Khan, N. Ali, S. Ali, Hybrid approach Springer Nature or its licensor (e.g. a society or other partner) holds
for shelf monitoring and planogram compliance (hyb-smpc) in exclusive rights to this article under a publishing agreement with the
retails using deep learning and computer vision. Math. Probl. Eng. author(s) or other rightsholder(s); author self-archiving of the accepted
2022, 1–18 (2022) manuscript version of this article is solely governed by the terms of
12. R. Ranjbarzadeh, S. Jafarzadeh Ghoushchi, M. Bendech- such publishing agreement and applicable law.
ache, A. Amirabadi, M.N. Ab Rahman, S. Baseri Saadi, A.
13