Federated learning-driven collaborative recommendation system for multi-modal art analysis and enhanced recommendations

Bei Gong; Ida Puteri Mahsan; Junhua Xiao

doi:10.7717/peerj-cs.2405

Federated learning-driven collaborative recommendation system for multi-modal art analysis and enhanced recommendations

Bei Gong^1,2, Ida Puteri Mahsan ², Junhua Xiao^1,3

1Department of Art & Design, Gongqing College of Nanchang University, Jiangxi, China

2Perak, Malaysia

3College of Creative Arts, Universiti Teknologi MARA (UiTM), Shah Alam, Malaysia

DOI: 10.7717/peerj-cs.2405

Published: 2024-11-27
Accepted: 2024-09-20
Received: 2024-06-03

Academic Editor: Giovanni Angiulli

Subject Areas: Artificial Intelligence, Data Mining and Machine Learning, Data Science, Visual Analytics, Neural Networks
Keywords: Artificial intelligence, Art similarity search, Data privacy, Federated learning, Multimodal data fusion

Copyright: © 2024 Gong et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Gong B, Mahsan IP, Xiao J. 2024. Federated learning-driven collaborative recommendation system for multi-modal art analysis and enhanced recommendations. PeerJ Computer Science 10:e2405 https://doi.org/10.7717/peerj-cs.2405

The authors have chosen to make the review history of this article public.

Abstract

With the rapid development of artificial intelligence technology, recommendation systems have been widely applied in various fields. However, in the art field, art similarity search and recommendation systems face unique challenges, namely data privacy and copyright protection issues. To address these problems, this article proposes a cross-institutional artwork similarity search and recommendation system (AI-based Collaborative Recommendation System (AICRS) framework) that combines multimodal data fusion and federated learning. This system uses pre-trained convolutional neural networks (CNN) and Bidirectional Encoder Representation from Transformers (BERT) models to extract features from image and text data. It then uses a federated learning framework to train models locally at each participating institution and aggregate parameters to optimize the global model. Experimental results show that the AICRS framework achieves a final accuracy of 92.02% on the SemArt dataset, compared to 81.52% and 83.44% for traditional CNN and Long Short-Term Memory (LSTM) models, respectively. The final loss value of the AICRS framework is 0.1284, which is better than the 0.248 and 0.188 of CNN and LSTM models. The research results of this article not only provide an effective technical solution but also offer strong support for the recommendation and protection of artworks in practice.

Introduction

Background

With the rapid development of artificial intelligence technology, deep learning-based recommendation systems have been widely applied in various fields. From e-commerce product recommendations to music and video content recommendations, recommendation systems have become important tools to enhance user experience. In the art field, especially in the similarity search and recommendation systems for artworks, there are unique challenges (Wu et al., 2024; Mintie, 2023). Artworks often have high originality and commercial value, making data privacy and copyright protection issues particularly important (Nishioka, Hauke & Scherp, 2020; Xiong & Zhang, 2023; Ajmal et al., 2023).

Freelance artists and art galleries are the main creators and collectors of artworks. They want to increase the exposure and sales of their works through advanced recommendation systems. However, these institutions are reluctant to share their original artwork data directly due to concerns about data breaches and copyright infringement (White & Matulionyte, 2020; Peplow, 2021). Additionally, wealthy art collectors and museums are also primary consumers of artworks. These entities play a significant role in the art market, seeking advanced recommendation systems to enhance their collections while ensuring privacy and copyright protection.

The extensive use of AI painting tools has also raised concerns about copyright infringement of unpublished works. In recent years, data privacy and copyright protection issues have drawn widespread attention in the art field. Getty Images accused the AI company Stability AI of using millions of unauthorized image data to train its image generation model, which harmed the commercial value of its images (BakerHostetler, 2024). In the field of AI-generated artworks, a platform used more than 100,000 artworks without authorization to train its generation model. These works came from thousands of artists, harming the interests of original artists (Appel, Neelbauer & Schweidel, 2023). These cases highlight the severe challenges of data privacy and copyright protection in the art field. These issues make it urgent to achieve efficient artwork recommendation while protecting data privacy and copyrights.

This work aims to address these challenges by proposing a federated learning framework that enables multiple institutions to collaboratively train recommendation models without sharing their raw data. By leveraging federated learning, we can enhance the quality of similarity search and recommendation for artworks while ensuring that the data privacy and copyright protection concerns of freelance artists and art galleries are adequately addressed.

Related work

Numerous studies have explored the application of deep learning and federated learning in recommendation systems, emphasizing the importance of data privacy and copyright protection. For instance, Chen et al. (2023) reviewed the latest developments in deep reinforcement learning for recommendation systems but noted insufficient consideration of data privacy issues. Dong et al. (2022) investigated trust-aware recommendation systems, focusing on robustness and interpretability, yet their exploration of federated learning was limited. Lee & Kim (2022) proposed deep learning recommendation systems using cross-convolution filters to capture complex user-item interactions, though they did not adequately address cross-institution collaboration.

Recent advancements have highlighted the potential of federated learning to enhance recommendation systems while maintaining data privacy. Dong et al. (2023) introduced the FedSR framework for Point-of-Interest recommendation systems, addressing data sparsity and Non-Independent and Identically Distributed (Non-IID) issues through Contrastive Learning. van Berlo, Saeed & Ozcelebi (2020) developed a federated unsupervised representation learning architecture, demonstrating how federated learning can pre-train deep neural networks with unlabeled data to protect user privacy. Elayan, Aloqaily & Guizani (2021) proposed a Deep Federated Learning framework for decentralized healthcare systems, illustrating the benefits of federated learning in maintaining user privacy and improving model performance in sensitive data environments.To address these issues, more and more researchers are focusing on these problems and seeking innovative solutions. However, previous studies have many shortcomings, as shown in Table 1.

Table 1:

Summary of related studies.

Author	Application scenario	Research content	Possible shortcomings
Chen et al. (2023)	Recommendation systems	Review of the application and latest developments of deep reinforcement learning in recommendation systems	Insufficient consideration of data privacy and copyright protection issues
Dong et al. (2022)	Trust-aware recommendation systems	Trust-aware recommendation systems from the perspective of deep learning, including social trust, robustness, and interpretability	Limited exploration of applications in multimodal data fusion and federated learning
Lee & Kim (2022)	Recommendation systems	Deep learning recommendation systems based on cross-convolution filters, capturing complex interactions between user and item features	Insufficient protection of data privacy and cross-institution collaboration
Park & Lee (2023)	Deep sleep recommendation systems	Personalized deep sleep recommendation using hybrid deep learning methods, combining user and collaborative filtering methods	No consideration of data protection and copyright issues
Jeong & Kim (2022)	Context-aware recommendation systems	Deep learning recommendation systems based on context features, combining neural networks and autoencoders for feature extraction and score prediction	Few applications for data privacy protection and cross-institution data collaboration
Vu & Le (2023)	Multi-criteria recommendation systems	Context-aware multi-criteria recommendation systems based on deep learning, using deep neural networks to predict ratings and learn aggregation functions	Insufficient consideration of privacy protection and copyright protection
Tegene et al. (2023)	Collaborative recommendation systems	Latent factor models based on deep learning and embedding to solve data sparsity issues and extract nonlinear features	Limited research on applications in federated learning and multimodal data fusion
Wu, Sun & Shang (2023)	Deep learning recommendation systems	Proposed DE-Opt framework to optimize hyperparameters of deep learning recommendation systems, improving recommendation accuracy and computational efficiency	Lack of exploration of multimodal data fusion and cross-institution data protection
Torkashvand, Jameii & Reza (2023)	Collaborative filtering recommendation systems	Systematic review of deep learning collaborative filtering recommendation systems, categorizing and analyzing existing methods and their advantages and disadvantages	Insufficient research on data privacy and cross-institution collaboration protection
Arthur et al. (2022)	Cross-domain recommendation systems	Proposed a discriminative geometric deep learning model to solve cold start and data sparsity issues in cross-domain recommendations	Less exploration of applications in multimodal data fusion and privacy protection
Dong et al. (2023)	POI recommendation systems	Proposed FedSR framework for POI-RS using sequential information and Contrastive Learning to address data sparsity and Non-IID issues in FL	Geographic and cultural differences affecting model training effectiveness
van Berlo, Saeed & Ozcelebi (2020)	Federated learning	Introduced federated unsupervised representation learning for pre-training deep neural networks with unlabeled data in a federated setting	Limited to scenarios where labeled data can be generated from user interaction
Elayan, Aloqaily & Guizani (2021)	Healthcare systems	Proposed Deep Federated Learning framework for decentralized healthcare systems to maintain user privacy and improve model performance	Model conversion time affecting quality of service to users

DOI: 10.7717/peerj-cs.2405/table-1

Our contribution

This study proposes a cross-institutional artwork similarity search and recommendation system—AICRS (AI-based Collaborative Recommendation System) framework, which combines multimodal data fusion and federated learning to address data privacy and copyright protection issues, as shown in Fig. 1. The main contributions of this article are as follows:

Figure 1: AICRS framework (Image source: Author’s own).

Download full-size image

DOI: 10.7717/peerj-cs.2405/fig-1

Figure 1 illustrates the AICRS framework, consisting of multiple components working together to provide a secure and efficient recommendation system. The framework involves various entities such as art galleries, artists, trading platforms, and self-employed individuals who contribute their data while maintaining control over their local models. The data, which remains with the participants, includes diverse artwork types (advertising, comics, publicity, art collections, etc.), represented by different colors for different types. The local models are trained on participant-owned data, and only the necessary parameters are exchanged between participants and the central trusted server. This server aggregates the parameters without exposing the original artwork, ensuring privacy and copyright protection while making customer recommendations based on their needs. The integration of these components ensures a collaborative yet secure environment for recommending artwork across institutions, leveraging the strengths of federated learning and multimodal data fusion.

We propose a Local Multi-Modal Feature Extraction and Aggregation algorithm (L-MFEA). It combines pre-trained convolutional neural networks (CNN) and Bidirectional Encoder Representations from Transformers (BERT) models. This extracts rich features from image and text data, generating multi-modal feature vectors and improving the accuracy of the recommendation system.
We propose a Federated Model Parameter Aggregation and Optimization algorithm (F-MPAO). This algorithm trains models locally at each participating institution and aggregates parameters to optimize the global model. It effectively addresses privacy and security issues caused by centralized data storage.
Combining multimodal data fusion and federated learning strategies, we propose a cross-institutional artwork similarity search and recommendation system (AICRS framework). This improves the overall performance and recommendation effect of the recommendation system while ensuring data privacy and copyright protection.

The remainder of this article is organized as follows: “Research Issues” discusses the key research issues related to recommendation systems and federated learning, and model aggregation strategies. “Experiments and Results” presents the experimental setup and results, including a detailed analysis of the system’s performance. Finally, “Conclusion” concludes the article and highlights potential directions for future research.

Research issues

Protecting user data privacy and copyright is crucial in recommendation systems. Federated learning allows collaborative model training without sharing raw data, ensuring privacy and security. In art recommendation systems, this protects sensitive artwork data and copyrights. Combining federated learning with deep learning, we propose the AICRS (AI-based Collaborative Recommendation System) for efficient and secure artwork recommendations.

Privacy protection in art recommendation systems

The dataset of artworks used in this research is represented as $D = {D_{1}, D_{2}, \dots, D_{N}}$ . Each $D_{i}$ represents a local dataset of a participant (freelance artist or art gallery) containing $n_{i}$ artworks, i.e., $D_{i} = {(x_{i, j}, y_{i, j})}_{j = 1}^{n_{i}}$ . Here, $x_{i, j}$ represents the feature of the $j$ -th artwork of the $i$ -th participant, and $y_{i, j}$ represents the corresponding label. Each participant trains their model locally and updates parameters by minimizing the following loss function:

(1) $θ_{i}^{(t + 1)} = θ_{i}^{(t)} - η \nabla_{θ} L_{i} (θ_{i}^{(t)})$ where $L_{i} (θ) = \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} ℓ (f (x_{i, j}; θ), y_{i, j})$ is the loss function of the $i$ -th participant, $ℓ (\cdot, \cdot)$ is the loss function (e.g., mean squared error or cross-entropy), and $η$ is the learning rate. Commonly used in stochastic gradient descent, is discussed in various works including (Jeon et al., 2023). The global model is updated by aggregating the local model parameters of all participants:

(2) $θ^{(t + 1)} = \frac{1}{N} \sum_{i = 1}^{N} θ_{i}^{(t + 1)}$

In the recommendation system, we use deep learning models to extract features of artworks for similarity search and recommendation. Suppose the deep learning model is a convolutional neural network (CNN), its structure can be represented as:

(3) $h = C N N (x; θ)$ where $h$ is the extracted high-dimensional feature vector, $x$ is the input artwork, and $θ$ is the model parameter. For similarity search, we use cosine similarity to measure the similarity between two artwork feature vectors:

(4) $s i m (h_{i}, h_{j}) = \frac{h_{i} \cdot h_{j}}{| | h_{i} | | | | h_{j} | |}$ where $h_{i}$ and $h_{j}$ represent the feature vectors of the $i$ -th and $j$ -th artworks, respectively, and $| | \cdot | |$ denotes the $L_{2}$ norm of the vector. The cosine similarity formula is a widely used method for comparing feature vectors in recommendation systems, as detailed in (Sankararaman et al., 2020). Our optimization objective is to maximize the performance of the recommendation system while protecting data privacy and copyrights. Specifically, the optimization objective can be expressed as:

(5) $min_{θ} \frac{1}{N} \sum_{i = 1}^{N} L_{i} (θ) + λ R (θ)$ where $R (θ)$ is the regularization term, and $λ$ is the regularization coefficient. This formula is a standard approach to incorporating regularization in machine learning, discussed extensively in Tian, Zhang & Zhang (2023).

Problem 1 Our core research question is how to achieve an efficient artwork recommendation system using federated learning and deep learning technologies while protecting the privacy and copyright of artwork data. The specific mathematical model is as follows:

(6) $\begin{aligned} min_{θ} \frac{1}{N} \sum_{i = 1}^{N} & L_{i} (θ) + λ R (θ) \\ s u b j e c t t o θ_{i}^{(t + 1)} & = θ_{i}^{(t)} - η \nabla_{θ} L_{i} (θ_{i}^{(t)}) \\ θ^{(t + 1)} & = \frac{1}{N} \sum_{i = 1}^{N} θ_{i}^{(t + 1)} \\ h & = C N N (x; θ) \\ s i m (h_{i}, h_{j}) & = \frac{h_{i} \cdot h_{j}}{| | h_{i} | | | | h_{j} | |} \end{aligned}$

Local feature extraction and aggregation based on multimodal data fusion

Multimodal data fusion: local feature extraction and aggregation based on CNN and BERT models

Traditional unimodal algorithms, like CNN or LSTM, can only handle single-modal data (image or text). They cannot fully utilize the multimodal features of artworks (Daneshvar & Ravanmehr, 2022). They also cannot effectively capture details in artworks, leading to large prediction errors and high loss values, resulting in poor recommendation effects (Fachrela et al., 2023). Traditional methods need to store and process large amounts of data centrally, posing data privacy and copyright protection issues (Nithya, Geetha & Kumar, 2024).
The AICRS framework combines multimodal data fusion techniques (image and text features). It uses pre-trained CNN and BERT models to extract rich features from image and text data and then fuses them to generate multimodal feature vectors. This can better capture the details of artworks, significantly reduce prediction errors and loss values, and improve the accuracy of the recommendation system. By using federated learning strategies, model parameters are shared among multiple participating institutions instead of directly sharing data. This protects data privacy and copyright while improving the generalization ability of the model.

Multimodal data fusion

Suppose there are N artworks. Each artwork $x_{i}$ contains image data $x_{i}^{i m g}$ and text data $x_{i}^{t e x t}$ . We first extract features from the image and text separately. For image feature extraction, we use a pre-trained CNN, such as Residual Neural Network (ResNet), to extract the image feature vector:

(7) $h_{i}^{i m g} = σ (\sum_{k = 1}^{K} \sum_{u = 1}^{U} w_{k}^{(u)} * x_{i}^{i m g} + b_{k}^{(u)})$ where $h_{i}^{i m g}$ represents the image feature vector of the $i$ -th artwork, $w_{k}^{(u)}$ and $b_{k}^{(u)}$ are the weights and biases of the $k$ -th convolution kernel in the $u$ -th layer, $*$ denotes the convolution operation, and $σ$ denotes the activation function (such as Rectified Linear Unit (ReLU)). Specifically, suppose the CNN contains L layers of convolution and pooling layers. The output for the $l$ -th convolution and pooling layer can be represented as:

(8) $h_{i}^{(l)} = σ (P o o l (\sum_{m = 1}^{M} \sum_{v = 1}^{V} w_{m}^{(v, l)} * h_{i}^{(l - 1)} + b_{m}^{(v, l)}))$ where $h_{i}^{(l - 1)}$ is the output of the $l - 1$ layer, $P o o l$ denotes the pooling operation, $w_{m}^{(v, l)}$ and $b_{m}^{(v, l)}$ are the weights and biases of the $m$ -th convolution kernel in the $v$ -th layer. The final image feature vector $h_{i}^{i m g}$ is represented by the output of the last layer of the CNN:

(9) $h_{i}^{i m g} = σ (\sum_{n = 1}^{N} \sum_{z = 1}^{Z} w_{n}^{(z, L)} * h_{i}^{(L - 1)} + b_{n}^{(z, L)})$

For text feature extraction, we use a pre-trained BERT model to extract the text feature vector:

(10) $h_{i}^{t e x t} = B E R T (x_{i}^{t e x t}; θ_{t e x t}) = \sum_{j = 1}^{J} \sum_{q = 1}^{Q} α_{j}^{(q)} h_{j}^{e m b e d} + b_{j}^{(q)}$ where $h_{i}^{t e x t}$ represents the text feature vector of the $i$ -th artwork, $α_{j}^{(q)}$ and $b_{j}^{(q)}$ are the attention weights and biases of the $j$ -th word in the $q$ -th layer, and $h_{j}^{e m b e d}$ represents the embedding vector of the $j$ -th word. Specifically, BERT uses a multi-layer Transformer structure to extract text features. The Transformer architecture in BERT consists of multiple layers of self-attention and feed-forward neural networks, which enables it to capture complex dependencies and contextual information from the text data. For the $l$ -th layer of the Transformer, the output can be represented as:

(11) $h_{i}^{t e x t (l)} = L a y e r N o r m (h_{i}^{t e x t (l - 1)} + M u l t i H e a d A t t e n t i o n (h_{i}^{t e x t (l - 1)}; θ_{a t t n}^{(l)}))$ where $h_{i}^{t e x t (l - 1)}$ is the output of the $l - 1$ layer, $M u l t i H e a d A t t e n t i o n$ denotes the multi-head attention mechanism, and $θ_{a t t n}^{(l)}$ are the parameters of the $l$ -th layer. The final text feature vector $h_{i}^{t e x t}$ is represented by the output of the last layer of BERT:

(12) $h_{i}^{t e x t} = L a y e r N o r m (h_{i}^{t e x t (L - 1)} + M u l t i H e a d A t t e n t i o n (h_{i}^{t e x t (L - 1)}; θ_{a t t n}^{(L)}))$

To generate a comprehensive multimodal feature vector, we concatenate the image and text features extracted from CNN and BERT, then process them through a fully connected layer. The feature fusion process can be represented as:

(13) $h_{i} = σ (W_{f u s i o n} [\begin{matrix} h_{i}^{i m g} \\ h_{i}^{t e x t} \end{matrix}] + b_{f u s i o n})$ where $h_{i}$ represents the multimodal feature vector of the $i$ -th artwork, $[h_{i}^{i m g}, h_{i}^{t e x t}]$ represents the concatenation of image and text features, $W_{f u s i o n}$ is the weight matrix, $b_{f u s i o n}$ is the bias vector, and $σ$ represents the activation function. Suppose the fully connected layer contains M neurons. The computation process can be represented as:

(14) $h_{i}^{(f)} = σ (\sum_{k = 1}^{K} \sum_{r = 1}^{R} W_{k}^{(r)} [\begin{matrix} h_{i, k}^{i m g} \\ h_{i, k}^{t e x t} \end{matrix}] + b_{k}^{(r)})$ where $h_{i}^{(f)}$ represents the output of the fully connected layer, $W_{k}^{(r)}$ and $b_{k}^{(r)}$ are the weights and biases of the $k$ -th neuron in the $r$ -th layer, and $σ$ represents the activation function (such as ReLU). The final multimodal feature vector $h_{i}$ can be represented as:

(15) $h_{i} = σ (\sum_{k = 1}^{K} \sum_{r = 1}^{R} W_{k}^{(r)} [\begin{matrix} h_{i, k}^{i m g} \\ h_{i, k}^{t e x t} \end{matrix}] + b_{k}^{(r)})$

We further normalize the feature vector $h_{i}$ to ensure its balance between different modalities:

(16) ${\hat{h}}_{i} = \frac{h_{i}}{| | h_{i} | |_{2}} = \frac{h_{i}}{\sqrt{\sum_{j = 1}^{J} h_{i, j}^{2} + ε}}$ where $| | h_{i} | |_{2}$ represents the $L_{2}$ norm of $h_{i}$ , and $ε$ is a small constant to avoid division by zero. This normalization ensures that the length of each feature vector is on a uniform scale, improving the accuracy of subsequent processing and recommendation. The output ${{\hat{h}}_{i}}_{i = 1}^{N}$ of this algorithm will be used as input for the subsequent recommendation system model.

Federated learning-based cross-institutional model parameter aggregation and optimization

Cross-institutional collaborative model: parameter aggregation and optimization based on federated learning framework

In cross-institutional data sharing, traditional methods cannot effectively protect the data ownership and copyright of each participating institution. This makes institutions reluctant to share data, affecting the overall performance of the model. Traditional model training methods require huge computational and storage resources when handling large-scale data. This can lead to single point failures and poor system robustness (Wang & Kawagoe, 2018; Musto et al., 2010; Kim, Kang & Lee, 2019). Traditional algorithms find it difficult to achieve efficient model parameter updates and optimization in cross-institutional cooperation, leading to slow convergence of the global model and limited performance improvement (Messina et al., 2019; Wang et al., 2019).
This algorithm aggregates local model parameters of each participating institution through methods such as weighted averaging. It ensures data ownership and copyright of each institution, enhances their willingness to cooperate, and improves the overall performance of the model. This algorithm reduces reliance on a central server, improves system robustness, and avoids single point failures. With improved optimization algorithms and effective parameter aggregation strategies, it significantly speeds up the convergence of the global model and improves performance.

Cross-institutional collaborative model

Suppose there are N participating institutions. Each institution $i$ has a local dataset $D_{i} = {(x_{i, j}, y_{i, j})}_{j = 1}^{n_{i}}$ , where $x_{i, j}$ is the $j$ -th sample of the $i$ -th institution, and $y_{i, j}$ is its corresponding label. On the local server of each participating institution, use the L-MFEA algorithm to extract multimodal features $h_{i, j}$ :

(17) $h_{i, j} = L - M F E A (x_{i, j}; θ_{i})$ where $θ_{i}$ is the local model parameter of the $i$ -th institution. Next, train the model on the local dataset to minimize the local loss function:

(18) $L_{i} (θ_{i}) = \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} ℓ (f (h_{i, j}; θ_{i}), y_{i, j}) + λ | | θ_{i} | |_{2}^{2}$ where $ℓ (\cdot, \cdot)$ is the loss function (such as cross-entropy loss), $f$ is the model output, and $λ | | θ_{i} | |_{2}^{2}$ is the regularization term.

Local model parameters are updated using the gradient descent algorithm:

(19) $θ_{i}^{(t + 1)} = θ_{i}^{(t)} - η (\nabla_{θ_{i}} L_{i} (θ_{i}^{(t)}) + γ \nabla_{θ_{i}} R (θ_{i}^{(t)}))$ where $η$ is the learning rate, $R (θ_{i})$ is the regularization term, and $γ$ is the weight of the regularization term. In each round of federated learning, the central server aggregates the local model parameters of each participating institution. The aggregation method, such as weighted averaging, is defined as:

(20) $θ^{(t + 1)} = \frac{1}{\sum_{i = 1}^{N} n_{i}} \sum_{i = 1}^{N} n_{i} θ_{i}^{(t + 1)}$ where $θ^{(t + 1)}$ is the global model parameter, and $n = \sum_{i = 1}^{N} n_{i}$ is the total amount of data from all participating institutions. The central server distributes the aggregated global model parameters $θ^{(t + 1)}$ back to each participating institution to update the local model parameters:

(21) $θ_{i}^{(t + 1)} = θ^{(t + 1)} f o r a l l i = 1, 2, \dots, N$

Each participating institution, after receiving the updated global model parameters, continues to train on the local dataset to minimize the local loss function. Repeat local training and parameter aggregation until the global model converges.

Theorem 1 Given a reasonable learning rate $η$ and sufficient iterations, the loss function $L (θ)$ of the global model parameter $θ$ will converge, assuming the local loss functions $L_{i} (θ_{i})$ of all participating institutions converge. Specifically, assuming each local model satisfies the following condition during the iteration process:

(22) $\begin{array}{l} L (θ^{(t + 1)}) \leq & L (θ^{(t)}) - η (\frac{1}{N} \sum_{i = 1}^{N} | | \nabla_{θ_{i}} L_{i} (θ_{i}^{(t)}) | |^{2} + γ \sum_{i = 1}^{N} | | \nabla_{θ_{i}} R (θ_{i}^{(t)}) | |^{2}) \\ + \frac{η^{2}}{2} (L_{L} + L_{R}) \leq L^{*} \end{array}$ where $L_{L}$ and $L_{R}$ are the Lipschitz constants of the loss function $L$ and the regularization term $R$ , respectively, and $L^{*}$ is the global optimal loss value.

Corollary 1 Based on the above theorem, if the multimodal feature vector $h_{i}$ is obtained by concatenating the image feature vector $h_{i}^{i m g}$ and the text feature vector $h_{i}^{t e x t}$ and then computing through the fully connected layer, the final model parameter update can be represented as:

(23) $θ_{i}^{(t + 1)} = θ_{i}^{(t)} - η (\sum_{v = 1}^{V} \nabla_{θ_{i}} ℓ (f (\sum_{p = 1}^{P} \sum_{q = 1}^{Q} W_{p, q}^{(i m g)} h_{i, p}^{i m g} + W_{p, q}^{(t e x t)} h_{i, q}^{t e x t} + b_{p, q}), y_{i, v}) + λ θ_{i})$ where $η$ represents the learning rate, $W_{p, q}^{(i m g)}$ and $W_{p, q}^{(t e x t)}$ represent the weight matrices of the image and text features, respectively, $b_{p, q}$ represents the bias vector, and $λ$ represents the regularization parameter.

AICRS framework: algorithm pseudocode and complexity analysis

The AICRS framework is presented as a comprehensive solution that integrates multiple components and processes to achieve efficient and secure artwork recommendations (Algorithm 1). While the overall structure is defined as a framework, the detailed implementation of its components and processes is expressed in the form of an algorithm. This approach allows us to provide a clear and precise description of the operational steps involved in the framework, ensuring reproducibility and facilitating practical application. By presenting the pseudocode, we aim to illustrate the exact sequence of operations, including data handling, model training, and federated learning procedures, thus bridging the conceptual framework with its practical execution.

Algorithm 1 :

AICRS framework.

Input Local dataset

D_{i} = {(x_{i, j}, y_{i, j})}_{j = 1}^{n_{i}}

Output Global model parameters θ

1 for each participating institution $i = 1$ to N do

2 for each sample $j = 1$ to n_i do

3 Extract multimodal features h_i,j using Eq. (17);

4 Extract image features

h_{i}^{i m g}

using Eq. (7);

5 Extract text features

h_{i}^{t e x t}

using Eq. (10);

6 Concatenate image and text features to generate multimodal feature vector h_i using Eq. (13);

7 Normalize the feature vector h_i to obtain

{\hat{h}}_{i}

using Eq. (41);

8 Train the model on the local dataset, minimizing the local loss function

L_{i} (θ_{i})

using Eq. (55);

9 Update local model parameters

θ_{i}^{(t + 1)}

using Eq. (65);

10 for each round of federated learning $t = 1$ to T do

11 The central server aggregates the local model parameters of each participating institution using Eq. (48);

12 Distribute the aggregated global model parameters

θ^{(t + 1)}

back to each participating institution using Eq. (21);

13 for each participating institution $i = 1$ to N do

14 Continue training on the local dataset, minimizing the local loss function

L_{i} (θ_{i})

;

15 Update local model parameters

θ_{i}^{(t + 1)}

;

16 while the global model has not converged do

17 for each participating institution

i = 1

to N do

18 Continue training on the local dataset, minimizing the local loss function

L_{i} (θ_{i})

using Eq. (55);

19 Update local model parameters

θ_{i}^{(t + 1)}

using Eq. (65);

20 The central server aggregates the local model parameters of each participating institution using Eq. (48);

21 Distribute the aggregated global model parameters back to each participating institution using Eq. (21);

22 return Global model parameters θ

DOI: 10.7717/peerj-cs.2405/table-5

Suppose each participating institution has $n_{i}$ samples. The constant time for image feature extraction and text feature extraction are $C_{i m g}$ and $C_{t e x t}$ , respectively. The time complexity for each training iteration is $C_{t r a i n}$ , the number of federated learning rounds is T, and the number of model parameters is P. Considering these factors, the time complexity is $O (T \cdot N^{2} \cdot n_{i} \cdot I)$ . For space complexity, each participating institution stores the local dataset $D_{i}$ and model parameters $θ_{i}$ , and the central server stores the global model parameters $θ$ and the model parameters of each participating institution. Suppose the dimension of each sample is D. The space complexity is $O (N \cdot n_{i} \cdot D + N^{2} \cdot P)$ .

To further analyze the performance of the proposed framework, we compared the AICRS framework with other state-of-the-art models, as shown in Table 2. The time complexity of AICRS is $O (T \cdot N^{2} \cdot n_{i} \cdot I)$ , which has better scalability compared to Karayel’s (2023) $O (ε^{- 2} \ln (δ^{- 1}) + \ln n)$ and Liu & Yu’s (2022) $O (2^{n / \log n})$ . Especially when handling large-scale data and multiple institutions, AICRS is more efficient. Additionally, the space complexity of AICRS is $O (N \cdot n_{i} \cdot D + N^{2} \cdot P)$ , which is much lower than Liu & Yu’s (2022) $O (2^{n})$ and Karayel’s (2023) $O (ε^{- 2} \ln (δ^{- 1}) + \ln n)$ . This significantly reduces storage requirements while maintaining performance. Overall, the optimization in time and space complexity of the AICRS algorithm makes it more advantageous in large-scale distributed federated learning systems.

Table 2:

Comparison of time and space complexities.

Related study	Time complexity	Space complexity
Karayel (2023)	$O (ε^{- 2} \ln (δ^{- 1}) + \ln n)$	$O (ε^{- 2} \ln (δ^{- 1}) + \ln n)$
Liu & Yu (2022)	$O (2^{n / \log n})$	$O (2^{n})$
Malandrino et al. (2021)	$O (N^{3})$	Variable
Zhou et al. (2023)	$O (P \log P)$	$O (N \cdot P)$
AICRS	$O (T \cdot N^{2} \cdot n_{i} \cdot I)$	$O (N \cdot n_{i} \cdot D + N^{2} \cdot P)$

DOI: 10.7717/peerj-cs.2405/table-2

Experiments and results

This section describes the dataset and experimental parameters used in our study. It provides an overview of the SemArt dataset, details the data split for training, validation, and testing, and outlines the experimental parameters set for the training and evaluation of the recommendation models. Additionally, it presents the results of the AICRS framework application, highlighting its performance compared to traditional CNN and LSTM models.

Dataset and experimental parameters

The SemArt dataset is specifically designed for the analysis and recommendation of artworks. It contains 21,384 images of artworks and their related textual descriptions. Each image comes with detailed text descriptions, including the title, artist, creation year, art style, and descriptive text (https://doi.org/10.17036/researchdata.aston.ac.uk.00000380).

To train and test the recommendation system, we split the dataset by art style. Each art style acts as a client for federated learning training. The training set includes a portion of the artworks from each style for model training. The test set includes the remaining artworks for evaluating the model’s recommendation accuracy. The accuracy of the recommendations is evaluated by comparing whether the recommended artworks match the actual artist’s style and type. The specific data content is shown in Fig. 2.

Figure 2: SemArt dataset samples: The presence of multiple types of art images in the dataset and the corresponding comments and attributes (Image source: SemArt Dataset (https://doi.org/10.17036/researchdata.aston.ac.uk.00000380), licensed under CC BY-NC 4.0.).

Download full-size image

DOI: 10.7717/peerj-cs.2405/fig-2

The dataset is split into training, validation, and test sets with the following percentages: 70% for training, 15% for validation, and 15% for testing. This split ensures that the model is adequately trained, validated, and tested to achieve reliable performance metrics. Our experiments were conducted using high-performance hardware to ensure efficient computation and model training.

Our experimental parameters are set as shown in Table 3. After setting these detailed experimental parameters, we trained and tested these algorithms on the SemArt dataset. The SemArt dataset includes images of artworks and their detailed textual descriptions from various art categories, covering multiple art styles from the 11th to the 20th century. By conducting experiments on this dataset, we can evaluate the effectiveness and performance of the AICRS framework in processing and recommending artworks.

Table 3:

Detailed experimental parameters.

Parameter name	Parameter value	Parameter name	Parameter value
Dataset	SemArt	Number of images	21,384
Image resolution	224 × 224	Text description	Each image includes title, artist, creation year, art style, descriptive text
Categories	Painting, sculpture, photography, etc.	Time span	11th to 20th Century
Art styles	Baroque, renaissance, impressionism, modern art, etc.	Model structure (Image Feature Extraction)	ResNet-50
Input size (Image feature extraction)	224 × 224 × 3	Number of convolution layers (Image feature extraction)	50
Activation function (Image feature extraction)	ReLU	Model structure (Text feature extraction)	BERT-base
Number of hidden layers (Text feature extraction)	12	Number of hidden units (Text feature extraction)	768
Number of attention heads (Text feature extraction)	12	Input length (Text feature extraction)	128
Number of layers (Feature Fusion)	2	Number of neurons (Feature fusion)	512, 256
Activation function (Feature fusion)	ReLU	Local training epochs	10
Global training epochs	50	Learning rate	0.001
Optimization algorithm	Adam	Batch size	32

DOI: 10.7717/peerj-cs.2405/table-3

AICRS framework application results

This study focuses on developing a cross-institutional artwork recommendation system based on federated learning. It uses multimodal data fusion (image and text features) to enhance the performance of the recommendation system. The model accuracy measures the ability of the recommendation system to recommend the correct artworks, while the loss value reflects the prediction error during training.

Figure 3 shows the accuracy performance of the four models over 50 training rounds. The results indicate that the AICRS framework has significant advantages in processing and recommending artworks, surpassing traditional CNN and LSTM models, as well as the Federated Averaging (FED-AVG) model. The CNN model’s final accuracy after 50 training rounds is 81.52%. Accuracy grows rapidly in the early stages but stabilizes, maintaining above 80% from round 25 and reaching 81.52% at the end. The LSTM model’s final accuracy after 50 training rounds is 83.44%. The LSTM model also shows rapid accuracy growth initially, reaching about 82.63% around round 22, then slightly improving to 83.44%. The FED-AVG model’s final accuracy after 50 training rounds is 84.44%. The FED-AVG model demonstrates strong performance, surpassing the LSTM model in the later stages, with accuracy steadily increasing and reaching 84.44% at the end. The AICRS framework’s final accuracy after 50 training rounds is 92.02%. The AICRS framework performs significantly better than CNN, LSTM, and FED-AVG from the start, exceeding 90% accuracy before round 30 and reaching 92.02% at the end.

Figure 3: Comparison of modeled 50-round ACCURACY values.

Download full-size image

DOI: 10.7717/peerj-cs.2405/fig-3

Figure 4 shows the loss values of the four models (CNN, LSTM, AICRS framework, and FED-AVG) over 50 training rounds. The results show that the AICRS framework has significantly lower loss values in processing and recommending artworks compared to traditional CNN and LSTM models, as well as the FED-AVG model. Specifically, the CNN model’s final loss value after 50 training rounds is 0.248. The loss value decreases rapidly in the early stages but stabilizes, staying below 0.26 from round 30 and reaching 0.248 at the end. The LSTM model’s final loss value after 50 training rounds is 0.188. The LSTM model also shows a rapid decrease in loss value initially, reaching about 0.22 around round 25, then slightly decreasing to 0.188. The FED-AVG model’s final loss value after 50 training rounds is 0.168. The FED-AVG model performs better than the LSTM model, showing a steady decrease in loss value and reaching 0.168 at the end. The AICRS framework’s final loss value after 50 training rounds is 0.1284. The AICRS framework performs significantly better than CNN, LSTM, and FED-AVG from the start, exceeding 0.20 loss value before round 20 and reaching 0.1284 at the end. The AICRS framework shows significant advantages in accuracy and loss values, improving recommendation performance and reducing prediction errors in the artwork recommendation system.

Figure 4: Comparison of modeled 50-round loss values.

Download full-size image

DOI: 10.7717/peerj-cs.2405/fig-4

Table 4 shows the performance comparison of the three models, including accuracy, loss value and training time. It is evident that the AICRS framework outperforms the CNN and LSTM models in accuracy and loss value. Although its training time is slightly longer, the performance improvement is significant. The training time for each model was measured by recording the duration from the start to the completion of the training process on the same hardware configuration. Specifically, we used a server equipped with dual NVIDIA Tesla V100 GPUs, each with 32 GB of memory, and 256 GB of system RAM. The server is powered by dual Intel Xeon Gold 6258R CPUs, each with 28 cores, providing ample processing power for both training and inference phases. Each model was trained using the same dataset, batch size, and number of epochs to ensure a fair comparison. The reported training times are the average of three runs to account for any variability in the training process. The slight increase in training time for the AICRS framework is justified by its superior performance metrics, indicating a worthwhile trade-off for the significant gains in accuracy and reduced loss value.

Table 4:

Model performance comparison.

Model	Accuracy (%)	Loss value	Training time (ms)
CNN	81.52	0.248	3,962
LSTM	83.44	0.188	4,185
AICRS	92.02	0.1284	4,577

DOI: 10.7717/peerj-cs.2405/table-4

Discussion

The rapid development of artificial intelligence has led to the widespread use of recommendation systems in various fields. In the art sector, implementing similarity search and recommendation systems poses unique challenges due to the high originality and commercial value of artworks. This necessitates robust data privacy and copyright protection. Freelance artists and art galleries aim to increase the exposure and sales of their works through advanced recommendation systems but are often hesitant to share their original artwork data due to concerns about data breaches and copyright infringement.

This study proposes a cross-institutional artwork similarity search and recommendation system (AICRS framework) that combines multimodal data fusion and federated learning to address data privacy and copyright protection issues. The experimental results show that the AICRS framework has significant advantages in processing and recommending artworks, surpassing traditional CNN and LSTM models.

The main reasons for the performance differences among the models are as follows:

The AICRS framework combines multimodal data fusion (image and text features), extracting richer features of artworks compared to single-modal CNN and LSTM models. This improves the accuracy of the recommendation system. The CNN and LSTM models’ final accuracy after 50 training rounds are 81.52% and 83.44%, respectively. The AICRS framework’s final accuracy reaches 92.02%. This shows that multimodal data fusion has significant advantages in capturing the details and features of artworks.
The AICRS framework uses pre-trained ResNet-50 and BERT models, which have better feature extraction and representation capabilities than traditional CNN and LSTM models. These models capture the details of artworks more effectively. In terms of loss value, the AICRS framework also performs better than other models. The CNN and LSTM models’ final loss values after 50 training rounds are 0.248 and 0.188, respectively. The AICRS framework’s final loss value is 0.1284. The more complex model structure enables the AICRS framework to handle complex features and reduce errors more effectively.
The AICRS framework adopts federated learning strategies, sharing model parameters among multiple participating institutions instead of directly sharing data. This improves the model’s generalization ability and privacy protection. This strategy not only protects the data privacy of participating institutions but also enhances the overall performance of the model. The training time of the AICRS framework is slightly longer than other models (4,577 ms), but the performance improvement is significant.

Conclusion

This article proposes a cross-institutional artwork similarity search and recommendation system (AICRS framework) that combines multimodal data fusion and federated learning to address data privacy and copyright protection issues. This system uses pre-trained convolutional neural networks (CNN) and BERT models to extract rich features from image and text data. It trains models locally at each participating institution and aggregates parameters through a federated learning framework to optimize the global model. The experimental results demonstrate that the AICRS framework has significant advantages in processing and recommending artworks, improving recommendation performance and reducing prediction errors. It enables collaboration among art institutions, offering accurate recommendations and complying with data protection regulations.The AICRS framework still has room for improvement in its reliance on high-quality multimodal data. Future research will explore enhancing the system’s robustness in cases of incomplete or low-quality data, as well as expanding the framework to support real-time recommendations across more diverse types of art media.

Appendix: mathematical theorems and corollary proofs

Theorem 2 Let $h_{i}^{i m g}$ and $h_{i}^{t e x t}$ be the image and text feature vectors, respectively. $[h_{i}^{i m g}, h_{i}^{t e x t}]$ is their concatenation, and $h_{i}$ is the multimodal feature vector. The normalized multimodal feature vector ${\hat{h}}_{i}$ satisfies:

(24) ${\hat{h}}_{i} = \frac{σ (W_{f u s i o n} [\begin{matrix} h_{i}^{i m g} \\ h_{i}^{t e x t} \end{matrix}] + b_{f u s i o n})}{\sqrt{\sum_{k = 1}^{K} {(\sum_{j = 1}^{J} σ (\sum_{m = 1}^{M} W_{m, j}^{(i m g)} h_{i, m}^{i m g} + \sum_{n = 1}^{N} W_{n, j}^{(t e x t)} h_{i, n}^{t e x t} + b_{j, f u s i o n}))}_{k}^{2} + ε}}$

where $σ$ is the activation function, $W_{f u s i o n}$ is the weight matrix of the fully connected layer, $b_{f u s i o n}$ is the bias vector, and $ε$ is a small constant to avoid division by zero.

Proof 1 First, define the optimization problem as follows:

(25) $min_{{h_{i}}_{i = 1}^{N}} \sum_{i = 1}^{N} | | h_{i} - {\hat{h}}_{i} | |_{2}^{2} + λ \sum_{i = 1}^{N} | | h_{i} | |_{2}^{2}$

where $λ > 0$ is the regularization parameter.

Expand the objective function as:

(26) $J ({h_{i}}_{i = 1}^{N}) = \sum_{i = 1}^{N} (| | h_{i} - {\hat{h}}_{i} | |_{2}^{2} + λ | | h_{i} | |_{2}^{2})$

We can consider each term individually, i.e., optimize for each $h_{i}$ . For each $h_{i}$ , the optimization problem is:

(27) $min_{h_{i}} | | h_{i} - {\hat{h}}_{i} | |_{2}^{2} + λ | | h_{i} | |_{2}^{2}$

Expand the above expression and take the derivative:

(28) $\frac{\partial}{\partial h_{i}} (| | h_{i} - {\hat{h}}_{i} | |_{2}^{2} + λ | | h_{i} | |_{2}^{2}) = 2 (h_{i} - {\hat{h}}_{i}) + 2 λ h_{i}$

Set the derivative to zero, we get:

(29) $2 (h_{i} - {\hat{h}}_{i}) + 2 λ h_{i} = 0 \Rightarrow h_{i} (1 + λ) = {\hat{h}}_{i}$

Solve to get:

(30) $h_{i} = \frac{{\hat{h}}_{i}}{1 + λ}$

To obtain the normalized multimodal feature vector ${\hat{h}}_{i}$ , we normalize the feature vector $h_{i}$ :

(31) ${\hat{h}}_{i} = \frac{h_{i}}{| | h_{i} | |_{2}} = \frac{h_{i}}{\sqrt{\sum_{j = 1}^{J} h_{i, j}^{2} + ε}}$

where $| | h_{i} | |_{2}$ is the $L_{2}$ norm of $h_{i}$ , and $ε$ is a small constant to avoid division by zero.

By definition, we have:

(32) $h_{i} = σ (W_{f u s i o n} [\begin{matrix} h_{i}^{i m g} \\ h_{i}^{t e x t} \end{matrix}] + b_{f u s i o n})$

Substitute $\frac{{\hat{h}}_{i}}{1 + λ}$ into the normalization equation and combine with the fusion equation, we get the final normalized multimodal feature vector:

(33) ${\hat{h}}_{i} = \frac{σ (W_{f u s i o n} [\begin{matrix} h_{i}^{i m g} \\ h_{i}^{t e x t} \end{matrix}] + b_{f u s i o n})}{\sqrt{\sum_{k = 1}^{K} {(\sum_{j = 1}^{J} σ (\sum_{m = 1}^{M} W_{m, j}^{(i m g)} h_{i, m}^{i m g} + \sum_{n = 1}^{N} W_{n, j}^{(t e x t)} h_{i, n}^{t e x t} + b_{j, f u s i o n}))}_{k}^{2} + ε}}$

In summary, the optimization problem has a unique solution that satisfies the equation, and the theorem is proved.

Corollary 2 Based on the above theorem, if the multimodal feature vector $h_{i}$ is obtained by concatenating the image feature vector $h_{i}^{i m g}$ and the text feature vector $h_{i}^{t e x t}$ and then computing through a fully connected layer, the normalized multimodal feature vector ${\hat{h}}_{i}$ can be expressed as follows:

(34) ${\hat{h}}_{i} = \frac{σ (\sum_{p = 1}^{P} \sum_{q = 1}^{Q} W_{p, q}^{f u s i o n} [\begin{matrix} h_{i, p}^{i m g} \\ h_{i, q}^{t e x t} \end{matrix}] + b_{p, q}^{f u s i o n})}{\sqrt{\sum_{r = 1}^{R} {(\sum_{s = 1}^{S} σ (\sum_{t = 1}^{T} W_{t, s}^{(i m g)} h_{i, t}^{i m g} + \sum_{u = 1}^{U} W_{u, s}^{(t e x t)} h_{i, u}^{t e x t} + b_{s, f u s i o n}))}_{r}^{2} + ε}}$

where $σ$ is the activation function, $W_{p, q}^{f u s i o n}$ is the weight matrix of the fully connected layer, $b_{p, q}^{f u s i o n}$ is the bias vector, and $ε$ is a small constant to avoid division by zero.

Proof 2 First, consider the calculation process of the multimodal feature vector $h_{i}$ . Assume $h_{i}$ is obtained by concatenating the image feature vector $h_{i}^{i m g}$ and the text feature vector $h_{i}^{t e x t}$ , and then computing through a fully connected layer:

(35) $h_{i} = σ (W_{f u s i o n} [\begin{matrix} h_{i}^{i m g} \\ h_{i}^{t e x t} \end{matrix}] + b_{f u s i o n})$

where $W_{f u s i o n}$ is the weight matrix of the fully connected layer, $b_{f u s i o n}$ is the bias vector, and $σ$ is the activation function.

We further refine the calculation process of the image feature vector $h_{i}^{i m g}$ and the text feature vector $h_{i}^{t e x t}$ . Assume the image feature vector $h_{i}^{i m g}$ is obtained through a convolutional neural network:

(36) $h_{i}^{i m g} = σ (\sum_{m = 1}^{M} W_{m}^{(i m g)} x_{i}^{i m g} + b_{m}^{(i m g)})$

where $W_{m}^{(i m g)}$ and $b_{m}^{(i m g)}$ are the weights and biases of the convolution kernels, and $x_{i}^{i m g}$ is the input image data.

Similarly, assume the text feature vector $h_{i}^{t e x t}$ is obtained through a pre-trained BERT model:

(37) $h_{i}^{t e x t} = B E R T (x_{i}^{t e x t}; θ_{t e x t}) = σ (\sum_{n = 1}^{N} W_{n}^{(t e x t)} x_{i}^{t e x t} + b_{n}^{(t e x t)})$

where $W_{n}^{(t e x t)}$ and $b_{n}^{(t e x t)}$ are the weights and biases of the BERT model, $x_{i}^{t e x t}$ is the input text data, and $θ_{t e x t}$ is the parameter of the BERT model.

Next, we concatenate the image feature vector $h_{i}^{i m g}$ and the text feature vector $h_{i}^{t e x t}$ to obtain the multimodal feature vector $h_{i}$ :

(38) $h_{i} = [\begin{matrix} h_{i}^{i m g} \\ h_{i}^{t e x t} \end{matrix}]$

Substitute Eqs. (36) and (37) into (38), we get:

(39) $h_{i} = [\begin{matrix} σ (\sum_{m = 1}^{M} W_{m}^{(i m g)} x_{i}^{i m g} + b_{m}^{(i m g)}) \\ σ (\sum_{n = 1}^{N} W_{n}^{(t e x t)} x_{i}^{t e x t} + b_{n}^{(t e x t)}) \end{matrix}]$

Then, compute through the fully connected layer to obtain the fused multimodal feature vector $h_{i}$ :

(40) $h_{i} = σ (W_{f u s i o n} [\begin{matrix} σ (\sum_{m = 1}^{M} W_{m}^{(i m g)} x_{i}^{i m g} + b_{m}^{(i m g)}) \\ σ (\sum_{n = 1}^{N} W_{n}^{(t e x t)} x_{i}^{t e x t} + b_{n}^{(t e x t)}) \end{matrix}] + b_{f u s i o n})$

To ensure the balance of feature vectors, we normalize them:

(41) ${\hat{h}}_{i} = \frac{h_{i}}{| | h_{i} | |_{2}} = \frac{h_{i}}{\sqrt{\sum_{r = 1}^{R} h_{i, r}^{2} + ε}}$

where $| | h_{i} | |_{2}$ is the $L_{2}$ norm of $h_{i}$ , and $ε$ is a small constant to avoid division by zero.

Substitute Eqs. (40) into (41), we get the normalized multimodal feature vector:

(42) ${\hat{h}}_{i} = \frac{σ (W_{f u s i o n} [\begin{matrix} σ (\sum_{m = 1}^{M} W_{m}^{(i m g)} x_{i}^{i m g} + b_{m}^{(i m g)}) \\ σ (\sum_{n = 1}^{N} W_{n}^{(t e x t)} x_{i}^{t e x t} + b_{n}^{(t e x t)}) \end{matrix}] + b_{f u s i o n})}{\sqrt{\sum_{r = 1}^{R} {(h_{i, r})}^{2} + ε}}$

Further expand the expression of $h_{i, r}$ :

(43) $h_{i, r} = \sum_{s = 1}^{S} σ (\sum_{t = 1}^{T} W_{t, s}^{(i m g)} h_{i, t}^{i m g} + \sum_{u = 1}^{U} W_{u, s}^{(t e x t)} h_{i, u}^{t e x t} + b_{s, f u s i o n})$

Substitute Eqs. (43) into (44), we get the final normalized multimodal feature vector:

(44) ${\hat{h}}_{i} = \frac{σ (\sum_{p = 1}^{P} \sum_{q = 1}^{Q} W_{p, q}^{f u s i o n} [\begin{matrix} h_{i, p}^{i m g} \\ h_{i, q}^{t e x t} \end{matrix}] + b_{p, q}^{f u s i o n})}{\sqrt{\sum_{r = 1}^{R} {(\sum_{s = 1}^{S} σ (\sum_{t = 1}^{T} W_{t, s}^{(i m g)} h_{i, t}^{i m g} + \sum_{u = 1}^{U} W_{u, s}^{(t e x t)} h_{i, u}^{t e x t} + b_{s, f u s i o n}))}_{r}^{2} + ε}}$

In summary, the corollary is proved.

Theorem 3 Given a reasonable learning rate $η$ and sufficient iterations, if the local loss functions $L_{i} (θ_{i})$ of all participating institutions converge, the loss function $L (θ)$ of the global model parameters $θ$ will also converge. Specifically, suppose each local model satisfies the following condition during the iteration process:

(45) $\begin{array}{l} L (θ^{(t + 1)}) \leq & L (θ^{(t)}) - η (\frac{1}{N} \sum_{i = 1}^{N} {| | \nabla_{θ_{i}} L_{i} (θ_{i}^{(t)}) | |}^{2} + γ \sum_{i = 1}^{N} {| | \nabla_{θ_{i}} R (θ_{i}^{(t)}) | |}^{2}) \\ + \frac{η^{2}}{2} (L_{L} + L_{R}) \leq L^{*} \end{array}$

where $L_{L}$ and $L_{R}$ are the Lipschitz constants of the loss function $L$ and the regularization term $R$ , respectively, and $L^{*}$ is the global optimal loss value.

Proof 3 First, we consider the local loss function $L_{i} (θ_{i})$ of each participating institution $i$

(46) $L_{i} (θ_{i}) = \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} ℓ (f (h_{i, j}; θ_{i}), y_{i, j}) + λ | | θ_{i} | |_{2}^{2}$

The local model parameters are updated using the gradient descent algorithm

(47) $θ_{i}^{(t + 1)} = θ_{i}^{(t)} - η (\nabla_{θ_{i}} L_{i} (θ_{i}^{(t)}) + γ \nabla_{θ_{i}} R (θ_{i}^{(t)}))$

where $η$ is the learning rate $R (θ_{i})$ is the regularization term $γ$ is the weight of the regularization term

Next, we consider the aggregation process of the global model parameters. The local model parameters of each participating institution are aggregated by weighted averaging

(48) $θ^{(t + 1)} = \frac{1}{\sum_{i = 1}^{N} n_{i}} \sum_{i = 1}^{N} n_{i} θ_{i}^{(t + 1)}$

where $θ^{(t + 1)}$ are the global model parameters $n = \sum_{i = 1}^{N} n_{i}$ is the total amount of data from all participating institutions

Then, we analyze the changes in the global loss function $L (θ)$ . The global loss function is defined as

(49) $L (θ) = \frac{1}{n} \sum_{i = 1}^{N} n_{i} L_{i} (θ_{i})$

where $L_{i} (θ_{i})$ is the local loss function of the $i$ -th participating institution.

Because the local loss functions of each participating institution are optimized based on the same global model parameters, we have

(50) $L (θ^{(t + 1)}) \leq L (θ^{(t)}) - η (\frac{1}{N} \sum_{i = 1}^{N} {| | \nabla_{θ_{i}} L_{i} (θ_{i}^{(t)}) | |}^{2} + γ \sum_{i = 1}^{N} {| | \nabla_{θ_{i}} R (θ_{i}^{(t)}) | |}^{2})$

where $η (\frac{1}{N} \sum_{i = 1}^{N} {| | \nabla_{θ_{i}} L_{i} (θ_{i}^{(t)}) | |}^{2} + γ \sum_{i = 1}^{N} {| | \nabla_{θ_{i}} R (θ_{i}^{(t)}) | |}^{2})$ represents the decrease in the loss function during gradient descent.

Considering the Lipschitz continuity of the loss function and the regularization term, we get

(51) $\begin{array}{l} L (θ^{(t + 1)}) \leq L (θ^{(t)}) - η (\frac{1}{N} \sum_{i = 1}^{N} {| | \nabla_{θ_{i}} L_{i} (θ_{i}^{(t)}) | |}^{2} + γ \sum_{i = 1}^{N} {| | \nabla_{θ_{i}} R (θ_{i}^{(t)}) | |}^{2}) \\ + \frac{η^{2}}{2} (L_{L} + L_{R}) \end{array}$

where $L_{L}$ and $L_{R}$ are the Lipschitz constants of the loss function $L$ and the regularization term $R$ , respectively.

Due to the convergence of the local loss functions $L_{i} (θ_{i})$ , we have

(52) $lim_{t \to \infty} L_{i} (θ_{i}^{(t)}) = L_{i}^{*}$

where $L_{i}^{*}$ is the local optimal loss value of the $i$ -th participating institution.

Therefore, the global loss function $L (θ)$ will also converge to its optimal value

(53) $lim_{t \to \infty} L (θ^{(t)}) = L^{*}$

where $L^{*}$ is the global optimal loss value.

In summary, given a reasonable learning rate $η$ and sufficient iterations, if the local loss functions $L_{i} (θ_{i})$ of all participating institutions converge, the loss function $L (θ)$ of the global model parameters $θ$ will also converge. The theorem is proved.

Corollary 1 Based on the above theorem, if the multimodal feature vector $h_{i}$ is obtained by concatenating the image feature vector $h_{i}^{i m g}$ and the text feature vector $h_{i}^{t e x t}$ and then computing through a fully connected layer, the final model parameter update can be expressed as:

(54) $θ_{i}^{(t + 1)} = θ_{i}^{(t)} - η (\sum_{v = 1}^{V} \nabla_{θ_{i}} ℓ (f (\sum_{p = 1}^{P} \sum_{q = 1}^{Q} W_{p, q}^{(i m g)} h_{i, p}^{i m g} + W_{p, q}^{(t e x t)} h_{i, q}^{t e x t} + b_{p, q}), y_{i, v}) + λ θ_{i})$

where $η$ is the learning rate $W_{p, q}^{(i m g)}$ and $W_{p, q}^{(t e x t)}$ are the weight matrices for image and text features, and $b_{p, q}$ is the bias vector $λ$ is the regularization parameter.

Proof 4 First, we consider the local loss function $L_{i} (θ_{i})$ for each participating institution $i$ :

(55) $L_{i} (θ_{i}) = \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} ℓ (f (h_{i, j}; θ_{i}), y_{i, j}) + λ | | θ_{i} | |_{2}^{2}$

where $l (\cdot, \cdot)$ is the loss function and $f$ is the model output.

Assume the multimodal feature vector $h_{i}$ is obtained by concatenating the image feature vector $h_{i}^{i m g}$ and the text feature vector $h_{i}^{t e x t}$ and then computing through a fully connected layer. That is:

(56) $h_{i, j} = [\begin{matrix} h_{i, j}^{i m g} \\ h_{i, j}^{t e x t} \end{matrix}]$

Computed through a fully connected layer:

(57) $h_{i, j} = σ (W [\begin{matrix} h_{i, j}^{i m g} \\ h_{i, j}^{t e x t} \end{matrix}] + b)$

where W is the weight matrix $b$ is the bias vector $σ$ is the activation function.

Assume W can be decomposed into the weight matrices for image and text features $W^{(i m g)}$ and $W^{(t e x t)}$ :

(58) $W = [\begin{matrix} W^{(i m g)} & W^{(t e x t)} \end{matrix}]$

Thus, the output of the fully connected layer can be expressed as:

(59) $h_{i, j} = σ (\sum_{p = 1}^{P} W_{p}^{(i m g)} h_{i, j}^{i m g} + \sum_{q = 1}^{Q} W_{q}^{(t e x t)} h_{i, j}^{t e x t} + b)$

The model output $f$ can be expressed as:

(60) $f (h_{i, j}; θ_{i}) = \sum_{v = 1}^{V} θ_{i}^{(v)} σ {(\sum_{p = 1}^{P} W_{p}^{(i m g)} h_{i, j}^{i m g} + \sum_{q = 1}^{Q} W_{q}^{(t e x t)} h_{i, j}^{t e x t} + b)}^{(v)}$

where $θ_{i}$ are the model parameters.

For the local loss function, we take the gradient with respect to the model parameters $θ_{i}$ :

(61) $\nabla_{θ_{i}} L_{i} (θ_{i}) = \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} \nabla_{θ_{i}} (f (h_{i, j}; θ_{i}), y_{i, j}) + λ \nabla_{θ_{i}} | | θ_{i} | |_{2}^{2}$

According to the chain rule, we can expand the gradient:

(62) $\nabla_{θ_{i}} ℓ (f (h_{i, j}; θ_{i}), y_{i, j}) = \nabla_{f} ℓ \cdot \nabla_{θ_{i}} f (h_{i, j}; θ_{i})$

Expanding the gradient of the model output $f$ :

(63) $\nabla_{θ_{i}} f (h_{i, j}; θ_{i}) = \sum_{v = 1}^{V} \nabla_{θ_{i}} (θ_{i}^{(v)} σ {(\sum_{p = 1}^{P} W_{p}^{(i m g)} h_{i, j}^{i m g} + \sum_{q = 1}^{Q} W_{q}^{(t e x t)} h_{i, j}^{t e x t} + b)}^{(v)})$

Note that the gradient of the regularization term is:

(64) $\nabla_{θ_{i}} | | θ_{i} | |_{2}^{2} = 2 θ_{i}$

Substitute the gradients into the local model parameter update formula:

(65) $θ_{i}^{(t + 1)} = θ_{i}^{(t)} - η (\frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} (\nabla_{f} ℓ \cdot \sum_{v = 1}^{V} \nabla_{θ_{i}} (θ_{i}^{(v)} σ {(\sum_{p = 1}^{P} W_{p}^{(i m g)} h_{i, j}^{i m g} + \sum_{q = 1}^{Q} W_{q}^{(t e x t)} h_{i, j}^{t e x t} + b)}^{(v)})) + 2 λ θ_{i})$

For the case of the multimodal feature vector $h_{i}$ , it can be further simplified as:

(66) $θ_{i}^{(t + 1)} = θ_{i}^{(t)} - η (\sum_{v = 1}^{V} \nabla_{θ_{i}} ℓ (f (\sum_{p = 1}^{P} \sum_{q = 1}^{Q} W_{p, q}^{(i m g)} h_{i, p}^{i m g} + W_{p, q}^{(t e x t)} h_{i, q}^{t e x t} + b_{p, q}), y_{i, v}) + λ θ_{i})$

where $W_{p, q}^{(i m g)}$ and $W_{p, q}^{(t e x t)}$ are the weight matrices for image and text features, and $b_{p, q}$ is the bias vector.

In conclusion, the corollary is proved.

Supplemental Information

AICRS.

DOI: 10.7717/peerj-cs.2405/supp-1

Download

[1] Appel G, Neelbauer J, Schweidel DA. 2023. Generative AI Has an Intellectual Property Problem.

[2] Ajmal S, Awais M, Khurshid KS, Shoaib M, Abdelrahman A. 2023. Data mining-based recommendation system using social networks—an analytical study. PeerJ Computer Science 9(1):e1202

[3] Arthur JK, Zhou C, Mantey EA, Osei-Kwakye J, Chen Y. 2022. A discriminative-based geometric deep learning model for cross domain recommender systems. Applied Sciences 12(10):5202

[4] BakerHostetler. 2024. Getty Images v. Stability AI.

[5] Chen X, Yao L, McAuley J, Zhou G, Wang X. 2023. Deep reinforcement learning in recommender systems: a survey and new perspectives. Knowledge-Based Systems 264:110335

[6] Daneshvar H, Ravanmehr R. 2022. A social hybrid recommendation system using lstm and cnn. Concurrency and Computation-Practice & Experience 34(18):e7015

[7] Dong Q, Liu B, Zhang X, Qin J, Wang B. 2023. Sequential POI recommend based on personalized federated learning. Neural Processing Letters 55(6):7351-7368

[8] Dong M, Yuan F, Yao L, Wang X, Xu X, Zhu L. 2022. A survey for trust-aware recommender systems: a deep learning perspective. Knowledge-Based Systems 249(7):108954

[9] Elayan H, Aloqaily M, Guizani M. 2021. Deep federated learning for iot-based decentralized healthcare systems.

[10] Fachrela J, Pravitasaria AA, Yulitab IN, Ardhisasmitac MN, Indrayatnaa F. 2023. A comparison between cnn and combined cnn-lstm for chest x-ray based covid-19 detection. Decision Science Letters 12(2):199-210

[11] Jeong SY, Kim YK. 2022. Deep learning-based context-aware recommender system considering contextual features. Applied Sciences 12(1):45

[12] Jeon I, Hong M, Yun J, Kim G. 2023. Federated learning via meta-variational dropout. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, eds. Advances in Neural Information Processing Systems. Curran Associates, Inc. 36:11168-11193

[13] Karayel E. 2023. An embarrassingly parallel optimal-space cardinality estimation algorithm. Arxiv preprint

[14] Kim M, Kang D, Lee N. 2019. Feature extraction from oriental painting for wellness contents recommendation services. IEEE Access 7:59263-59270

[15] Lee S, Kim D. 2022. Deep learning based recommender system using cross convolutional filters. Information Sciences 592:112-122

[16] Liu H, Yu Y. 2022. A non-heuristic approach to time-space tradeoffs and optimizations for BKW. In: Agrawal S, Lin D, eds. Advances in Cryptology–ASIACRYPT 2022. ASIACRYPT 2022. Lecture Notes in Computer Science. Cham: Springer. 13793:741-770

[17] Malandrino F, Chiasserini C, Molner N, de la Oliva A. 2021. Network support for high-performance distributed machine learning. IEEE/ACM Transactions on Networking 31(1):264-278

[18] Messina P, Dominguez V, Parra D, Trattner C, Soto A. 2019. Content-based artwork recommendation: integrating painting metadata with neural and manually-engineered visual features. User Modeling and User-Adapted Interaction 29(2):251-290

[19] Mintie K. 2023. A means of protection or destruction? copyright notifications on paintings in the united states, 1870-1911. American Art 37(3):9-12

[20] Musto C, Narducci F, Lops P, de Gemmis M, Semeraro G. 2010. Integrating a content-based recommender system into digital libraries for cultural heritage. In: Agosti M, Esposito F, Thanos C, eds. Digital Libraries. 27-38 Volume 91 of Communications in Computer and Information Science, DELOS Assoc; Inst Inform Sci & Technol Italian, Natl Res Council; Univ Padua, Dept Informat Engn. 6th Italian Research Conference on Digital Libraries, Univ Padua, Dept Informat Engn, Padua, Italy

[21] Nishioka C, Hauke J, Scherp A. 2020. Influence of tweets and diversification on serendipitous research paper recommender systems. PeerJ Computer Science 6(7):e273

[22] Nithya BN, Geetha DE, Kumar M. 2024. Optimal hybrid classification model for event recommendation system. Web Intelligence 22(2):167-184

[23] Park JH, Lee JD. 2023. A customized deep sleep recommender system using hybrid deep learning. Sensors 23(15):6670

[24] Peplow EC. 2021. Paint on any other canvas: closing a copyright loophole for street art on the exterior of an architectural work. Duke Law Journal 70(4):885-929

[25] Sankararaman KA, De S, Xu Z, Huang WR, Goldstein T. 2020. The impact of neural network overparameterization on gradient confusion and stochastic gradient descent.

[26] Tegene A, Liu Q, Gan Y, Dai T, Leka H, Ayenew M. 2023. Deep learning and embedding based latent factor model for collaborative recommender systems. Applied Sciences 13(2):726

[27] Tian Y, Zhang Y, Zhang H. 2023. Recent advances in stochastic gradient descent in deep learning. Mathematics 11(3):682

[28] Torkashvand A, Jameii SM, Reza A. 2023. Deep learning-based collaborative filtering recommender systems: a comprehensive and systematic review. Neural Computing and Applications 35(35):24783-24827

[29] van Berlo B, Saeed A, Ozcelebi T. 2020. Towards federated unsupervised representation learning, SIGOPS.

[30] Vu SL, Le QH. 2023. A deep learning based approach for context-aware multi-criteria recommender systems. Computer Systems Science and Engineering 44(1):471-483

[31] Wang J, Kawagoe K. 2018. A recommender system for ancient books, pamphlets and paintings in ritsumeikan art research center database.

[32] Wang F, Lin S, Luo X, Zhao B, Wang R. 2019. Query-by-sketch image retrieval using homogeneous painting style characterization. Journal of Electronic Imaging 28(2):1

[33] White C, Matulionyte R. 2020. Artificial intelligence: painting the bigger picture for copyright ownership. Australian Intellectual Property Journal 30(4):224-242