Federated learning-driven collaborative recommendation system for multi-modal art analysis and enhanced recommendations
- Published
- Accepted
- Received
- Academic Editor
- Giovanni Angiulli
- Subject Areas
- Artificial Intelligence, Data Mining and Machine Learning, Data Science, Visual Analytics, Neural Networks
- Keywords
- Artificial intelligence, Art similarity search, Data privacy, Federated learning, Multimodal data fusion
- Copyright
- © 2024 Gong et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
- Cite this article
- 2024. Federated learning-driven collaborative recommendation system for multi-modal art analysis and enhanced recommendations. PeerJ Computer Science 10:e2405 https://doi.org/10.7717/peerj-cs.2405
Abstract
With the rapid development of artificial intelligence technology, recommendation systems have been widely applied in various fields. However, in the art field, art similarity search and recommendation systems face unique challenges, namely data privacy and copyright protection issues. To address these problems, this article proposes a cross-institutional artwork similarity search and recommendation system (AI-based Collaborative Recommendation System (AICRS) framework) that combines multimodal data fusion and federated learning. This system uses pre-trained convolutional neural networks (CNN) and Bidirectional Encoder Representation from Transformers (BERT) models to extract features from image and text data. It then uses a federated learning framework to train models locally at each participating institution and aggregate parameters to optimize the global model. Experimental results show that the AICRS framework achieves a final accuracy of 92.02% on the SemArt dataset, compared to 81.52% and 83.44% for traditional CNN and Long Short-Term Memory (LSTM) models, respectively. The final loss value of the AICRS framework is 0.1284, which is better than the 0.248 and 0.188 of CNN and LSTM models. The research results of this article not only provide an effective technical solution but also offer strong support for the recommendation and protection of artworks in practice.
Introduction
Background
With the rapid development of artificial intelligence technology, deep learning-based recommendation systems have been widely applied in various fields. From e-commerce product recommendations to music and video content recommendations, recommendation systems have become important tools to enhance user experience. In the art field, especially in the similarity search and recommendation systems for artworks, there are unique challenges (Wu et al., 2024; Mintie, 2023). Artworks often have high originality and commercial value, making data privacy and copyright protection issues particularly important (Nishioka, Hauke & Scherp, 2020; Xiong & Zhang, 2023; Ajmal et al., 2023).
Freelance artists and art galleries are the main creators and collectors of artworks. They want to increase the exposure and sales of their works through advanced recommendation systems. However, these institutions are reluctant to share their original artwork data directly due to concerns about data breaches and copyright infringement (White & Matulionyte, 2020; Peplow, 2021). Additionally, wealthy art collectors and museums are also primary consumers of artworks. These entities play a significant role in the art market, seeking advanced recommendation systems to enhance their collections while ensuring privacy and copyright protection.
The extensive use of AI painting tools has also raised concerns about copyright infringement of unpublished works. In recent years, data privacy and copyright protection issues have drawn widespread attention in the art field. Getty Images accused the AI company Stability AI of using millions of unauthorized image data to train its image generation model, which harmed the commercial value of its images (BakerHostetler, 2024). In the field of AI-generated artworks, a platform used more than 100,000 artworks without authorization to train its generation model. These works came from thousands of artists, harming the interests of original artists (Appel, Neelbauer & Schweidel, 2023). These cases highlight the severe challenges of data privacy and copyright protection in the art field. These issues make it urgent to achieve efficient artwork recommendation while protecting data privacy and copyrights.
This work aims to address these challenges by proposing a federated learning framework that enables multiple institutions to collaboratively train recommendation models without sharing their raw data. By leveraging federated learning, we can enhance the quality of similarity search and recommendation for artworks while ensuring that the data privacy and copyright protection concerns of freelance artists and art galleries are adequately addressed.
Related work
Numerous studies have explored the application of deep learning and federated learning in recommendation systems, emphasizing the importance of data privacy and copyright protection. For instance, Chen et al. (2023) reviewed the latest developments in deep reinforcement learning for recommendation systems but noted insufficient consideration of data privacy issues. Dong et al. (2022) investigated trust-aware recommendation systems, focusing on robustness and interpretability, yet their exploration of federated learning was limited. Lee & Kim (2022) proposed deep learning recommendation systems using cross-convolution filters to capture complex user-item interactions, though they did not adequately address cross-institution collaboration.
Recent advancements have highlighted the potential of federated learning to enhance recommendation systems while maintaining data privacy. Dong et al. (2023) introduced the FedSR framework for Point-of-Interest recommendation systems, addressing data sparsity and Non-Independent and Identically Distributed (Non-IID) issues through Contrastive Learning. van Berlo, Saeed & Ozcelebi (2020) developed a federated unsupervised representation learning architecture, demonstrating how federated learning can pre-train deep neural networks with unlabeled data to protect user privacy. Elayan, Aloqaily & Guizani (2021) proposed a Deep Federated Learning framework for decentralized healthcare systems, illustrating the benefits of federated learning in maintaining user privacy and improving model performance in sensitive data environments.To address these issues, more and more researchers are focusing on these problems and seeking innovative solutions. However, previous studies have many shortcomings, as shown in Table 1.
Author | Application scenario | Research content | Possible shortcomings |
---|---|---|---|
Chen et al. (2023) | Recommendation systems | Review of the application and latest developments of deep reinforcement learning in recommendation systems | Insufficient consideration of data privacy and copyright protection issues |
Dong et al. (2022) | Trust-aware recommendation systems | Trust-aware recommendation systems from the perspective of deep learning, including social trust, robustness, and interpretability | Limited exploration of applications in multimodal data fusion and federated learning |
Lee & Kim (2022) | Recommendation systems | Deep learning recommendation systems based on cross-convolution filters, capturing complex interactions between user and item features | Insufficient protection of data privacy and cross-institution collaboration |
Park & Lee (2023) | Deep sleep recommendation systems | Personalized deep sleep recommendation using hybrid deep learning methods, combining user and collaborative filtering methods | No consideration of data protection and copyright issues |
Jeong & Kim (2022) | Context-aware recommendation systems | Deep learning recommendation systems based on context features, combining neural networks and autoencoders for feature extraction and score prediction | Few applications for data privacy protection and cross-institution data collaboration |
Vu & Le (2023) | Multi-criteria recommendation systems | Context-aware multi-criteria recommendation systems based on deep learning, using deep neural networks to predict ratings and learn aggregation functions | Insufficient consideration of privacy protection and copyright protection |
Tegene et al. (2023) | Collaborative recommendation systems | Latent factor models based on deep learning and embedding to solve data sparsity issues and extract nonlinear features | Limited research on applications in federated learning and multimodal data fusion |
Wu, Sun & Shang (2023) | Deep learning recommendation systems | Proposed DE-Opt framework to optimize hyperparameters of deep learning recommendation systems, improving recommendation accuracy and computational efficiency | Lack of exploration of multimodal data fusion and cross-institution data protection |
Torkashvand, Jameii & Reza (2023) | Collaborative filtering recommendation systems | Systematic review of deep learning collaborative filtering recommendation systems, categorizing and analyzing existing methods and their advantages and disadvantages | Insufficient research on data privacy and cross-institution collaboration protection |
Arthur et al. (2022) | Cross-domain recommendation systems | Proposed a discriminative geometric deep learning model to solve cold start and data sparsity issues in cross-domain recommendations | Less exploration of applications in multimodal data fusion and privacy protection |
Dong et al. (2023) | POI recommendation systems | Proposed FedSR framework for POI-RS using sequential information and Contrastive Learning to address data sparsity and Non-IID issues in FL | Geographic and cultural differences affecting model training effectiveness |
van Berlo, Saeed & Ozcelebi (2020) | Federated learning | Introduced federated unsupervised representation learning for pre-training deep neural networks with unlabeled data in a federated setting | Limited to scenarios where labeled data can be generated from user interaction |
Elayan, Aloqaily & Guizani (2021) | Healthcare systems | Proposed Deep Federated Learning framework for decentralized healthcare systems to maintain user privacy and improve model performance | Model conversion time affecting quality of service to users |
Our contribution
This study proposes a cross-institutional artwork similarity search and recommendation system—AICRS (AI-based Collaborative Recommendation System) framework, which combines multimodal data fusion and federated learning to address data privacy and copyright protection issues, as shown in Fig. 1. The main contributions of this article are as follows:
Figure 1: AICRS framework (Image source: Author’s own).
Figure 1 illustrates the AICRS framework, consisting of multiple components working together to provide a secure and efficient recommendation system. The framework involves various entities such as art galleries, artists, trading platforms, and self-employed individuals who contribute their data while maintaining control over their local models. The data, which remains with the participants, includes diverse artwork types (advertising, comics, publicity, art collections, etc.), represented by different colors for different types. The local models are trained on participant-owned data, and only the necessary parameters are exchanged between participants and the central trusted server. This server aggregates the parameters without exposing the original artwork, ensuring privacy and copyright protection while making customer recommendations based on their needs. The integration of these components ensures a collaborative yet secure environment for recommending artwork across institutions, leveraging the strengths of federated learning and multimodal data fusion.
-
We propose a Local Multi-Modal Feature Extraction and Aggregation algorithm (L-MFEA). It combines pre-trained convolutional neural networks (CNN) and Bidirectional Encoder Representations from Transformers (BERT) models. This extracts rich features from image and text data, generating multi-modal feature vectors and improving the accuracy of the recommendation system.
We propose a Federated Model Parameter Aggregation and Optimization algorithm (F-MPAO). This algorithm trains models locally at each participating institution and aggregates parameters to optimize the global model. It effectively addresses privacy and security issues caused by centralized data storage.
Combining multimodal data fusion and federated learning strategies, we propose a cross-institutional artwork similarity search and recommendation system (AICRS framework). This improves the overall performance and recommendation effect of the recommendation system while ensuring data privacy and copyright protection.
The remainder of this article is organized as follows: “Research Issues” discusses the key research issues related to recommendation systems and federated learning, and model aggregation strategies. “Experiments and Results” presents the experimental setup and results, including a detailed analysis of the system’s performance. Finally, “Conclusion” concludes the article and highlights potential directions for future research.
Research issues
Protecting user data privacy and copyright is crucial in recommendation systems. Federated learning allows collaborative model training without sharing raw data, ensuring privacy and security. In art recommendation systems, this protects sensitive artwork data and copyrights. Combining federated learning with deep learning, we propose the AICRS (AI-based Collaborative Recommendation System) for efficient and secure artwork recommendations.
Privacy protection in art recommendation systems
The dataset of artworks used in this research is represented as . Each represents a local dataset of a participant (freelance artist or art gallery) containing artworks, i.e., . Here, represents the feature of the -th artwork of the -th participant, and represents the corresponding label. Each participant trains their model locally and updates parameters by minimizing the following loss function:
(1) where is the loss function of the -th participant, is the loss function (e.g., mean squared error or cross-entropy), and is the learning rate. Commonly used in stochastic gradient descent, is discussed in various works including (Jeon et al., 2023). The global model is updated by aggregating the local model parameters of all participants:
(2)
In the recommendation system, we use deep learning models to extract features of artworks for similarity search and recommendation. Suppose the deep learning model is a convolutional neural network (CNN), its structure can be represented as:
(3) where is the extracted high-dimensional feature vector, is the input artwork, and is the model parameter. For similarity search, we use cosine similarity to measure the similarity between two artwork feature vectors:
(4) where and represent the feature vectors of the -th and -th artworks, respectively, and denotes the norm of the vector. The cosine similarity formula is a widely used method for comparing feature vectors in recommendation systems, as detailed in (Sankararaman et al., 2020). Our optimization objective is to maximize the performance of the recommendation system while protecting data privacy and copyrights. Specifically, the optimization objective can be expressed as:
(5) where is the regularization term, and is the regularization coefficient. This formula is a standard approach to incorporating regularization in machine learning, discussed extensively in Tian, Zhang & Zhang (2023).
Problem 1 Our core research question is how to achieve an efficient artwork recommendation system using federated learning and deep learning technologies while protecting the privacy and copyright of artwork data. The specific mathematical model is as follows:
(6)
Local feature extraction and aggregation based on multimodal data fusion
Multimodal data fusion: local feature extraction and aggregation based on CNN and BERT models
Traditional unimodal algorithms, like CNN or LSTM, can only handle single-modal data (image or text). They cannot fully utilize the multimodal features of artworks (Daneshvar & Ravanmehr, 2022). They also cannot effectively capture details in artworks, leading to large prediction errors and high loss values, resulting in poor recommendation effects (Fachrela et al., 2023). Traditional methods need to store and process large amounts of data centrally, posing data privacy and copyright protection issues (Nithya, Geetha & Kumar, 2024).
The AICRS framework combines multimodal data fusion techniques (image and text features). It uses pre-trained CNN and BERT models to extract rich features from image and text data and then fuses them to generate multimodal feature vectors. This can better capture the details of artworks, significantly reduce prediction errors and loss values, and improve the accuracy of the recommendation system. By using federated learning strategies, model parameters are shared among multiple participating institutions instead of directly sharing data. This protects data privacy and copyright while improving the generalization ability of the model.
Multimodal data fusion
Suppose there are N artworks. Each artwork contains image data and text data . We first extract features from the image and text separately. For image feature extraction, we use a pre-trained CNN, such as Residual Neural Network (ResNet), to extract the image feature vector:
(7) where represents the image feature vector of the -th artwork, and are the weights and biases of the -th convolution kernel in the -th layer, denotes the convolution operation, and denotes the activation function (such as Rectified Linear Unit (ReLU)). Specifically, suppose the CNN contains L layers of convolution and pooling layers. The output for the -th convolution and pooling layer can be represented as:
(8) where is the output of the layer, denotes the pooling operation, and are the weights and biases of the -th convolution kernel in the -th layer. The final image feature vector is represented by the output of the last layer of the CNN:
(9)
For text feature extraction, we use a pre-trained BERT model to extract the text feature vector:
(10) where represents the text feature vector of the -th artwork, and are the attention weights and biases of the -th word in the -th layer, and represents the embedding vector of the -th word. Specifically, BERT uses a multi-layer Transformer structure to extract text features. The Transformer architecture in BERT consists of multiple layers of self-attention and feed-forward neural networks, which enables it to capture complex dependencies and contextual information from the text data. For the -th layer of the Transformer, the output can be represented as:
(11) where is the output of the layer, denotes the multi-head attention mechanism, and are the parameters of the -th layer. The final text feature vector is represented by the output of the last layer of BERT:
(12)
To generate a comprehensive multimodal feature vector, we concatenate the image and text features extracted from CNN and BERT, then process them through a fully connected layer. The feature fusion process can be represented as:
(13) where represents the multimodal feature vector of the -th artwork, represents the concatenation of image and text features, is the weight matrix, is the bias vector, and represents the activation function. Suppose the fully connected layer contains M neurons. The computation process can be represented as:
(14) where represents the output of the fully connected layer, and are the weights and biases of the -th neuron in the -th layer, and represents the activation function (such as ReLU). The final multimodal feature vector can be represented as:
(15)
We further normalize the feature vector to ensure its balance between different modalities:
(16) where represents the norm of , and is a small constant to avoid division by zero. This normalization ensures that the length of each feature vector is on a uniform scale, improving the accuracy of subsequent processing and recommendation. The output of this algorithm will be used as input for the subsequent recommendation system model.
Federated learning-based cross-institutional model parameter aggregation and optimization
Cross-institutional collaborative model: parameter aggregation and optimization based on federated learning framework
In cross-institutional data sharing, traditional methods cannot effectively protect the data ownership and copyright of each participating institution. This makes institutions reluctant to share data, affecting the overall performance of the model. Traditional model training methods require huge computational and storage resources when handling large-scale data. This can lead to single point failures and poor system robustness (Wang & Kawagoe, 2018; Musto et al., 2010; Kim, Kang & Lee, 2019). Traditional algorithms find it difficult to achieve efficient model parameter updates and optimization in cross-institutional cooperation, leading to slow convergence of the global model and limited performance improvement (Messina et al., 2019; Wang et al., 2019).
This algorithm aggregates local model parameters of each participating institution through methods such as weighted averaging. It ensures data ownership and copyright of each institution, enhances their willingness to cooperate, and improves the overall performance of the model. This algorithm reduces reliance on a central server, improves system robustness, and avoids single point failures. With improved optimization algorithms and effective parameter aggregation strategies, it significantly speeds up the convergence of the global model and improves performance.
Cross-institutional collaborative model
Suppose there are N participating institutions. Each institution has a local dataset , where is the -th sample of the -th institution, and is its corresponding label. On the local server of each participating institution, use the L-MFEA algorithm to extract multimodal features :
(17) where is the local model parameter of the -th institution. Next, train the model on the local dataset to minimize the local loss function:
(18) where is the loss function (such as cross-entropy loss), is the model output, and is the regularization term.
Local model parameters are updated using the gradient descent algorithm:
(19) where is the learning rate, is the regularization term, and is the weight of the regularization term. In each round of federated learning, the central server aggregates the local model parameters of each participating institution. The aggregation method, such as weighted averaging, is defined as:
(20) where is the global model parameter, and is the total amount of data from all participating institutions. The central server distributes the aggregated global model parameters back to each participating institution to update the local model parameters:
(21)
Each participating institution, after receiving the updated global model parameters, continues to train on the local dataset to minimize the local loss function. Repeat local training and parameter aggregation until the global model converges.
Theorem 1 Given a reasonable learning rate and sufficient iterations, the loss function of the global model parameter will converge, assuming the local loss functions of all participating institutions converge. Specifically, assuming each local model satisfies the following condition during the iteration process:
(22) where and are the Lipschitz constants of the loss function and the regularization term , respectively, and is the global optimal loss value.
Corollary 1 Based on the above theorem, if the multimodal feature vector is obtained by concatenating the image feature vector and the text feature vector and then computing through the fully connected layer, the final model parameter update can be represented as:
(23) where represents the learning rate, and represent the weight matrices of the image and text features, respectively, represents the bias vector, and represents the regularization parameter.
AICRS framework: algorithm pseudocode and complexity analysis
The AICRS framework is presented as a comprehensive solution that integrates multiple components and processes to achieve efficient and secure artwork recommendations (Algorithm 1). While the overall structure is defined as a framework, the detailed implementation of its components and processes is expressed in the form of an algorithm. This approach allows us to provide a clear and precise description of the operational steps involved in the framework, ensuring reproducibility and facilitating practical application. By presenting the pseudocode, we aim to illustrate the exact sequence of operations, including data handling, model training, and federated learning procedures, thus bridging the conceptual framework with its practical execution.
Input Local dataset |
Output Global model parameters θ |
1 for each participating institution to N do |
2 for each sample to ni do |
3 Extract multimodal features hi,j using Eq. (17); |
4 Extract image features using Eq. (7); |
5 Extract text features using Eq. (10); |
6 Concatenate image and text features to generate multimodal feature vector hi using Eq. (13); |
7 Normalize the feature vector hi to obtain using Eq. (41); |
8 Train the model on the local dataset, minimizing the local loss function using Eq. (55); |
9 Update local model parameters using Eq. (65); |
10 for each round of federated learning to T do |
11 The central server aggregates the local model parameters of each participating institution using Eq. (48); |
12 Distribute the aggregated global model parameters back to each participating institution using Eq. (21); |
13 for each participating institution to N do |
14 Continue training on the local dataset, minimizing the local loss function ; |
15 Update local model parameters ; |
16 while the global model has not converged do |
17 for each participating institution to N do |
18 Continue training on the local dataset, minimizing the local loss function using Eq. (55); |
19 Update local model parameters using Eq. (65); |
20 The central server aggregates the local model parameters of each participating institution using Eq. (48); |
21 Distribute the aggregated global model parameters back to each participating institution using Eq. (21); |
22 return Global model parameters θ |
Suppose each participating institution has samples. The constant time for image feature extraction and text feature extraction are and , respectively. The time complexity for each training iteration is , the number of federated learning rounds is T, and the number of model parameters is P. Considering these factors, the time complexity is . For space complexity, each participating institution stores the local dataset and model parameters , and the central server stores the global model parameters and the model parameters of each participating institution. Suppose the dimension of each sample is D. The space complexity is .
To further analyze the performance of the proposed framework, we compared the AICRS framework with other state-of-the-art models, as shown in Table 2. The time complexity of AICRS is , which has better scalability compared to Karayel’s (2023) and Liu & Yu’s (2022) . Especially when handling large-scale data and multiple institutions, AICRS is more efficient. Additionally, the space complexity of AICRS is , which is much lower than Liu & Yu’s (2022) and Karayel’s (2023) . This significantly reduces storage requirements while maintaining performance. Overall, the optimization in time and space complexity of the AICRS algorithm makes it more advantageous in large-scale distributed federated learning systems.
Related study | Time complexity | Space complexity |
---|---|---|
Karayel (2023) | ||
Liu & Yu (2022) | ||
Malandrino et al. (2021) | Variable | |
Zhou et al. (2023) | ||
AICRS |
Experiments and results
This section describes the dataset and experimental parameters used in our study. It provides an overview of the SemArt dataset, details the data split for training, validation, and testing, and outlines the experimental parameters set for the training and evaluation of the recommendation models. Additionally, it presents the results of the AICRS framework application, highlighting its performance compared to traditional CNN and LSTM models.
Dataset and experimental parameters
The SemArt dataset is specifically designed for the analysis and recommendation of artworks. It contains 21,384 images of artworks and their related textual descriptions. Each image comes with detailed text descriptions, including the title, artist, creation year, art style, and descriptive text (https://doi.org/10.17036/researchdata.aston.ac.uk.00000380).
To train and test the recommendation system, we split the dataset by art style. Each art style acts as a client for federated learning training. The training set includes a portion of the artworks from each style for model training. The test set includes the remaining artworks for evaluating the model’s recommendation accuracy. The accuracy of the recommendations is evaluated by comparing whether the recommended artworks match the actual artist’s style and type. The specific data content is shown in Fig. 2.
Figure 2: SemArt dataset samples: The presence of multiple types of art images in the dataset and the corresponding comments and attributes (Image source: SemArt Dataset (https://doi.org/10.17036/researchdata.aston.ac.uk.00000380), licensed under CC BY-NC 4.0.).
The dataset is split into training, validation, and test sets with the following percentages: 70% for training, 15% for validation, and 15% for testing. This split ensures that the model is adequately trained, validated, and tested to achieve reliable performance metrics. Our experiments were conducted using high-performance hardware to ensure efficient computation and model training.
Our experimental parameters are set as shown in Table 3. After setting these detailed experimental parameters, we trained and tested these algorithms on the SemArt dataset. The SemArt dataset includes images of artworks and their detailed textual descriptions from various art categories, covering multiple art styles from the 11th to the 20th century. By conducting experiments on this dataset, we can evaluate the effectiveness and performance of the AICRS framework in processing and recommending artworks.
Parameter name | Parameter value | Parameter name | Parameter value |
---|---|---|---|
Dataset | SemArt | Number of images | 21,384 |
Image resolution | 224 × 224 | Text description | Each image includes title, artist, creation year, art style, descriptive text |
Categories | Painting, sculpture, photography, etc. | Time span | 11th to 20th Century |
Art styles | Baroque, renaissance, impressionism, modern art, etc. | Model structure (Image Feature Extraction) | ResNet-50 |
Input size (Image feature extraction) | 224 × 224 × 3 | Number of convolution layers (Image feature extraction) | 50 |
Activation function (Image feature extraction) | ReLU | Model structure (Text feature extraction) | BERT-base |
Number of hidden layers (Text feature extraction) | 12 | Number of hidden units (Text feature extraction) | 768 |
Number of attention heads (Text feature extraction) | 12 | Input length (Text feature extraction) | 128 |
Number of layers (Feature Fusion) | 2 | Number of neurons (Feature fusion) | 512, 256 |
Activation function (Feature fusion) | ReLU | Local training epochs | 10 |
Global training epochs | 50 | Learning rate | 0.001 |
Optimization algorithm | Adam | Batch size | 32 |
AICRS framework application results
This study focuses on developing a cross-institutional artwork recommendation system based on federated learning. It uses multimodal data fusion (image and text features) to enhance the performance of the recommendation system. The model accuracy measures the ability of the recommendation system to recommend the correct artworks, while the loss value reflects the prediction error during training.
Figure 3 shows the accuracy performance of the four models over 50 training rounds. The results indicate that the AICRS framework has significant advantages in processing and recommending artworks, surpassing traditional CNN and LSTM models, as well as the Federated Averaging (FED-AVG) model. The CNN model’s final accuracy after 50 training rounds is 81.52%. Accuracy grows rapidly in the early stages but stabilizes, maintaining above 80% from round 25 and reaching 81.52% at the end. The LSTM model’s final accuracy after 50 training rounds is 83.44%. The LSTM model also shows rapid accuracy growth initially, reaching about 82.63% around round 22, then slightly improving to 83.44%. The FED-AVG model’s final accuracy after 50 training rounds is 84.44%. The FED-AVG model demonstrates strong performance, surpassing the LSTM model in the later stages, with accuracy steadily increasing and reaching 84.44% at the end. The AICRS framework’s final accuracy after 50 training rounds is 92.02%. The AICRS framework performs significantly better than CNN, LSTM, and FED-AVG from the start, exceeding 90% accuracy before round 30 and reaching 92.02% at the end.
Figure 3: Comparison of modeled 50-round ACCURACY values.
Figure 4 shows the loss values of the four models (CNN, LSTM, AICRS framework, and FED-AVG) over 50 training rounds. The results show that the AICRS framework has significantly lower loss values in processing and recommending artworks compared to traditional CNN and LSTM models, as well as the FED-AVG model. Specifically, the CNN model’s final loss value after 50 training rounds is 0.248. The loss value decreases rapidly in the early stages but stabilizes, staying below 0.26 from round 30 and reaching 0.248 at the end. The LSTM model’s final loss value after 50 training rounds is 0.188. The LSTM model also shows a rapid decrease in loss value initially, reaching about 0.22 around round 25, then slightly decreasing to 0.188. The FED-AVG model’s final loss value after 50 training rounds is 0.168. The FED-AVG model performs better than the LSTM model, showing a steady decrease in loss value and reaching 0.168 at the end. The AICRS framework’s final loss value after 50 training rounds is 0.1284. The AICRS framework performs significantly better than CNN, LSTM, and FED-AVG from the start, exceeding 0.20 loss value before round 20 and reaching 0.1284 at the end. The AICRS framework shows significant advantages in accuracy and loss values, improving recommendation performance and reducing prediction errors in the artwork recommendation system.
Figure 4: Comparison of modeled 50-round loss values.
Table 4 shows the performance comparison of the three models, including accuracy, loss value and training time. It is evident that the AICRS framework outperforms the CNN and LSTM models in accuracy and loss value. Although its training time is slightly longer, the performance improvement is significant. The training time for each model was measured by recording the duration from the start to the completion of the training process on the same hardware configuration. Specifically, we used a server equipped with dual NVIDIA Tesla V100 GPUs, each with 32 GB of memory, and 256 GB of system RAM. The server is powered by dual Intel Xeon Gold 6258R CPUs, each with 28 cores, providing ample processing power for both training and inference phases. Each model was trained using the same dataset, batch size, and number of epochs to ensure a fair comparison. The reported training times are the average of three runs to account for any variability in the training process. The slight increase in training time for the AICRS framework is justified by its superior performance metrics, indicating a worthwhile trade-off for the significant gains in accuracy and reduced loss value.
Model | Accuracy (%) | Loss value | Training time (ms) |
---|---|---|---|
CNN | 81.52 | 0.248 | 3,962 |
LSTM | 83.44 | 0.188 | 4,185 |
AICRS | 92.02 | 0.1284 | 4,577 |
Discussion
The rapid development of artificial intelligence has led to the widespread use of recommendation systems in various fields. In the art sector, implementing similarity search and recommendation systems poses unique challenges due to the high originality and commercial value of artworks. This necessitates robust data privacy and copyright protection. Freelance artists and art galleries aim to increase the exposure and sales of their works through advanced recommendation systems but are often hesitant to share their original artwork data due to concerns about data breaches and copyright infringement.
This study proposes a cross-institutional artwork similarity search and recommendation system (AICRS framework) that combines multimodal data fusion and federated learning to address data privacy and copyright protection issues. The experimental results show that the AICRS framework has significant advantages in processing and recommending artworks, surpassing traditional CNN and LSTM models.
The main reasons for the performance differences among the models are as follows:
The AICRS framework combines multimodal data fusion (image and text features), extracting richer features of artworks compared to single-modal CNN and LSTM models. This improves the accuracy of the recommendation system. The CNN and LSTM models’ final accuracy after 50 training rounds are 81.52% and 83.44%, respectively. The AICRS framework’s final accuracy reaches 92.02%. This shows that multimodal data fusion has significant advantages in capturing the details and features of artworks.
The AICRS framework uses pre-trained ResNet-50 and BERT models, which have better feature extraction and representation capabilities than traditional CNN and LSTM models. These models capture the details of artworks more effectively. In terms of loss value, the AICRS framework also performs better than other models. The CNN and LSTM models’ final loss values after 50 training rounds are 0.248 and 0.188, respectively. The AICRS framework’s final loss value is 0.1284. The more complex model structure enables the AICRS framework to handle complex features and reduce errors more effectively.
The AICRS framework adopts federated learning strategies, sharing model parameters among multiple participating institutions instead of directly sharing data. This improves the model’s generalization ability and privacy protection. This strategy not only protects the data privacy of participating institutions but also enhances the overall performance of the model. The training time of the AICRS framework is slightly longer than other models (4,577 ms), but the performance improvement is significant.
Conclusion
This article proposes a cross-institutional artwork similarity search and recommendation system (AICRS framework) that combines multimodal data fusion and federated learning to address data privacy and copyright protection issues. This system uses pre-trained convolutional neural networks (CNN) and BERT models to extract rich features from image and text data. It trains models locally at each participating institution and aggregates parameters through a federated learning framework to optimize the global model. The experimental results demonstrate that the AICRS framework has significant advantages in processing and recommending artworks, improving recommendation performance and reducing prediction errors. It enables collaboration among art institutions, offering accurate recommendations and complying with data protection regulations.The AICRS framework still has room for improvement in its reliance on high-quality multimodal data. Future research will explore enhancing the system’s robustness in cases of incomplete or low-quality data, as well as expanding the framework to support real-time recommendations across more diverse types of art media.
Appendix: mathematical theorems and corollary proofs
Theorem 2 Let and be the image and text feature vectors, respectively. is their concatenation, and is the multimodal feature vector. The normalized multimodal feature vector satisfies:
(24)
where is the activation function, is the weight matrix of the fully connected layer, is the bias vector, and is a small constant to avoid division by zero.
Proof 1 First, define the optimization problem as follows:
(25)
where is the regularization parameter.
Expand the objective function as:
(26)
We can consider each term individually, i.e., optimize for each . For each , the optimization problem is:
(27)
Expand the above expression and take the derivative:
(28)
Set the derivative to zero, we get:
(29)
Solve to get:
(30)
To obtain the normalized multimodal feature vector , we normalize the feature vector :
(31)
where is the norm of , and is a small constant to avoid division by zero.
By definition, we have:
(32)
Substitute into the normalization equation and combine with the fusion equation, we get the final normalized multimodal feature vector:
(33)
In summary, the optimization problem has a unique solution that satisfies the equation, and the theorem is proved.
Corollary 2 Based on the above theorem, if the multimodal feature vector is obtained by concatenating the image feature vector and the text feature vector and then computing through a fully connected layer, the normalized multimodal feature vector can be expressed as follows:
(34)
where is the activation function, is the weight matrix of the fully connected layer, is the bias vector, and is a small constant to avoid division by zero.
Proof 2 First, consider the calculation process of the multimodal feature vector . Assume is obtained by concatenating the image feature vector and the text feature vector , and then computing through a fully connected layer:
(35)
where is the weight matrix of the fully connected layer, is the bias vector, and is the activation function.
We further refine the calculation process of the image feature vector and the text feature vector . Assume the image feature vector is obtained through a convolutional neural network:
(36)
where and are the weights and biases of the convolution kernels, and is the input image data.
Similarly, assume the text feature vector is obtained through a pre-trained BERT model:
(37)
where and are the weights and biases of the BERT model, is the input text data, and is the parameter of the BERT model.
Next, we concatenate the image feature vector and the text feature vector to obtain the multimodal feature vector :
(38)
Substitute Eqs. (36) and (37) into (38), we get:
(39)
Then, compute through the fully connected layer to obtain the fused multimodal feature vector :
(40)
To ensure the balance of feature vectors, we normalize them:
(41)
where is the norm of , and is a small constant to avoid division by zero.
Substitute Eqs. (40) into (41), we get the normalized multimodal feature vector:
(42)
Further expand the expression of :
(43)
Substitute Eqs. (43) into (44), we get the final normalized multimodal feature vector:
(44)
In summary, the corollary is proved.
Theorem 3 Given a reasonable learning rate and sufficient iterations, if the local loss functions of all participating institutions converge, the loss function of the global model parameters will also converge. Specifically, suppose each local model satisfies the following condition during the iteration process:
(45)
where and are the Lipschitz constants of the loss function and the regularization term , respectively, and is the global optimal loss value.
Proof 3 First, we consider the local loss function of each participating institution
(46)
The local model parameters are updated using the gradient descent algorithm
(47)
where is the learning rate is the regularization term is the weight of the regularization term
Next, we consider the aggregation process of the global model parameters. The local model parameters of each participating institution are aggregated by weighted averaging
(48)
where are the global model parameters is the total amount of data from all participating institutions
Then, we analyze the changes in the global loss function . The global loss function is defined as
(49)
where is the local loss function of the -th participating institution.
Because the local loss functions of each participating institution are optimized based on the same global model parameters, we have
(50)
where represents the decrease in the loss function during gradient descent.
Considering the Lipschitz continuity of the loss function and the regularization term, we get
(51)
where and are the Lipschitz constants of the loss function and the regularization term , respectively.
Due to the convergence of the local loss functions , we have
(52)
where is the local optimal loss value of the -th participating institution.
Therefore, the global loss function will also converge to its optimal value
(53)
where is the global optimal loss value.
In summary, given a reasonable learning rate and sufficient iterations, if the local loss functions of all participating institutions converge, the loss function of the global model parameters will also converge. The theorem is proved.
Corollary 1 Based on the above theorem, if the multimodal feature vector is obtained by concatenating the image feature vector and the text feature vector and then computing through a fully connected layer, the final model parameter update can be expressed as:
(54)
where is the learning rate and are the weight matrices for image and text features, and is the bias vector is the regularization parameter.
Proof 4 First, we consider the local loss function for each participating institution :
(55)
where is the loss function and is the model output.
Assume the multimodal feature vector is obtained by concatenating the image feature vector and the text feature vector and then computing through a fully connected layer. That is:
(56)
Computed through a fully connected layer:
(57)
where W is the weight matrix is the bias vector is the activation function.
Assume W can be decomposed into the weight matrices for image and text features and :
(58)
Thus, the output of the fully connected layer can be expressed as:
(59)
The model output can be expressed as:
(60)
where are the model parameters.
For the local loss function, we take the gradient with respect to the model parameters :
(61)
According to the chain rule, we can expand the gradient:
(62)
Expanding the gradient of the model output :
(63)
Note that the gradient of the regularization term is:
(64)
Substitute the gradients into the local model parameter update formula:
(65)
For the case of the multimodal feature vector , it can be further simplified as:
(66)
where and are the weight matrices for image and text features, and is the bias vector.
In conclusion, the corollary is proved.