Federated learning-driven collaborative recommendation system for multi-modal art analysis and enhanced recommendations

View article
PeerJ Computer Science

Introduction

Background

With the rapid development of artificial intelligence technology, deep learning-based recommendation systems have been widely applied in various fields. From e-commerce product recommendations to music and video content recommendations, recommendation systems have become important tools to enhance user experience. In the art field, especially in the similarity search and recommendation systems for artworks, there are unique challenges (Wu et al., 2024; Mintie, 2023). Artworks often have high originality and commercial value, making data privacy and copyright protection issues particularly important (Nishioka, Hauke & Scherp, 2020; Xiong & Zhang, 2023; Ajmal et al., 2023).

Freelance artists and art galleries are the main creators and collectors of artworks. They want to increase the exposure and sales of their works through advanced recommendation systems. However, these institutions are reluctant to share their original artwork data directly due to concerns about data breaches and copyright infringement (White & Matulionyte, 2020; Peplow, 2021). Additionally, wealthy art collectors and museums are also primary consumers of artworks. These entities play a significant role in the art market, seeking advanced recommendation systems to enhance their collections while ensuring privacy and copyright protection.

The extensive use of AI painting tools has also raised concerns about copyright infringement of unpublished works. In recent years, data privacy and copyright protection issues have drawn widespread attention in the art field. Getty Images accused the AI company Stability AI of using millions of unauthorized image data to train its image generation model, which harmed the commercial value of its images (BakerHostetler, 2024). In the field of AI-generated artworks, a platform used more than 100,000 artworks without authorization to train its generation model. These works came from thousands of artists, harming the interests of original artists (Appel, Neelbauer & Schweidel, 2023). These cases highlight the severe challenges of data privacy and copyright protection in the art field. These issues make it urgent to achieve efficient artwork recommendation while protecting data privacy and copyrights.

This work aims to address these challenges by proposing a federated learning framework that enables multiple institutions to collaboratively train recommendation models without sharing their raw data. By leveraging federated learning, we can enhance the quality of similarity search and recommendation for artworks while ensuring that the data privacy and copyright protection concerns of freelance artists and art galleries are adequately addressed.

Related work

Numerous studies have explored the application of deep learning and federated learning in recommendation systems, emphasizing the importance of data privacy and copyright protection. For instance, Chen et al. (2023) reviewed the latest developments in deep reinforcement learning for recommendation systems but noted insufficient consideration of data privacy issues. Dong et al. (2022) investigated trust-aware recommendation systems, focusing on robustness and interpretability, yet their exploration of federated learning was limited. Lee & Kim (2022) proposed deep learning recommendation systems using cross-convolution filters to capture complex user-item interactions, though they did not adequately address cross-institution collaboration.

Recent advancements have highlighted the potential of federated learning to enhance recommendation systems while maintaining data privacy. Dong et al. (2023) introduced the FedSR framework for Point-of-Interest recommendation systems, addressing data sparsity and Non-Independent and Identically Distributed (Non-IID) issues through Contrastive Learning. van Berlo, Saeed & Ozcelebi (2020) developed a federated unsupervised representation learning architecture, demonstrating how federated learning can pre-train deep neural networks with unlabeled data to protect user privacy. Elayan, Aloqaily & Guizani (2021) proposed a Deep Federated Learning framework for decentralized healthcare systems, illustrating the benefits of federated learning in maintaining user privacy and improving model performance in sensitive data environments.To address these issues, more and more researchers are focusing on these problems and seeking innovative solutions. However, previous studies have many shortcomings, as shown in Table 1.

Table 1:
Summary of related studies.
Author Application scenario Research content Possible shortcomings
Chen et al. (2023) Recommendation systems Review of the application and latest developments of deep reinforcement learning in recommendation systems Insufficient consideration of data privacy and copyright protection issues
Dong et al. (2022) Trust-aware recommendation systems Trust-aware recommendation systems from the perspective of deep learning, including social trust, robustness, and interpretability Limited exploration of applications in multimodal data fusion and federated learning
Lee & Kim (2022) Recommendation systems Deep learning recommendation systems based on cross-convolution filters, capturing complex interactions between user and item features Insufficient protection of data privacy and cross-institution collaboration
Park & Lee (2023) Deep sleep recommendation systems Personalized deep sleep recommendation using hybrid deep learning methods, combining user and collaborative filtering methods No consideration of data protection and copyright issues
Jeong & Kim (2022) Context-aware recommendation systems Deep learning recommendation systems based on context features, combining neural networks and autoencoders for feature extraction and score prediction Few applications for data privacy protection and cross-institution data collaboration
Vu & Le (2023) Multi-criteria recommendation systems Context-aware multi-criteria recommendation systems based on deep learning, using deep neural networks to predict ratings and learn aggregation functions Insufficient consideration of privacy protection and copyright protection
Tegene et al. (2023) Collaborative recommendation systems Latent factor models based on deep learning and embedding to solve data sparsity issues and extract nonlinear features Limited research on applications in federated learning and multimodal data fusion
Wu, Sun & Shang (2023) Deep learning recommendation systems Proposed DE-Opt framework to optimize hyperparameters of deep learning recommendation systems, improving recommendation accuracy and computational efficiency Lack of exploration of multimodal data fusion and cross-institution data protection
Torkashvand, Jameii & Reza (2023) Collaborative filtering recommendation systems Systematic review of deep learning collaborative filtering recommendation systems, categorizing and analyzing existing methods and their advantages and disadvantages Insufficient research on data privacy and cross-institution collaboration protection
Arthur et al. (2022) Cross-domain recommendation systems Proposed a discriminative geometric deep learning model to solve cold start and data sparsity issues in cross-domain recommendations Less exploration of applications in multimodal data fusion and privacy protection
Dong et al. (2023) POI recommendation systems Proposed FedSR framework for POI-RS using sequential information and Contrastive Learning to address data sparsity and Non-IID issues in FL Geographic and cultural differences affecting model training effectiveness
van Berlo, Saeed & Ozcelebi (2020) Federated learning Introduced federated unsupervised representation learning for pre-training deep neural networks with unlabeled data in a federated setting Limited to scenarios where labeled data can be generated from user interaction
Elayan, Aloqaily & Guizani (2021) Healthcare systems Proposed Deep Federated Learning framework for decentralized healthcare systems to maintain user privacy and improve model performance Model conversion time affecting quality of service to users
DOI: 10.7717/peerj-cs.2405/table-1

Our contribution

This study proposes a cross-institutional artwork similarity search and recommendation system—AICRS (AI-based Collaborative Recommendation System) framework, which combines multimodal data fusion and federated learning to address data privacy and copyright protection issues, as shown in Fig. 1. The main contributions of this article are as follows:

AICRS framework (Image source: Author’s own).

Figure 1: AICRS framework (Image source: Author’s own).

Figure 1 illustrates the AICRS framework, consisting of multiple components working together to provide a secure and efficient recommendation system. The framework involves various entities such as art galleries, artists, trading platforms, and self-employed individuals who contribute their data while maintaining control over their local models. The data, which remains with the participants, includes diverse artwork types (advertising, comics, publicity, art collections, etc.), represented by different colors for different types. The local models are trained on participant-owned data, and only the necessary parameters are exchanged between participants and the central trusted server. This server aggregates the parameters without exposing the original artwork, ensuring privacy and copyright protection while making customer recommendations based on their needs. The integration of these components ensures a collaborative yet secure environment for recommending artwork across institutions, leveraging the strengths of federated learning and multimodal data fusion.

  • We propose a Local Multi-Modal Feature Extraction and Aggregation algorithm (L-MFEA). It combines pre-trained convolutional neural networks (CNN) and Bidirectional Encoder Representations from Transformers (BERT) models. This extracts rich features from image and text data, generating multi-modal feature vectors and improving the accuracy of the recommendation system.

  • We propose a Federated Model Parameter Aggregation and Optimization algorithm (F-MPAO). This algorithm trains models locally at each participating institution and aggregates parameters to optimize the global model. It effectively addresses privacy and security issues caused by centralized data storage.

  • Combining multimodal data fusion and federated learning strategies, we propose a cross-institutional artwork similarity search and recommendation system (AICRS framework). This improves the overall performance and recommendation effect of the recommendation system while ensuring data privacy and copyright protection.

The remainder of this article is organized as follows: “Research Issues” discusses the key research issues related to recommendation systems and federated learning, and model aggregation strategies. “Experiments and Results” presents the experimental setup and results, including a detailed analysis of the system’s performance. Finally, “Conclusion” concludes the article and highlights potential directions for future research.

Research issues

Protecting user data privacy and copyright is crucial in recommendation systems. Federated learning allows collaborative model training without sharing raw data, ensuring privacy and security. In art recommendation systems, this protects sensitive artwork data and copyrights. Combining federated learning with deep learning, we propose the AICRS (AI-based Collaborative Recommendation System) for efficient and secure artwork recommendations.

Privacy protection in art recommendation systems

The dataset of artworks used in this research is represented as D={D1,D2,,DN}. Each Di represents a local dataset of a participant (freelance artist or art gallery) containing ni artworks, i.e., Di={(xi,j,yi,j)}j=1ni. Here, xi,j represents the feature of the j-th artwork of the i-th participant, and yi,j represents the corresponding label. Each participant trains their model locally and updates parameters by minimizing the following loss function:

θi(t+1)=θi(t)ηθLi(θi(t))where Li(θ)=1nij=1ni(f(xi,j;θ),yi,j) is the loss function of the i-th participant, (,) is the loss function (e.g., mean squared error or cross-entropy), and η is the learning rate. Commonly used in stochastic gradient descent, is discussed in various works including (Jeon et al., 2023). The global model is updated by aggregating the local model parameters of all participants:

θ(t+1)=1Ni=1Nθi(t+1)

In the recommendation system, we use deep learning models to extract features of artworks for similarity search and recommendation. Suppose the deep learning model is a convolutional neural network (CNN), its structure can be represented as:

h=CNN(x;θ)where h is the extracted high-dimensional feature vector, x is the input artwork, and θ is the model parameter. For similarity search, we use cosine similarity to measure the similarity between two artwork feature vectors:

sim(hi,hj)=hihj||hi||||hj||where hi and hj represent the feature vectors of the i-th and j-th artworks, respectively, and |||| denotes the L2 norm of the vector. The cosine similarity formula is a widely used method for comparing feature vectors in recommendation systems, as detailed in (Sankararaman et al., 2020). Our optimization objective is to maximize the performance of the recommendation system while protecting data privacy and copyrights. Specifically, the optimization objective can be expressed as:

minθ1Ni=1NLi(θ)+λR(θ)where R(θ) is the regularization term, and λ is the regularization coefficient. This formula is a standard approach to incorporating regularization in machine learning, discussed extensively in Tian, Zhang & Zhang (2023).

Problem 1 Our core research question is how to achieve an efficient artwork recommendation system using federated learning and deep learning technologies while protecting the privacy and copyright of artwork data. The specific mathematical model is as follows:

minθ1Ni=1NLi(θ)+λR(θ)subjecttoθi(t+1)=θi(t)ηθLi(θi(t))θ(t+1)=1Ni=1Nθi(t+1)h=CNN(x;θ)sim(hi,hj)=hihj||hi||||hj||

Local feature extraction and aggregation based on multimodal data fusion

Multimodal data fusion: local feature extraction and aggregation based on CNN and BERT models

  • Traditional unimodal algorithms, like CNN or LSTM, can only handle single-modal data (image or text). They cannot fully utilize the multimodal features of artworks (Daneshvar & Ravanmehr, 2022). They also cannot effectively capture details in artworks, leading to large prediction errors and high loss values, resulting in poor recommendation effects (Fachrela et al., 2023). Traditional methods need to store and process large amounts of data centrally, posing data privacy and copyright protection issues (Nithya, Geetha & Kumar, 2024).

  • The AICRS framework combines multimodal data fusion techniques (image and text features). It uses pre-trained CNN and BERT models to extract rich features from image and text data and then fuses them to generate multimodal feature vectors. This can better capture the details of artworks, significantly reduce prediction errors and loss values, and improve the accuracy of the recommendation system. By using federated learning strategies, model parameters are shared among multiple participating institutions instead of directly sharing data. This protects data privacy and copyright while improving the generalization ability of the model.

Multimodal data fusion

Suppose there are N artworks. Each artwork xi contains image data xiimg and text data xitext. We first extract features from the image and text separately. For image feature extraction, we use a pre-trained CNN, such as Residual Neural Network (ResNet), to extract the image feature vector:

hiimg=σ(k=1Ku=1Uwk(u)xiimg+bk(u))where hiimg represents the image feature vector of the i-th artwork, wk(u) and bk(u) are the weights and biases of the k-th convolution kernel in the u-th layer, denotes the convolution operation, and σ denotes the activation function (such as Rectified Linear Unit (ReLU)). Specifically, suppose the CNN contains L layers of convolution and pooling layers. The output for the l-th convolution and pooling layer can be represented as:

hi(l)=σ(Pool(m=1Mv=1Vwm(v,l)hi(l1)+bm(v,l)))where hi(l1) is the output of the l1 layer, Pool denotes the pooling operation, wm(v,l) and bm(v,l) are the weights and biases of the m-th convolution kernel in the v-th layer. The final image feature vector hiimg is represented by the output of the last layer of the CNN:

hiimg=σ(n=1Nz=1Zwn(z,L)hi(L1)+bn(z,L))

For text feature extraction, we use a pre-trained BERT model to extract the text feature vector:

hitext=BERT(xitext;θtext)=j=1Jq=1Qαj(q)hjembed+bj(q)where hitext represents the text feature vector of the i-th artwork, αj(q) and bj(q) are the attention weights and biases of the j-th word in the q-th layer, and hjembed represents the embedding vector of the j-th word. Specifically, BERT uses a multi-layer Transformer structure to extract text features. The Transformer architecture in BERT consists of multiple layers of self-attention and feed-forward neural networks, which enables it to capture complex dependencies and contextual information from the text data. For the l-th layer of the Transformer, the output can be represented as:

hitext(l)=LayerNorm(hitext(l1)+MultiHeadAttention(hitext(l1);θattn(l)))where hitext(l1) is the output of the l1 layer, MultiHeadAttention denotes the multi-head attention mechanism, and θattn(l) are the parameters of the l-th layer. The final text feature vector hitext is represented by the output of the last layer of BERT:

hitext=LayerNorm(hitext(L1)+MultiHeadAttention(hitext(L1);θattn(L)))

To generate a comprehensive multimodal feature vector, we concatenate the image and text features extracted from CNN and BERT, then process them through a fully connected layer. The feature fusion process can be represented as:

hi=σ(Wfusion[hiimghitext]+bfusion)where hi represents the multimodal feature vector of the i-th artwork, [hiimg,hitext] represents the concatenation of image and text features, Wfusion is the weight matrix, bfusion is the bias vector, and σ represents the activation function. Suppose the fully connected layer contains M neurons. The computation process can be represented as:

hi(f)=σ(k=1Kr=1RWk(r)[hi,kimghi,ktext]+bk(r))where hi(f) represents the output of the fully connected layer, Wk(r) and bk(r) are the weights and biases of the k-th neuron in the r-th layer, and σ represents the activation function (such as ReLU). The final multimodal feature vector hi can be represented as:

hi=σ(k=1Kr=1RWk(r)[hi,kimghi,ktext]+bk(r))

We further normalize the feature vector hi to ensure its balance between different modalities:

h^i=hi||hi||2=hij=1Jhi,j2+εwhere ||hi||2 represents the L2 norm of hi, and ε is a small constant to avoid division by zero. This normalization ensures that the length of each feature vector is on a uniform scale, improving the accuracy of subsequent processing and recommendation. The output {h^i}i=1N of this algorithm will be used as input for the subsequent recommendation system model.

Federated learning-based cross-institutional model parameter aggregation and optimization

Cross-institutional collaborative model: parameter aggregation and optimization based on federated learning framework

  • In cross-institutional data sharing, traditional methods cannot effectively protect the data ownership and copyright of each participating institution. This makes institutions reluctant to share data, affecting the overall performance of the model. Traditional model training methods require huge computational and storage resources when handling large-scale data. This can lead to single point failures and poor system robustness (Wang & Kawagoe, 2018; Musto et al., 2010; Kim, Kang & Lee, 2019). Traditional algorithms find it difficult to achieve efficient model parameter updates and optimization in cross-institutional cooperation, leading to slow convergence of the global model and limited performance improvement (Messina et al., 2019; Wang et al., 2019).

  • This algorithm aggregates local model parameters of each participating institution through methods such as weighted averaging. It ensures data ownership and copyright of each institution, enhances their willingness to cooperate, and improves the overall performance of the model. This algorithm reduces reliance on a central server, improves system robustness, and avoids single point failures. With improved optimization algorithms and effective parameter aggregation strategies, it significantly speeds up the convergence of the global model and improves performance.

Cross-institutional collaborative model

Suppose there are N participating institutions. Each institution i has a local dataset Di={(xi,j,yi,j)}j=1ni, where xi,j is the j-th sample of the i-th institution, and yi,j is its corresponding label. On the local server of each participating institution, use the L-MFEA algorithm to extract multimodal features hi,j:

hi,j=L-MFEA(xi,j;θi)where θi is the local model parameter of the i-th institution. Next, train the model on the local dataset to minimize the local loss function:

Li(θi)=1nij=1ni(f(hi,j;θi),yi,j)+λ||θi||22where (,) is the loss function (such as cross-entropy loss), f is the model output, and λ||θi||22 is the regularization term.

Local model parameters are updated using the gradient descent algorithm:

θi(t+1)=θi(t)η(θiLi(θi(t))+γθiR(θi(t)))where η is the learning rate, R(θi) is the regularization term, and γ is the weight of the regularization term. In each round of federated learning, the central server aggregates the local model parameters of each participating institution. The aggregation method, such as weighted averaging, is defined as:

θ(t+1)=1i=1Nnii=1Nniθi(t+1)where θ(t+1) is the global model parameter, and n=i=1Nni is the total amount of data from all participating institutions. The central server distributes the aggregated global model parameters θ(t+1) back to each participating institution to update the local model parameters:

θi(t+1)=θ(t+1)foralli=1,2,,N

Each participating institution, after receiving the updated global model parameters, continues to train on the local dataset to minimize the local loss function. Repeat local training and parameter aggregation until the global model converges.

Theorem 1 Given a reasonable learning rate η and sufficient iterations, the loss function L(θ) of the global model parameter θ will converge, assuming the local loss functions Li(θi) of all participating institutions converge. Specifically, assuming each local model satisfies the following condition during the iteration process:

L(θ(t+1))L(θ(t))η(1Ni=1N||θiLi(θi(t))||2+γi=1N||θiR(θi(t))||2)+η22(LL+LR)Lwhere LL and LR are the Lipschitz constants of the loss function L and the regularization term R, respectively, and L is the global optimal loss value.

Corollary 1 Based on the above theorem, if the multimodal feature vector hi is obtained by concatenating the image feature vector hiimg and the text feature vector hitext and then computing through the fully connected layer, the final model parameter update can be represented as:

θi(t+1)=θi(t)η(v=1Vθi(f(p=1Pq=1QWp,q(img)hi,pimg+Wp,q(text)hi,qtext+bp,q),yi,v)+λθi)where η represents the learning rate, Wp,q(img) and Wp,q(text) represent the weight matrices of the image and text features, respectively, bp,q represents the bias vector, and λ represents the regularization parameter.

AICRS framework: algorithm pseudocode and complexity analysis

The AICRS framework is presented as a comprehensive solution that integrates multiple components and processes to achieve efficient and secure artwork recommendations (Algorithm 1). While the overall structure is defined as a framework, the detailed implementation of its components and processes is expressed in the form of an algorithm. This approach allows us to provide a clear and precise description of the operational steps involved in the framework, ensuring reproducibility and facilitating practical application. By presenting the pseudocode, we aim to illustrate the exact sequence of operations, including data handling, model training, and federated learning procedures, thus bridging the conceptual framework with its practical execution.

Algorithm 1 :
AICRS framework.
Input Local dataset Di={(xi,j,yi,j)}j=1ni
Output Global model parameters θ
1 for each participating institution i=1 to N do
2   for each sample j=1 to ni do
3    Extract multimodal features hi,j using Eq. (17);
4    Extract image features hiimg using Eq. (7);
5    Extract text features hitext using Eq. (10);
6    Concatenate image and text features to generate multimodal feature vector hi using Eq. (13);
7    Normalize the feature vector hi to obtain h^i using Eq. (41);
8  Train the model on the local dataset, minimizing the local loss function Li(θi) using Eq. (55);
9  Update local model parameters θi(t+1) using Eq. (65);
10 for each round of federated learning t=1 to T do
11  The central server aggregates the local model parameters of each participating institution using Eq. (48);
12  Distribute the aggregated global model parameters θ(t+1) back to each participating institution using Eq. (21);
13  for each participating institution i=1 to N do
14    Continue training on the local dataset, minimizing the local loss function Li(θi);
15    Update local model parameters θi(t+1);
16 while the global model has not converged do
17  for each participating institution i=1 to N do
18   Continue training on the local dataset, minimizing the local loss function Li(θi) using Eq. (55);
19   Update local model parameters θi(t+1) using Eq. (65);
20  The central server aggregates the local model parameters of each participating institution using Eq. (48);
21  Distribute the aggregated global model parameters back to each participating institution using Eq. (21);
22 return Global model parameters θ
DOI: 10.7717/peerj-cs.2405/table-5

Suppose each participating institution has ni samples. The constant time for image feature extraction and text feature extraction are Cimg and Ctext, respectively. The time complexity for each training iteration is Ctrain, the number of federated learning rounds is T, and the number of model parameters is P. Considering these factors, the time complexity is O(TN2niI). For space complexity, each participating institution stores the local dataset Di and model parameters θi, and the central server stores the global model parameters θ and the model parameters of each participating institution. Suppose the dimension of each sample is D. The space complexity is O(NniD+N2P).

To further analyze the performance of the proposed framework, we compared the AICRS framework with other state-of-the-art models, as shown in Table 2. The time complexity of AICRS is O(TN2niI), which has better scalability compared to Karayel’s (2023) O(ε2ln(δ1)+lnn) and Liu & Yu’s (2022) O(2n/logn). Especially when handling large-scale data and multiple institutions, AICRS is more efficient. Additionally, the space complexity of AICRS is O(NniD+N2P), which is much lower than Liu & Yu’s (2022) O(2n) and Karayel’s (2023) O(ε2ln(δ1)+lnn). This significantly reduces storage requirements while maintaining performance. Overall, the optimization in time and space complexity of the AICRS algorithm makes it more advantageous in large-scale distributed federated learning systems.

Table 2:
Comparison of time and space complexities.
Related study Time complexity Space complexity
Karayel (2023) O(ε2ln(δ1)+lnn) O(ε2ln(δ1)+lnn)
Liu & Yu (2022) O(2n/logn) O(2n)
Malandrino et al. (2021) O(N3) Variable
Zhou et al. (2023) O(PlogP) O(NP)
AICRS O(TN2niI) O(NniD+N2P)
DOI: 10.7717/peerj-cs.2405/table-2

Experiments and results

This section describes the dataset and experimental parameters used in our study. It provides an overview of the SemArt dataset, details the data split for training, validation, and testing, and outlines the experimental parameters set for the training and evaluation of the recommendation models. Additionally, it presents the results of the AICRS framework application, highlighting its performance compared to traditional CNN and LSTM models.

Dataset and experimental parameters

The SemArt dataset is specifically designed for the analysis and recommendation of artworks. It contains 21,384 images of artworks and their related textual descriptions. Each image comes with detailed text descriptions, including the title, artist, creation year, art style, and descriptive text (https://doi.org/10.17036/researchdata.aston.ac.uk.00000380).

To train and test the recommendation system, we split the dataset by art style. Each art style acts as a client for federated learning training. The training set includes a portion of the artworks from each style for model training. The test set includes the remaining artworks for evaluating the model’s recommendation accuracy. The accuracy of the recommendations is evaluated by comparing whether the recommended artworks match the actual artist’s style and type. The specific data content is shown in Fig. 2.

SemArt dataset samples: The presence of multiple types of art images in the dataset and the corresponding comments and attributes (Image source: SemArt Dataset (https://doi.org/10.17036/researchdata.aston.ac.uk.00000380), licensed under CC BY-NC 4.0.).

Figure 2: SemArt dataset samples: The presence of multiple types of art images in the dataset and the corresponding comments and attributes (Image source: SemArt Dataset (https://doi.org/10.17036/researchdata.aston.ac.uk.00000380), licensed under CC BY-NC 4.0.).

The dataset is split into training, validation, and test sets with the following percentages: 70% for training, 15% for validation, and 15% for testing. This split ensures that the model is adequately trained, validated, and tested to achieve reliable performance metrics. Our experiments were conducted using high-performance hardware to ensure efficient computation and model training.

Our experimental parameters are set as shown in Table 3. After setting these detailed experimental parameters, we trained and tested these algorithms on the SemArt dataset. The SemArt dataset includes images of artworks and their detailed textual descriptions from various art categories, covering multiple art styles from the 11th to the 20th century. By conducting experiments on this dataset, we can evaluate the effectiveness and performance of the AICRS framework in processing and recommending artworks.

Table 3:
Detailed experimental parameters.
Parameter name Parameter value Parameter name Parameter value
Dataset SemArt Number of images 21,384
Image resolution 224 × 224 Text description Each image includes title, artist, creation year, art style, descriptive text
Categories Painting, sculpture, photography, etc. Time span 11th to 20th Century
Art styles Baroque, renaissance, impressionism, modern art, etc. Model structure (Image Feature Extraction) ResNet-50
Input size (Image feature extraction) 224 × 224 × 3 Number of convolution layers (Image feature extraction) 50
Activation function (Image feature extraction) ReLU Model structure (Text feature extraction) BERT-base
Number of hidden layers (Text feature extraction) 12 Number of hidden units (Text feature extraction) 768
Number of attention heads (Text feature extraction) 12 Input length (Text feature extraction) 128
Number of layers (Feature Fusion) 2 Number of neurons (Feature fusion) 512, 256
Activation function (Feature fusion) ReLU Local training epochs 10
Global training epochs 50 Learning rate 0.001
Optimization algorithm Adam Batch size 32
DOI: 10.7717/peerj-cs.2405/table-3

AICRS framework application results

This study focuses on developing a cross-institutional artwork recommendation system based on federated learning. It uses multimodal data fusion (image and text features) to enhance the performance of the recommendation system. The model accuracy measures the ability of the recommendation system to recommend the correct artworks, while the loss value reflects the prediction error during training.

Figure 3 shows the accuracy performance of the four models over 50 training rounds. The results indicate that the AICRS framework has significant advantages in processing and recommending artworks, surpassing traditional CNN and LSTM models, as well as the Federated Averaging (FED-AVG) model. The CNN model’s final accuracy after 50 training rounds is 81.52%. Accuracy grows rapidly in the early stages but stabilizes, maintaining above 80% from round 25 and reaching 81.52% at the end. The LSTM model’s final accuracy after 50 training rounds is 83.44%. The LSTM model also shows rapid accuracy growth initially, reaching about 82.63% around round 22, then slightly improving to 83.44%. The FED-AVG model’s final accuracy after 50 training rounds is 84.44%. The FED-AVG model demonstrates strong performance, surpassing the LSTM model in the later stages, with accuracy steadily increasing and reaching 84.44% at the end. The AICRS framework’s final accuracy after 50 training rounds is 92.02%. The AICRS framework performs significantly better than CNN, LSTM, and FED-AVG from the start, exceeding 90% accuracy before round 30 and reaching 92.02% at the end.

Comparison of modeled 50-round ACCURACY values.

Figure 3: Comparison of modeled 50-round ACCURACY values.

Figure 4 shows the loss values of the four models (CNN, LSTM, AICRS framework, and FED-AVG) over 50 training rounds. The results show that the AICRS framework has significantly lower loss values in processing and recommending artworks compared to traditional CNN and LSTM models, as well as the FED-AVG model. Specifically, the CNN model’s final loss value after 50 training rounds is 0.248. The loss value decreases rapidly in the early stages but stabilizes, staying below 0.26 from round 30 and reaching 0.248 at the end. The LSTM model’s final loss value after 50 training rounds is 0.188. The LSTM model also shows a rapid decrease in loss value initially, reaching about 0.22 around round 25, then slightly decreasing to 0.188. The FED-AVG model’s final loss value after 50 training rounds is 0.168. The FED-AVG model performs better than the LSTM model, showing a steady decrease in loss value and reaching 0.168 at the end. The AICRS framework’s final loss value after 50 training rounds is 0.1284. The AICRS framework performs significantly better than CNN, LSTM, and FED-AVG from the start, exceeding 0.20 loss value before round 20 and reaching 0.1284 at the end. The AICRS framework shows significant advantages in accuracy and loss values, improving recommendation performance and reducing prediction errors in the artwork recommendation system.

Comparison of modeled 50-round loss values.

Figure 4: Comparison of modeled 50-round loss values.

Table 4 shows the performance comparison of the three models, including accuracy, loss value and training time. It is evident that the AICRS framework outperforms the CNN and LSTM models in accuracy and loss value. Although its training time is slightly longer, the performance improvement is significant. The training time for each model was measured by recording the duration from the start to the completion of the training process on the same hardware configuration. Specifically, we used a server equipped with dual NVIDIA Tesla V100 GPUs, each with 32 GB of memory, and 256 GB of system RAM. The server is powered by dual Intel Xeon Gold 6258R CPUs, each with 28 cores, providing ample processing power for both training and inference phases. Each model was trained using the same dataset, batch size, and number of epochs to ensure a fair comparison. The reported training times are the average of three runs to account for any variability in the training process. The slight increase in training time for the AICRS framework is justified by its superior performance metrics, indicating a worthwhile trade-off for the significant gains in accuracy and reduced loss value.

Table 4:
Model performance comparison.
Model Accuracy (%) Loss value Training time (ms)
CNN 81.52 0.248 3,962
LSTM 83.44 0.188 4,185
AICRS 92.02 0.1284 4,577
DOI: 10.7717/peerj-cs.2405/table-4

Discussion

The rapid development of artificial intelligence has led to the widespread use of recommendation systems in various fields. In the art sector, implementing similarity search and recommendation systems poses unique challenges due to the high originality and commercial value of artworks. This necessitates robust data privacy and copyright protection. Freelance artists and art galleries aim to increase the exposure and sales of their works through advanced recommendation systems but are often hesitant to share their original artwork data due to concerns about data breaches and copyright infringement.

This study proposes a cross-institutional artwork similarity search and recommendation system (AICRS framework) that combines multimodal data fusion and federated learning to address data privacy and copyright protection issues. The experimental results show that the AICRS framework has significant advantages in processing and recommending artworks, surpassing traditional CNN and LSTM models.

The main reasons for the performance differences among the models are as follows:

  • The AICRS framework combines multimodal data fusion (image and text features), extracting richer features of artworks compared to single-modal CNN and LSTM models. This improves the accuracy of the recommendation system. The CNN and LSTM models’ final accuracy after 50 training rounds are 81.52% and 83.44%, respectively. The AICRS framework’s final accuracy reaches 92.02%. This shows that multimodal data fusion has significant advantages in capturing the details and features of artworks.

  • The AICRS framework uses pre-trained ResNet-50 and BERT models, which have better feature extraction and representation capabilities than traditional CNN and LSTM models. These models capture the details of artworks more effectively. In terms of loss value, the AICRS framework also performs better than other models. The CNN and LSTM models’ final loss values after 50 training rounds are 0.248 and 0.188, respectively. The AICRS framework’s final loss value is 0.1284. The more complex model structure enables the AICRS framework to handle complex features and reduce errors more effectively.

  • The AICRS framework adopts federated learning strategies, sharing model parameters among multiple participating institutions instead of directly sharing data. This improves the model’s generalization ability and privacy protection. This strategy not only protects the data privacy of participating institutions but also enhances the overall performance of the model. The training time of the AICRS framework is slightly longer than other models (4,577 ms), but the performance improvement is significant.

Conclusion

This article proposes a cross-institutional artwork similarity search and recommendation system (AICRS framework) that combines multimodal data fusion and federated learning to address data privacy and copyright protection issues. This system uses pre-trained convolutional neural networks (CNN) and BERT models to extract rich features from image and text data. It trains models locally at each participating institution and aggregates parameters through a federated learning framework to optimize the global model. The experimental results demonstrate that the AICRS framework has significant advantages in processing and recommending artworks, improving recommendation performance and reducing prediction errors. It enables collaboration among art institutions, offering accurate recommendations and complying with data protection regulations.The AICRS framework still has room for improvement in its reliance on high-quality multimodal data. Future research will explore enhancing the system’s robustness in cases of incomplete or low-quality data, as well as expanding the framework to support real-time recommendations across more diverse types of art media.

Appendix: mathematical theorems and corollary proofs

Theorem 2 Let hiimg and hitext be the image and text feature vectors, respectively. [hiimg,hitext] is their concatenation, and hi is the multimodal feature vector. The normalized multimodal feature vector h^i satisfies:

h^i=σ(Wfusion[hiimghitext]+bfusion)k=1K(j=1Jσ(m=1MWm,j(img)hi,mimg+n=1NWn,j(text)hi,ntext+bj,fusion))k2+ε

where σ is the activation function, Wfusion is the weight matrix of the fully connected layer, bfusion is the bias vector, and ε is a small constant to avoid division by zero.

Proof 1 First, define the optimization problem as follows:

min{hi}i=1Ni=1N||hih^i||22+λi=1N||hi||22

where λ>0 is the regularization parameter.

Expand the objective function as:

J({hi}i=1N)=i=1N(||hih^i||22+λ||hi||22)

We can consider each term individually, i.e., optimize for each hi. For each hi, the optimization problem is:

minhi||hih^i||22+λ||hi||22

Expand the above expression and take the derivative:

hi(||hih^i||22+λ||hi||22)=2(hih^i)+2λhi

Set the derivative to zero, we get:

2(hih^i)+2λhi=0hi(1+λ)=h^i

Solve to get:

hi=h^i1+λ

To obtain the normalized multimodal feature vector h^i, we normalize the feature vector hi:

h^i=hi||hi||2=hij=1Jhi,j2+ε

where ||hi||2 is the L2 norm of hi, and ε is a small constant to avoid division by zero.

By definition, we have:

hi=σ(Wfusion[hiimghitext]+bfusion)

Substitute h^i1+λ into the normalization equation and combine with the fusion equation, we get the final normalized multimodal feature vector:

h^i=σ(Wfusion[hiimghitext]+bfusion)k=1K(j=1Jσ(m=1MWm,j(img)hi,mimg+n=1NWn,j(text)hi,ntext+bj,fusion))k2+ε

In summary, the optimization problem has a unique solution that satisfies the equation, and the theorem is proved.

Corollary 2 Based on the above theorem, if the multimodal feature vector hi is obtained by concatenating the image feature vector hiimg and the text feature vector hitext and then computing through a fully connected layer, the normalized multimodal feature vector h^i can be expressed as follows:

h^i=σ(p=1Pq=1QWp,qfusion[hi,pimghi,qtext]+bp,qfusion)r=1R(s=1Sσ(t=1TWt,s(img)hi,timg+u=1UWu,s(text)hi,utext+bs,fusion))r2+ε

where σ is the activation function, Wp,qfusion is the weight matrix of the fully connected layer, bp,qfusion is the bias vector, and ε is a small constant to avoid division by zero.

Proof 2 First, consider the calculation process of the multimodal feature vector hi. Assume hi is obtained by concatenating the image feature vector hiimg and the text feature vector hitext, and then computing through a fully connected layer:

hi=σ(Wfusion[hiimghitext]+bfusion)

where Wfusion is the weight matrix of the fully connected layer, bfusion is the bias vector, and σ is the activation function.

We further refine the calculation process of the image feature vector hiimg and the text feature vector hitext. Assume the image feature vector hiimg is obtained through a convolutional neural network:

hiimg=σ(m=1MWm(img)xiimg+bm(img))

where Wm(img) and bm(img) are the weights and biases of the convolution kernels, and xiimg is the input image data.

Similarly, assume the text feature vector hitext is obtained through a pre-trained BERT model:

hitext=BERT(xitext;θtext)=σ(n=1NWn(text)xitext+bn(text))

where Wn(text) and bn(text) are the weights and biases of the BERT model, xitext is the input text data, and θtext is the parameter of the BERT model.

Next, we concatenate the image feature vector hiimg and the text feature vector hitext to obtain the multimodal feature vector hi:

hi=[hiimghitext]

Substitute Eqs. (36) and (37) into (38), we get:

hi=[σ(m=1MWm(img)xiimg+bm(img))σ(n=1NWn(text)xitext+bn(text))]

Then, compute through the fully connected layer to obtain the fused multimodal feature vector hi:

hi=σ(Wfusion[σ(m=1MWm(img)xiimg+bm(img))σ(n=1NWn(text)xitext+bn(text))]+bfusion)

To ensure the balance of feature vectors, we normalize them:

h^i=hi||hi||2=hir=1Rhi,r2+ε

where ||hi||2 is the L2 norm of hi, and ε is a small constant to avoid division by zero.

Substitute Eqs. (40) into (41), we get the normalized multimodal feature vector:

h^i=σ(Wfusion[σ(m=1MWm(img)xiimg+bm(img))σ(n=1NWn(text)xitext+bn(text))]+bfusion)r=1R(hi,r)2+ε

Further expand the expression of hi,r:

hi,r=s=1Sσ(t=1TWt,s(img)hi,timg+u=1UWu,s(text)hi,utext+bs,fusion)

Substitute Eqs. (43) into (44), we get the final normalized multimodal feature vector:

h^i=σ(p=1Pq=1QWp,qfusion[hi,pimghi,qtext]+bp,qfusion)r=1R(s=1Sσ(t=1TWt,s(img)hi,timg+u=1UWu,s(text)hi,utext+bs,fusion))r2+ε

In summary, the corollary is proved.

Theorem 3 Given a reasonable learning rate η and sufficient iterations, if the local loss functions Li(θi) of all participating institutions converge, the loss function L(θ) of the global model parameters θ will also converge. Specifically, suppose each local model satisfies the following condition during the iteration process:

L(θ(t+1))L(θ(t))η(1Ni=1N||θiLi(θi(t))||2+γi=1N||θiR(θi(t))||2)+η22(LL+LR)L

where LL and LR are the Lipschitz constants of the loss function L and the regularization term R, respectively, and L is the global optimal loss value.

Proof 3 First, we consider the local loss function Li(θi) of each participating institution i

Li(θi)=1nij=1ni(f(hi,j;θi),yi,j)+λ||θi||22

The local model parameters are updated using the gradient descent algorithm

θi(t+1)=θi(t)η(θiLi(θi(t))+γθiR(θi(t)))

where η is the learning rate R(θi) is the regularization term γ is the weight of the regularization term

Next, we consider the aggregation process of the global model parameters. The local model parameters of each participating institution are aggregated by weighted averaging

θ(t+1)=1i=1Nnii=1Nniθi(t+1)

where θ(t+1) are the global model parameters n=i=1Nni is the total amount of data from all participating institutions

Then, we analyze the changes in the global loss function L(θ). The global loss function is defined as

L(θ)=1ni=1NniLi(θi)

where Li(θi) is the local loss function of the i-th participating institution.

Because the local loss functions of each participating institution are optimized based on the same global model parameters, we have

L(θ(t+1))L(θ(t))η(1Ni=1N||θiLi(θi(t))||2+γi=1N||θiR(θi(t))||2)

where η(1Ni=1N||θiLi(θi(t))||2+γi=1N||θiR(θi(t))||2) represents the decrease in the loss function during gradient descent.

Considering the Lipschitz continuity of the loss function and the regularization term, we get

L(θ(t+1))L(θ(t))η(1Ni=1N||θiLi(θi(t))||2+γi=1N||θiR(θi(t))||2)+η22(LL+LR)

where LL and LR are the Lipschitz constants of the loss function L and the regularization term R, respectively.

Due to the convergence of the local loss functions Li(θi), we have

limtLi(θi(t))=Li

where Li is the local optimal loss value of the i-th participating institution.

Therefore, the global loss function L(θ) will also converge to its optimal value

limtL(θ(t))=L

where L is the global optimal loss value.

In summary, given a reasonable learning rate η and sufficient iterations, if the local loss functions Li(θi) of all participating institutions converge, the loss function L(θ) of the global model parameters θ will also converge. The theorem is proved.

Corollary 1 Based on the above theorem, if the multimodal feature vector hi is obtained by concatenating the image feature vector hiimg and the text feature vector hitext and then computing through a fully connected layer, the final model parameter update can be expressed as:

θi(t+1)=θi(t)η(v=1Vθi(f(p=1Pq=1QWp,q(img)hi,pimg+Wp,q(text)hi,qtext+bp,q),yi,v)+λθi)

where η is the learning rate Wp,q(img) and Wp,q(text) are the weight matrices for image and text features, and bp,q is the bias vector λ is the regularization parameter.

Proof 4 First, we consider the local loss function Li(θi) for each participating institution i:

Li(θi)=1nij=1ni(f(hi,j;θi),yi,j)+λ||θi||22

where l(,) is the loss function and f is the model output.

Assume the multimodal feature vector hi is obtained by concatenating the image feature vector hiimg and the text feature vector hitext and then computing through a fully connected layer. That is:

hi,j=[hi,jimghi,jtext]

Computed through a fully connected layer:

hi,j=σ(W[hi,jimghi,jtext]+b)

where W is the weight matrix b is the bias vector σ is the activation function.

Assume W can be decomposed into the weight matrices for image and text features W(img) and W(text):

W=[W(img)W(text)]

Thus, the output of the fully connected layer can be expressed as:

hi,j=σ(p=1PWp(img)hi,jimg+q=1QWq(text)hi,jtext+b)

The model output f can be expressed as:

f(hi,j;θi)=v=1Vθi(v)σ(p=1PWp(img)hi,jimg+q=1QWq(text)hi,jtext+b)(v)

where θi are the model parameters.

For the local loss function, we take the gradient with respect to the model parameters θi:

θiLi(θi)=1nij=1niθi(f(hi,j;θi),yi,j)+λθi||θi||22

According to the chain rule, we can expand the gradient:

θi(f(hi,j;θi),yi,j)=fθif(hi,j;θi)

Expanding the gradient of the model output f:

θif(hi,j;θi)=v=1Vθi(θi(v)σ(p=1PWp(img)hi,jimg+q=1QWq(text)hi,jtext+b)(v))

Note that the gradient of the regularization term is:

θi||θi||22=2θi

Substitute the gradients into the local model parameter update formula:

θi(t+1)=θi(t)η(1nij=1ni(fv=1Vθi(θi(v)σ(p=1PWp(img)hi,jimg+q=1QWq(text)hi,jtext+b)(v)))+2λθi)

For the case of the multimodal feature vector hi, it can be further simplified as:

θi(t+1)=θi(t)η(v=1Vθi(f(p=1Pq=1QWp,q(img)hi,pimg+Wp,q(text)hi,qtext+bp,q),yi,v)+λθi)

where Wp,q(img) and Wp,q(text) are the weight matrices for image and text features, and bp,q is the bias vector.

In conclusion, the corollary is proved.

Supplemental Information

1 Citation   Views   Downloads