0% found this document useful (0 votes)
71 views6 pages

Unsupervised Face Recognition with Siamese Networks

Uploaded by

hemantdarur04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views6 pages

Unsupervised Face Recognition with Siamese Networks

Uploaded by

hemantdarur04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Deep Learning Based Face Recognition Method

using Siamese Network


Enoch Solomon Abraham Woubie Eyael Solomon Emiru
Department of Computer Science Silo AI Department of Information
Virginia State University Helsinki, Finland Engineering and Computer Science
Richmond, Virginia [Link]@[Link] University of Trento, Italy
esolomon@[Link] [Link]@[Link]
arXiv:2312.14001v2 [[Link]] 9 Feb 2024

Abstract—Achieving state-of-the-art results in face verifica- it is not usually easy to have access to large amounts of
tion systems typically hinges on the availability of labeled face labeled training data. Thus, the lack of labeled training data
training data, a resource that often proves challenging to acquire results in a significant performance gap between cosine
in substantial quantities. In this research endeavor, we proposed
employing Siamese networks for face recognition, eliminating and PLDA scoring techniques [21], [22] in face recognition.
the need for labeled face images. We achieve this by strategically Although the authors in [23] proposed automatic labeling
leveraging negative samples alongside nearest neighbor counter- techniques, they could not appropriately estimate the true
parts, thereby establishing positive and negative pairs through labels. Although these approaches perform reasonably well,
an unsupervised methodology. The architectural framework the results are still compared to that PLDA that uses
adopts a VGG encoder, trained as a double branch siamese
network. Our primary aim is to circumvent the necessity for actual labels [24]. Whereas, the authors in [25] proposed
labeled face image data, thus proposing the generation of autoencoder based unsupervised method improve the per-
training pairs in an entirely unsupervised manner. Positive formance of face recognition systems. Previous studies
training data are selected within a dataset based on their highest show that these approaches mainly aim at increasing the
cosine similarity scores with a designated anchor, while negative discriminative power of face image embeddings. Thus, they
training data are culled in a parallel fashion, though drawn from
an alternate dataset. During training, the proposed siamese can be applied as a back-end in face verification task.
network conducts binary classification via cross-entropy loss. In our proposed work, the main goal is to reduce the
Subsequently, during the testing phase, we directly extract face reliance on labeled training data for a face verification sys-
verification scores from the network’s output layer. Experi- tem. We aim to obtain end-to-end face verification scores
mental results reveal that the proposed unsupervised system without using face image labels. We propose a siamese
delivers a performance on par with a similar but fully supervised
baseline. network [26] consisting of double-branch networks, each
Index Terms—biometrics, deep learning, face recognition, with two branches that are CNN encoders. These encoders
siamese network, unsupervised method are inspired by the VGG architecture and adapted for face
verification [27]. Traditionally, Siamese networks are trained
I. I NTRODUCTION using pairs of training samples, such as anchor-positive
In today’s increasingly digital world, safeguarding per- and anchor-negative pairs. However, to avoid using face
sonal information and ensuring security has become a top image labels, we generate these training sample pairs in
priority [1]–[10]. One of the most cutting-edge technologies an unsupervised manner. Positive training data are chosen
addressing this concern is face verification, a powerful tool within one dataset based on their highest cosine scores with
that leverages facial recognition to authenticate individuals the anchor sample. In contrast, negative training data are
[11]. This technology offers immense potential across vari- selected in a similar manner from another dataset, ensuring
ous industries, from financial services to law enforcement, that the two datasets do not include face images from the
and even everyday consumer applications. same identity.
Deep learning methods are successful in image appli- In our study, we introduced the utilization of a two
cation specifically in face recognition [12]–[16]. These ap- branch Siamese network. The network processes positive
proaches excel in learning deep features and bottleneck and negative samples, individually paired with anchor sam-
features (BNF) [17], which are used within a conventional ples, and computes the binary cross-entropy loss based
GMM-UBM framework [18]. However, these deep learning on their binary labels. Following the training process, de-
approaches typically rely on labeled training data. cision scores for face verification trials are derived from
The authors in [19] reported that PLDA is the efficient the last layer of the network, resulting in an end-to-end
back-end for image recognition [19], [20]. Previous works face verification system. Our evaluation was performed on
reported PLDA provide a superior performance than cosine the Labeled Faces in the Wild (LFW) dataset [28]. Despite
scoring. But, this improvement comes with the cost of being unsupervised, our results show significant promise,
labeled training data. However, previous works show that approaching the performance of a fully supervised baseline.
The remainder of this paper is organized as follows:
Section 2 explains the proposed method, Section 3 provides
details on the experimental setup and data, Section 4
discusses the results, and we conclude in Section 5.

II. P ROPOSED M ETHOD

The training of a DNN in an end-to-end fashion typically


requires access to labeled training data, which can be chal-
lenging in real-world scenarios. To address this, especially
when labeled data is unavailable, we propose the use of
siamese networks [26], characterized by a double-branch
architecture. Each branch features a CNN encoder that is
motivated by the VGG network [29], which has recently
shown promise in face verification [30]. Given our objective
of avoiding reliance on face image labels, we introduce a
novel approach to generate pairs of training data in an un-
supervised manner [31]–[36]. Specifically, positive training
data are chosen within one dataset based on their similarly
score with the anchor. Similarly, negative training data are
selected using the same criterion, but from a separate
dataset, ensuring that the two datasets do not contain face
images from the same identity. This innovative training
methodology circumvents the need for labeled data, making
the process more accessible and adaptable to real-world Fig. 1: The architecture of the proposed siamese network.
scenarios. FC denotes fully connected layer. In evaluation phase,
The proposed work utilizes a siamese network. It is decision scores for face recognition are obtained directly
trained by minimizing binary cross-entropy loss. This net- from the last layer of the network. This layer computes
work selects the pair of negative-anchor and positive- the final representation of the features extracted from the
anchor face images samples, and processes them to gener- input face images, which is then used to make binary
ate binary labels of 1 (positive match) or 0 (negative match) classification decisions regarding facial similarity.
at the output. These labels are used to compute the loss
during training. Once the network is trained, we can obtain
Algorithm 1: The proposed algorithm to select posi-
decision scores for face recognition evaluations directly
tive and negative face images for each face image in
from the network’s output. In summary, our proposed
the training dataset.
siamese network constitutes an end-to-end face recognition
system, capable of making binary classification decisions Require: Training face images t i , 1 < i < n, x i ∈ X and
with respect to facial similarity. y i ∈ Y , 1 ≤ i ≤ n, and threshold
Ensure: Positive and Negative face images P i j and Ni j ,
A. Training Pairs Selection 1 ≤ i ≤ m and 1 ≤ j ≤ k
for each training face image x i do
The procedure for selecting negative and positive images for each training face image x p , 1 < p ¡< m do
involves utilizing two datasets, denoted as X and X. Let Compute Positive Scores i ,p = cosine x i , x p
¢
C X and C Y represent the individuals in datasets X and Y, end for
respectively. It’s assumed that individuals in dataset X are From Positive Scores i ,p , select k highest face
distinct from those in dataset Y, i.e., C X ∩C Y = φ. Algorithm images.
1 provides an overview of how the selection of negative and if Positive Scores i ,p < threshold
positive face image vectors is conducted in an unsupervised Pi , j = xp
manner. for each training face image y m , 1 < n <
To begin, we extract the face embedding vectors for all ¡ m do ¢
Compute Negative Scores i ,n = cosine y i , y n
face images in datasets X and Y. Next, we compute similarity end for
scores among all face embedding vectors within dataset X From Negative Scores i ,n , select k highest face
by using cosine distance metrics. For each face embedding images.
vector in X, we select a fixed number k of neighbor if Negative Scores i ,n < threshold
face embedding as a potential positive face embeddings. Ni , j = y n
Subsequently, we introduce a threshold to select potential end for
k positives face images.

2
To select the negative face image embeddings, we score of the network respectively. These two branches share the
all the face embeddings in X with those in Y , by using same parameters. Following the CNN encoder, the encoded
cosine similarity scores. For each face embedding in X , vectors output from the two branches are concatenated.
we select k number of face embeddings from Y that by Then, the concatenated vector is fed into the FC layers. The
using their cosine distance scores. Given that the person in last layer is a binary class layer. It will be one for the case of
dataset X does not appear in dataset Y , the k selected face anchor and positive face images, whereas it will be zero for
embeddings are considered potential negative face images. the case of anchor and negative face images. Throughout
Subsequently, we apply a threshold to their corresponding the training process, the network minimizes binary cross-
cosine scores to identify the most challenging negative face entropy loss.
images. This process ensures that each face embedding in After training with selected positive and negative samples
X is paired with k most positive and k negative face image in an unsupervised manner, evaluations are conducted in
embeddings. an end-to-end way. During evaluation, airs of test face
Assuming there are n face images in each dataset X images are inputted into the network, and decision scores
and Y , then we will have a total of n plus n times k are directly obtained from the last layer. This approach
face image training data. For example, if we have 100 establishes an unsupervised end-to-end face verification
training data in each datasets with k=10, then we will have system using the proposed Siamese network.
(100+100)*10 = 1000 training data. This way, we were able
D. DNN Encoder
to increase intrinsically the number training dataset. In our
experiments, training pairs consist of two samples, namely The DNN encoder block draws inspiration from the
anchor-positive and anchor-negative. VGG architecture [37], which has been recently tailored
To select the negative face embedding vectors, we com- for face verification in [27]. This encoder is structured
pute cosine similarity scores between all face embedding with three primary blocks, with each block comprising two
vectors in set X and those in set Y . For each face em- convolutional layers and one max-pooling layer. Following
bedding vector in X , we choose K face embedding vectors these blocks, there are two Fully Connected (FC) layers,
from Y that have the closest scores. Given that individuals having 1024 and 300 neurons, respectively. The function
in X do not appear in Y , these K selected face embeddings of the DNN encoder is to encode the input face image,
represent potential negative face images. Subsequently, we reducing it to a 300-dimensional vector.
apply a threshold to the corresponding cosine scores to
III. E XPERIMENTAL S ETUP AND D ATASET
identify the most challenging negatives face images among
them. As a result, each face embedding vector in X is A. Experimental Setup
assigned K positive and K negative face embedding vectors. The training phase utilized the CelebA dataset [38], while
the evaluation was conducted on the Labeled Faces in the
B. Proposed Siamese Network Wild (LFW) dataset [28]. The performance metrics assessed
The training of a DNN in an end-to-end manner usually were EER and accuracy.
requires labeled face images for the training data, which During the selection of positives and negatives, we di-
can be challenging to obtain in practical scenarios. When vided the validation partition of the CelebA dataset into
such labels are unavailable, we propose the use of siamese two equal parts, creating datasets denoted as A and B, as
networks. These networks operate in a fully unsupervised explained previously. The CNN encoder was identical for
manner, distinguishing them from a typical DNN classifiers. both networks. The threshold values for the selection of
Traditionally, a siamese network is trained using pairs of positive and negative samples were 0.3 and 0.1, respec-
samples, consisting of an anchor, a positive, and negative. tively. The activation function used for both the DNN and
To bypass the need for face image labels, we propose gener- fully connected layers was ReLU, while the final layer of
ating these training pairs in an unsupervised manner, which the network employed sigmoid activation function. The
are then fed into the network. Specifically, we advocate network is trained for 300 epochs or until the error is
for a siamese network utilizing binary cross-entropy loss. no longer decreasing using a batch size of 64 images. It
Following the training process, we extract decision scores uses stochastic gradient descent, momentum of 0.91, weight
from the output of the proposed network for the evaluation. decay of 0.00001 and a logarithmically decaying learning
This approach allows us to perform face verification in an rate from 10−2 to 10−8 .
unsupervised manner. The baseline system underwent training utilizing the
entire validation partition of the CelebA dataset. Its archi-
C. End-to-End Face Verification System tecture entails a DNN encoder followed by a classification
Figure 1 depicts the architecture of the proposed siamese layer. To ensure an equitable comparison, the architecture
network. This network comprises two identical branches, of the CNN encoder was precisely replicated from that of
both acting as CNN encoders. During training, a pair the siamese networks. The training process involved min-
consisting of an anchor face image along with either a imizing cross-entropy loss, employing the Adam optimizer
positive or negative face image is inputted into each branch with identical parameters as those in the siamese networks.

3
TABLE I: It presents a comparison of the proposed method TABLE II: Comparison of the proposed method result with
with different values of k and the baseline methods. The some of state-of-the-art methods on LFW testing dataset.
evaluation metrics used is Equal Error Rate (EER). The re-
Training Labeled/ Testing
sults demonstrate the performance of the proposed method
Model size Unlabeled size Accuracy(%)
across various k values, showcasing its effectiveness com-
UniformFace 6.1M Labeled 6K 99.80 [39]
pared to the baseline methods. ArcFace 5.8M Labeled 6K 99.82 [40]
GroupFace 5.8M labeled 6K 99.85 [41]
Model k EER CosFace 5M Labeled 6K 99.73 [22]
Baseline - 7.93 Marginal Loss 4M Labeled 6K 99.48 [42]
Proposed 2 7.91 CurricularFace 3.8M Labeled 6K 99.80 [43]
Proposed 5 7.83 RegularFace 3.1M Labeled 6K 99.61 [44]
Proposed 10 6.90 MDCNN 1M Labeled 6K 99.38 [45]
Proposed 12 7.01 PSO AlexNet TL 14M Labeled 6K 99.57 [46]
Proposed 15 8.02 Ben Face 0.5M Labeled 6K 99.20 [47]
F2 C 5.8M Labeled 6K 99.83 [48]
PCCycleGAN 0.5M Unlabeled 6K 99.52 [49]
CAPG GAN 1M Unlabeled 6K 99.37 [50]
During evaluation, face image embeddings were extracted UFace 200K Unlabeled 6K 99.40 [31]
from the CNN encoder of this network. Proposed 200K Unlabeled 6K 99.67

B. Datasets
The CelebA dataset [38], initially introduced by Liu et Table II provides a comprehensive comparison between
al., comprises a vast collection of over 200,000 face images our proposed method and state-of-the-art approaches. It’s
showcasing 10,177 celebrities. This dataset incorporates important to note that our method is compared with both
various elements like pose variations and background clut- supervised and unsupervised training techniques. Notably,
ter, offering a comprehensive representation of real-world our proposed training method demonstrates competitive
scenarios. To facilitate model training and evaluation, the performance without the need for an extensive labeled
dataset is divided into distinct sets for training, validation, dataset. While some methods like ArcFace, GroupFace,
and testing purposes. In this work, we utilized the training, Marginal Loss, and CosFace exhibit slightly higher accuracy,
validation, and test segments of the CelebA dataset as our it’s worth highlighting that our approach is trained on a
unlabeled dataset for training the autoencoder. significantly smaller dataset (approximately 200K images),
The Labeled Faces in the Wild dataset (LFW) [28] is whereas most state-of-the-art methods rely on millions of
a compilation of 13,233 images, featuring 5,749 distinct training images. This showcases the efficiency and effec-
individuals. For testing, the dataset is randomly and uni- tiveness of our proposed method, especially in scenarios
formly divided into ten subsets. Each subset consists of where labeled data is limited.
300 matched pairs (depicting the same person) and 300
mismatched pairs (depicting different individuals), which
results in a total of 3,000 matched pairs and 3,000 mis- V. C ONCLUSION
matched pairs used for testing [28].
In this work, we introduced a siamese network for face
IV. R ESULTS
verification that operates without the reliance on labeled
A. Results on LFW data. This network features two branches, each functioning
In Table I, we present a comprehensive comparison as a CNN encoder, and was trained as a binary classifier.
between our proposed systems and the baseline, evaluating Given our objective of avoiding the need for face image
them based on the Equal Error Rate (EER) expressed as labels, the training sample pairs were generated using
a percentage. The table encompasses results for different an unsupervised approach. Positive training images were
values of k. Throughout the experimental trials, face recog- selected within a single dataset based on the highest cosine
nition scores are extracted from the output of the proposed scores with a designated anchor, while negative training
network. The findings reveal that increasing the value of images followed the same selection criteria but were drawn
k leads to enhanced system performance. The optimal from a different dataset.
EER of was achieved when k was set to 10. While higher Following the training process, decision scores were ob-
values of k were explored, no significant improvement has tained using the proposed network, and the evaluation was
been observed, suggesting a balance between performance conducted using the LFW dataset. Notably, our experimen-
and computational cost. This underscores the promising tal results demonstrated that our proposed system, despite
potential of the proposed approach, particularly in unsu- being unsupervised, achieved results that closely paralleled
pervised scenarios, demonstrating its effectiveness in face those of fully supervised baselines. This highlights the
verification tasks without the reliance on labeled training effectiveness of our approach in face verification, even in
data. the absence of labeled data.

4
R EFERENCES [21] Hafed, Z. & Levine, M. Face recognition using the discrete cosine
transform. International Journal Of Computer Vision. 43 pp. 167-188
(2001)
[1] Kumar, J. & Patel, D. A survey on internet of things: Security and
[22] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z. &
privacy issues. International Journal Of Computer Applications. 90
Liu, W. Cosface: Large margin cosine loss for deep face recognition.
(2014)
Proceedings Of The IEEE Conference On Computer Vision And Pattern
[2] Solomon, E. & Cios, K. FASS: Face Anti-Spoofing System Using Image Recognition. pp. 5265-5274 (2018)
Quality Features and Deep Learning. Electronics. 12, 2199 (2023)
[23] Gao, Y., Ma, J. & Yuille, A. Semi-supervised sparse representation
[3] Solomon, E. & Cios, K. HDLHC: Hybrid Face Anti-Spoofing Method based classification for face recognition with insufficient labeled
Concatenating Deep Learning and Hand-Crafted Features. 2023 IEEE samples. IEEE Transactions On Image Processing. 26, 2545-2560 (2017)
6th International Conference On Electronic Information And Commu- [24] Muslihah, I. & Muqorobin, M. Texture characteristic of local binary
nication Technology (ICEICT). pp. 470-474 (2023) pattern on face recognition with probabilistic linear discriminant
[4] Chekole, E., Chattopadhyay, S., Ochoa, M., Guo, H. & analysis. International Journal Of Computer And Information System
Cheramangalath, U. CIMA: Compiler-Enforced Resilience (IJCIS). 1, 22-26 (2020)
Against Memory Safety Attacks in Cyber-Physical Sys- [25] Hammouche, R., Attia, A., Akhrouf, S. & Akhtar, Z. Gabor filter bank
tems. Computers & Security. 94 pp. 101832 (2020), with deep autoencoder based face recognition system. Expert Systems
[Link] With Applications. 197 pp. 116743 (2022)
[5] Chekole, E., Ochoa, M. & Chattopadhyay, S. SCOPE: Secure [26] Chen, X. & He, K. Exploring simple siamese representation learning.
Compiling of PLCs in Cyber-Physical Systems. International Proceedings Of The IEEE/CVF Conference On Computer Vision And
Journal Of Critical Infrastructure Protection. 33 pp. 100431 (2021), Pattern Recognition. pp. 15750-15758 (2021)
[Link] [27] Cheng, S. & Zhou, G. Facial expression recognition method based on
[6] Chekole, E. & Huaqun, G. ICS-SEA: Formally Modeling the Con- improved VGG convolutional neural network. International Journal
flicting Design Constraints in ICS. Proceedings Of The Fifth Annual Of Pattern Recognition And Artificial Intelligence. 34, 2056003 (2020)
Industrial Control System Security (ICSS) Workshop. pp. 60-69 (2019), [28] Huang, G., Mattar, M., Berg, T. & Learned-Miller, E. Labeled faces
[Link] in the wild: A database forstudying face recognition in uncon-
[7] Chekole, E., Castellanos, J., Ochoa, M. & Yau, D. Enforcing Memory strained environments. Workshop On Faces In’Real-Life’Images: De-
Safety in Cyber-Physical Systems. Computer Security. pp. 127-144 tection, Alignment, And Recognition. (2008)
(2018) [29] Sengupta, A., Ye, Y., Wang, R., Liu, C. & Roy, K. Going deeper in
[8] Chekole, E., Chattopadhyay, S., Ochoa, M. & Huaqun, G. Enforcing spiking neural networks: VGG and residual architectures. Frontiers In
Full-Stack Memory-Safety in Cyber-Physical Systems. Engineering Neuroscience. 13 pp. 95 (2019)
Secure Software And Systems. pp. 9-26 (2018) [30] Gwyn, T., Roy, K. & Atay, M. Face recognition using popular deep
[9] Chekole, E., Thulasiraman, R. & Zhou, J. EARIC: Exploiting ADC net architectures: A brief comparative study. Future Internet. 13, 164
Registers in IoT and Control Systems. Applied Cryptography And (2021)
Network Security Workshops. pp. 245-265 (2023) [31] Solomon, E., Woubie, A. & Cios, K. UFace: An Unsupervised Deep
[10] Wong, A., Chekole, E., Ochoa, M. & Zhou, J. On the Security Learning Face Verification System. Electronics. 11, 3909 (2022)
of Containers: Threat Modeling, Attack Analysis, and Mitigation [32] Solomon, E. Face Anti-Spoofing and Deep Learning Based Unsuper-
Strategies. Computers & Security. 128 pp. 103140 (2023), vised Image Recognition Systems. (2023)
[Link] [33] Solomon, E., Woubie, A. & Emiru, E. Unsupervised Deep Learning
[11] Kumar, N., Berg, A., Belhumeur, P. & Nayar, S. Attribute and simile Image Verification Method. ArXiv Preprint arXiv:2312.14395. (2023)
classifiers for face verification. 2009 IEEE 12th International Confer- [34] Solomon, E., Woubie, A. & Emiru, E. Autoencoder Based Face Verifi-
ence On Computer Vision. pp. 365-372 (2009) cation System. ArXiv Preprint arXiv:2312.14301. (2023)
[12] Serengil, S. & Ozpinar, A. Lightface: A hybrid deep face recognition [35] Woubie, A., Solomon, E. & Emiru, E. Image Clustering using Restricted
framework. 2020 Innovations In Intelligent Systems And Applications Boltzman Machine. ArXiv Preprint arXiv:2312.13845. (2023)
Conference (ASYU). pp. 1-5 (2020) [36] Solomon, E. Face Anti-Spoofing and Deep Learning Based Unsu-
[13] Abraham Woubie and Tom Bäckström: Federated learning for privacy- pervised Image Recognition Systems. VCU Theses And Dissertations.
preserving speaker recognition, In IEEE Access, 2021; pp. 149477– (2023), [Link]
149485. [37] Chang, X., Wu, J., Yang, T. & Feng, G. Deepfake face image detection
[14] Abraham Woubie and Tom Bäckström: Voice-quality Features for based on improved VGG convolutional neural network. 2020 39th
Deep Neural Network Based Speaker Verification Systems, In 29th Chinese Control Conference (CCC). pp. 7252-7256 (2020)
European Signal Processing Conference (EUSIPCO), 2021; pp. 176– [38] Liu, Z., Luo, P., Wang, X. & Tang, X. Large-scale celebfaces attributes
180. (celeba) dataset. Retrieved August. 15, 11 (2018)
[15] Abraham Woubie and Tom Bäckström: Federated Learning for Privacy [39] Duan, Y.; Lu, J.; Zhou, J. Uniformface: Learning deep equidistributed
Preserving On-Device Speaker Recognition, In ISCA Symposium on representation for face recognition. In Proceedings of the IEEE/CVF
Security and Privacy in Speech Communication, 2021; pp. 1–5. Conference on Computer Vision and Pattern Recognition, Long
[16] Abraham Woubie, Pablo Zarazaga and Tom Bäckström: The Use Beach, CA, USA, 15-20 June 2019; pp. 3415–3424.
of Audio Fingerprints for Authentication of Speakers on Speech [40] Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular
Operated Interfaces, In ISCA Symposium on Security and Privacy in margin loss for deep face recognition. In Proceedings of the IEEE/CVF
Speech Communication, 2021; pp. 6–9. Conference on Computer Vision and Pattern Recognition, Long
[17] Prasad, P., Pathak, R., Gunjan, V. & Ramana Rao, H. Deep learning Beach, CA, USA, 15-20 June 2019; pp. 4690–4699.
based representation for face recognition. ICCCE 2019: Proceedings [41] Kim, Y.; Park, W.; Roh, M.; Shin, J. Groupface: Learning latent groups
Of The 2nd International Conference On Communications And Cyber and constructing group-based representations for face recognition. In
Physical Engineering. pp. 419-424 (2020) Proceedings of the IEEE/CVF Conference on Computer Vision and
[18] Mallouh, A., Qawaqneh, Z. & Barkana, B. New transformed features Pattern Recognition, Seattle, WA, USA, 13-19 June 2020; pp. 5621–
generated by deep bottleneck extractor and a GMM–UBM classifier 5630.
for speaker age and gender classification. Neural Computing And [42] Deng, J.; Zhou, Y.; Zafeiriou, S. Marginal loss for deep face recogni-
Applications. 30 pp. 2581-2593 (2018) tion. In Proceedings of the IEEE Conference on Computer Vision and
[19] Ioffe, S. Probabilistic linear discriminant analysis. Computer Vi- Pattern Recognition Workshops, Honolulu, HI, USA, 21-26 July 2017;
sion–ECCV 2006: 9th European Conference On Computer Vision, Graz, pp. 60–68.
Austria, May 7-13, 2006, Proceedings, Part IV 9. pp. 531-542 (2006) [43] Huang, Y.; Wang, Y.; Tai, Y.; Liu, X.; Shen, P.; Li, S.; Li, J.; Huang,
[20] El Shafey, L., McCool, C., Wallace, R. & Marcel, S. A scalable for- F. Curricularface: Adaptive curriculum learning loss for deep face
mulation of probabilistic linear discriminant analysis: Applied to recognition. In Proceedings of the IEEE/CVF Conference on Com-
face recognition. IEEE Transactions On Pattern Analysis And Machine puter Vision and Pattern Recognition, Seattle, WA, USA, 13-19 June
Intelligence. 35, 1788-1794 (2013) 2020; pp. 5901–5910.

5
[44] Zhao, K.; Xu, J.; Cheng, M. Regularface: Deep face recognition via
exclusive regularization. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, Long Beach, CA, USA,
15-20 June 2019; pp. 1136–1144.
[45] Huang, X.; Zeng, X.; Wu, Q.; Lu, Y.; Huang, X.; Zheng, H. Face Ver-
ification Based on Deep Learning for Person Tracking in Hazardous
Goods Factories. Processes 2022, 10, 380.
[46] Elaggoune, H.; Belahcene, M.; Bourennane, S. Hybrid descriptor and
optimized CNN with transfer learning for face recognition. Multimed.
Tools Appl. 2022, 81, 9403–9427.
[47] Ben Fredj, H.; Bouguezzi, S.; Souani, C. Face recognition in uncon-
strained environment with CNN. Vis. Comput. 2021, 37, 217–226.
[48] Wang, K.; Wang, S.; Zhang, P.; Zhou, Z.; Zhu, Z.; Wang, X.; Peng, X.;
Sun, B.; Li, H.; You, Y. An efficient training approach for very large
scale face recognition. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, New Orleans, Louisiana,
USA, 21-24 June 2022; pp. 4083–4092.
[49] Liu, Y.; Chen, J. Unsupervised face frontalization for pose-invariant
face recognition. Image Vis. Comput. 2021, 106, 104093.
[50] Hu, Y.; Wu, X.; Yu, B.; He, R.; Sun, Z. Pose-guided photorealistic face
rotation. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, Salt Lake City, UT, USA, 18-23 June 2018;
pp. 8398–8406.

Common questions

Powered by AI

The proposed method differs from traditional face verification approaches by avoiding the reliance on labeled data entirely. Traditional methods typically require labeled datasets for training, involving anchor-positive and anchor-negative pairs derived directly from annotated data. This method, however, generates these pairs in an unsupervised manner by utilizing cosine similarity to form pairs and a threshold for selecting challenging negatives, thus circumventing the label dependency and making it suitable for instances where labeled data is scarce .

The use of VGG-inspired CNN encoders enhances the effectiveness of the siamese network by providing robust feature extraction capabilities crucial for forming discriminative face embeddings. The VGG architecture is known for its deep, yet relatively simple, convolutional design which allows for learning intricate features through layered feature hierarchies, making it apt for capturing fine details in face images. This detail orientation aids in the precise calculation of cosine similarity scores necessary for forming reliable training pairs, ultimately improving the face verification accuracy of the siamese network .

In the siamese network architecture, positive and negative training pairs are crucial for training the network to distinguish between similar and dissimilar face images. Positive training pairs consist of face images from the same dataset with high cosine similarity to an anchor sample. Negative training pairs consist of images from a different dataset to the anchor, ensuring no shared identities. The network learns to optimize the binary cross-entropy loss based on these pairs, ultimately allowing it to make accurate binary classification decisions about facial similarity .

Cosine similarity is used because it measures the cosine of the angle between two non-zero vectors, providing a measure of similarity that is not sensitive to the magnitude of the vectors, which is an important property when dealing with face embeddings. This allows the method to effectively identify the most similar or dissimilar pairs based on orientation rather than absolute position on the manifold, thus offering robustness to variations in image conditions while ensuring computational simplicity and speed .

Generating unsupervised training pairs enhances the adaptability of the face verification system by allowing it to function effectively without depending on expensive and time-consuming labeled datasets. By leveraging cosine similarity scores within and across datasets to form training pairs, the system can be rapidly applied and adapted to new environments or datasets, beyond the constraints of pre-existing labeled data. This not only speeds up deployment but also integrates smoothly into dynamic operational settings where new or unpredictable data scenarios may arise .

The threshold in the proposed algorithm acts as a filter to select the most discriminative samples. For positive samples, only those with a cosine similarity score above the threshold are chosen, ensuring that the pairs are truly similar, enhancing the training of the network's positive recognition capability. Conversely, for negative samples, the threshold ensures that only the most challenging dissimilar pairs (those below the threshold) are selected, which aids the network in better distinguishing dissimilar faces. This threshold mechanism optimizes the balance between accuracy and generalization by focusing on the most informative examples .

The proposed method addresses the challenge of labeled data scarcity by using a siamese network that operates in an unsupervised manner. It introduces an innovative approach to generate training data pairs without labels by selecting positive samples from a dataset with high cosine similarity to an anchor, and negative samples from a different dataset ensuring no common identities. This way, it circumvents the need for labeled data, which is often hard to obtain in practical scenarios, making the method adaptable to real-world applications .

Binary cross-entropy loss serves as the cost function guiding the training of the proposed network by quantifying the difference between predicted outcomes and true binary labels (1 for positive matches, 0 for negatives). This loss function is critical as it penalizes wrong predictions, encouraging the network to enhance its ability to correctly classify face pairs as similar or dissimilar. By consistently updating the network weights to minimize this loss, it enhances the model's discriminative power and, consequently, its overall performance in producing accurate face verification scores .

The Labeled Faces in the Wild (LFW) dataset is significant for evaluating the proposed method as it is a well-established benchmark in the face recognition community. It consists of facial images captured under unconstrained environments, representing a real-world scenario for validation. Successful evaluation on this dataset demonstrates the method's capability to generalize and operate effectively without labeled data, showcasing its robustness and potential applicability in real-life situations .

The main components of the proposed siamese network include a double-branch architecture, each branch containing a CNN encoder inspired by the VGG network. These components work together by encoding input face images into embedding vectors which are then compared using cosine similarity to create positive and negative training pairs. The network is trained using binary cross-entropy loss to differentiate between similar and dissimilar pairs based on their decision scores. This integration allows the network to function as an end-to-end system for face verification without the need for face image labels .

You might also like