0% found this document useful (0 votes)
32 views10 pages

ViBA-Net: Body Shape in Fashion Compatibility

This paper proposes a Visual Body-shape-Aware Network (ViBA-Net) to improve fashion compatibility recommendations by incorporating visual features of body shapes and outfits. It introduces a new large-scale body shape dataset and consists of three modules: a body-shape embedding module, an outfit embedding module, and a joint embedding module. The effectiveness is evaluated on two fashion datasets through qualitative and quantitative analysis.

Uploaded by

Burak Fırat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views10 pages

ViBA-Net: Body Shape in Fashion Compatibility

This paper proposes a Visual Body-shape-Aware Network (ViBA-Net) to improve fashion compatibility recommendations by incorporating visual features of body shapes and outfits. It introduces a new large-scale body shape dataset and consists of three modules: a body-shape embedding module, an outfit embedding module, and a joint embedding module. The effectiveness is evaluated on two fashion datasets through qualitative and quantitative analysis.

Uploaded by

Burak Fırat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

This WACV paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

Learning Visual Body-shape-Aware Embeddings for Fashion Compatibility

Kaicheng Pang, Xingxing Zou, Waikeung Wong*


{kaicpang.pang, aemika.zou}@connect.polyu.hk, [email protected]
School of Fashion and Textiles, The Hong Kong Polytechnic University
Laboratory for Artificial Intelligence in Design
Hong Kong SAR

Bottom hourglass Inverted Triangle Spoon Top hourglass Triangle


Abstract
Body shape is a crucial factor in outfit recommenda-
tion. Previous studies that directly used body measurement
data to investigate the relationship between body shape and
outfit have achieved limited performance due to oversim-
plified body shape representations. This paper proposes a
Visual Body-shape-Aware Network (ViBA-Net) to improve
the fashion compatibility model’s awareness of human body Incompatible Compatible Incompatible Compatible Incompatible

shape through visual-level information. Specifically, ViBA- (a). Outfit Composition (b). Human body type

Net consists of three modules: a body-shape embedding Figure 1. An example of the body-shape-aware fashion compati-
module, which extracts visual and anthropometric features bility task. The outfit is compatible with the inverted triangle and
of body shape from a newly introduced large-scale body top hourglass body shapes, but does not fit other body shapes.
shape dataset; an outfit embedding module, which learns studies [12–14, 26] represent the body shape merely rely-
the outfit representation based on visual features extracted ing on body measurement data while overlooking the valu-
from a try-on image and textual features extracted from able visual features of body shape, which limits their ability
fashion attributes; and a joint embedding module, which to provide precise recommendations. To effectively incor-
jointly models the relationship between the representations porate accurate body shape information into FRSs, lever-
of body shape and outfit. ViBA-Net is designed to generate aging valuable information from body images is essential.
attribute-level explanations for the evaluation results based Moreover, accurately representing outfits is also critical, as
on the computed attention weights. The effectiveness of the scaling and spatial relationships between clothing items
ViBA-Net is evaluated on two mainstream datasets through can impact how they fit and flatter different body shapes.
qualitative and quantitative analysis. Data and code are Therefore, conventional outfit representation methods used
released1 . in FCL, such as item-wise correlations [4, 31, 32] or graph
neural networks [5, 28], are insufficient for modeling the
1. Introduction
relationships between body shape and an outfit. Lastly, pro-
Fashion Recommendation Systems (FRSs) [2, 15] is not viding a reasonable explanation for the evaluation is es-
a new topic, but they still have great potential for economic sential for personalized FRSs. However, previous stud-
benefits. Previous works have mainly focused on fashion ies [12, 21, 22] have not achieved this.
compatibility learning (FCL) [6,16,17], which only consid- To this end, this paper proposes a Visual Body-shape-
ers the compatibility among fashion items. However, be- Aware Network (ViBA-Net) to model the relationships be-
sides the outfit itself, consumers will be more concerned tween body shape and outfit. The ViBA-Net consists of
about how it looks when worn. Figure 1 demonstrates how three modules: Body-shape Embedding Module (BEM),
fashion compatibility can vary depending on different body Outfit Embedding Module (OEM), and Joint Embedding
shapes. For instance, individuals with an inverted trian- Module (JEM). The BEM combines visual and anthropo-
gle body shape may find the outfit in Figure 1 (a) suitable, metric features to obtain a general representation of the
while those with a triangle body shape may not. Previous body shape. However, obtaining accurate visual features
* Corresponding author. from body images requires a diverse dataset with explicit
1 https://github.com/BenjaminPang/ViBA-Net body shape annotations, which is currently unavailable.

8056
Thus, we create a new dataset covering seven common body through multiple MLPs. These approaches all neglect the
shapes; each contains 4,000 3D body models with vary- visual features of human bodies. In this work, we propose
ing but similar shapes. Every model within the dataset to encode the body into a more comprehensive embedding
is accompanied by corresponding anthropometric data and incorporating anthropometric and visual features, which are
frontal view images, which offer the visual features of the extracted from body images.
respective body shape. The OEM learns the outfit embed- Body Shape Classification. Most body-aware methods
ding by incorporating visual and textual features of the out- proposed to classify body shapes using clustering ap-
fit. We propose to represent an outfit leveraging its try- proaches, such as using the k-means in [14] and the affin-
on appearance instead of separate item images because the ity propagation in [12, 13]. In [26], authors separate body
try-on image contains the scaling and spatial relationships shapes into two groups according to users’ sizes. How-
among individual items. For the textual aspect, we ex- ever, research on classifying body shapes has been exten-
ploit the fashion attributes information to enhance the out- sively investigated over the past two decades. Notably,
fit representation, where the attribute values are encoded Simmons [27] developed a well-known body shape clas-
into word embeddings. Finally, the JEM integrates rep- sification system, the Female Figure Identification Tech-
resentations of the body shape and outfit to compute the nique (FFIT), which uses anthropometric data from 3D
body-shape-aware embedding, which is then transformed body scans for body shape classification. Subsequent re-
by a linear function to obtain the final compatibility score. search [7, 23, 33] improved the FFIT, which has become a
The core of the OEM and JEM is a cross-modal atten- widely accepted standard for body shape classification. So,
tion layer, allowing them to merge features from different in this work, we introduce a body shape dataset that clas-
modalities. The hierarchical design of ViBA-Net facilitates sifies body shapes into seven well-known types using FFIT
the propagation of cross-modal interactions between fash- instead of clustering methods.
ion attributes and body shapes through the computed atten- Fashion Outfit Representation. How to represent the out-
tion maps, as visualized in Figure 6. We leverage these at- fit plays a crucial role in fashion recommendation. Early
tention maps to generate the attribute-level explanations for works addressing fashion compatibility Learning (FCL)
the prediction results. All experiments are conducted on problem [4,31,32] represented an outfit as pairwise relation-
two mainstream fashion compatibility datasets, i.e., Outfit ships between fashion items and mapped fashion item em-
for Your (O4U) [22] and Body-Diverse (BD) Dataset [14], beddings into a unified space using category information.
that all include body shape annotations. Both qualitative Beyond pairwise distance, some studies attempted to model
and quantitative results show the advancement of the ViBA- high-order interactions among items [19, 22, 28]. These ap-
Net. We summarize main contributions as follows: proaches have two limitations: 1. They omit the scaling
• We propose ViBA-Net to obtain better body-shape- and spatial relationships between individual clothing items
aware embeddings for fashion compatibility. We en- when encoding the outfit; 2. Using only item category infor-
hance the body-shape embedding by introducing vi- mation is inadequate because adopting more specific fash-
sual features extracted from body images and repre- ion attribute information is useful. To this end, we propose
senting the outfit using its try-on appearance. to use try-on appearance images to represent outfits and ex-
• We introduce a new dataset with 28,000 body samples ploit fashion attributes to enhance the model performance.
covering seven common body shapes, each with a 3D
body model, anthropometric data, and a frontal view 3. Body Shape Dataset
image. This dataset can also be useful for tasks such
Previous studies [13, 23] have introduced a few body
as virtual try-on and clothed human generation.
shape datasets. However, their number of body models is
• We conduct experiments on the O4U and BD datasets,
insufficient to represent body shapes. For example, Parker
demonstrating the superiority of ViBA-Net over other
et al. [23] analyzed 1,679 3D body scans, but only 10 and 62
state-of-the-art approaches.
human bodies are categorized as triangle and top hourglass
2. Related Work body shapes, respectively. Although Hidayati et al. [13]
introduced a dataset consisting of 3,150 individual celebri-
Body-shape-Aware Fashion Compatibility. With the de- ties with their body measurement, no body shape labels
velopment of FCL, researchers are increasingly aware of are annotated. In light of this, we present a new dataset
the importance of body shape to practical applications [12– for the body shape representation. It features a diverse ar-
14, 22, 26]. Hidayati et al. [13] represented body shapes of ray of 28,000 individual models, spanning seven prevalent
female celebrities using their body measurements. Sun et body shapes: bottom hourglass, inverted triangle, spoon,
al. [29] proposed to use 3D features to represent female up- top hourglass, triangle, hourglass, and rectangle. The con-
per body shapes. Hsiao et al. [14] extracted body shape fea- struction process involves five steps: 1. Randomly generat-
tures using body measurements and SMPL [18] parameters ing 200,000 3D body models using the SMPL method [18];

8057
2. Measuring anthropometric data, including bust, waist, ResNet-18 [11] model, which is trained on the body im-
high hip, and hip circumferences from these models; 3. Re- ages of the proposed body shape dataset with a split ratio
moving unrealistic models and generating 100,000 more re- of 80%, 10%, and 10% for training, validation, and test.
alistic bodies based on refined shape parameters; 4. classi- v̄k ∈ R1×512 is the visual features, and Fbody refers to the
fying body shapes using the FFIT algorithm [33]; and 5. forward function of ResNet with the last linear layer dis-
Capturing frontal view images for each model using an or- carded. The visual feature extraction process can be written
thographic camera. The details of constructing this dataset as v̄k = Fbody (Īk ).
are presented in Section 1 of the Supplementary Material. We measure the representative model to acquire the an-
thropometric data, denoted as ω̄ k = Fmeasure (T̄k ) ∈
4. Methodology R1×20 , where Fmeasure refers to the measuring process.
Since body shape parameters contain information for char-
In this section, we elaborate on the details of the pro- acterize the body shape, we concatenate β̄ k and ω̄ k , and
posed ViBA-Net: 1. Clarify the task formulation; 2. Present send it to a linear layer consisting of a linear transforma-
the representations of body type, try-on image, and fashion tion and a Rectified Linear Unit (ReLU) activation function.
attributes; 3. Describe the architecture of ViBA-Net. The resulting output is concatenated with v̄k to produce the
body-shape embedding, denoted as Ūk ∈ R1×1024 . For-
4.1. Task Formulation
mally, Ūk is calculated using the following equation:
Following [22], we formulate this task as a multi-label
\bar {\mathbf {U}}^k = \mathrm {Concat}(\mathrm {ReLU}(\mathrm {Concat}(\bar {\bm {\beta }}^k, \bar {\bm {\omega }}^k)\mathbf {W}_B + \mathbf {b}_B), \bar {\mathbf {v}}^k)
classification task. Given a training set T = {Oj , Y j }N j=1
containing N outfits, we denote Oj = {Xj , Gj } as the j-th (3)
outfit containing serveral individual clothing images Xj and where WB ∈ R30×512 and bB ∈ R1×512 are fully con-
structured fashion attributes Gj . Y j = {ykj |k = 1, · · · , K} nected layer’s weight matrix and bias vector, respectively.
refers to a set of ground truth labels for j-th outfit condi- The resulting body shape features will be sent to the joint
tioned on K body shapes, where ykj = 1 indicates that out- embedding module.
fit Oj is incompatible with k-th body shape. Our goal is 4.3. Try-on Image Representation
to devise a learning function F to predict the compatibility
score ŷkj between a query outfit Oj and k-th body shape: We leverage try-on images instead of individual cloth-
ing images to represent outfits. However, try-on images
\hat {y}^j_k=\mathcal {F}(\{ \mathbf {X}^j, \mathbf {G}^j, {\bm {\omega }}^k, \mathbf {I}^k \} | \mathbf {\Theta } ) (1) are not typically included in mainstream datasets for the
FCL task, such as Polyvore [10], Style4BodyShape [13], and
where ω k and Ik are the anthropometric data and front view O4U [22] to name a few. To address this, a Multi-layer
image of k-th body shape. Θ is the training parameters. Virtual Try-On Network (M-VTON) system is utilized to
synthesize separate item images while preserving clothing
4.2. Body-shape Representation details as much as possible. Details can be found in Section
2 of the Supplementary Material. After obtaining the try-on
We devise a Body-shape Embedding Module (BEM) to
image, we utilize a pre-trained ResNet model with its last
compute the embedding for the body shape by exploiting
pooling layer and linear layer discarded to extract its visual
both visual and anthropometric features extracted from a
features. The motivation behind encoding it into multiple
representative body model, as illustrated in the top-left cor-
region-level features is that they can provide more accurate
ner of Figure 2. To obtain the representative model for the
representations than a single global feature. Formally, the
k-th body shape, we first average the shape parameters of
feature extraction process can be expressed as:
all body models belonging to the set Uk , and then use the
SMPL model [18] to generate the representative model ac- \mathbf {S} = \mathcal {F}_{\mathrm {outfit}}(\tilde {\mathbf {X} }) = \{ \mathbf {x}_1, \cdots , \mathbf {x} _n \}; (4)
cording to the averaged parameters:
where S is the representation of try-on image containing
\bar {\mathbf {T}}^k = \mathcal {F}_\mathrm {SMPL}(\bar {\bm {\beta }}^k) = \mathcal {F}_\mathrm {SMPL}(\frac {1}{|\mathbf {U}^k|}\sum _{\mathbf {T}_i \in \mathbf {U}^k} \bm {\beta }_i) (2) 128 spatial features xi ∈ R512 , and Foutfit refers to the
forward function of the modified ResNet.
4.4. Fashion Attributes Representation
where T̄k is the representative 3D model of k-th body
shape, and β̄ k ∈ R1×10 is the averaged shape parameter The clothing items are associated with a set of fashion at-
vector. |Uk | means the size of set Uk . Then, we use an tributes manually recognized from various attribute dimen-
orthographic camera to capture the corresponding frontal sions. For the sake of explanation, we show three fashion
view image, denoted as Īk = Fortho (T̄k ). We extract vi- attributes in the bottom-left part of Figure 2. We utilize
sual features of k-th body shape from Īk by employing a the union of all attributes associated with each item in an

8058
Body-shape Embedding Module Joint Embedding Module
SMPL 3D model Frontal view image

ℱortho Cross-modal Attention



𝑻𝑻 �𝑰𝑰 �
𝒗𝒗

𝑼𝑼
ℱ𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 Concat
𝑯𝑯 � softmax �
𝑼𝑼𝑾𝑾𝑏𝑏 𝑯𝑯𝑻𝑻
𝜶𝜶𝒃𝒃
⊗ 𝑯𝑯 𝑦𝑦�
score
ℱ𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 Concat 𝑯𝑯

Body measurements 𝝎𝝎 Linear Layer

Linear Layer

Body shape parameter 𝜷𝜷

Outfit Embedding Module M-VTON


Cross-modal Attention Item images 𝑿𝑿
𝜶𝜶𝒐𝒐
One-hot Attribute dimension
Concat ⊗ mannequin
Top Top Bottom
Encoding 𝒄𝒄1 𝑺𝑺
length silhouette type softmax
𝒄𝒄2
𝒄𝒄3 𝑺𝑺 �
𝑿𝑿 Detect Keypoints
Word 𝑨𝑨𝑾𝑾𝑜𝑜 𝑺𝑺 𝑻𝑻 align
Long V Skirt Encoding 𝒆𝒆1
𝒆𝒆2 𝑨𝑨 ℱ𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜
𝒆𝒆3 Synthesize
Fashion attributes 𝑮𝑮 Try-on image
Attribute value
Linear Layer Segment ⋮
Figure 2. The proposed ViBA-Net consists of three modules. Body-shape Embedding Module represents the body shape using both
body image features and anthropometric features. Outfit Embedding Module extracted outfit visual features from the try-on image using
M-VTON. Finally, both body shape features and outfit features are sent to the Joint Embedding Module to learn body-shape-aware embed-
dings. The cross-modal attention mechanism employed in ViBA-Net computes attention weights to generate attribute-level explanations.

outfit to represent the fashion attributes of the entire outfit. where W ∈ Rdq ×dv is the learnable weight matrix, and the
For fashion attribute value, we use a pre-trained GloVe [24] softmax operation is applied on the second dimension. Ac-
model to encode its text into a word embedding, denoted cording to the obtained attention distribution and value V,
as e ∈ Rdtext , where dtext = 300 is the dimensionality of the output of this block is computed by V̂ = αV, where
the word embedding. For fashion attribute dimension, we V̂ ∈ RNq ×dv is the fused feature vectors. The OEM aims to
j
encode it into a one-hot vector, denoted as c ∈ RNA , where acquire the outfit representation, denoted as Hj ∈ RL ×512 ,
NA = 15 is the number of all fashion attributes used in through integrating features of try-on image and fashion at-
this work. We then concatenate c and e to represent one tributes using the cross-modal attention block:
fashion attribute and then apply a linear transformation to
the concatenated vector. Suppose the j-th outfit possesses \mathbf {H}^j = \bm {\alpha }_o \cdot \mathbf {S}^j = \mathrm {softmax} (\mathbf {A}^j \mathbf {W}_o {\mathbf {S}^j}^T)\cdot \mathbf {S}^j (7)
Lj fashion attributes, this outfit’s attribute representation where Wo ∈ R512×512 is the learnable weight matrix and
j
Aj ∈ RL ×512 is computed by: j
αo ∈ RL ×128 is the attention maps calculated in OEM.
Then JEM learns the relationship between the k-th body
\mathbf {A}^j = \{\mathrm {ReLU}(\mathrm {Concat}(\mathbf {c}_l, \mathbf {e}_l)\mathbf {W}_A + \mathbf {b}_A )\}_{l=1}^{L^j} (5) shape features Ūk and the j-th outfit representation and out-
puts the compatibility vector between these two:
where WA ∈ R315×512 and bB ∈ R512 is the weight ma-
\hat {\mathbf {H}}^j_k = \bm {\alpha }_b \cdot \mathbf {H}^j = \mathrm {softmax}(\bar {\mathbf {U}}^k \mathbf {W}_b {\mathbf {H}^j}^T) \cdot \mathbf {H}^j (8)
trix and bias vector of the linear transformation.

4.5. Body-type-Aware Network Architecture where Ĥjk ∈ R1×512 is the body-shape-aware embedding,
and Wb ∈ R1024×512 is the learnable weight matrix in the
We employ the cross-modal attention block [20] in both j
JEM. αb ∈ R1×L is the attention maps computed in JEM.
the Outfit Embedding Module (OEM)and Joint Embed- We can observe that the second dimension of αb is the same
ding Module (JEM) of ViBA-Net to merge data represen- as the number of the fashion attributes associated with the j-
tations from different modalities. This mechanism im- th outfit. Based on this characteristic of the ViBA-Net, we
proves conventional attention mechanisms by introducing can obtain corresponding explanations based on the influ-
a learnable weight matrix in the score function, where two ence distribution of fashion attributes reflected in the atten-
modalities are connected by calculating their compatibility tion maps computed in JEM. We visualize αb in Figure 6
scores. Specifically, it takes two inputs denoted as a query to demonstrate the explainability possessed by ViBA-Net.
Q ∈ RNq ×dq and a value V ∈ RNv ×dv , and the attention Lastly, we compute the compatibility score by applying a
weights α ∈ RNq ×Nv is calculated as: linear transformation on Ĥjk :
\bm {\alpha } = \mathrm {softmax} (\mathbf {Q} \mathbf {W} \mathbf {V}^T) \label {compute_score} (6) \hat {y}^j_k = \hat {\mathbf {H}}^j_k \cdot \mathbf {W}_s + b_s (9)

8059
where Ws ∈ R512×1 and bs ∈ R are the linear transfor- Implementation Details. We adopt the SGD optimizer [25]
mation’s weights and bias, respectively. Since the task is with momentum factor equalling 0.9 and weight decay 5e-4.
formulated as a multi-label classification task, we use the We gradually decrease the learning rate according to:
binary cross entropy loss to measure the difference between
predicted scores ŷkj and target scores ykj . \mathrm {lr} = \mathrm {base\_lr} \times (1 - \mathrm {step\_num} / \mathrm {max\_step})^{0.9} (10)

5. Experiments where the base learning rate is 0.1. The maximum steps
and training batch size are set to 1,260 and 10, respectively.
We conduct experiments on two fashion compatibility During training, we save the checkpoint model correspond-
datasets to showcase the benefits of the proposed ViBA-Net ing to the highest mAP performance achieved on the vali-
model by addressing following research questions: dation set and evaluate the saved model on the test set. We
report the average evaluation results of five repeated exper-
• RQ1: Is the ViBA-Net superior to the current state-of-
iments for all experiments.
the-art methods?
• RQ2: To what extent do the individual components of 5.2. Comparative Results (RQ1)
ViBA-Net influence the model’s performance? Baselines. We compare the ViBA-Net with seven baseline
methods: (1) StyleMe [12], which extends AuxStyles [13]
• RQ3: What can ViBA-Net generate for explainations?
by using bidirectional symmetrical deep neural networks to
• RQ4: How does ViBA-Net perform in the perceptual learn a joint representation of outfits and body shapes. (2)
study? TDRG [34], an effective multi-object recognition model
that explores the structural and semantic aspect relations
5.1. Experimental Settings
through Graph Convolutional Network. We use it to learn
Datasets We evaluate the proposed network on two pub- the joint relation of the try-on image. (3) M3TR [35],
lic datasets: Outfit for You (O4U) [22] and Body-Diverse a multi-modal multi-label recognition model that incor-
(BD) [14] datasets. O4U contains 15,748 compatible out- porates global visual context and linguistic information
fits and 82,017 clothing items. Each item is associated with through ternary relationship learning. We embed the body
a product image and several fashion attributes. On aver- shape labels into the word embedding as the linguistic in-
age, the top item contains 6.64 fashion attributes, while the formation and use try-on appearances as input images. (4)
bottom item contains 3.77 attributes. We use the public CSRA [36], which captures spatial regions of objects from
training, validation, and testing data split provided by O4U different categories by effectively combining a simple spa-
to ensure a fair comparison. The BD dataset comprises tial attention score with class-specific and class-agnostic
889 dresses and 971 tops, spanning 57 individual fashion features. We train CSRA using try-on images as input. (5)
models. We classify these body models into three types FCN [22], which employs a convolutional layer to embed
(Bottom hourglass, Hourglass, and Rectangle) by aligning the outfit based on fashion attribute features and utilize a
their body measurements with the models in our body shape GCN to learn multi-label classifiers based on word embed-
dataset. We consider two scenarios for the dataset division: dings of body shapes. The compatibility scores are obtained
1). The ”easier” case involves seeing models from the test by applying the learned classifiers to the outfit embedding.
split during the training process denoted as the Joint ver- (6) Mo et al. [21], which learns the correlation between
sion; 2). In the more ”difficult” case, models from the test fashion images, fashion attributes, and physical attributes
split are not included in the training process, as termed the with two transformer encoders. (7) ViBE [14], which ap-
Disjoint version. Please refer to Section 3 of the supple- plies several MLPs to learn fashion clothing’s affinity with
mentary material for statistics details of the BD dataset. body measurements. (8) Body-aware CF [1], which is
Evaluation Metrics. For experiments on the O4U dataset, a collaborative filtering-based method utilizing the fashion
we employ a set of seven evaluation metrics to compare item and body measurement features.
the performance of different models. This practice aligns Quantitative Results on O4U. We present the quantitative
with prior works such as [9], [22], and [21], which tackle results on O4U dataset in Table 1. All baseline methods
multi-label classification problems. The metrics encompass are trained on the training set of O4U. The random method
Mean Average Precision (mAP), Average Per-Class Preci- means all predictions are given randomly. We observe
sion (CP), Recall (CR), F1 score (CF1), as well as Average that the proposed ViBA-Net achieves the best performances
Overall Precision (OP), Recall (OR), and F1 score (OF1). across all metrics. Specifically, it surpasses StyleMe by a
Notably, mAP, CF1, and OF1 hold greater significance due clear margin (+14.06 on mAP). This may be because the
to their ability to provide a holistic evaluation of model per- bidirectional symmetrical deep neural networks utilized in
formance. For experiments on the BD dataset, we evaluate StyleMe are limited in their ability to learn cross-modal re-
performances using the Area Under Curve (AUC) metric. lationships. Compared with the TDRG, M3TR, and CSRA

8060
Table 1. Evaluation results on O4U dataset.
Methods mAP CP CR CF1 OP OR OF1
Random 45.01 44.27 23.04 30.31 44.91 21.93 29.47
StyleMe [12] 49.08 37.50 56.05 44.94 62.81 77.70 69.47
TDRG [34] 54.66 50.80 63.60 56.48 65.42 78.85 71.51 Ground Truth
M3TR [35] 61.37 55.92 61.19 58.44 69.37 79.65 74.15 StyleMe
TDTG
CSRA [36] 61.38 56.63 61.18 58.82 71.82 76.79 74.22
M3TR
FCN [22] 62.34 56.96 62.41 59.55 71.42 78.14 74.62 CSRA
Mo et al. [21] 62.38 55.24 62.10 58.47 67.17 79.34 72.75 FCN
ViBE [14] 62.18 55.63 64.43 59.71 70.79 79.25 74.78 ViBE
ViBA-Net
ViBA-Net (Ours) 63.14 57.30 64.85 60.84 72.02 80.73 76.13
(a) (b)

Body-aware CF 80.7 60.7 61.9


79.7
ViBE 75.6
Ours 57.6
AUC

67.7 69.5 53.5 54.4 Ground Truth


66.9 StyleMe
53.2
TDTG
M3TR
CSRA
Dress Top Dress Top FCN
(a) Joint (b) Disjoint ViBE
Figure 3. Evaluation results on Body-Diverse dataset. In the joint ViBA-Net
(c) (d)
scenario, test models are seen during the training process In the
disjoint version, training and test sets of models are completely Figure 4. Qualitative comparison among different methods. The
separate. Our method notably outperforms other approaches tick symbol indicates a match between the outfit and the body
across all scenarios and fashion categories, securing the highest shape, while the cross symbol indicates a mismatch.
AUC performance by a substantial margin.
are superior to those in the dress category. This discrepancy
methods, the ViBA-Net brings consistent +1.78∼8.5 mAP may be because tops are predominantly associated with up-
gains, +2.02∼4.36 CF1 gains, and +1.9∼4.6 OF1 gains per body parameters such as bust and waist rather than hip
over them. The reason may be that the ViBA-Net takes size. This specificity impacts model performance. In con-
advantage of multi-modal features. ViBA-Net also outper- trast, as full-body garments, dresses leverage data from all
forms the FCN, Mo et al. [21], and ViBE methods on all body dimensions, contributing to improved performance.
metrics. This may be attributed to the fact that these meth- Qualitative Results. The quantitative results are presented
ods fail to learn body shape embeddings using visual fea- in Figure 4. It is evident that among all baselines, the ViBA-
tures. Net consistently performs well with various outfit composi-
Quantitative Results on Body-Diverse Dataset. We re- tions. In Figure 4 (a), for example, the outfit consists of
port results on the Body-Diverse dataset in Figure 3. We corset straps with hot pants, which might not be compati-
compare ViBA-Net to the Body-aware CF and ViBE meth- ble with people having lower body segment obesity due to
ods. The latter two methods rely solely on SMPL param- tight pants. However, the length of hot pants is short, expos-
eters and body measurements for representing body shape. ing the legs, which can alleviate the feeling of envelopment,
Remarkably, our method consistently outperforms the oth- and thus, the outfit can still be compatible with body shapes
ers across all scenarios and fashion categories. Specifically, such as bottom hourglass, spoon, and triangle. On the other
Figure 3 (a) shows the results on the ”easier” test, and our hand, corset straps are heavy for people with inverted tri-
method brings +2.6 and +1.8 AUC gains over CF and ViBE angle or top hourglass body shapes, which also have larger
methods on the dress set, respectively. A substantial AUC breasts. Thus, matching hot pants with the same large expo-
improvement of +5.1 over ViBE is also observed on the top sure of skin is unsuitable. In contrast, as shown in Figure 4
set. Figure 3 (b) shows the results on the disjoint test set. (b), when the clothing is changed to a tank top and A-line
AUC performances of ViBA-Net are +0.9 and +1.2 higher long skirt, it can solve both problems. Similarly, in Fig-
than the ViBE method on the dress and top test sets, respec- ure 4 (c), the off-shoulder blouse is unsuitable for people
tively. These consistent enhancements can be attributed to with broad shoulders, and the tight jeans are not compatible
the incorporation of visual body features. with those with lower body segment obesity. Furthermore,
Notably, it can be observed in Figure 3 that all methods for outfits with special silhouettes, such as the peplum top
perform better on the joint dataset compared to the disjoint with an H-line short skirt in Figure 4 (d), the ViBA-Net can
dataset, aligning with our expectations. Another observa- still accurately assess the compatibility between body shape
tion is the evaluation results achieved in the top category and the outfit composition.

8061
Table 2. Ablation results on representation learning. backbone: Table 3. Body shape classification accuracy comparing with avail-
utilizing backbone (ResNet-18) as multi-label classifier. w/o-body: able classifiers.
encoding the body shape into one-hot vector. w/o-try-on: encoding Available body shape classifiers
outfit using visual features from separate items. w/o-attr: remov- Ours
Lee et al. [33] Francis [8] Collings [3] Hidayati et al. [12]
ing fashion attribute data. 28.63% 31.84% 37.87% 76.83% 97.60%
Methods mAP CP CR CF1 OP OR OF1 Visual features Anthropometric features
backbone 57.71 54.47 57.54 55.96 67.53 76.39 71.68
w/o-body 60.57 55.68 60.71 57.97 67.46 73.84 70.46
w/o-anth. 62.73 56.85 64.25 60.32 71.75 79.91 75.61
w/o-visual 62.61 56.92 64.85 60.63 71.83 80.33 75.84
w/o-try-on 61.72 56.29 62.77 59.35 71.43 78.99 75.02
w/o-attr 61.45 55.83 63.32 59.34 70.61 79.50 74.79
Full model 63.14 57.30 64.85 60.84 72.02 80.73 76.13

5.3. Ablation Study (RQ2)


We examine the effectiveness of components in the Figure 5. Visualization of different body features using t-SNE.
ViBA-Net by conducting several ablation studies.
Ablation Study on Representation Learning. We first anthropometric data to classify body shapes.
demonstrate the effectiveness of the body shape and out- To further illustrate the difference between the visual
fit representation applied in ViBA-Net, as shown in Table 2. and anthropometric features of the body shape, we visualize
Firstly, we investigate the overall contribution of ViBA-Net them in Figure 5 using t-SNE [30]. The visual features are
to the multi-label classification performance by comparing extracted from the frontal view images, and the anthropo-
it with ViBA-Net’s backbone model (ResNet-18). Our full metric features are measured from 3D models belonging to
network brings +5.43 mAP, +4.88 CF1, and +4.45 OF1 per- the testing set of the body shape dataset. We can observe
formance improvements. Furthermore, we proceed to eval- that the five body shapes are separated more clearly from
uate the efficacy of our body-shape embedding approach each other in the left part of Figure 5 compared with anthro-
by conducting experiments involving the removal of spe- pometric features in right part. This suggests that the visual
cific components: anthropometric features (w/o-anth.), vi- features contain more valuable information for characteriz-
sual features (w/o-visual), and a combination of both (w/o- ing the body shape. We also observe that the Euclidean dis-
body). Notably, consistent performance deterioration is ob- tance between similar body shapes is closer. For instance,
served across all three cases. This substantiates that anthro- the distance between inverted triangle (orange star symbol)
pometric and visual features are pivotal in accurately repre- and top hourglass (red diamond symbol) is shorter than the
senting body shapes. We also compare our try-on embed- distance between inverted triangle and triangle (purple tri-
ding method with a separate item embedding method (w/o- angle symbol). The main reason is that both inverted trian-
try-on). The result shows that ViBA-Net using the try-on gle and top hourglass body shapes have a wider upper body
embedding achieves higher scores (+1.42 mAP, +1.49 CF1, and a narrower lower body. In contrast, triangle body shape
and +1.11 OF1) than the model using separate items, sug- typically has larger hips. These results support the proposal
gesting that our try-on embedding method captures more that incorporating visual body features into the process of
information from the try-on image compared with discrete learning body-shape-aware embeddings is effective.
items. Lastly, we investigate the impact of utilizing fashion
5.4. Explainability Analysis (RQ3)
attributes in our model. Results of w/o-Attributes demon-
strate that using fashion attribute data can improve the We visualize the attention maps for three query outfits in
model’s overall performance, with the full model achiev- Figure 6 to provide a visualization of the fashion attributes
ing increases of +1.69 mAP, +1.50 CF1, and +1.34 OF1, that the ViBA-Net focuses on when predicting compatibil-
which suggest fashion attributes can provide valuable cues ity. Each row entry of the attention map represents attention
for personalized fashion recommendations. weights αb generated in the JEM, which indicates the sig-
More ablation studies on network structure and outfit en- nificance of fashion attributes with respect to correspond-
coding are discussed in the supplementary file Section 4. ing body shapes. In the first two examples (Figures 6 (a)
Comparing Visual and Anthropometric Features. and (b)), we present two outfits where the first query does
Table 3 compares the performance of body shape classi- not match the bottom hourglass, spoon, and triangle body
fication methods. These results show that our visual-based shapes, while the second query is compatible with them.
classification approach (Ours) clearly outperforms other The attention maps indicate that ViBA-Net attends mostly
baselines. This could be because other baselines merely use to the bottom silhouette attribute dimension (last row), i.e.,

8062
Table 4. Perceptual results of the compatibility models.
Methods StyleMe [12] TDRG [34] M3TR [35] CSRA [36] FCN [22] ViBA-Net (Ours)
OCs 49% 52% 51% 53% 59% 61%
ECs - - - - - 67%

(a) (b) (c) (d) (e) (f) (g)

Figure 7. The pipeline of a prototype for applying ViBA-Net in


(a) (b) (c)
a real application. Step (a): inputting the personal information;
Figure 6. Visualization of attention maps computed in JEM. The
step (b): generating a 3D SMPL model according to the input
vertical axis represents all the fashion attributes possessed by the
measurements data; step (c): adjusting and confirming the body
query outfit. The horizontal axis represents five body shapes,
shape; step (d): browsing the fashion items; step (e): selecting one
namely, from left to right, bottom hourglass, inverted triangle,
favour clothing item with corresponding outfit recommendations
spoon, and top hourglass.
that consider the body shape; step (f): visualizing the outfit com-
Slim and A-line, respectively. This may due to the fact that position on the size of body shape; step (g): translating the SMPL
these three body shapes all possess a larger hip measure- model into a human image via generative model e.g., Midjourney.
ment, which is congruent with an A-line dress but not with a
that the ViBA-Net enjoys the highest performance on Body-
slim one. Additionally, Figures 6 (c) shows an outfit which
shape-Aware fashion compatibility while taking a unique
is incompatible with inverted triangle body shape. ViBA-
advantage in explainability.
Net suggests that the main reason for this mismatch is the
In addition to the perceptual study, we also build the pro-
top item contains a cold shoulder design. From a fashion
totype for applying ViBA-Net in a real application to show
perspective, this inference is reasonable because tops with
the practicality of the proposed method. As shown in Fig-
cold shoulder designs often fail to provide adequate support
ure 7, we present the main steps of the prototype for apply-
for the chest and upper body, which can be a concern for in-
ing ViBA-Net in real applications. It can be seen that, with
dividuals with a larger bust resulting in an unflattering and
the awareness of body shape, customers can more easily and
uncomfortable fit.
directly accept the recommended outfits. And connecting
Interestingly, the ViBA-Net has varied focuses on fash-
with the current cutting-edge techniques can generate more
ion attributes belonging to the bottom and top items of dif-
user-friendly and interesting results with huge economic po-
ferent body shapes. The network concentrates mainly on the
tential, e.g., translating the SMPL model into a human im-
bottom attributes for body shapes such as bottom hourglass,
age via generative models such as Midjourney or executing
spoon, and triangle. Conversely, it pays more attention to
a call API of Large Language models such as ChatGPT to
the top attributes for the inverted triangle and top hourglass.
make the explanation more like a natural conversation.
This could be because the bottom attributes play a more crit-
ical role in determining compatibility for body shapes that 6. Conclusion
tend to have a larger hip and thigh area. On the other hand,
for body shapes that have broader shoulders and a smaller Body shape is an essential consideration when recom-
waist, the network focuses more on the top attributes to en- mending outfits to consumers in real-life applications. To
sure a balanced overall look that accentuates the waistline. this end, we propose ViBA-Net to learn better body-shape-
aware embeddings for fashion compatibility and a new
5.5. Perceptual Study (RQ4) dataset containing varied information about body shape.
Finally, we conduct a perceptual study to show the po- Meanwhile, we also propose representing the outfit using
tentiality of the ViBA-Net in practical applications. Specif- its try-on appearance, which captures the scaling and spa-
ically, we invite ten experts working in the fashion industry tial relationships between fashion items on the body. We
to assess the results of all the compatibility models from the conduct experiments on both the O4U and BD dataset to
following two aspects, (1) Body-shape-Aware Compatibil- demonstrate the superiority of ViBA-Net compared to other
ity score (OCs): whether the outfits are compatible with the state-of-the-art approaches.
body shape or not; (2) Explanation Confidence score (ECs):
whether the explanation reasonable or not. The score range Acknowledgements
is [0, 1], 0.1 per level, and the final score is the weighted This work is supported by Laboratory for Artificial In-
average of all the scores given by those experts. The per- telligence in Design (Project Code: RP 3-2) under InnoHK
ceptual results are summarized in Table 4. It can be seen Research Clusters, Hong Kong SAR Government.

8063
References ence on Computer Vision and Pattern Recognition (CVPR),
June 2020. 1, 2, 5, 6
[1] Tianqi Chen, Weinan Zhang, Qiuxia Lu, Kailong Chen, Zhao
[15] Hyunwoo Hwangbo, Yang Sok Kim, and Kyung Jin Cha.
Zheng, and Yong Yu. Svdfeature: a toolkit for feature-based
Recommendation system development for fashion retail e-
collaborative filtering. The Journal of Machine Learning Re-
commerce. Electronic Commerce Research and Applica-
search, 13(1):3619–3622, 2012. 5
tions, 28:94–101, 2018. 1
[2] Wen Chen, Pipei Huang, Jiaming Xu, Xin Guo, Cheng Guo,
[16] Pang Kaicheng, Zou Xingxing, and Wai Keung Wong. Mod-
Fei Sun, Chao Li, Andreas Pfadler, Huan Zhao, and Bin-
eling fashion compatibility with explanation by using bidi-
qiang Zhao. Pog: personalized outfit generation for fashion
rectional lstm. In Proceedings of the IEEE/CVF Conference
recommendation at alibaba ifashion. In Proceedings of the
on Computer Vision and Pattern Recognition (CVPR) Work-
25th ACM SIGKDD international conference on knowledge
shops, pages 3894–3898, June 2021. 1
discovery & data mining, pages 2662–2670, 2019. 1
[17] Yen-Liang Lin, Son Tran, and Larry S Davis. Fashion
[3] Kat Collings. The foolproof way to find out your real body
outfit complementary item retrieval. In Proceedings of
type. 7
the IEEE/CVF conference on computer vision and pattern
[4] Guillem Cucurull, Perouz Taslakian, and David Vazquez.
recognition, pages 3311–3319, 2020. 1
Context-aware visual compatibility prediction. In Proceed-
[18] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard
ings of the IEEE/CVF conference on computer vision and
Pons-Moll, and Michael J Black. Smpl: A skinned multi-
pattern recognition, pages 12617–12626, 2019. 1, 2
person linear model. ACM transactions on graphics (TOG),
[5] Zeyu Cui, Zekun Li, Shu Wu, Xiao-Yu Zhang, and Liang
34(6):1–16, 2015. 2, 3
Wang. Dressing as a whole: Outfit compatibility learning
based on node-wise graph neural networks. In The world [19] Zhi Lu, Yang Hu, Yan Chen, and Bing Zeng. Personalized
wide web conference, pages 307–317, 2019. 1 outfit recommendation with learnable anchors. In Proceed-
ings of the IEEE/CVF conference on computer vision and
[6] Lavinia De Divitiis, Federico Becattini, Claudio Baecchi,
pattern recognition, pages 12722–12731, 2021. 2
and Alberto Del Bimbo. Disentangling features for fash-
ion recommendation. ACM Transactions on Multimedia [20] Minh-Thang Luong, Hieu Pham, and Christopher D Man-
Computing, Communications and Applications, 19(1s):1– ning. Effective approaches to attention-based neural machine
21, 2023. 1 translation. arXiv preprint arXiv:1508.04025, 2015. 4
[7] Priya Devarajan and Cynthia L Istook. Validation of fe- [21] Dongmei Mo, Xingxing Zou, Kaicheng Pang, and Wai Ke-
male figure identification technique (ffit) for apparel soft- ung Wong. Towards private stylists via personalized com-
ware. Journal of Textile and Apparel, Technology and Man- patibility learning. Expert Systems with Applications,
agement, 4(1):1–23, 2004. 2 219:119632, 2023. 1, 5, 6
[8] Cherene Francis. Body shape calculator. 7 [22] Kaicheng Pang, Xingxing Zou, and Waikeung Wong. Dress
[9] Bin-Bin Gao and Hong-Yu Zhou. Learning to discover well via fashion cognitive learning. In British Machine Vi-
multi-class attentional regions for multi-label image recog- sion Conference (BMVC), November 2022. 1, 2, 3, 5, 6, 8
nition. IEEE Transactions on Image Processing, 30:5920– [23] Christopher J Parker, Steven George Hayes, Kathryn Brown-
5932, 2021. 5 bridge, and Simeon Gill. Assessing the female figure identi-
[10] Xintong Han, Zuxuan Wu, Yu-Gang Jiang, and Larry S fication technique’s reliability as a body shape classification
Davis. Learning fashion compatibility with bidirectional system. Ergonomics, 64(8):1035–1051, 2021. 2
lstms. In Proceedings of the 25th ACM international con- [24] Jeffrey Pennington, Richard Socher, and Christopher D Man-
ference on Multimedia, pages 1078–1086, 2017. 3 ning. Glove: Global vectors for word representation. In
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Proceedings of the 2014 conference on empirical methods in
Deep residual learning for image recognition. In Proceed- natural language processing (EMNLP), pages 1532–1543,
ings of the IEEE conference on computer vision and pattern 2014. 4
recognition, pages 770–778, 2016. 3 [25] Sebastian Ruder. An overview of gradient descent optimiza-
[12] Shintami Chusnul Hidayati, Ting Wei Goh, Ji-Sheng Gary tion algorithms. arXiv preprint arXiv:1609.04747, 2016. 5
Chan, Cheng-Chun Hsu, John See, Lai-Kuan Wong, Kai- [26] Hosnieh Sattar, Gerard Pons-Moll, and Mario Fritz. Fashion
Lung Hua, Yu Tsao, and Wen-Huang Cheng. Dress with is taking shape: Understanding clothing preference based on
style: Learning style from joint deep embedding of clothing body shape from online sources. In 2019 IEEE winter con-
styles and body shapes. IEEE Transactions on Multimedia, ference on applications of computer vision (WACV), pages
23:365–377, 2020. 1, 2, 5, 6, 7, 8 968–977. IEEE, 2019. 1, 2
[13] Shintami Chusnul Hidayati, Cheng-Chun Hsu, Yu-Ting [27] Karla Kristin Peavy Simmons. Body shape analysis using
Chang, Kai-Lung Hua, Jianlong Fu, and Wen-Huang Cheng. three-dimensional body scanning technology. North Car-
What dress fits me best? fashion recommendation on the olina State University, 2002. 2
clothing style for personal body shape. In Proceedings of the [28] Tianyu Su, Xuemeng Song, Na Zheng, Weili Guan, Yan Li,
26th ACM international conference on Multimedia, pages and Liqiang Nie. Complementary factorization towards out-
438–446, 2018. 1, 2, 3, 5 fit compatibility modeling. In Proceedings of the 29th ACM
[14] Wei-Lin Hsiao and Kristen Grauman. Vibe: Dressing for di- international conference on multimedia, pages 4073–4081,
verse body shapes. In Proceedings of the IEEE/CVF Confer- 2021. 1, 2

8064
[29] Jie Sun, Qianyun Cai, Tao Li, Lei Du, and Fengyuan Zou.
Body shape classification and block optimization based on
space vector length. International Journal of Clothing Sci-
ence and Technology, 31(1):115–129, 2019. 2
[30] Laurens Van der Maaten and Geoffrey Hinton. Visualiz-
ing data using t-sne. Journal of machine learning research,
9(11), 2008. 7
[31] Mariya I Vasileva, Bryan A Plummer, Krishna Dusad,
Shreya Rajpal, Ranjitha Kumar, and David Forsyth. Learn-
ing type-aware embeddings for fashion compatibility. In
Proceedings of the European conference on computer vision
(ECCV), pages 390–405, 2018. 1, 2
[32] Xuewen Yang, Dongliang Xie, Xin Wang, Jiangbo Yuan,
Wanying Ding, and Pengyun Yan. Learning tuple compati-
bility for conditional outfit recommendation. In Proceedings
of the 28th ACM International Conference on Multimedia,
pages 2636–2644, 2020. 1, 2
[33] Jeong Yim Lee, Cynthia L Istook, Yun Ja Nam, and Sun
Mi Park. Comparison of body shape between usa and ko-
rean women. International Journal of Clothing Science and
Technology, 19(5):374–391, 2007. 2, 3, 7
[34] Jiawei Zhao, Ke Yan, Yifan Zhao, Xiaowei Guo, Feiyue
Huang, and Jia Li. Transformer-based dual relation graph
for multi-label image recognition. In Proceedings of the
IEEE/CVF International Conference on Computer Vision,
pages 163–172, 2021. 5, 6, 8
[35] Jiawei Zhao, Yifan Zhao, and Jia Li. M3tr: Multi-modal
multi-label recognition with transformer. In Proceedings
of the 29th ACM International Conference on Multimedia,
pages 469–477, 2021. 5, 6, 8
[36] Ke Zhu and Jianxin Wu. Residual attention: A simple but
effective method for multi-label recognition. In Proceedings
of the IEEE/CVF International Conference on Computer Vi-
sion, pages 184–193, 2021. 5, 6, 8

8065

You might also like