ViBA-Net: Body Shape in Fashion Compatibility
ViBA-Net: Body Shape in Fashion Compatibility
shape through visual-level information. Specifically, ViBA- (a). Outfit Composition (b). Human body type
Net consists of three modules: a body-shape embedding Figure 1. An example of the body-shape-aware fashion compati-
module, which extracts visual and anthropometric features bility task. The outfit is compatible with the inverted triangle and
of body shape from a newly introduced large-scale body top hourglass body shapes, but does not fit other body shapes.
shape dataset; an outfit embedding module, which learns studies [12–14, 26] represent the body shape merely rely-
the outfit representation based on visual features extracted ing on body measurement data while overlooking the valu-
from a try-on image and textual features extracted from able visual features of body shape, which limits their ability
fashion attributes; and a joint embedding module, which to provide precise recommendations. To effectively incor-
jointly models the relationship between the representations porate accurate body shape information into FRSs, lever-
of body shape and outfit. ViBA-Net is designed to generate aging valuable information from body images is essential.
attribute-level explanations for the evaluation results based Moreover, accurately representing outfits is also critical, as
on the computed attention weights. The effectiveness of the scaling and spatial relationships between clothing items
ViBA-Net is evaluated on two mainstream datasets through can impact how they fit and flatter different body shapes.
qualitative and quantitative analysis. Data and code are Therefore, conventional outfit representation methods used
released1 . in FCL, such as item-wise correlations [4, 31, 32] or graph
neural networks [5, 28], are insufficient for modeling the
1. Introduction
relationships between body shape and an outfit. Lastly, pro-
Fashion Recommendation Systems (FRSs) [2, 15] is not viding a reasonable explanation for the evaluation is es-
a new topic, but they still have great potential for economic sential for personalized FRSs. However, previous stud-
benefits. Previous works have mainly focused on fashion ies [12, 21, 22] have not achieved this.
compatibility learning (FCL) [6,16,17], which only consid- To this end, this paper proposes a Visual Body-shape-
ers the compatibility among fashion items. However, be- Aware Network (ViBA-Net) to model the relationships be-
sides the outfit itself, consumers will be more concerned tween body shape and outfit. The ViBA-Net consists of
about how it looks when worn. Figure 1 demonstrates how three modules: Body-shape Embedding Module (BEM),
fashion compatibility can vary depending on different body Outfit Embedding Module (OEM), and Joint Embedding
shapes. For instance, individuals with an inverted trian- Module (JEM). The BEM combines visual and anthropo-
gle body shape may find the outfit in Figure 1 (a) suitable, metric features to obtain a general representation of the
while those with a triangle body shape may not. Previous body shape. However, obtaining accurate visual features
* Corresponding author. from body images requires a diverse dataset with explicit
1 https://github.com/BenjaminPang/ViBA-Net body shape annotations, which is currently unavailable.
8056
Thus, we create a new dataset covering seven common body through multiple MLPs. These approaches all neglect the
shapes; each contains 4,000 3D body models with vary- visual features of human bodies. In this work, we propose
ing but similar shapes. Every model within the dataset to encode the body into a more comprehensive embedding
is accompanied by corresponding anthropometric data and incorporating anthropometric and visual features, which are
frontal view images, which offer the visual features of the extracted from body images.
respective body shape. The OEM learns the outfit embed- Body Shape Classification. Most body-aware methods
ding by incorporating visual and textual features of the out- proposed to classify body shapes using clustering ap-
fit. We propose to represent an outfit leveraging its try- proaches, such as using the k-means in [14] and the affin-
on appearance instead of separate item images because the ity propagation in [12, 13]. In [26], authors separate body
try-on image contains the scaling and spatial relationships shapes into two groups according to users’ sizes. How-
among individual items. For the textual aspect, we ex- ever, research on classifying body shapes has been exten-
ploit the fashion attributes information to enhance the out- sively investigated over the past two decades. Notably,
fit representation, where the attribute values are encoded Simmons [27] developed a well-known body shape clas-
into word embeddings. Finally, the JEM integrates rep- sification system, the Female Figure Identification Tech-
resentations of the body shape and outfit to compute the nique (FFIT), which uses anthropometric data from 3D
body-shape-aware embedding, which is then transformed body scans for body shape classification. Subsequent re-
by a linear function to obtain the final compatibility score. search [7, 23, 33] improved the FFIT, which has become a
The core of the OEM and JEM is a cross-modal atten- widely accepted standard for body shape classification. So,
tion layer, allowing them to merge features from different in this work, we introduce a body shape dataset that clas-
modalities. The hierarchical design of ViBA-Net facilitates sifies body shapes into seven well-known types using FFIT
the propagation of cross-modal interactions between fash- instead of clustering methods.
ion attributes and body shapes through the computed atten- Fashion Outfit Representation. How to represent the out-
tion maps, as visualized in Figure 6. We leverage these at- fit plays a crucial role in fashion recommendation. Early
tention maps to generate the attribute-level explanations for works addressing fashion compatibility Learning (FCL)
the prediction results. All experiments are conducted on problem [4,31,32] represented an outfit as pairwise relation-
two mainstream fashion compatibility datasets, i.e., Outfit ships between fashion items and mapped fashion item em-
for Your (O4U) [22] and Body-Diverse (BD) Dataset [14], beddings into a unified space using category information.
that all include body shape annotations. Both qualitative Beyond pairwise distance, some studies attempted to model
and quantitative results show the advancement of the ViBA- high-order interactions among items [19, 22, 28]. These ap-
Net. We summarize main contributions as follows: proaches have two limitations: 1. They omit the scaling
• We propose ViBA-Net to obtain better body-shape- and spatial relationships between individual clothing items
aware embeddings for fashion compatibility. We en- when encoding the outfit; 2. Using only item category infor-
hance the body-shape embedding by introducing vi- mation is inadequate because adopting more specific fash-
sual features extracted from body images and repre- ion attribute information is useful. To this end, we propose
senting the outfit using its try-on appearance. to use try-on appearance images to represent outfits and ex-
• We introduce a new dataset with 28,000 body samples ploit fashion attributes to enhance the model performance.
covering seven common body shapes, each with a 3D
body model, anthropometric data, and a frontal view 3. Body Shape Dataset
image. This dataset can also be useful for tasks such
Previous studies [13, 23] have introduced a few body
as virtual try-on and clothed human generation.
shape datasets. However, their number of body models is
• We conduct experiments on the O4U and BD datasets,
insufficient to represent body shapes. For example, Parker
demonstrating the superiority of ViBA-Net over other
et al. [23] analyzed 1,679 3D body scans, but only 10 and 62
state-of-the-art approaches.
human bodies are categorized as triangle and top hourglass
2. Related Work body shapes, respectively. Although Hidayati et al. [13]
introduced a dataset consisting of 3,150 individual celebri-
Body-shape-Aware Fashion Compatibility. With the de- ties with their body measurement, no body shape labels
velopment of FCL, researchers are increasingly aware of are annotated. In light of this, we present a new dataset
the importance of body shape to practical applications [12– for the body shape representation. It features a diverse ar-
14, 22, 26]. Hidayati et al. [13] represented body shapes of ray of 28,000 individual models, spanning seven prevalent
female celebrities using their body measurements. Sun et body shapes: bottom hourglass, inverted triangle, spoon,
al. [29] proposed to use 3D features to represent female up- top hourglass, triangle, hourglass, and rectangle. The con-
per body shapes. Hsiao et al. [14] extracted body shape fea- struction process involves five steps: 1. Randomly generat-
tures using body measurements and SMPL [18] parameters ing 200,000 3D body models using the SMPL method [18];
8057
2. Measuring anthropometric data, including bust, waist, ResNet-18 [11] model, which is trained on the body im-
high hip, and hip circumferences from these models; 3. Re- ages of the proposed body shape dataset with a split ratio
moving unrealistic models and generating 100,000 more re- of 80%, 10%, and 10% for training, validation, and test.
alistic bodies based on refined shape parameters; 4. classi- v̄k ∈ R1×512 is the visual features, and Fbody refers to the
fying body shapes using the FFIT algorithm [33]; and 5. forward function of ResNet with the last linear layer dis-
Capturing frontal view images for each model using an or- carded. The visual feature extraction process can be written
thographic camera. The details of constructing this dataset as v̄k = Fbody (Īk ).
are presented in Section 1 of the Supplementary Material. We measure the representative model to acquire the an-
thropometric data, denoted as ω̄ k = Fmeasure (T̄k ) ∈
4. Methodology R1×20 , where Fmeasure refers to the measuring process.
Since body shape parameters contain information for char-
In this section, we elaborate on the details of the pro- acterize the body shape, we concatenate β̄ k and ω̄ k , and
posed ViBA-Net: 1. Clarify the task formulation; 2. Present send it to a linear layer consisting of a linear transforma-
the representations of body type, try-on image, and fashion tion and a Rectified Linear Unit (ReLU) activation function.
attributes; 3. Describe the architecture of ViBA-Net. The resulting output is concatenated with v̄k to produce the
body-shape embedding, denoted as Ūk ∈ R1×1024 . For-
4.1. Task Formulation
mally, Ūk is calculated using the following equation:
Following [22], we formulate this task as a multi-label
\bar {\mathbf {U}}^k = \mathrm {Concat}(\mathrm {ReLU}(\mathrm {Concat}(\bar {\bm {\beta }}^k, \bar {\bm {\omega }}^k)\mathbf {W}_B + \mathbf {b}_B), \bar {\mathbf {v}}^k)
classification task. Given a training set T = {Oj , Y j }N j=1
containing N outfits, we denote Oj = {Xj , Gj } as the j-th (3)
outfit containing serveral individual clothing images Xj and where WB ∈ R30×512 and bB ∈ R1×512 are fully con-
structured fashion attributes Gj . Y j = {ykj |k = 1, · · · , K} nected layer’s weight matrix and bias vector, respectively.
refers to a set of ground truth labels for j-th outfit condi- The resulting body shape features will be sent to the joint
tioned on K body shapes, where ykj = 1 indicates that out- embedding module.
fit Oj is incompatible with k-th body shape. Our goal is 4.3. Try-on Image Representation
to devise a learning function F to predict the compatibility
score ŷkj between a query outfit Oj and k-th body shape: We leverage try-on images instead of individual cloth-
ing images to represent outfits. However, try-on images
\hat {y}^j_k=\mathcal {F}(\{ \mathbf {X}^j, \mathbf {G}^j, {\bm {\omega }}^k, \mathbf {I}^k \} | \mathbf {\Theta } ) (1) are not typically included in mainstream datasets for the
FCL task, such as Polyvore [10], Style4BodyShape [13], and
where ω k and Ik are the anthropometric data and front view O4U [22] to name a few. To address this, a Multi-layer
image of k-th body shape. Θ is the training parameters. Virtual Try-On Network (M-VTON) system is utilized to
synthesize separate item images while preserving clothing
4.2. Body-shape Representation details as much as possible. Details can be found in Section
2 of the Supplementary Material. After obtaining the try-on
We devise a Body-shape Embedding Module (BEM) to
image, we utilize a pre-trained ResNet model with its last
compute the embedding for the body shape by exploiting
pooling layer and linear layer discarded to extract its visual
both visual and anthropometric features extracted from a
features. The motivation behind encoding it into multiple
representative body model, as illustrated in the top-left cor-
region-level features is that they can provide more accurate
ner of Figure 2. To obtain the representative model for the
representations than a single global feature. Formally, the
k-th body shape, we first average the shape parameters of
feature extraction process can be expressed as:
all body models belonging to the set Uk , and then use the
SMPL model [18] to generate the representative model ac- \mathbf {S} = \mathcal {F}_{\mathrm {outfit}}(\tilde {\mathbf {X} }) = \{ \mathbf {x}_1, \cdots , \mathbf {x} _n \}; (4)
cording to the averaged parameters:
where S is the representation of try-on image containing
\bar {\mathbf {T}}^k = \mathcal {F}_\mathrm {SMPL}(\bar {\bm {\beta }}^k) = \mathcal {F}_\mathrm {SMPL}(\frac {1}{|\mathbf {U}^k|}\sum _{\mathbf {T}_i \in \mathbf {U}^k} \bm {\beta }_i) (2) 128 spatial features xi ∈ R512 , and Foutfit refers to the
forward function of the modified ResNet.
4.4. Fashion Attributes Representation
where T̄k is the representative 3D model of k-th body
shape, and β̄ k ∈ R1×10 is the averaged shape parameter The clothing items are associated with a set of fashion at-
vector. |Uk | means the size of set Uk . Then, we use an tributes manually recognized from various attribute dimen-
orthographic camera to capture the corresponding frontal sions. For the sake of explanation, we show three fashion
view image, denoted as Īk = Fortho (T̄k ). We extract vi- attributes in the bottom-left part of Figure 2. We utilize
sual features of k-th body shape from Īk by employing a the union of all attributes associated with each item in an
8058
Body-shape Embedding Module Joint Embedding Module
SMPL 3D model Frontal view image
Linear Layer
�
Body shape parameter 𝜷𝜷
outfit to represent the fashion attributes of the entire outfit. where W ∈ Rdq ×dv is the learnable weight matrix, and the
For fashion attribute value, we use a pre-trained GloVe [24] softmax operation is applied on the second dimension. Ac-
model to encode its text into a word embedding, denoted cording to the obtained attention distribution and value V,
as e ∈ Rdtext , where dtext = 300 is the dimensionality of the output of this block is computed by V̂ = αV, where
the word embedding. For fashion attribute dimension, we V̂ ∈ RNq ×dv is the fused feature vectors. The OEM aims to
j
encode it into a one-hot vector, denoted as c ∈ RNA , where acquire the outfit representation, denoted as Hj ∈ RL ×512 ,
NA = 15 is the number of all fashion attributes used in through integrating features of try-on image and fashion at-
this work. We then concatenate c and e to represent one tributes using the cross-modal attention block:
fashion attribute and then apply a linear transformation to
the concatenated vector. Suppose the j-th outfit possesses \mathbf {H}^j = \bm {\alpha }_o \cdot \mathbf {S}^j = \mathrm {softmax} (\mathbf {A}^j \mathbf {W}_o {\mathbf {S}^j}^T)\cdot \mathbf {S}^j (7)
Lj fashion attributes, this outfit’s attribute representation where Wo ∈ R512×512 is the learnable weight matrix and
j
Aj ∈ RL ×512 is computed by: j
αo ∈ RL ×128 is the attention maps calculated in OEM.
Then JEM learns the relationship between the k-th body
\mathbf {A}^j = \{\mathrm {ReLU}(\mathrm {Concat}(\mathbf {c}_l, \mathbf {e}_l)\mathbf {W}_A + \mathbf {b}_A )\}_{l=1}^{L^j} (5) shape features Ūk and the j-th outfit representation and out-
puts the compatibility vector between these two:
where WA ∈ R315×512 and bB ∈ R512 is the weight ma-
\hat {\mathbf {H}}^j_k = \bm {\alpha }_b \cdot \mathbf {H}^j = \mathrm {softmax}(\bar {\mathbf {U}}^k \mathbf {W}_b {\mathbf {H}^j}^T) \cdot \mathbf {H}^j (8)
trix and bias vector of the linear transformation.
4.5. Body-type-Aware Network Architecture where Ĥjk ∈ R1×512 is the body-shape-aware embedding,
and Wb ∈ R1024×512 is the learnable weight matrix in the
We employ the cross-modal attention block [20] in both j
JEM. αb ∈ R1×L is the attention maps computed in JEM.
the Outfit Embedding Module (OEM)and Joint Embed- We can observe that the second dimension of αb is the same
ding Module (JEM) of ViBA-Net to merge data represen- as the number of the fashion attributes associated with the j-
tations from different modalities. This mechanism im- th outfit. Based on this characteristic of the ViBA-Net, we
proves conventional attention mechanisms by introducing can obtain corresponding explanations based on the influ-
a learnable weight matrix in the score function, where two ence distribution of fashion attributes reflected in the atten-
modalities are connected by calculating their compatibility tion maps computed in JEM. We visualize αb in Figure 6
scores. Specifically, it takes two inputs denoted as a query to demonstrate the explainability possessed by ViBA-Net.
Q ∈ RNq ×dq and a value V ∈ RNv ×dv , and the attention Lastly, we compute the compatibility score by applying a
weights α ∈ RNq ×Nv is calculated as: linear transformation on Ĥjk :
\bm {\alpha } = \mathrm {softmax} (\mathbf {Q} \mathbf {W} \mathbf {V}^T) \label {compute_score} (6) \hat {y}^j_k = \hat {\mathbf {H}}^j_k \cdot \mathbf {W}_s + b_s (9)
8059
where Ws ∈ R512×1 and bs ∈ R are the linear transfor- Implementation Details. We adopt the SGD optimizer [25]
mation’s weights and bias, respectively. Since the task is with momentum factor equalling 0.9 and weight decay 5e-4.
formulated as a multi-label classification task, we use the We gradually decrease the learning rate according to:
binary cross entropy loss to measure the difference between
predicted scores ŷkj and target scores ykj . \mathrm {lr} = \mathrm {base\_lr} \times (1 - \mathrm {step\_num} / \mathrm {max\_step})^{0.9} (10)
5. Experiments where the base learning rate is 0.1. The maximum steps
and training batch size are set to 1,260 and 10, respectively.
We conduct experiments on two fashion compatibility During training, we save the checkpoint model correspond-
datasets to showcase the benefits of the proposed ViBA-Net ing to the highest mAP performance achieved on the vali-
model by addressing following research questions: dation set and evaluate the saved model on the test set. We
report the average evaluation results of five repeated exper-
• RQ1: Is the ViBA-Net superior to the current state-of-
iments for all experiments.
the-art methods?
• RQ2: To what extent do the individual components of 5.2. Comparative Results (RQ1)
ViBA-Net influence the model’s performance? Baselines. We compare the ViBA-Net with seven baseline
methods: (1) StyleMe [12], which extends AuxStyles [13]
• RQ3: What can ViBA-Net generate for explainations?
by using bidirectional symmetrical deep neural networks to
• RQ4: How does ViBA-Net perform in the perceptual learn a joint representation of outfits and body shapes. (2)
study? TDRG [34], an effective multi-object recognition model
that explores the structural and semantic aspect relations
5.1. Experimental Settings
through Graph Convolutional Network. We use it to learn
Datasets We evaluate the proposed network on two pub- the joint relation of the try-on image. (3) M3TR [35],
lic datasets: Outfit for You (O4U) [22] and Body-Diverse a multi-modal multi-label recognition model that incor-
(BD) [14] datasets. O4U contains 15,748 compatible out- porates global visual context and linguistic information
fits and 82,017 clothing items. Each item is associated with through ternary relationship learning. We embed the body
a product image and several fashion attributes. On aver- shape labels into the word embedding as the linguistic in-
age, the top item contains 6.64 fashion attributes, while the formation and use try-on appearances as input images. (4)
bottom item contains 3.77 attributes. We use the public CSRA [36], which captures spatial regions of objects from
training, validation, and testing data split provided by O4U different categories by effectively combining a simple spa-
to ensure a fair comparison. The BD dataset comprises tial attention score with class-specific and class-agnostic
889 dresses and 971 tops, spanning 57 individual fashion features. We train CSRA using try-on images as input. (5)
models. We classify these body models into three types FCN [22], which employs a convolutional layer to embed
(Bottom hourglass, Hourglass, and Rectangle) by aligning the outfit based on fashion attribute features and utilize a
their body measurements with the models in our body shape GCN to learn multi-label classifiers based on word embed-
dataset. We consider two scenarios for the dataset division: dings of body shapes. The compatibility scores are obtained
1). The ”easier” case involves seeing models from the test by applying the learned classifiers to the outfit embedding.
split during the training process denoted as the Joint ver- (6) Mo et al. [21], which learns the correlation between
sion; 2). In the more ”difficult” case, models from the test fashion images, fashion attributes, and physical attributes
split are not included in the training process, as termed the with two transformer encoders. (7) ViBE [14], which ap-
Disjoint version. Please refer to Section 3 of the supple- plies several MLPs to learn fashion clothing’s affinity with
mentary material for statistics details of the BD dataset. body measurements. (8) Body-aware CF [1], which is
Evaluation Metrics. For experiments on the O4U dataset, a collaborative filtering-based method utilizing the fashion
we employ a set of seven evaluation metrics to compare item and body measurement features.
the performance of different models. This practice aligns Quantitative Results on O4U. We present the quantitative
with prior works such as [9], [22], and [21], which tackle results on O4U dataset in Table 1. All baseline methods
multi-label classification problems. The metrics encompass are trained on the training set of O4U. The random method
Mean Average Precision (mAP), Average Per-Class Preci- means all predictions are given randomly. We observe
sion (CP), Recall (CR), F1 score (CF1), as well as Average that the proposed ViBA-Net achieves the best performances
Overall Precision (OP), Recall (OR), and F1 score (OF1). across all metrics. Specifically, it surpasses StyleMe by a
Notably, mAP, CF1, and OF1 hold greater significance due clear margin (+14.06 on mAP). This may be because the
to their ability to provide a holistic evaluation of model per- bidirectional symmetrical deep neural networks utilized in
formance. For experiments on the BD dataset, we evaluate StyleMe are limited in their ability to learn cross-modal re-
performances using the Area Under Curve (AUC) metric. lationships. Compared with the TDRG, M3TR, and CSRA
8060
Table 1. Evaluation results on O4U dataset.
Methods mAP CP CR CF1 OP OR OF1
Random 45.01 44.27 23.04 30.31 44.91 21.93 29.47
StyleMe [12] 49.08 37.50 56.05 44.94 62.81 77.70 69.47
TDRG [34] 54.66 50.80 63.60 56.48 65.42 78.85 71.51 Ground Truth
M3TR [35] 61.37 55.92 61.19 58.44 69.37 79.65 74.15 StyleMe
TDTG
CSRA [36] 61.38 56.63 61.18 58.82 71.82 76.79 74.22
M3TR
FCN [22] 62.34 56.96 62.41 59.55 71.42 78.14 74.62 CSRA
Mo et al. [21] 62.38 55.24 62.10 58.47 67.17 79.34 72.75 FCN
ViBE [14] 62.18 55.63 64.43 59.71 70.79 79.25 74.78 ViBE
ViBA-Net
ViBA-Net (Ours) 63.14 57.30 64.85 60.84 72.02 80.73 76.13
(a) (b)
8061
Table 2. Ablation results on representation learning. backbone: Table 3. Body shape classification accuracy comparing with avail-
utilizing backbone (ResNet-18) as multi-label classifier. w/o-body: able classifiers.
encoding the body shape into one-hot vector. w/o-try-on: encoding Available body shape classifiers
outfit using visual features from separate items. w/o-attr: remov- Ours
Lee et al. [33] Francis [8] Collings [3] Hidayati et al. [12]
ing fashion attribute data. 28.63% 31.84% 37.87% 76.83% 97.60%
Methods mAP CP CR CF1 OP OR OF1 Visual features Anthropometric features
backbone 57.71 54.47 57.54 55.96 67.53 76.39 71.68
w/o-body 60.57 55.68 60.71 57.97 67.46 73.84 70.46
w/o-anth. 62.73 56.85 64.25 60.32 71.75 79.91 75.61
w/o-visual 62.61 56.92 64.85 60.63 71.83 80.33 75.84
w/o-try-on 61.72 56.29 62.77 59.35 71.43 78.99 75.02
w/o-attr 61.45 55.83 63.32 59.34 70.61 79.50 74.79
Full model 63.14 57.30 64.85 60.84 72.02 80.73 76.13
8062
Table 4. Perceptual results of the compatibility models.
Methods StyleMe [12] TDRG [34] M3TR [35] CSRA [36] FCN [22] ViBA-Net (Ours)
OCs 49% 52% 51% 53% 59% 61%
ECs - - - - - 67%
8063
References ence on Computer Vision and Pattern Recognition (CVPR),
June 2020. 1, 2, 5, 6
[1] Tianqi Chen, Weinan Zhang, Qiuxia Lu, Kailong Chen, Zhao
[15] Hyunwoo Hwangbo, Yang Sok Kim, and Kyung Jin Cha.
Zheng, and Yong Yu. Svdfeature: a toolkit for feature-based
Recommendation system development for fashion retail e-
collaborative filtering. The Journal of Machine Learning Re-
commerce. Electronic Commerce Research and Applica-
search, 13(1):3619–3622, 2012. 5
tions, 28:94–101, 2018. 1
[2] Wen Chen, Pipei Huang, Jiaming Xu, Xin Guo, Cheng Guo,
[16] Pang Kaicheng, Zou Xingxing, and Wai Keung Wong. Mod-
Fei Sun, Chao Li, Andreas Pfadler, Huan Zhao, and Bin-
eling fashion compatibility with explanation by using bidi-
qiang Zhao. Pog: personalized outfit generation for fashion
rectional lstm. In Proceedings of the IEEE/CVF Conference
recommendation at alibaba ifashion. In Proceedings of the
on Computer Vision and Pattern Recognition (CVPR) Work-
25th ACM SIGKDD international conference on knowledge
shops, pages 3894–3898, June 2021. 1
discovery & data mining, pages 2662–2670, 2019. 1
[17] Yen-Liang Lin, Son Tran, and Larry S Davis. Fashion
[3] Kat Collings. The foolproof way to find out your real body
outfit complementary item retrieval. In Proceedings of
type. 7
the IEEE/CVF conference on computer vision and pattern
[4] Guillem Cucurull, Perouz Taslakian, and David Vazquez.
recognition, pages 3311–3319, 2020. 1
Context-aware visual compatibility prediction. In Proceed-
[18] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard
ings of the IEEE/CVF conference on computer vision and
Pons-Moll, and Michael J Black. Smpl: A skinned multi-
pattern recognition, pages 12617–12626, 2019. 1, 2
person linear model. ACM transactions on graphics (TOG),
[5] Zeyu Cui, Zekun Li, Shu Wu, Xiao-Yu Zhang, and Liang
34(6):1–16, 2015. 2, 3
Wang. Dressing as a whole: Outfit compatibility learning
based on node-wise graph neural networks. In The world [19] Zhi Lu, Yang Hu, Yan Chen, and Bing Zeng. Personalized
wide web conference, pages 307–317, 2019. 1 outfit recommendation with learnable anchors. In Proceed-
ings of the IEEE/CVF conference on computer vision and
[6] Lavinia De Divitiis, Federico Becattini, Claudio Baecchi,
pattern recognition, pages 12722–12731, 2021. 2
and Alberto Del Bimbo. Disentangling features for fash-
ion recommendation. ACM Transactions on Multimedia [20] Minh-Thang Luong, Hieu Pham, and Christopher D Man-
Computing, Communications and Applications, 19(1s):1– ning. Effective approaches to attention-based neural machine
21, 2023. 1 translation. arXiv preprint arXiv:1508.04025, 2015. 4
[7] Priya Devarajan and Cynthia L Istook. Validation of fe- [21] Dongmei Mo, Xingxing Zou, Kaicheng Pang, and Wai Ke-
male figure identification technique (ffit) for apparel soft- ung Wong. Towards private stylists via personalized com-
ware. Journal of Textile and Apparel, Technology and Man- patibility learning. Expert Systems with Applications,
agement, 4(1):1–23, 2004. 2 219:119632, 2023. 1, 5, 6
[8] Cherene Francis. Body shape calculator. 7 [22] Kaicheng Pang, Xingxing Zou, and Waikeung Wong. Dress
[9] Bin-Bin Gao and Hong-Yu Zhou. Learning to discover well via fashion cognitive learning. In British Machine Vi-
multi-class attentional regions for multi-label image recog- sion Conference (BMVC), November 2022. 1, 2, 3, 5, 6, 8
nition. IEEE Transactions on Image Processing, 30:5920– [23] Christopher J Parker, Steven George Hayes, Kathryn Brown-
5932, 2021. 5 bridge, and Simeon Gill. Assessing the female figure identi-
[10] Xintong Han, Zuxuan Wu, Yu-Gang Jiang, and Larry S fication technique’s reliability as a body shape classification
Davis. Learning fashion compatibility with bidirectional system. Ergonomics, 64(8):1035–1051, 2021. 2
lstms. In Proceedings of the 25th ACM international con- [24] Jeffrey Pennington, Richard Socher, and Christopher D Man-
ference on Multimedia, pages 1078–1086, 2017. 3 ning. Glove: Global vectors for word representation. In
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Proceedings of the 2014 conference on empirical methods in
Deep residual learning for image recognition. In Proceed- natural language processing (EMNLP), pages 1532–1543,
ings of the IEEE conference on computer vision and pattern 2014. 4
recognition, pages 770–778, 2016. 3 [25] Sebastian Ruder. An overview of gradient descent optimiza-
[12] Shintami Chusnul Hidayati, Ting Wei Goh, Ji-Sheng Gary tion algorithms. arXiv preprint arXiv:1609.04747, 2016. 5
Chan, Cheng-Chun Hsu, John See, Lai-Kuan Wong, Kai- [26] Hosnieh Sattar, Gerard Pons-Moll, and Mario Fritz. Fashion
Lung Hua, Yu Tsao, and Wen-Huang Cheng. Dress with is taking shape: Understanding clothing preference based on
style: Learning style from joint deep embedding of clothing body shape from online sources. In 2019 IEEE winter con-
styles and body shapes. IEEE Transactions on Multimedia, ference on applications of computer vision (WACV), pages
23:365–377, 2020. 1, 2, 5, 6, 7, 8 968–977. IEEE, 2019. 1, 2
[13] Shintami Chusnul Hidayati, Cheng-Chun Hsu, Yu-Ting [27] Karla Kristin Peavy Simmons. Body shape analysis using
Chang, Kai-Lung Hua, Jianlong Fu, and Wen-Huang Cheng. three-dimensional body scanning technology. North Car-
What dress fits me best? fashion recommendation on the olina State University, 2002. 2
clothing style for personal body shape. In Proceedings of the [28] Tianyu Su, Xuemeng Song, Na Zheng, Weili Guan, Yan Li,
26th ACM international conference on Multimedia, pages and Liqiang Nie. Complementary factorization towards out-
438–446, 2018. 1, 2, 3, 5 fit compatibility modeling. In Proceedings of the 29th ACM
[14] Wei-Lin Hsiao and Kristen Grauman. Vibe: Dressing for di- international conference on multimedia, pages 4073–4081,
verse body shapes. In Proceedings of the IEEE/CVF Confer- 2021. 1, 2
8064
[29] Jie Sun, Qianyun Cai, Tao Li, Lei Du, and Fengyuan Zou.
Body shape classification and block optimization based on
space vector length. International Journal of Clothing Sci-
ence and Technology, 31(1):115–129, 2019. 2
[30] Laurens Van der Maaten and Geoffrey Hinton. Visualiz-
ing data using t-sne. Journal of machine learning research,
9(11), 2008. 7
[31] Mariya I Vasileva, Bryan A Plummer, Krishna Dusad,
Shreya Rajpal, Ranjitha Kumar, and David Forsyth. Learn-
ing type-aware embeddings for fashion compatibility. In
Proceedings of the European conference on computer vision
(ECCV), pages 390–405, 2018. 1, 2
[32] Xuewen Yang, Dongliang Xie, Xin Wang, Jiangbo Yuan,
Wanying Ding, and Pengyun Yan. Learning tuple compati-
bility for conditional outfit recommendation. In Proceedings
of the 28th ACM International Conference on Multimedia,
pages 2636–2644, 2020. 1, 2
[33] Jeong Yim Lee, Cynthia L Istook, Yun Ja Nam, and Sun
Mi Park. Comparison of body shape between usa and ko-
rean women. International Journal of Clothing Science and
Technology, 19(5):374–391, 2007. 2, 3, 7
[34] Jiawei Zhao, Ke Yan, Yifan Zhao, Xiaowei Guo, Feiyue
Huang, and Jia Li. Transformer-based dual relation graph
for multi-label image recognition. In Proceedings of the
IEEE/CVF International Conference on Computer Vision,
pages 163–172, 2021. 5, 6, 8
[35] Jiawei Zhao, Yifan Zhao, and Jia Li. M3tr: Multi-modal
multi-label recognition with transformer. In Proceedings
of the 29th ACM International Conference on Multimedia,
pages 469–477, 2021. 5, 6, 8
[36] Ke Zhu and Jianxin Wu. Residual attention: A simple but
effective method for multi-label recognition. In Proceedings
of the IEEE/CVF International Conference on Computer Vi-
sion, pages 184–193, 2021. 5, 6, 8
8065