Food2K: A Benchmark for Food Recognition
Food2K: A Benchmark for Food Recognition
8, AUGUST 2023
0162-8828 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9933
obtain abundant and various food features from food images, TABLE I
COMPARISON OF CURRENT FOOD RECOGNITION DATASETS
like different ingredient information. This strategy can help to
learn comprehensive and multiple fine-grained information as
training progresses. In addition, our model incorporates richer
context with multiple scales into local features via self-attention
to enhance local feature representation.
Extensive experiments on Food2K demonstrate the effective-
ness of our proposed method. In addition, we provide extensive
experiments comparing various state-of-the-art methods for im-
age representation learning, including popular deep networks,
fine-grained methods and existing food recognition methods.
Furthermore, we also show that the networks learned on Food2K
can benefit various food-relevant vision tasks, i.e., food recogni-
tion, food image retrieval, cross-modal recipe retrieval, food de-
tection and segmentation, indicating better generalization ability
of Food2K. The developed networks on Food2K can be expected
as the backbone to support more food-relevant vision tasks,
especially emerging and more complex ones.
The contributions of our paper can be summarized as follows:
r We introduce a new large-scale high-quality food recogni-
tion benchmark Food2K, which is the largest food image
dataset with 2,000 categories and 1,036,564 images.
r We propose a deep progressive region enhancement net- food recognition. Considering the important role of large-scale
work to learn food-oriented local features by progressive datasets in the continuous improvement of visual recognition
training, and further utilize self-attention to enhance local algorithms, especially for deep learning based methods, we
features for food recognition. build a large-scale food recognition dataset Food2K with more
r We conduct extensive evaluation on Food2K to verify the comprehensive coverage of categories and larger quantity of
effectiveness of our approach, where extensive baselines images. In Table I, we give statistics of existing food recognition
on this benchmark are provided, including popular deep benchmarks together with Food2K. The size of Food2K in both
networks, fine-grained recognition methods and existing categories and images bypasses the size of existing datasets by at
food recognition ones. least one order of magnitude. Although there are some datasets,
r We explore the ability of models trained on Food2K to such as UNICT-FD1200 with larger categories, the quantity of
transfer to various food-relevant tasks including visual food images for each category is very limited.
recognition, retrieval, detection, segmentation and cross- In addition, there are other food-relevant recipe datasets, such
modal recipe retrieval, and demonstrate its better generality as Yummly-66K [39] and Recipe1M [17]. The most known
of Food2K on these tasks. dataset is Recipe1M. Food2K and Recipe1M belong to large-
scale food-related datasets, but with two important differences:
(1) Recipe1M is used for cross-modal embedding and retrieval
II. RELATED WORK
between recipes and images while the released Food2K aims at
Food-Centric Datasets. Over the years, the size of food- advancing scalable food visual feature learning. (2) Recipe1M
centric datasets has grown steadily. For example, Bossard et mainly contains over 1 million structured cooking recipes,
al. [8] constructs one western food dataset ETH Food-101 where each recipe is associated with some food images while
with 101,000 images from 101 food categories. VIREO Food- Food2K contains over 1 million images, belonging to 2,000
172 [15] consists of 110,241 images from 172 Chinese food food categories. We believe Food2K and Recipe1M are very
categories. Compared with these two datasets, FoodX-251 [11] complementary and jointly promote the development of visual
is released with 158,846 images and 251 categories in the Fine- analysis and understanding of food.
Grained Visual Categorization Challenge held in conjunction Food Image Analysis: The availability of more food datasets
with CVPR2019. However, these datasets fail short in terms has further enabled progress in food recognition. More impor-
of both more comprehensive coverage of food categories and tantly, recognizing food directly from images is highly desirable
large scale of food images, like ImageNet [37]. Although the for various food-related applications, such as nutrient assess-
full set of ImageNet contains about 1000 food-related cate- ment [5], [7], food logging [36] and self-service settlement [40].
gories [38], different from existing food datasets and Food2K, For these reasons, we have seen an explosion in food recognition
which contains food classes for direct eating, many food-relevant algorithms in recent years.
categories from ImageNet belong to nutrient composition (e.g., Although food recognition belongs to fine-grained analysis,
choline, vitamin), cooking methods (e.g., split, mix), kitchen it has its unique characteristics[1]. First, food images don’t
ware (e.g., mixer, kibble), etc. The reason is that the aim of have distinctive spatial layout. A large number of dishes have
ImageNet is for generic object recognition, not particularly for deformable food appearance and thus lack the rigid structures.
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
9934 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 8, AUGUST 2023
Food consists of ingredients. Ingredients from various types of There are mainly two lines for this study. The first line is cross-
food images are distributed randomly on a plate. There exists modal food image-recipe text retrieval [17], [47], [48], [49], [50],
the overlap among different ingredients in the same food image. where the main idea is to learn a joint embedding of food images
Even the same ingredient may appear distinctly in different and recipes to support the retrieval between food images and
food images. Such complex ingredient distributions in the food recipes. Due to the complex visual appearance of food images,
images make the task different from other ones like scene effective visual feature learning is still one key. In addition,
recognition with distinctive features such as buildings and trees. various types of text annotations should be further considered.
Second, food image recognition belongs to fine-grained classi- The recent work[51] proposed a hierarchical transformer to
fication, and thus has the same problem as fine-grained classi- achieve better cross-modal retrieval performance. The other one
fication, such as subtle differences among different categories. is cross-modal food image or cooking recipe generation[18],
Existing fine-grained classification methods generally focus on [52], [53], [54], [55]. For example, Wang et al.[18] proposed
discovering the fixed semantic parts as one important part of one structure-aware generation network to generate cooking
its representation. However, the common semantic parts do not instructions based on only food images and ingredients. All
exist in many food categories. Therefore, we should re-design of these works also involve visual feature learning from food
the fine-grained categorization method for food recognition. images. Constructing large annotated food image datasets can
In the earlier years, various hand-crafted features, such as help the development of food visual recognition models, and
color, texture and SIFT are utilized for food recognition [8], [41], also supports the multimodal food learning task.
[42]. In the deep learning era, because of its powerful capacity of Food Computing: Food computing [1] has raised great interest
feature representation, more and more works resort to different recently for its various applications in health, culture, etc. It
deep networks for food recognition [9], [10], [11], [36]. For contains different tasks, such as food recognition [9], [11], [14],
example, Qiu et al. [9] propose a PAR-Net to mine discriminative detection [40], segmentation [46], retrieval[47], [48], [50] and
food regions to improve the performance of classification. There generation [18], [56]. Among these tasks, food recognition is an
are also some recent works on few-shot food recognition [14], important and basic task for further supporting more complex
[43]. For example, Zhao et al. [14] propose a fusion learning food-relevant vision and multimodal tasks. Therefore, construct-
framework, which utilizes a graph convolutional network to ing large scale food recognition datasets and designing advanced
capture inter-class relations between image representations and food recognition algorithms on these large-scale datasets are
semantic embeddings of different categories for both few-shot very vital for the development of food computing.
and many-shot food recognition. In addition, there are many
works [26], [28], [35], [44], [45], which introduce additional III. FOOD2K DATASET
context information, e.g., GPS and ingredient information to
A. Dataset Construction
improve the recognition performance. For example, Zhou et al.
[28] mine rich relationships among ingredients and restaurant We collected this dataset from the catering website Meituan,1
information through the bi-partite graph for food recognition. which holds a huge number of food images including both
Min et al. [35] utilize ingredients as additional supervised sig- eastern and western ones. We first obtained a large-scale raw
nal to localize multiple informative regions and fused these dataset So from this website after application and authorization
regional features into the final representation for recognition. from Meituan. It consists of about seventy million food images
Different from these works, considering the characteristics of uploaded from both catering staffs from take-out online restau-
food images, we adopt one progressive training strategy to learn rants and users in Meituan. We then processed So to complete
comprehensive and multiple local features, and further utilize the Food2K construction in three phases, namely food category
self-attention to incorporate richer contexts with multiple scales vocabulary construction, food image collection and labeling.
to enhance local features. Like general object image analysis Building a Vocabulary of Food Categories: Considering it is
including recognition, detection and segmentation, in addition harder to obtain a comprehensive and standard food label list, we
to food recognition, there are also some works on food image adopted a bottom-to-up method to construct the category vocab-
detection and segmentation towards more accurate nutritional ulary from noisy and redundant labels associated with images
information extraction[40], [46]. However, they are still under in So . We first removed special characters and quantifiers, and
slow progress for the lack of food-relevant datasets and complex then aggregated different labels belonging to the same food as
ingredient distributions in food images. one category according to the synonym set of food built from
These above-mentioned food visual analysis methods gener- Meituan, leading to one set of food labels Vo . However, there
ally use RGB food images, and thus probably do not achieve still exist many labels corresponding to the same category. We
satisfactory performance for further nutritional content predic- thus carried out the secondary label aggregation. Considering
tion since the volume information is not obtained. For that, manually judging whether two food labels belong to the same
RGB-D food analysis benchmarks for nutritional evaluation are dish or not is time-consuming and difficult, we conducted the
built[6], where additional depth images can be used to estimate following iterative process: we first selected one food label
the food volume for the improved prediction of calorie and from Vo and manually selected its several representative images
macronutrients.
Multimodal Food Learning: Multimodal food image-recipe 1 Meituan is China’s leading e-commerce platform for services, especially
text joint learning is another food related topic in this community. well known in catering, which is similar to Yelp.
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9935
Fig. 1. (a) some categories from Food2K and (b) an example for collecting images via labels and image based retrieval.
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
9936 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 8, AUGUST 2023
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9937
Fig. 5. (a) Datasets distributed across number of images and categories and (b) distributions of categories on Food2K and typical datasets.
Fig. 6. (a) Various visual appearances for one category and (b) one example with more fine-grained annotation in Food2K.
Fig. 7. The framework of PRENet. (a) Global Feature Learning branch, which learns the global superclass features. (b) Progressive Local Feature Learning
branch, which capture complementary multi-scale local features through progressive training strategy. (c) Region Feature Enhancement branch, which incorporate
contexts into local features through self-attention. Note that the predicted scores by classifier A in (a) and B in (b) are combined for the final prediction.
feature representation. Then we fuse enhanced local features KL-divergence to increase the difference between stages for
and global ones from global feature learning into the unified capturing more detailed features. For the inference, considering
one via the concat layer. During training, after progressively the complementary output from each stage and the concatenated
training the networks from different stages, we then train the features, we combine the prediction results from them for final
whole network with the concat part, and further introduce the food classification.
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
9938 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 8, AUGUST 2023
A. Global-Local Feature Learning number of stages of the network. Because we will concatenate
Although food image recognition is one fine-grained visual all the global and local features in the final stage (also called
recognition task, the food images under different sup-classes concat stage), thus the total number of steps is S + 1. In our
i
method, for the output fLoc from the ith stage of the network
have obvious discriminative visual differences, and thus can be
to be trained in each step of progressive training, Fi is used to
better recognized by global representation. Those in different
sub-classes under the same super-class have high inter-class process the output features. Fi consists of a convolutional layer,
a batch norm layer and a ReLU unit. Then we can get local
similarity, just as shown in Fig. 6(b), and we thus should pay
feature representation ci = Fi (fLoc
i
).
more attention to more fine-grained local features.
Therefore, we extract and fuse both global representation Furthermore, for multiple local features from different stages,
we utilize KL-divergence to increase the difference between
and their subtle visual differences. We use two sub-networks to
extract the global and local features of food images respectively. stages, which can help capture as many details as possible. Under
These two sub-networks can be two separate networks. However, the progressive training strategy, the visual features extracted
from different stages are projected as specific probability distri-
they share most of the layers of the same backbone network in
our network for efficiency. bution which represent the visual semantic information. How-
Global Feature Learning: Inspired by [59], Based on existing ever, the iterative optimization may result in different probability
distributions converging to the same distribution, which impairs
network (e.g., ResNet), for the output f g of the last convolutional
layer, we use Global Average Pooling (GAP) to extract the global the ability to extract diverse features. KL-divergence can mea-
feature fGlo : sure the similarity between two distributions. By maximizing the
KL-divergence, the convergence can be suppressed and more
fGlo = GAP(f g ) (1) fine-grained visual features can be extracted for recognition.
The KL-divergence is calculated over global features for every
Progressive Local Feature Learning: Local features sub- adjacent two outputs in each batch, where the reduction of
network aims to learn discriminative fine-grained features of KL-divergence is batchmean.
food. Due to the diverse ingredients and cooking style, the
U U
discriminative parts of the food image are multi-scale and irreg- yi
ular. As the first contribution, we adopt the progressive training LKL (yi , yj ) = yi log (3)
i=1
y j
j=U −i
strategy to solve this problem. In this strategy, we train the low
stages first which have the small receptive field, then zoom out where yi and yj are the output distribution from different stages.
a larger field surrounding this local region, and finish when
we reach the whole image. This training strategy will force B. Region Feature Enhancement
our model to extract finer discriminative local features, such as
Different from general fine-grained tasks, food images do
ingredient-relevant ones. After this process, we extract features
not have fixed semantic information [1]. Most of existing food
from different layers to obtain multi-scale feature representa-
recognition methods [10], [35] mine these discriminative fea-
tions. Specifically, for the output f i of each stage from the local
tures directly, ignoring the relationship between local features.
feature sub-network, we use a convolutional block and a Global
Therefore, we adopt a self-attention mechanism to capture the
Maximum Pooling (GMP) layer to get their local feature vectors
i relationship between different local features. This strategy aims
fLoc :
to capture the co-occurring food features in the feature map. It is
i
fLoc = GMP(f i ) (2) revised non-local interaction [60] within the same level feature
map, and the output feature map has the same scale as its input.
where f i denotes the output from the ith stage of the network. Specifically, we first extract the local feature representation fLoc
Therefore, this strategy first learns more stable fine-grained of the last S stages, and then obtain the enhanced features via
information in shallower layers and gradually shifts attention to self-attention as follows:
learning coarse-grained information in deeper layers as training
(i)
progresses. Specifically, it can extract discriminative local fine- q (i) = Conv fLoc
grained features such as ingredients when features with different
(j)
granularities are sent to the network. However, simply using k (j) , v (j) = Conv fLoc
progressive training strategy will not get diverse fine-grained T
features, because the multi-scale information learned via pro- Si,j = Softmax q (i) k (j) / dk(j)
gressive training may focus on the similar region. As the second
contribution, we optimize the KL divergence between features f
IJ = Si,j v
(j)
(4)
from different stages to increase the difference between them to (i) (j)
solve the problem. By maximizing the KL divergence between where fLoc and fLoc are the ith and j th feature positions in fLoc .
features from different stages, we force multi-scale features q (i) is the ith query and k (j) , v (j) are the j th key/value pair, dk(j)
to focus on different areas in different stages, which can help is the dimension of the k (j) and Si,j means the similarity between
capture as many details as possible. q (i) and k (j) . Thus we can get the enhanced feature f IJ . Finally,
Particularly, we divide the training process into S steps, and we concat these enhanced feature maps of the same size, and use
train the first U − S + i stages at step i, where U is the total convolutional layers to convert them into the same dimension.
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9939
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
9940 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 8, AUGUST 2023
TABLE II
COMPARISON OF OUR APPROACH (PRENET) TO BASELINES ON FOOD2K (%)
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9941
Fig. 9. Ablation study of PRENet on Food2K: (a) Different components. (b) Different number of learning stages K. (c) Each learning stage. (d) Different balance
parameters (α, β).
TABLE III
PERFORMANCE COMPARISON ON ETH FOOD-101 (%)
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
9942 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 8, AUGUST 2023
TABLE IV
RESULTS OF TRANSFERRING VISUAL REPRESENTATIONS LEARNED ON FOOD2K TO THREE DATASETS (%)
example, our method outperforms typical fine-grained methods 2.58%, 2.31%, 2.61% and 2.74% for VGG16, ResNet152, In-
with ResNet50 as the backbone. Compared with existing food ception V3, DenseNet161 and SENet154, respectively on Vireo
recognition methods, such as MSMVFA [12], our method also Food-172. These results show that features learned on Food2K
obtains the highest accuracy 90.74% for Top-1 classification ac- generalize well on food recognition. The average performance
curacy. When we use the trained backbone model from Food2K, improvement on several common popular neural networks in-
namely PRENet (SENet154+Pretrained), there is further perfor- cluding VGG16, ResNet152, Inception V3, DenseNet161 and
mance improvement. SENet154 is 1.68%, 3.51% and 3.41% in Top-1 classification ac-
curacy for ETH Food-101, Vireo Food-172 and ISIA Food-500,
respectively, indicating that higher performance gain is from
B. Generalization Ability of Food2K Vireo Food-172 and ISIA Food-500 while the lowest perfor-
In this section, we conduct comprehensive evaluation on the mance gain is from ETH Food-101. The probable reason is as
generalization ability of Food2K in various vision and multi- follows: both ISIA Food-500 and Food2K are Misc. (including
modal tasks, including food recognition, food image retrieval, both eastern and western cuisines). Therefore, their domain gap
cross-modal recipe retrieval, food detection and segmentation. is relatively small. In addition, there is a larger proportion of
1) Food Recognition: We assess the generalization of mod- eastern food categories in Food2K and Vireo Food-172 consists
els learned using Food2K to ETH Food-101. In addition, we of eastern cuisines, also resulting in the smaller domain gap
also conduct the evaluation on another two datasets, namely between Food2K and Vireo Food-172. For more complicated
Vireo Food-172 and ISIA Food-500 from the multimedia field. networks, we also observe higher gain in Vireo Food-172.
All of presented experiments follow the same training-test- Take PAR-Net as an example, the performance improvement
splitting in the mentioned papers. Representative methods in- is 0.63%, 1.32% and 0.74% in Top-1 classification accuracy for
cluding baseline networks, fine-grained recognition and food ETH Food-101, Vireo Food-172 and ISIA Food-500.
recognition methods are used for evaluation. For each method, We further compare the performance of transfer learning from
there are two settings. Take VGG16 for example: VGG16 different food datasets, and give discussions on the recognition
denotes we use the target dataset to fine-tune the ImageNet performance with trained models based on different ratios of
pre-trained network, while VGG16 + Fine-tuned on Food2K Food2K. Details are in supplementary materials, which can be
denotes we first use Food2K to fine-tune the ImageNet pre- found on the Computer Society Digital Library at http://doi.
trained network, then fine-tune it on the target dataset for ieeecomputersociety.org/10.1109/TPAMI.2023.3237871.
evaluation. 2) Food Image Retrieval: We further validate the general-
We report experimental results in Table IV. From Table IV, ization ability of Food2K on food image retrieval. Such task
we can see that all the transferred features are better than training is to find food images containing relevant content with the
on target datasets alone. For example, the network fine-tuned on given query in a database. We conduct the evaluation on ETH
Food2K can improve Top-1 classification accuracy by 7.33%, Food-101, Vireo Food-172 and ISIA Food-500 with the same
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9943
TABLE V
EVALUATING VISUAL REPRESENTATION LEARNED FROM ETH FOOD101 AND FOOD2K FOR RETRIEVAL ON THREE DATASETS (%)
training-test splitting in the mentioned papers. Each image in percentage of queries for which the matching answer is included
the test set is used in turn as the query, and the retrieval set is in the top K results (K=1,5,10).
formed as all the remaining images of the test set. We initialize Table VI reports experimental results of different methods
the networks using convolutional layers of ResNet101. The using backbones from ImageNet, ETH Food-101 and Food2K,
models are optimized using Adam. The initial learning rate and shows that (1) all the methods obtains the improvement on
l0 = 2 × 10−5 , an exponential decay l0 exp(−0.1i) over epoch the model from ETH Food-101 and Food2K in MedR and R@K,
i, momentum 0.9, weight decay 5 × 10−4 , margin τ = 0.85 for and (2) The performance gain from Food2K is higher than ETH
contrastive loss and triplet loss, and the batch size is 32. All Food-101. These experimental results prove that the backbone
training images are resized to a maximum size of 362 × 362, trained on Food2K is more helpful in visual food embedding
while keeping the original aspect ratio [89]. mAP and Recall@1 learning, and thus improves cross-modal embedding learning.
are used as the evaluation metrics. The following methods are This is because of the diversity and scale of Food2K. Note that
used: fine-tuning the network using Cross-entropy loss and for JE [47] on 10k test size, because the original literature only
metric-learning based methods using contrastive loss[87] and uses the MedR for comparison, we also adopt this metric for
triplet loss [88], respectively. ResNet101 is used as the backbone consistent comparison.
network for all methods. 4) Food Detection: We assess the generalization of models
The results in Table V show that the methods using the back- from Food2K on food detection, where the task is to detect food
bone fine-tuned from Food2K all achieve the performance gain items from food trays. For comparison, we also conduct the
with various degrees on these benchmarks. The improvement evaluation on models trained from ETH Food-101 on this task.
in Vireo Food-172 is the highest: the average improvement As the food detection model is supposed to have the ability of
on these methods is 4.04%, 5.28% and 4.16% in mAP for detecting negative samples (e.g., background), we add 2,500
ETH Food-101, Vireo Food-172 and ISIA Food-500, which non-food instance samples from [30] to ETH Food-101 and
is consistent with the performance trend in food recognition. Food2K to fine-tune backbone models for detection. We conduct
Further observation on Vireo Food-172 shows that when the the evaluation on two available datasets UNIMIB2016 [98] and
performance from the base model is very high (probably close Oktoberfest [99] with original training-test splitting. Fig. 11
to saturation), the improvement from the model fine-tuned on shows some examples. We train all the detectors using stochastic
Food2K is relatively small even the domain gap is small. For gradient descent with a batch-size of 8. The learning rate is 10−3
example, there is only 0.51% improvement for the triplet loss in and 5 × 10−3 on UNIMIB2016 and Oktoberfest, respectively,
Vireo Food-172. In contrast, 4.32% improvement is obtained in and the input size of image is set to 512 × 512. The COCO style
ETH Food-101. Note that there is no performance improvement mAP, AP50 and AP75 are adopted as metrics, where the settings
using additional ETH Food-101 for metric-learning methods in of IoU (Intersection over Union) threshhold for mAP, AP50 and
our context. We speculate that metric-learning methods are more AP75 are 0.50:.05:.95, 0.5 and 0.75 respectively. Two single-
sensitive to the larger domain gap between ETH Food-101 and stage (SSD [92], RetinalNet [93]) and four two-stage detection
these two target datasets (especially Vireo Food-172), which models (Faster-RCNN [94],PAN [95],Cascade-RCNN [96], Dy-
indirectly indicates the higher diversity of categories and scale namic R-CNN [97]) with the same backbones mentioned in their
of images from Food2K enable better and stable generalization papers are used for evaluation.
on food image retrieval. Table VII reports the detection results of different methods
3) Cross-Modal Recipe Retrieval: We evaluate the general- using backbones from ImageNet, ETH Food-101 and Food2K,
ization of Food2K on cross-modal recipe retrieval on Recipe1M, and shows that (1) all the methods obtain the improvement
which is currently one popular task in the computer vision on the model from ETH Food-101 and Food2K in mAP and
community [47]. We adopt the original data splits with 238,999 AP75, and (2) the performance gain from Food2K is higher
image-recipe pairs, 51,119 pairs and 51,303 pairs for train- than ETH Food-101, indicating the advantage of Food2K in both
ing, validation and testing, respectively, and similar experiment categories and image numbers. Particularly, the average mAP of
setup [47]. The evaluation metrics are the median retrieval rank all detectors from Food2K on UNIMIB2016 (66.4%) is higher
(MedR) and the recall percentage at top K (Recall@K), i.e., the (0.9%) than it from ETH Food-101 (65.5%). Similar trend can
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
9944 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 8, AUGUST 2023
TABLE VI
EVALUATING VISUAL REPRESENTATION LEARNED FROM ETH FOOD-101 AND FOOD2K FOR CROSS-MODAL RECIPE RETRIEVAL ON THE RECIPE1M DATASET (%)
TABLE VII
EVALUATING VISUAL REPRESENTATION LEARNED FROM ETH FOOD-101 AND FOOD2K FOR FOOD DETECTION ON UNIMIB2016 AND OKTOBERFEST(%)
be found on Oktoberfest. It can be explained that this is because to get more obvious advantages in judging the categories of
of the diversity of food categories present in Food2K. With food instances after selecting positive proposals from the back-
the supplement of more new food categories from Food2K, the ground, resulting in better performance gain on UNIMIB2016
trained backbones can help generalize better to food detection. than Oktoberfest. In addition, since two-stage models depends
In addition, considering UNIMIB2016 provides more accurate more on the proposals based on the accurately extracted features,
annotations than Oktoberfest, and the environment background they obtains bigger increase in precision than two single-stage
of UNIMIB2016 is more steady than Oktoberfest, it makes the detectors. The average mAP growth of four two-stage models is
backbone that delivers better visual food features for detectors 2.1% and the average mAP growth of two single-stage models
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9945
TABLE VIII
EVALUATING VISUAL REPRESENTATION LEARNED FROM ETH FOOD-101 AND
FOOD2K FOR FOOD SEGMENTATION ON THE UEC-FOODPIX COMPLETE
DATASET (%)
Fig. 11. Comparison of food detection results via Dynamic R-CNN trained
on ETH Food-101 and Food2K.
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
9946 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 8, AUGUST 2023
However, they also fail to obtain better performance in hundreds of novel food categories without forgetting those
large-scale food recognition on Food2K. We speculate categories, where each novel category has only a few
that there are more complex visual patterns about food samples [108]. Food2K provides such large-scale food
generated from different ingredients, accessories and ar- dataset test benchmark to support this task. In addition,
rangements with the increase of both the diversity and the constructed food ontology can also help the method
scale of food data, and these methods are not suitable or design of LS-FSFR as the prior knowledge.
robust for this case. As one initial attempt, we combined 5) More applications on Food2K: We have verified better
progressive training and self-attention to learn more stable generalization ability of Food2K in various tasks, includ-
and discriminative global and local features, resulting in ing food image recognition, food image retrieval, cross-
good performance. More methods are worth further ex- modal recipe retrieval, food detection and segmentation
ploration. For example, recently, transformers have made in the paper. Furthermore, Food2K can also support more
tremendous impact in image recognition [106], where the novel applications. Food image generation is one novel
performance is higher than CNNs on large-scale datasets. and interesting applications, and it can synthesize new
Food2K can provide sufficient training data to develop food images which are similar to those in real-life scenar-
transformer-based food recognition methods to improve ios by Generative Adversarial Networks (GANs) [109].
its performance. For example, Zhu et al. [54] can generate highly realistic
2) Human vision evaluation on food recognition: Conducting and semantically consistent images from given ingredients
human vision research on Food2K is also an interesting and instructions. Another work [52] aims to teach a ma-
topic to study. Compared with human vision research on chine how to make a pizza by building a generative model
generic object recognition, it is probably more difficult to that mirrors the step-by-step procedure. Different GANs
conduct such evaluation on food recognition. For example, such as Lightweight GAN [110] can also be used to gener-
food has strong regional and cultural characteristics, and ate synthetic food images based on Food2K. Please refer
human subjects from different regions thus have stronger to the supplementary materials, available online for more
bias for food recognition. Recent works [107] give an ini- details about the evaluation for food image generation on
tial empirical comparison between human visual system Food2K.
and CNNs in the food recognition task. In order to avoid 6) Extension of Food2K for more tasks: Researchers are
information overburden, the number of dishes to learn was encouraged to apply trained models on Food2K to more
restricted to 16 different types of food for human subjects. food-relevant tasks. Moreover, we hope Food2K will
More interesting problems can be further explored. For evolve over time. Considering that some works [15] have
example, What’s the upper bound for human performance showed that ingredients can improve the recognition per-
on food recognition? What’s their own advantages and formance, we plan to extend Food2K by providing richer
disadvantages for human vision system and CNNs in attribute annotation to support food recognition with dif-
recognizing food types and number of categories? More- ferent semantic levels. We can also conduct region-level
over, knowledge from other fields, e.g., food science is and pixel-level annotation on Food2K to enable broader
probably needed for further explanation on experimental range of its application. In addition, we can also con-
results. duct some novel tasks, such as aesthetic assessment of
3) Cross-X transfer learning for food recognition: We have food images via annotating aesthetic attribute labels on
verified the generalization of Food2K in various vision Food2K [111].
and multimodal tasks. We can study the transfer learn-
ing from more aspects in the future. For example, food
has its own geographical and culture attributes. We can VI. CONCLUSION
conduct cross-cuisine transfer learning. That means we In this paper, we present Food2K with larger data vol-
use trained models from eastern cuisines for performance ume, larger category coverage and higher diversity compared
analysis on western cuisines, and vice versa. After more with existing ones, which can be served as a new bench-
fine-grained scenario annotation, such as region-level or mark for scalable food recognition. It can benefit various vi-
even restaurant-level annotation, we can conduct cross- sion and multimodal tasks, including food recognition, re-
scenario transfer learning for food recognition. In addition, trieval, detection, segmentation, and cross-modal recipe re-
we can also study cross super-class transfer learning for trieval for its better generalization ability. To date, Food2K
food recognition. For example, we can use trained models is the largest food recognition dataset with its diversity and
from the seafood super-class for performance analysis on scale. We believe it will enable development of large-scale
the meat super-class. These interesting problems are worth food recognition methods, and also help the researchers to
deep exploration. utilize Food2K for the future research on more food-relevant
4) Large-Scale Few-Shot Food Recognition (LS-FSFR): Re- tasks, such as large-scale few-shot food recognition and trans-
cently, there are some works on few-shot food recognition fer learning on food recognition from various aspects, such
on small/medium-scale food categories [14], [43]. In con- as cross-scenario, cross-cuisine and cross-super-class transfer
trast, LS-FSFR is a more realistic task that aims to identify learning.
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9947
REFERENCES [26] R. Xu, L. Herranz, S. Jiang, S. Wang, X. Song, and R. Jain, “Geolocalized
modeling for dish recognition,” IEEE Trans. Multimedia, vol. 17, no. 8,
[1] W. Min, S. Jiang, L. Liu, Y. Rui, and R. Jain, “A survey on food pp. 1187–1199, Aug. 2015.
computing,” ACM Comput. Surv., vol. 52, no. 5, pp. 1–36, 2019. [27] G. M. Farinella, D. Allegra, and F. Stanco, “A benchmark dataset to
[2] R. G. Boswell, W. Sun, S. Suzuki, and H. Kober, “Training in cognitive study the representation of food images,” in Proc. Eur. Conf. Comput.
strategies reduces eating and improves food choice,” Proc. Nat. Acad. Vis., 2014, pp. 584–599.
Sci., vol. 115, no. 48, pp. E11 238–E11 247, 2018. [28] F. Zhou and Y. Lin, “Fine-grained image classification by exploring
[3] T. David and C. Michael, “Global diets link environmental sustainability bipartite-graph labels,” in Proc. Conf. Comput. Vis. Pattern Recognit.,
and human health,” Nature, vol. 515, no. 7528, pp. 518–22, 2014. 2016, pp. 1124–1133.
[4] P. Rozin, “The selection of foods by rats, humans, and other animals,” in [29] M. Merler, H. Wu, R. Uceda-Sosa, Q.-B. Nguyen, and J. R. Smith, “Snap,
Advances in the Study of Behavior, vol. 6. New York, NY, USA: Academic eat, repeat: A food recognition engine for dietary logging,” in Proc. Int.
Press, 1976, pp. 21–76. Workshop Multimedia Assist. Dietary Manage., 2016, pp. 31–40.
[5] A. Meyers et al., “Im2Calories: Towards an automated mobile vision [30] A. Singla, L. Yuan, and T. Ebrahimi, “Food/non-food image classi-
food diary,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 1233–1241. fication and food categorization using pre-trained googlenet model,”
[6] Q. Thames et al., “Nutrition5k: Towards automatic nutritional under- in Proc. Int. Workshop Multimedia Assist. Dietary Manage., 2016,
standing of generic food,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern pp. 3–11.
Recognit., 2021, pp. 8903–8911. [31] G. M. Farinella, D. Allegra, M. Moltisanti, F. Stanco, and S. Battiato,
[7] Y. Lu, T. Stathopoulou, M. F. Vasiloglou, S. Christodoulidis, Z. Stanga, “Retrieval and classification of food images,” Comput. Biol. Med., vol. 77,
and S. Mougiakakou, “An artificial intelligence-based system to as- pp. 23–39, 2016.
sess nutrient intake for hospitalised patients,” IEEE Trans. Multimedia, [32] G. Ciocca, P. Napoletano, and R. Schettini, “Learning CNN-based fea-
vol. 23, pp. 1136–1147, 2021. tures for retrieval of food images,” in Proc. Int. Conf. Image Anal.
[8] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discrim- Process., 2017, pp. 426–434.
inative components with random forests,” in Proc. Eur. Conf. Comput. [33] X. Chen, H. Zhou, and L. Diao, “ChineseFoodNet: A large-scale image
Vis., 2014, pp. 446–461. dataset for Chinese food recognition,” 2017, arXiv: 1705.02743.
[9] J. Qiu, F. P.-W. Lo, Y. Sun, S. Wang, and B. Lo, “Mining discriminative [34] S. Hou, Y. Feng, and Z. Wang, “VegFru: A domain-specific dataset for
food regions for accurate food recognition,” in Proc. Brit. Mach. Vis. fine-grained visual categorization,” in Proc. Int. Conf. Comput. Vis., 2017,
Conf., 2019, pp. 588–598. pp. 541–549.
[10] N. Martinel, G. L. Foresti, and C. Micheloni, “Wide-slice residual net- [35] W. Min, L. Liu, Z. Luo, and S. Jiang, “Ingredient-guided cascaded
works for food recognition,” in Proc. IEEE/CVF Winter Conf. Appl. multi-attention network for food recognition,” in Proc. ACM Int. Conf.
Comput. Vis., 2018, pp. 567–576. Multimedia, 2019, pp. 99–107.
[11] P. Kaur, K. Sikka, W. Wang, S. J. Belongie, and A. Divakaran, “Foodx- [36] D. Sahoo et al., “FoodAI: Food image recognition via deep learning for
251: A dataset for fine-grained food classification,” in Proc. Conf. Com- smart food logging,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discov.
put. Vis. Pattern Recognit. Workshop, 2019. Data Mining, 2019, pp. 2260–2268.
[12] S. Jiang, W. Min, L. Liu, and Z. Luo, “Multi-scale multi-view deep feature [37] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “ImageNet: A
aggregation for food recognition,” IEEE Trans. Image Process., vol. 29, large-scale hierarchical image database,” in Proc. Conf. Comput. Vis.
pp. 265–276, 2020. Pattern Recognit., 2009, pp. 248–255.
[13] L. Deng et al., “Mixed-dish recognition with contextual relation net- [38] K. Yanai and Y. Kawano, “Food image recognition using deep convo-
works,” in Proc. ACM Int. Conf. Multimedia, 2019, pp. 112–120. lutional network with pre-training and fine-tuning,” in Proc. IEEE Int.
[14] H. Zhao, K.-H. Yap, and A. Chichung Kot, “Fusion learning using se- Conf. Multimedia Expo Workshops, 2015, pp. 1–6.
mantics and graph convolutional network for visual food recognition,” in [39] W. Min, B.-K. Bao, S. Mei, Y. Zhu, Y. Rui, and S. Jiang, “You
Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2021, pp. 1711–1720. are what you eat: Exploring rich recipe information for cross-region
[15] J. Chen and C.-W. Ngo, “Deep-based ingredient recognition for cooking food analysis,” IEEE Trans. Multimedia, vol. 20, no. 4, pp. 950–964,
recipe retrieval,” in Proc. ACM Int. Conf. Multimedia, 2016, pp. 32–41. Apr. 2018.
[16] W. Min, L. Liu, Z. Wang, Z. Luo, X. Wei, and X. Wei, “ISIA Food- [40] E. Aguilar, B. Remeseiro, M. Bola nos, and P. Radeva, “Grab, Pay and Eat:
500: A dataset for large-scale food recognition via stacked global- Semantic food detection for smart restaurants,” IEEE Trans. Multimedia,
local attention network,” in Proc. ACM Int. Conf. Multimedia, 2020, vol. 20, no. 12, pp. 3266–3275, Dec. 2018.
pp. 393–401. [41] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, “Food recognition
[17] J. Marín et al., “Recipe1M+: A dataset for learning cross-modal embed- using statistics of pairwise local features,” in Proc. Conf. Comput. Vis.
dings for cooking recipes and food images,” IEEE Trans. Pattern Anal. Pattern Recognit., 2010, pp. 2249–2256.
Mach. Intell., vol. 43, no. 1, pp. 187–203, Jan. 2021. [42] N. Martinel, C. Piciarelli, and C. Micheloni, “A supervised extreme learn-
[18] H. Wang, G. Lin, S. C. H. Hoi, and C. Miao, “Structure-aware generation ing committee for food recognition,” Comput. Vis. Image Understanding,
network for recipe generation from images,” in Proc. Eur. Conf. Comput. vol. 148, pp. 67–86, 2016.
Vis., 2020, pp. 359–374. [43] S. Jiang, W. Min, Y. Lyu, and L. Liu, “Few-shot food recognition via
[19] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and J. Yang, multi-view representation learning,” ACM Trans. Multimedia Comput.,
“PFID: Pittsburgh fast-food image dataset,” in Proc. Int. Conf. Image Commun. Appl., vol. 16, no. 3, pp. 87:1–87:20, 2020.
Process., 2009, pp. 289–292. [44] O. Beijbom, N. Joshi, D. Morris, S. Saponas, and S. Khullar,
[20] T. Joutou and K. Yanai, “A food image recognition system with multiple “Menu-match: Restaurant-specific food logging from images,” in Proc.
kernel learning,” in Proc. IEEE 16th Int. Conf. Image Process., 2009, IEEE/CVF Winter Conf. Appl. Comput. Vis., 2015, pp. 844–851.
pp. 285–288. [45] S. Horiguchi, S. Amano, M. Ogawa, and K. Aizawa, “Personalized
[21] H. Hoashi, T. Joutou, and K. Yanai, “Image recognition of 85 food classifier for food image recognition,” IEEE Trans. Multimedia, vol. 20,
categories by feature fusion,” in Proc. IEEE Int. Symp. Multimedia, 2010, no. 10, pp. 2836–2848, Oct. 2018.
pp. 296–301. [46] K. Okamoto and K. Yanai, “UEC-FoodPIX Complete: A large-scale food
[22] Y. Matsuda and K. Yanai, “Multiple-food recognition considering co- image segmentation dataset,” in Proc. Int. Conf. Pattern Recognit., 2021,
occurrence employing manifold ranking,” in Proc. Int. Conf. Pattern pp. 647–659.
Recognit., 2012, pp. 2017–2020. [47] A. Salvador et al., “Learning cross-modal embeddings for cooking recipes
[23] Y. Kawano and K. Yanai, “Automatic expansion of a food image dataset and food images,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2017,
leveraging existing categories with domain adaptation,” in Proc. Eur. pp. 3020–3028.
Conf. Comput. Vis., 2014, pp. 3–17. [48] H. Wang, D. Sahoo, C. Liu, E.-P. Lim, and S. C. Hoi, “Learning
[24] M. M. Anthimopoulos, L. Gianola, L. Scarnato, P. Diem, and S. G. cross-modal embeddings with adversarial networks for cooking recipes
Mougiakakou, “A food recognition system for diabetic patients based on and food images,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2019,
an optimized bag-of-features model,” IEEE J. Biomed. Health Inform., pp. 11 572–11 581.
vol. 18, no. 4, pp. 1261–1271, Jul. 2014. [49] D. P. Papadopoulos, E. Mora, N. Chepurko, K. W. Huang, F. Ofli,
[25] X. Wang, D. Kumar, N. Thome, M. Cord, and F. Precioso, “Recipe and A. Torralba, “Learning program representations for food images
recognition with large multimodal food dataset,” in Proc. IEEE Int. Conf. and cooking recipes,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Multimedia Expo, 2015, pp. 1–6. Recognit., 2022, pp. 16 559–16 569.
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
9948 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 8, AUGUST 2023
[50] H. Fu, R. Wu, C. Liu, and J. Sun, “MCEN: Bridging cross-modal [76] K. Yanai and Y. Kawano, “Food image recognition using deep convo-
gap between cooking recipes and dish images with latent vari- lutional network with pre-training and fine-tuning,” in Proc. IEEE Int.
able model,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2020, Conf. Multimedia Expo Workshops, 2015, pp. 1–6.
pp. 14 570–14 580. [77] H. Wu, M. Merler, R. Uceda-Sosa, and J. R. Smith, “Learning to make
[51] A. Salvador, E. Gundogdu, L. Bazzani, and M. Donoser, “Revamping better mistakes: Semantics-aware visual food recognition,” in Proc. ACM
cross-modal recipe retrieval with hierarchical transformers and self- Int. Conf. Multimedia, 2016, pp. 172–176.
supervised learning,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2021, [78] P. Pandey, A. Deepthi, B. Mandal, and N. B. Puhan, “FoodNet: Recog-
pp. 15 475–15 484. nizing foods using ensemble of deep networks,” IEEE Signal Process.
[52] D. P. Papadopoulos, Y. Tamaazousti, F. Ofli, I. Weber, and A. Torralba, Lett., vol. 24, no. 12, pp. 1758–1762, Dec. 2017.
“How to make a pizza: Learning a compositional layer-based GAN [79] S. Ao and C. X. Ling, “Adapting new categories for food recognition
model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, with deep representation,” in Proc. Int. Conf. Des. Minings Workshop,
pp. 7994–8003. 2015, pp. 1196–1203.
[53] F. Han, R. Guerrero, and V. Pavlovic, “CookGAN: Meal image synthesis [80] C. Liu, Y. Cao, Y. Luo, G. Chen, V. Vokkarane, and Y. Ma, “DeepFood:
from ingredients,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., Deep learning-based food image recognition for computer-aided dietary
2020, pp. 1439–1447. assessment,” in Proc. Conf. Inclusive Smart Cities Digit. Health, 2016,
[54] B. Zhu and C. Ngo, “CookGAN: Causality based text-to-image synthe- pp. 37–48.
sis,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2020, pp. 5518–5526. [81] M. Bolanos and P. Radeva, “Simultaneous food localization and recog-
[55] A. Salvador, M. Drozdzal, X. Giró-i Nieto, and A. Romero, “Inverse nition,” in Proc. Int. Conf. Pattern Recognit., 2017, pp. 3140–3145.
cooking: Recipe generation from food images,” in Proc. IEEE/CVF Conf. [82] P. R. López, D. V. Dorta, G. C. Preixens, J. M. Gonfaus, and J. G. Sabaté,
Comput. Vis. Pattern Recognit., 2019, pp. 10 453–10 462. “Pay attention to the activations: A modular attention mechanism for
[56] A. Salvador, M. Drozdzal, X. Giró i Nieto, and A. Romero, “Inverse fine-grained image recognition,” IEEE Trans. Multimedia, vol. 22, no. 2,
cooking: Recipe generation from food images,” in Proc. Conf. Comput. pp. 502–514, Feb. 2020.
Vis. Pattern Recognit., 2019, pp. 10 445–10 454. [83] E. Aguilar, M. Bola nos, and P. Radeva, “Food recognition using fusion
[57] M. Nestle, Food Politics: How the Food Industry Influences Nutrition and of classifiers based on CNNs,” in Proc. Int. Conf. Image Anal. Process.,
Health, vol. 3. Berkeley, CA, USA: Univ. California Press, 2013. 2017, pp. 213–224.
[58] National food safety standard for uses of food additives (GB 31632– [84] H. Hassannejad, G. Matrella, P. Ciampolini, I. D. Munari, M. Mordonini,
2014), China Food Additives, no. 8X, 2015, Art. no. 28. and S. Cagnoni, “Food image recognition using very deep convolutional
[59] L. Zhang, S. Huang, W. Liu, and D. Tao, “Learning a mixture of networks,” in Proc. Int. Workshop Multimedia Assist. Dietary Manage.,
granularity-specific experts for fine-grained categorization,” in Proc. Int. 2016, pp. 41–49.
Conf. Comput. Vis., 2019, pp. 8331–8340. [85] N. Martinel, G. L. Foresti, and C. Micheloni, “Wide-slice residual net-
[60] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” works for food recognition,” in Proc. IEEE/CVF Winter Conf. Appl.
in Proc. Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7794–7803. Comput. Vis., 2018, pp. 567–576.
[61] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely [86] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
connected convolutional networks,” in Proc. Conf. Comput. Vis. Pattern D. Batra, “Grad-CAM: Visual explanations from deep networks via
Recognit., 2017, pp. 4700–4708. gradient-based localization,” in Proc. Int. Conf. Comput. Vis., 2017,
[62] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. pp. 618–626.
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141. [87] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by
[63] S. Min, H. Yao, H. Xie, Z.-J. Zha, and Y. Zhang, “Multi-objective matrix learning an invariant mapping,” in Proc. Conf. Comput. Vis. Pattern
normalization for fine-grained visual recognition,” IEEE Trans. Image Recognit., 2006, pp. 1735–1742.
Process., vol. 29, pp. 4996–5009, 2020. [88] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embed-
[64] R. Du et al., “Fine-grained visual classification via progressive multi- ding for face recognition and clustering,” in Proc. Conf. Comput. Vis.
granularity training of jigsaw patches,” in Proc. Eur. Conf. Comput. Vis., Pattern Recognit., 2015, pp. 815–823.
2020, pp. 153–168. [89] F. Radenović, G. Tolias, and O. Chum, “Fine-tuning CNN image retrieval
[65] C. Szegedy et al., “Going deeper with convolutions,” in Proc. Conf. with no human annotation,” IEEE Trans. Pattern Anal. Mach. Intell.,
Comput. Vis. Pattern Recognit., 2015, pp. 1–9. vol. 41, no. 7, pp. 1655–1668, Jul. 2019.
[66] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, [90] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large
inception-resnet and the impact of residual connections on learning,” in margin nearest neighbor classification,” J. Mach. Learn. Res., vol. 10,
Proc. Conf. Assoc. Advance. Artif. Intell., 2017, pp. 4278–4284. no. 2, pp. 207–244, 2009.
[67] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for [91] M. Carvalho, R. Cadène, D. Picard, L. Soulier, N. Thome, and M. Cord,
image recognition,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2016, “Cross-modal retrieval in the cooking context: Learning semantic text-
pp. 770–778. image embeddings,” in Proc. Int. ACM SIGIR Conf. Res. Develop. Inf.
[68] Z. Sergey and K. Nikos, “Wide residual networks,” in Proc. Brit. Mach. Retrieval, 2018, pp. 35–44.
Vis. Conf., 2016, pp. 87.1–87.12. [92] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf.
[69] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, and L. Wang, “Learning to Comput. Vis., 2016, pp. 21–37.
navigate for fine-grained classification,” in Proc. Eur. Conf. Comput. Vis., [93] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
2018, pp. 438–454. dense object detection,” in Proc. Conf. Comput. Vis. Pattern Recognit.,
[70] C. Yu, X. Zhao, Q. Zheng, P. Zhang, and X. You, “Hierarchical bilinear 2017, pp. 2980–2988.
pooling for fine-grained visual recognition,” in Proc. Eur. Conf. Comput. [94] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
Vis., 2018, pp. 595–610. time object detection with region proposal networks,” in Proc. Int. Conf.
[71] Y. Chen, Y. Bai, W. Zhang, and T. Mei, “Destruction and construction Neural Inf. Process. Syst., 2015, pp. 91–99.
learning for fine-grained image recognition,” in Proc. Conf. Comput. Vis. [95] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for
Pattern Recognit., 2019, pp. 5157–5166. instance segmentation,” in Proc. Conf. Comput. Vis. Pattern Recognit.,
[72] T. Hu, H. Qi, Q. Huang, and Y. Lu, “See better before looking closer: 2018, pp. 8759–8768.
Weakly supervised data augmentation network for fine-grained visual [96] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality
classification,” 2019, arXiv: 1901.09891. object detection,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2018,
[73] S. Kornblith, J. Shlens, and Q. Le, “Do better ImageNet models transfer pp. 6154–6162.
better?,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2019, pp. 2661– [97] H. Zhang, H. Chang, B. Ma, N. Wang, and X. Chen, “Dynamic R-CNN:
2671. Towards high quality object detection via dynamic training,” in Proc. Eur.
[74] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,” Conf. Comput. Vis., 2020, pp. 260–275.
in Proc. Conf. Comput. Vis. Pattern Recognit., 2018, pp. 2403–2412. [98] G. Ciocca, P. Napoletano, and R. Schettini, “Food recognition: A new
[75] P. McAllister, H. Zheng, R. Bond, and A. Moorhead, “Combining deep dataset, experiments, and results,” IEEE J. Biomed. Health Inform.,
residual neural network features with supervised machine learning al- vol. 21, no. 3, pp. 588–598, May 2017.
gorithms to classify diverse food image datasets,” Comput. Biol. Med., [99] A. Ziller, J. Hansjakob, V. Rusinov, D. Zügner, P. Vogel, and S. Günne-
vol. 95, pp. 217–233, 2018. mann, “Oktoberfest food dataset,” 2019, arXiv: 1912.05007.
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9949
[100] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for Mengjiang Luo received the BE degree from the
semantic segmentation,” in Proc. Conf. Comput. Vis. Pattern Recognit., School of Yanbian University, Yianbian, China, in
2015, pp. 3431–3440. 2018. He is currently working toward the master’s
[101] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep con- degree in computer science with the Key Labora-
volutional encoder-decoder architecture for image segmentation,” IEEE tory of Intelligent Information Processing, Institute
Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, of Computing Technology, Chinese Academy of Sci-
Dec. 2017. ences, Beijing, China. His research interests include
[102] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing multimedia content analysis, understanding and food
network,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2881– recognition.
2890.
[103] P. Wang et al., “Understanding convolution for semantic segmentation,” in
Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2018, pp. 1451–1460.
[104] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters– Liping Kang received the graduation degree from
improve semantic segmentation by global convolutional network,” in the Xi’an Jiaotong University of Information Engi-
Proc. Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4353–4361. neering in July 2013, and the master’s degree from
[105] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder- the Chinese Academy of Sciences University in July
decoder with atrous separable convolution for semantic image segmen- 2016. She is currently the algorithm expert in Meituan
tation,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 801–818. Vision AI Department, Beijing, China. Her current
[106] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for research interests include deep learning, computer
image recognition at scale,” 2020, arXiv: 2010.11929. vision, fine-grained image recognition and retrieval,
[107] P. Furtado, M. Caldeira, and P. Martins, “Human visual system ver- and their applications. She has applied for 17 patents
sus convolution neural networks in food recognition task: An empiri- as the first inventor, and 10 have been authorized.
cal comparison,” Comput. Vis. Image Understanding, vol. 191, 2020,
Art. no. 102878.
[108] A. Li, T. Luo, Z. Lu, T. Xiang, and L. Wang, “Large-scale few-shot learn-
ing: Knowledge transfer with class hierarchy,” in Proc. Conf. Comput.
Vis. Pattern Recognit., 2019, pp. 7212–7220.
[109] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Int. Conf. Xiaoming Wei is currently the leader of Vision
Neural Inf. Process. Syst., 2014, pp. 2672–2680. Understanding group, Computer Vision Division at
[110] B. Liu, Y. Zhu, K. Song, and A. Elgammal, “Towards faster and stabilized Meituan. His research interests focus on fine-grained
GAN training for high-fidelity few-shot image synthesis,” in Proc. Int. image recognition, multimodal analysis and genera-
Conf. Learn. Representations, 2021. tion, etc. He has led the team and got top rankings
[111] K. Sheng, W. Dong, H. Huang, C. Ma, and B.-G. Hu, “Gourmet photog- in several fine-grained matches such as Herbarium
raphy dataset for aesthetic assessment of food images,” in Proc. Conf. 2022 FGVC9 (the 1st place), Product Recognition in
SIGGRAPH Asia Tech. Briefs, 2018, pp. 1–4. CVPR2019 (the 2nd place) . He has published 10+
papers in CVPR, ECCV, IJCAI, ACM MM, AAAI,
etc.
Weiqing Min (Senior Member, IEEE) is currently an
Associate Professor with the Key Laboratory of Intel-
ligent Information Processing, Institute of Computing
Technology, Chinese Academy of Sciences. His re-
search interests include multimedia content analysis
and food computing. He has authored or coauthored
more than 50 peer-referenced papers in relevant jour- Xiaolin Wei received the PhD degree in computer
nals and conferences, including Patterns(Cell Press), science from Texas A & M University. He is
ACM Computing Surveys, Trends in Food Science now leading Computer Vision Division with Meituan.
& Technology, IEEE Transaction on Pattern Anal- His research area includes computer vision, machine
ysis and Machine Intelligence, IEEE Transaction on learning, computer graphics, 3D vision and aug-
Image Processing, IEEE Transaction on Multimedia, ACM MM, AAAI, etc. mented reality. He worked as a research engineer with
He organized several special issues on international journals, such as IEEE Google, CEO of Virtroid and principal engineer of
Transaction on Multimedia and IEEE Multimedia as a Guest Editor. He was a Magic Leap. He has been granted 40+ patents and
recipient of the 2016 ACM TOMM Nicolas D. Georganas Best Paper Award published 30+ papers in SIGGRAPH, ICCV, ECCV,
and the 2017 IEEE Multimedia Magazine Best Paper Award. ACM MM, IJCAI, etc.
Zhiling Wang received the BE degree from the
School of Geodesy and Geomatics, Wuhan Univer-
sity, Wuhan, China, in 2019 and the master’s degree
in computer science from the Chinese Academy of
Sciences University, Beijing, China, in 2022. He is Shuqiang Jiang (Senior Member, IEEE) is a profes-
currently the algorithm engineer in Meituan Vision AI sor with the Institute of Computing Technology, Chi-
Department, Beijing. His research interests include nese Academy of Sciences(CAS) and a professor with
food computing, fine-grained image recognition and the University of CAS. He is also with the Key Lab-
retrieval. oratory of Intelligent Information Processing, CAS.
His research interests include multimedia analysis,
multimodal intelligence and food computing. He has
Yuxin Liu received the BE degree from the School of authored or coauthored more than 200 papers on the
Computer Science and Technology, Shandong Uni- related research topics. He was supported by National
versity, Qingdao, China, in 2020. He is currently Science Fund for Distinguished Young Scholars in
working toward the PhD degree in computer sci- 2021. He won the CAS International Cooperation
ence with the Key Laboratory of Intelligent Infor- Award for Young Scientists, the CCF Award of Science and Technology, Wu
mation Processing, Institute of Computing Technol- Wenjun Natural Science Award for Artificial Intelligence, CSIG Natural Science
ogy, Chinese Academy of Sciences, Beijing, China. Award, and Beijing Science and Technology Progress Award. He is the associate
His research interests include computer vision, ma- editor of ACM Transactions on Multimedia Computing, Communications, and
chine learning and fine-grained and multi-label image Applications, vice chair of IEEE CASS Beijing Chapter, vice chair of ACM
recognition. SIGMM China Chapter.
Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.