0% found this document useful (0 votes)
82 views18 pages

Food2K: A Benchmark for Food Recognition

Uploaded by

Shibil Haq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views18 pages

Food2K: A Benchmark for Food Recognition

Uploaded by

Shibil Haq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

9932 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO.

8, AUGUST 2023

Large Scale Visual Food Recognition


Weiqing Min , Senior Member, IEEE, Zhiling Wang , Yuxin Liu, Mengjiang Luo , Liping Kang,
Xiaoming Wei , Xiaolin Wei , and Shuqiang Jiang , Senior Member, IEEE

Abstract—Food recognition plays an important role in food I. INTRODUCTION


choice and intake, which is essential to the health and well-being of
OOD computing [1] recently has come into the focus of
humans. It is thus of importance to the computer vision community,
and can further support many food-oriented vision and multi-
modal tasks, e.g., food detection and segmentation, cross-modal
F public attention as one new emerging area for supporting
many food-related issues, e.g., food choice [2] and healthy
recipe retrieval and generation. Unfortunately, we have witnessed diets [3]. As one basic task in food computing, food recognition
remarkable advancements in generic visual recognition for released
large-scale datasets, yet largely lags in the food domain. In this
plays an important role for humans in identifying food for further
paper, we introduce Food2K, which is the largest food recognition food gathering to meet their survival needs [4]. It is also an
dataset with 2,000 categories and over 1 million images. Compared essential step in many health applications, such as nutritional
with existing food recognition datasets, Food2K bypasses them in understanding of food [5], [6] and dietary management [7]. In
both categories and images by one order of magnitude, and thus addition, food image recognition is an important branch of fine-
establishes a new challenging benchmark to develop advanced
models for food visual representation learning. Furthermore, we
grained visual classification, and thus has important theoretical
propose a deep progressive region enhancement network for food research significance. For these reasons, food recognition has
recognition, which mainly consists of two components, namely pro- been drawing more attention in computer vision and beyond
gressive local feature learning and region feature enhancement. The [8], [9], [10], [11], [12], [13], [14].
former adopts improved progressive training to learn diverse and Existing works mainly focus on utilizing medium-scale or
complementary local features, while the latter utilizes self-attention
to incorporate richer context with multiple scales into local features
small-scale image datasets for food recognition, such as ETH
for further local feature enhancement. Extensive experiments on Food-101 [8], Vireo Food-172 [15] and ISIA Food-500 [16].
Food2K demonstrate the effectiveness of our proposed method. They are probably insufficient to build more complicated and
More importantly, we have verified better generalization ability advanced statistical models for food computing due to their
of Food2K in various tasks, including food image recognition, food inadequate food categories and images. Considering that large-
image retrieval, cross-modal recipe retrieval, food detection and
segmentation. Food2K can be further explored to benefit more
scale datasets have been key enablers of progress in general
food-relevant tasks including emerging and more complex ones image classification and understanding, a large-scale food image
(e.g., nutritional understanding of food), and the trained models dataset is urgently needed for developing advanced food visual
on Food2K can be expected as backbones to improve the perfor- representation learning algorithms, and can further support vari-
mance of more food-relevant tasks. We also hope Food2K can serve ous food-relevant tasks, such as cross-modal recipe retrieval and
as a large scale fine-grained visual recognition benchmark, and
contributes to the development of large scale fine-grained visual
generation [17], [18].
analysis. To this end, we introduce a new large-scale benchmark dataset
Food2K for food recognition as the main contribution. Food2K
Index Terms—Food dataset, food recognition, large-scale contains 1,036,564 images with 2,000 categories, belonging to
datasets, fine-grained recognition.
different super-classes, such as vegetables, meat, barbecue and
Manuscript received 29 March 2021; revised 27 July 2022; accepted 14 fried food. In contrast to existing datasets, such as ETH Food-
January 2023. Date of publication 18 January 2023; date of current version 101, the size of Food2K in both categories and images bypasses
30 June 2023. This work was supported in part by the National Natural Sci-
ence Foundation of China under Grants 61972378, 62125207, U1936203, and
their size by one order of magnitude. In addition to the size, we
U19B2040, and in part by the Meituan-Dianping Group. Recommended for have carried out rigorous data cleaning, iterative annotation and
acceptance by J. Hoffman. (Corresponding author: Shuqiang Jiang.) multiple professional inspection to guarantee its high quality. We
Weiqing Min, Zhiling Wang, Yuxin Liu, Mengjiang Luo, and Shuqiang Jiang
are with the Key Laboratory of Intelligent Information Processing, Institute
hope that this large scale, high-quality dataset will be a useful
of Computing Technology, Chinese Academy of Sciences, Beijing 100190, resource for developing advanced food image representation
China, and also with University of Chinese Academy of Sciences, Beijing learning and understanding methods to support both popular and
100049, China (e-mail: [email protected]; [email protected];
[email protected]; [email protected]; [email protected]).
emerging food-relevant vision tasks. In addition, Food2K can be
Liping Kang, Xiaoming Wei, and Xiaolin Wei are with Meituan Group, expected to be one large-scale fine-grained visual recognition
Beijing 100102, China (e-mail: [email protected]; weixiaoming@ benchmark to enable the development of fine-grained visual
meituan.com; [email protected]).
The dataset, code and models are publicly available at http://123.57.42.89/
recognition.
FoodProject.html. Based on this dataset, we propose a deep progressive re-
This article has supplementary material provided by the authors and color gion enhancement network for food recognition. It can jointly
versions of one or more figures available at https://doi.org/10.1109/TPAMI.
2023.3237871.
learn diverse and complementary local and global features for
Digital Object Identifier 10.1109/TPAMI.2023.3237871 food recognition. We adopt the progressive training strategy to

0162-8828 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9933

obtain abundant and various food features from food images, TABLE I
COMPARISON OF CURRENT FOOD RECOGNITION DATASETS
like different ingredient information. This strategy can help to
learn comprehensive and multiple fine-grained information as
training progresses. In addition, our model incorporates richer
context with multiple scales into local features via self-attention
to enhance local feature representation.
Extensive experiments on Food2K demonstrate the effective-
ness of our proposed method. In addition, we provide extensive
experiments comparing various state-of-the-art methods for im-
age representation learning, including popular deep networks,
fine-grained methods and existing food recognition methods.
Furthermore, we also show that the networks learned on Food2K
can benefit various food-relevant vision tasks, i.e., food recogni-
tion, food image retrieval, cross-modal recipe retrieval, food de-
tection and segmentation, indicating better generalization ability
of Food2K. The developed networks on Food2K can be expected
as the backbone to support more food-relevant vision tasks,
especially emerging and more complex ones.
The contributions of our paper can be summarized as follows:
r We introduce a new large-scale high-quality food recogni-
tion benchmark Food2K, which is the largest food image
dataset with 2,000 categories and 1,036,564 images.
r We propose a deep progressive region enhancement net- food recognition. Considering the important role of large-scale
work to learn food-oriented local features by progressive datasets in the continuous improvement of visual recognition
training, and further utilize self-attention to enhance local algorithms, especially for deep learning based methods, we
features for food recognition. build a large-scale food recognition dataset Food2K with more
r We conduct extensive evaluation on Food2K to verify the comprehensive coverage of categories and larger quantity of
effectiveness of our approach, where extensive baselines images. In Table I, we give statistics of existing food recognition
on this benchmark are provided, including popular deep benchmarks together with Food2K. The size of Food2K in both
networks, fine-grained recognition methods and existing categories and images bypasses the size of existing datasets by at
food recognition ones. least one order of magnitude. Although there are some datasets,
r We explore the ability of models trained on Food2K to such as UNICT-FD1200 with larger categories, the quantity of
transfer to various food-relevant tasks including visual food images for each category is very limited.
recognition, retrieval, detection, segmentation and cross- In addition, there are other food-relevant recipe datasets, such
modal recipe retrieval, and demonstrate its better generality as Yummly-66K [39] and Recipe1M [17]. The most known
of Food2K on these tasks. dataset is Recipe1M. Food2K and Recipe1M belong to large-
scale food-related datasets, but with two important differences:
(1) Recipe1M is used for cross-modal embedding and retrieval
II. RELATED WORK
between recipes and images while the released Food2K aims at
Food-Centric Datasets. Over the years, the size of food- advancing scalable food visual feature learning. (2) Recipe1M
centric datasets has grown steadily. For example, Bossard et mainly contains over 1 million structured cooking recipes,
al. [8] constructs one western food dataset ETH Food-101 where each recipe is associated with some food images while
with 101,000 images from 101 food categories. VIREO Food- Food2K contains over 1 million images, belonging to 2,000
172 [15] consists of 110,241 images from 172 Chinese food food categories. We believe Food2K and Recipe1M are very
categories. Compared with these two datasets, FoodX-251 [11] complementary and jointly promote the development of visual
is released with 158,846 images and 251 categories in the Fine- analysis and understanding of food.
Grained Visual Categorization Challenge held in conjunction Food Image Analysis: The availability of more food datasets
with CVPR2019. However, these datasets fail short in terms has further enabled progress in food recognition. More impor-
of both more comprehensive coverage of food categories and tantly, recognizing food directly from images is highly desirable
large scale of food images, like ImageNet [37]. Although the for various food-related applications, such as nutrient assess-
full set of ImageNet contains about 1000 food-related cate- ment [5], [7], food logging [36] and self-service settlement [40].
gories [38], different from existing food datasets and Food2K, For these reasons, we have seen an explosion in food recognition
which contains food classes for direct eating, many food-relevant algorithms in recent years.
categories from ImageNet belong to nutrient composition (e.g., Although food recognition belongs to fine-grained analysis,
choline, vitamin), cooking methods (e.g., split, mix), kitchen it has its unique characteristics[1]. First, food images don’t
ware (e.g., mixer, kibble), etc. The reason is that the aim of have distinctive spatial layout. A large number of dishes have
ImageNet is for generic object recognition, not particularly for deformable food appearance and thus lack the rigid structures.

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
9934 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 8, AUGUST 2023

Food consists of ingredients. Ingredients from various types of There are mainly two lines for this study. The first line is cross-
food images are distributed randomly on a plate. There exists modal food image-recipe text retrieval [17], [47], [48], [49], [50],
the overlap among different ingredients in the same food image. where the main idea is to learn a joint embedding of food images
Even the same ingredient may appear distinctly in different and recipes to support the retrieval between food images and
food images. Such complex ingredient distributions in the food recipes. Due to the complex visual appearance of food images,
images make the task different from other ones like scene effective visual feature learning is still one key. In addition,
recognition with distinctive features such as buildings and trees. various types of text annotations should be further considered.
Second, food image recognition belongs to fine-grained classi- The recent work[51] proposed a hierarchical transformer to
fication, and thus has the same problem as fine-grained classi- achieve better cross-modal retrieval performance. The other one
fication, such as subtle differences among different categories. is cross-modal food image or cooking recipe generation[18],
Existing fine-grained classification methods generally focus on [52], [53], [54], [55]. For example, Wang et al.[18] proposed
discovering the fixed semantic parts as one important part of one structure-aware generation network to generate cooking
its representation. However, the common semantic parts do not instructions based on only food images and ingredients. All
exist in many food categories. Therefore, we should re-design of these works also involve visual feature learning from food
the fine-grained categorization method for food recognition. images. Constructing large annotated food image datasets can
In the earlier years, various hand-crafted features, such as help the development of food visual recognition models, and
color, texture and SIFT are utilized for food recognition [8], [41], also supports the multimodal food learning task.
[42]. In the deep learning era, because of its powerful capacity of Food Computing: Food computing [1] has raised great interest
feature representation, more and more works resort to different recently for its various applications in health, culture, etc. It
deep networks for food recognition [9], [10], [11], [36]. For contains different tasks, such as food recognition [9], [11], [14],
example, Qiu et al. [9] propose a PAR-Net to mine discriminative detection [40], segmentation [46], retrieval[47], [48], [50] and
food regions to improve the performance of classification. There generation [18], [56]. Among these tasks, food recognition is an
are also some recent works on few-shot food recognition [14], important and basic task for further supporting more complex
[43]. For example, Zhao et al. [14] propose a fusion learning food-relevant vision and multimodal tasks. Therefore, construct-
framework, which utilizes a graph convolutional network to ing large scale food recognition datasets and designing advanced
capture inter-class relations between image representations and food recognition algorithms on these large-scale datasets are
semantic embeddings of different categories for both few-shot very vital for the development of food computing.
and many-shot food recognition. In addition, there are many
works [26], [28], [35], [44], [45], which introduce additional III. FOOD2K DATASET
context information, e.g., GPS and ingredient information to
A. Dataset Construction
improve the recognition performance. For example, Zhou et al.
[28] mine rich relationships among ingredients and restaurant We collected this dataset from the catering website Meituan,1
information through the bi-partite graph for food recognition. which holds a huge number of food images including both
Min et al. [35] utilize ingredients as additional supervised sig- eastern and western ones. We first obtained a large-scale raw
nal to localize multiple informative regions and fused these dataset So from this website after application and authorization
regional features into the final representation for recognition. from Meituan. It consists of about seventy million food images
Different from these works, considering the characteristics of uploaded from both catering staffs from take-out online restau-
food images, we adopt one progressive training strategy to learn rants and users in Meituan. We then processed So to complete
comprehensive and multiple local features, and further utilize the Food2K construction in three phases, namely food category
self-attention to incorporate richer contexts with multiple scales vocabulary construction, food image collection and labeling.
to enhance local features. Like general object image analysis Building a Vocabulary of Food Categories: Considering it is
including recognition, detection and segmentation, in addition harder to obtain a comprehensive and standard food label list, we
to food recognition, there are also some works on food image adopted a bottom-to-up method to construct the category vocab-
detection and segmentation towards more accurate nutritional ulary from noisy and redundant labels associated with images
information extraction[40], [46]. However, they are still under in So . We first removed special characters and quantifiers, and
slow progress for the lack of food-relevant datasets and complex then aggregated different labels belonging to the same food as
ingredient distributions in food images. one category according to the synonym set of food built from
These above-mentioned food visual analysis methods gener- Meituan, leading to one set of food labels Vo . However, there
ally use RGB food images, and thus probably do not achieve still exist many labels corresponding to the same category. We
satisfactory performance for further nutritional content predic- thus carried out the secondary label aggregation. Considering
tion since the volume information is not obtained. For that, manually judging whether two food labels belong to the same
RGB-D food analysis benchmarks for nutritional evaluation are dish or not is time-consuming and difficult, we conducted the
built[6], where additional depth images can be used to estimate following iterative process: we first selected one food label
the food volume for the improved prediction of calorie and from Vo and manually selected its several representative images
macronutrients.
Multimodal Food Learning: Multimodal food image-recipe 1 Meituan is China’s leading e-commerce platform for services, especially
text joint learning is another food related topic in this community. well known in catering, which is similar to Yelp.

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9935

Fig. 1. (a) some categories from Food2K and (b) an example for collecting images via labels and image based retrieval.

from So (10 in our setting). We then extracted their deep visual


features (Inception-ResNet V2 in our setting), and used their
averaged visual features as its visual description to retrieve
images from So until we obtained the closest 50 food labels
from retrieved labeled images as the candidate set. According
to their images and associated labels, the candidate label set is
verified to generate the final aggregated label via strict manual
verification, and meanwhile these food labels are removed from Fig. 2. Example annotations for Food2K. For each column, the blue dashed
Vo . For example, the following food label set {Spaghetti with box denotes the qualified image for this category, while the red one is unqualified.
seafood, Italian seafood spaghetti, Angel seafood spaghetti} Different types of unqualified ones are showed, e.g., the painting food, the
occlusion of the main part, missed important ingredients, and more categories
is aggregated into the label ‘Seafood spaghetti’. We finally in one image.
settled on a vocabulary with 2,000 frequently used categories,
covering daily diet with various food types, such as snack,
bread and vegetables. Fig. 1(a) shows some categories from the
vocabulary. website are from different sources. For example, the final con-
Collecting Food Images: For each category from the con- structed dataset includes not only food images from merchants,
structed vocabulary, we aggregated food images based on its but also self-made food images taken by users themselves.
mapping M and then used these aggregated images to retrieve Another way to ensure the diversity is that we aggregated food
more images from unlabeled ones of So to enlarge this dataset. images based on the label-images mapping. Each category is
Fig. 1(b) shows an example. The images with the labels from from various original labels belonging to the same food, making
the food category “Tom yum kung” can be obtained. An ideal more diversity of aggregated food images for the fine-grained
method is to clean up these candidate images via strict manual semantic difference. For example, the corresponding images
check. However, it is impractical for huge number of candidate with the following food label set {Spaghetti with Seafood, Italian
images. Therefore, we cleaned up these candidate images via seafood spaghetti, Angel seafood spaghetti} are attached into the
visual retrieval: for each food category, we similarly chose final category label ‘Seafood Spaghetti’.
representative images (10 in our setting) and obtained their Annotating Food Images: After initial food image collection,
averaged visual features extracted from Inception-ResNet V2 as we conducted exact duplicate removal. To further build such a
its visual description. Then we computed the visual similarity high-quality dataset, all the images were finally evaluated by
between it and its candidate images, and removed ones with low the professional data annotation and quality inspection team.
similarity. To ensure better diversity of food images and lower These annotators are divided into two groups. For each category,
bias produced from visual retrieval with certain deep network in we first obtain some groundtruth images from Wikipedia or
Food2K, we adopted the relatively lower threshold and for the encyclopedic items. Based on these reference images, the first
images within this threshold, different sampling proportions are group conducted the annotation for each category via removing
set according to their different visual similarity range. Different unqualified images, such as unclear ones, and ones with food
thresholds can be used for different networks to ensure the low parts occluded. Fig. 2 illustrated different types of unqualified
bias, and the dependence on certain feature extraction networks images. Note that all the images in Food2K have a single dish
is thus small.2 In addition, the original images in the Meituan label, since our primary goal is to provide the basic large-scale
food recognition benchmark to support more complex food-
2 Please refer to image samples from some categories via http://123.57.42.89/ relevant vision tasks. The second group then carried out the
FoodProject.html. inspection for each image from this category to ensure that the

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
9936 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 8, AUGUST 2023

with existing food datasets. Some samples3 are also showed in


Fig. 4.
Fig. 5 illustrates the scale of Food2K, compared to existing
food recognition datasets. As shown in Fig. 5(a), the size of
images from Food2K bypasses the size of the previous largest
one by one order of magnitude. Fig. 5(b) provided an overview
of distributions of dishes on categories and compared to typical
datasets, including ETH Food-101 (western food), Vireo Food-
172 (eastern food) and ISIA Food-500 (Misc. food). We can
see that for each category, the number of dishes from Food2K
is larger than existing datasets. Furthermore, some categories
exist in Food2K, but missing or very few in existing ones,
such as Barbecue. Further statistics show the overlapped cat-
egories between existing datasets (Food-101, Vireo Food-172
and ISIA Food-500) and Food2K are only 13, 75 and 40,
respectively, and duplicate detection shows that there are no
Fig. 3. The ontology of Food2K. repeated images for each overlapped category. The probable
reason is that images from Meituan are taken by catering staffs
and users in Meituan, and are not allowed crawled without
authorization.
Besides more comprehensive coverage of dishes and larger
quantity of images, Food2K has the following characteristics:
First, Food2K covers more diverse visual appearance and pat-
terns. Fig. 6(a) shows some examples for each category. Dif-
ferent ingredients and their combination, different accessories,
different arrangements all lead to the visual difference for
the same category. For example, the fresh fruit salad appears
different visual appearances for its mixture of different fruit
Fig. 4. The distributions over each category in the Food2K. ingredients, while the large visual difference of “Rib eye steak”
comes from its different accessories. Such unique characteristics
from food leads to higher intra-class difference, making large-
rate of data qualification reaches 99%. Otherwise, the images scale food recognition difficult. Second, Food2K contains more
belonging to this category should be re-annotated. Through the fine-grained annotation. Based on the constructed ontology of
iterative annotation and multiple inspection, the high-quality Food2K, the level of category annotation in Food2K is more
Food2K dataset is finally built with 2,000 classes and 1,036,564 fine-grained compared with other food datasets. Take Pizza as
images. an example, some classic datasets such as Food-101 has only the
pizza class. In contrast, as shown in Fig. 6(b), the pizza class in
Food2K is further divided into more types. Their subtle visual
B. Dataset Statistics and Analysis
differences among different pizza images are mainly caused by
Considering Food2K contains both western and eastern food, their unique ingredients or the same ingredient with different
with the help of food experts, we built a unified food ontology granularity, also leading to more difficulty in recognition. All
by combining and adapting existing food classification systems of these factors enable Food2K one new challenging large-scale
from both western [30], [57] and eastern ones [58]. It is a food recognition benchmark.
hierarchical structure: there are 12 super-classes (like “Bread”
and “Meat”), and there are some sub-classes for each super-class IV. OUR METHOD
(like “Beef” and “Pork” in “Meat”). Each type contains lots of
dishes (like “Curry Beef Brisket” and “Fillet steak” in “Beef”), In this section, we present the proposed Progressive Region
where each dish generally contains several ingredients. Note Enhancement Network (PRENet) in Fig. 7. PRENet mainly
that some groups are very broad in the ontology, like “Bread” consists of progressive local feature learning and region feature
and “Meat”, and are thus divided into more detailed types. As enhancement. The former mainly adopts the progressive training
shown in Fig. 3, 2,000 dish categories from Food2K covers strategy to learn complementary multi-scale finer local features,
all the 12 groups with both eastern food (e.g., Noodles and like different ingredient-relevant information. The region feature
Sushi) and western food (e.g., Pasta and Dessert). Food2K enhancement uses self-attention to incorporate richer contexts
contains 1,710 eastern food categories and 290 western ones. with multiple scales into local features to enhance the local
Fig. 4 shows the number of images per category with de-
creasing order. The number of images per category is in the 3 Please refer to the full category list and samples from all categories via
range [153,1999], showing a larger class imbalance compared http://123.57.42.89/FoodProject.html.

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9937

Fig. 5. (a) Datasets distributed across number of images and categories and (b) distributions of categories on Food2K and typical datasets.

Fig. 6. (a) Various visual appearances for one category and (b) one example with more fine-grained annotation in Food2K.

Fig. 7. The framework of PRENet. (a) Global Feature Learning branch, which learns the global superclass features. (b) Progressive Local Feature Learning
branch, which capture complementary multi-scale local features through progressive training strategy. (c) Region Feature Enhancement branch, which incorporate
contexts into local features through self-attention. Note that the predicted scores by classifier A in (a) and B in (b) are combined for the final prediction.

feature representation. Then we fuse enhanced local features KL-divergence to increase the difference between stages for
and global ones from global feature learning into the unified capturing more detailed features. For the inference, considering
one via the concat layer. During training, after progressively the complementary output from each stage and the concatenated
training the networks from different stages, we then train the features, we combine the prediction results from them for final
whole network with the concat part, and further introduce the food classification.

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
9938 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 8, AUGUST 2023

A. Global-Local Feature Learning number of stages of the network. Because we will concatenate
Although food image recognition is one fine-grained visual all the global and local features in the final stage (also called
recognition task, the food images under different sup-classes concat stage), thus the total number of steps is S + 1. In our
i
method, for the output fLoc from the ith stage of the network
have obvious discriminative visual differences, and thus can be
to be trained in each step of progressive training, Fi is used to
better recognized by global representation. Those in different
sub-classes under the same super-class have high inter-class process the output features. Fi consists of a convolutional layer,
a batch norm layer and a ReLU unit. Then we can get local
similarity, just as shown in Fig. 6(b), and we thus should pay
feature representation ci = Fi (fLoc
i
).
more attention to more fine-grained local features.
Therefore, we extract and fuse both global representation Furthermore, for multiple local features from different stages,
we utilize KL-divergence to increase the difference between
and their subtle visual differences. We use two sub-networks to
extract the global and local features of food images respectively. stages, which can help capture as many details as possible. Under
These two sub-networks can be two separate networks. However, the progressive training strategy, the visual features extracted
from different stages are projected as specific probability distri-
they share most of the layers of the same backbone network in
our network for efficiency. bution which represent the visual semantic information. How-
Global Feature Learning: Inspired by [59], Based on existing ever, the iterative optimization may result in different probability
distributions converging to the same distribution, which impairs
network (e.g., ResNet), for the output f g of the last convolutional
layer, we use Global Average Pooling (GAP) to extract the global the ability to extract diverse features. KL-divergence can mea-
feature fGlo : sure the similarity between two distributions. By maximizing the
KL-divergence, the convergence can be suppressed and more
fGlo = GAP(f g ) (1) fine-grained visual features can be extracted for recognition.
The KL-divergence is calculated over global features for every
Progressive Local Feature Learning: Local features sub- adjacent two outputs in each batch, where the reduction of
network aims to learn discriminative fine-grained features of KL-divergence is batchmean.
food. Due to the diverse ingredients and cooking style, the
U U  
discriminative parts of the food image are multi-scale and irreg- yi
ular. As the first contribution, we adopt the progressive training LKL (yi , yj ) = yi log (3)
i=1
y j
j=U −i
strategy to solve this problem. In this strategy, we train the low
stages first which have the small receptive field, then zoom out where yi and yj are the output distribution from different stages.
a larger field surrounding this local region, and finish when
we reach the whole image. This training strategy will force B. Region Feature Enhancement
our model to extract finer discriminative local features, such as
Different from general fine-grained tasks, food images do
ingredient-relevant ones. After this process, we extract features
not have fixed semantic information [1]. Most of existing food
from different layers to obtain multi-scale feature representa-
recognition methods [10], [35] mine these discriminative fea-
tions. Specifically, for the output f i of each stage from the local
tures directly, ignoring the relationship between local features.
feature sub-network, we use a convolutional block and a Global
Therefore, we adopt a self-attention mechanism to capture the
Maximum Pooling (GMP) layer to get their local feature vectors
i relationship between different local features. This strategy aims
fLoc :
to capture the co-occurring food features in the feature map. It is
i
fLoc = GMP(f i ) (2) revised non-local interaction [60] within the same level feature
map, and the output feature map has the same scale as its input.
where f i denotes the output from the ith stage of the network. Specifically, we first extract the local feature representation fLoc
Therefore, this strategy first learns more stable fine-grained of the last S stages, and then obtain the enhanced features via
information in shallower layers and gradually shifts attention to self-attention as follows:
learning coarse-grained information in deeper layers as training  
(i)
progresses. Specifically, it can extract discriminative local fine- q (i) = Conv fLoc
grained features such as ingredients when features with different  
(j)
granularities are sent to the network. However, simply using k (j) , v (j) = Conv fLoc
progressive training strategy will not get diverse fine-grained  T 

features, because the multi-scale information learned via pro- Si,j = Softmax q (i) k (j) / dk(j)
gressive training may focus on the similar region. As the second
contribution, we optimize the KL divergence between features f
IJ = Si,j v
(j)
(4)
from different stages to increase the difference between them to (i) (j)
solve the problem. By maximizing the KL divergence between where fLoc and fLoc are the ith and j th feature positions in fLoc .
features from different stages, we force multi-scale features q (i) is the ith query and k (j) , v (j) are the j th key/value pair, dk(j)
to focus on different areas in different stages, which can help is the dimension of the k (j) and Si,j means the similarity between
capture as many details as possible. q (i) and k (j) . Thus we can get the enhanced feature f IJ . Finally,
Particularly, we divide the training process into S steps, and we concat these enhanced feature maps of the same size, and use
train the first U − S + i stages at step i, where U is the total convolutional layers to convert them into the same dimension.

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9939

Finally, we can get S local features i


fLoc after it. By this strategy, (including the final score) with equal weights to predict the
our model enables features to interact across space and scales. output class:
During training, after obtaining global features and local Y = Sum(y  , yU −S+1 , ..., yU ) (11)
features, we combine them as the final representation f  in the
concat stage: where Sum() denotes the weighted sum with the equal weights.
 

f = Concat fGlo , fLoc U −S+1 U
, . . . . . . , fLoc (5) V. EXPERIMENT
We first begin with the extensive evaluation of our proposed
method for food recognition on Food2K. Then, we study the
C. Optimization and Inference
generalization ability of Food2K on five food-relevant tasks,
During the optimization, we use an iterative procedure to including food recognition, food image retrieval, cross-modal
update the parameters of the network. First, during progressive recipe retrieval, food detection and food segmentation. Finally,
learning stage, we utilize the cross entropy loss Lipro from S we discuss potential research problems and techniques that could
stages to back propagate for updating the parameters of the be investigated based on Food2K.
corresponding network. Note that all the parameters in the
current stage will be optimized, even they have been updated A. Recognition on Food2K
in the previous stages. Then, during the concat stage, we utilize
another loss function to update the parameters of the whole 1) Implementation Details: Food2K is divided into 60%,
network. Our network is trained in an end-to-end way. 10% and 30% images for training, validation and test set, re-
During the progressive training, for the output from each spectively. Top-1 (Top-1 Acc.) and Top-5 classification accuracy
stages, we can utilize classifiers CiA to predict the corresponding (Top-5 Acc.) are adopted as evaluation metrics. All the neural
probability distribution as: network models are implemented using the PyTorch framework.
We compare various methods, including deep networks (e.g.,
yi = CiA (ci ) (6) DenseNet161 [61] and SENet [62]), recently proposed fine-
grained recognition methods (e.g., MOMN [63] and PMG [64])
where classifiers CiA consists of two fully-connected stage with and food recognition methods (e.g., PAR-Net [9]). For the deep
batchnorm and ELU non-linearity, corresponding to the ith stage. networks, we train all the networks with parameters initialized
yi means the prediction from the corresponding stages, and i from ImageNet pre-trained weights with a learning rate of 10−2 ,
ranges from U − S + 1 to U . Then, we adopt the following and divided by 10 after 30 epoches. All the networks are opti-
cross entropy loss: mized using the stochastic gradient descent with a momentum

Lipro = − yim lnyim (7) of 0.9, and weight decay of 10−4 . Training and testing are
xm ∈M
performed with an image size of 224 × 224, excepting Inception
family uses the size of 299 × 299. For fine-grained methods
where M is a set of training data, xm means the mth sample and food recognition methods, all the presented experiments
and yim is the corresponding category label. follow the same settings in the mentioned papers. Similar to
For the unified representation, the prediction from the final fine-grained recognition methods, in our method, the input im-
Concat stage can be obtained by a classifier CB as: ages are resized to a fixed size of 550 × 550 and randomly
cropped into 448 × 448, and random horizontal flip and color
y  = CB (f  ) (8)
jitter are applied for data augmentation when we train the model.
we use another cross entropy loss as: The test images are resized to a fixed size of 550 × 550 and
N
cropped from center into 448 × 448. The initial learning rate
 is 10−3 and multiplied by 0.9 after 2 epoches. Note that for
Lcon = − y  i × lny  i (9)
i=1
fair comparison in fine-grained methods, we list results with the
same backbone mentioned in their papers. For our method, the
where y  is the predicted label, N is the total number of the number of learning stages U = 3, which corresponds to the last
training samples. In our method, Lcon is kept for all runs and 3 layers of the network. The number of steps S = 3, and the
number of steps is S + 1. Furthermore, we introduce another balance parameters of the total loss α = 0.8, β = 0.2.
loss of KL-divergence to increase the difference between stages 2) Comparisons with State-of-the-Art Methods: Table II
for capturing more detailed features, resulting in the final loss shows the performance comparison of different methods on
function for the whole network: Food2K. We can see that the recognition performance has con-
L = αLcon + βLKL (10) sistent improvement when more advanced networks are adopted,
indicating deeper and more advanced networks for further im-
where α and β are balance parameters. provement. When using the same backbone, we further observe
During the inference step, considering the prediction from that most fine-grained methods perform better than baseline
different stages and the fused stage is complementary, we networks, such as DCL [71]. However, the performance trend
can combine all outputs together to improve the recognition on Food2K is not consistent with existing fine-grained datasets,
performance. Particularly, we add the scores across all stages such as birds and cars. For example, recently proposed PMG [64]

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
9940 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 8, AUGUST 2023

TABLE II
COMPARISON OF OUR APPROACH (PRENET) TO BASELINES ON FOOD2K (%)

Fig. 8. Some classification results on Food2K. GT means the ground truth.


The dishes with red color are not correctly classified in Top-1 results.

more fine-grained confusion, e.g., “Shrimp pizza” and “Sausage


pizza”, “Pickle” and “Spicy stir-fried rice cakes”, revealing
opportunities for algorithmic improvements to handle these
challenges.
3) Ablation Study on PRENet: We conduct various ablation
studies to understand the effectiveness of our method from dif-
ferent aspects, where ResNet50 is used as the backbone network.
Effect of Different Components: We study the effect of differ-
ent components in our method, including progressive learning
strategy (PL), region feature enhancement (RE) and their com-
bination. For comparison, we introduce another baseline Simple
does not achieve the desired performance on Food2K compared Fusion (SF), which simply uses concatenated features from the
with existing fine-grained datasets. The probable reason is that last three layers without PL and RE for recognition. As shown
PMG does not consider the relations between local features and in Fig. 9(a), we can see that: (1) The introduction of PL brings
it also does not take the global representation into considera- the recognition performance gain; (2) The combination of PL
tion during feature learning. There are also some fine-grained and RE gives further performance boost, which shows that our
methods, which perform even worse than backbones, such as method is effective in learning and enhancing local features via
HBP [70]. The probable reason is that HBP adopts the bilinear both PL and RE.
pooling to capture the inter-layer part feature relations and The Number of Learning Stages U : We study the effect of our
extracts the co-occurring features to integrate a unified represen- method when changing the number of learning stages U . The re-
tation for classification. Therefore, the model may focus more on sults are reported in Fig. 9(b). It is clear that increasing U boosts
the same and common semantic parts, such as the bird’s mouth. the model performance. Our model achieves 81.45%, 82.11%
However, many food categories have non-rigid structures, and and 83.03% Top-1 classification accuracy from U = 1 to 3 on
they do not have fixed semantic information. These experimental Food2K. The classification performance of our model receives
results demonstrate that directly adopting existing fine-grained consecutive gains by 0.66% and 0.92%. However, we notice the
methods do not necessarily achieve the optimal performance for accuracy starts to decrease when U = 4. The possible reason is
large-scale food recognition, indicating that we should design that low stage layers mainly focus on class-irrelevant features.
food-oriented network for further performance improvement. With deep progressive training, too many stages probably force
The performance of food recognition methods such as PAR- the model to find class-relevant information and introduce the
Net also does not achieve reasonable results. We speculate that noise to this class, which may cause the model to generalize
these food recognition methods are more suitable for existing poorly on the test set and the overall performance probably
small or medium-scale food datasets, such as ETH Food-101 decreases.
and VireoFood-172. From Table II, we can see that our method Effect of Different Learning Stages: To better verify the con-
achieves competitive result on Food2K. Our method outper- tribution of each learning stage and the final concat stage, we
forms the backbone (ResNet50) by 2.24 % and 1.47 % in also test the accuracy by using the predication from each stage
Top-1 and Top-5 classification accuracy, respectively, and it separately. The results are reported in Fig. 9(c). We observe
also outperforms PMG [64] by 1.74% in Top-1 classification that the Concat learning stage achieves the best performance
accuracy, even PMG adopts various granularities to learn local for Top-1 classification accuracy. It can reveal that our method
features. This verifies the advantage of combining the progres- captures and fuses complementary information from different
sive training strategy and self-attention to enhance local feature stages, and achieves the best recognition performance.
representation. Balance Parameters α and β We study the influence of two
We further show responses to some predicted examples by balance parameters α and β in the total loss (10). As shown in
PRENet (ResNet50) in Fig. 8. We observe that there are still Fig. 9(d), when α = 0 and β = 1, the total loss is only opti-
some wrongly predicted results, and they are mainly from mized by the KL divergence, and the model can not converge.

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9941

Fig. 9. Ablation study of PRENet on Food2K: (a) Different components. (b) Different number of learning stages K. (c) Each learning stage. (d) Different balance
parameters (α, β).

TABLE III
PERFORMANCE COMPARISON ON ETH FOOD-101 (%)

Fig. 10. Visualization results of proposed progressive learning on some sam-


ples from Food2K. The original method means we use feature maps from last
three layers of the network without PL for visualization.

With α increasing, our model performs better until it reaches


a tipping point when α = 0.8 and β = 0.2. When α = 1, our
model is only optimized by the cross entropy loss, and the Top-1
classification accuracy is reduced by 3.18%, which proves that the introduction of KL-divergence. Take “Wasabi octopus” for
introducing the KL divergence can obtain better performance. example (the second row in Fig. 10), the baseline only obtains
This is because KL divergence can force multi-scale features to limited information and different feature maps tend to focus
focus on different areas, and thus helps capture as many details on the similar part. In contrast, for our method, the Stage I
as possible. pays more attention to the “vegetable leaf”, while the Stage
4) Visualization Analysis: To gain further insight into our II mainly focuses on the “octopus”. For Stage III, the overall
method, we further conduct visualization analysis via Grad- characteristics of this food can be captured, and thus both global
CAM [86]. We visualize the output feature maps from different and local features are utilized for recognition.
learning stages and compare our method with the original one 5) Recognition Performance on Other Datasets: Consider-
in Fig. 10. The original method means we use feature maps ing ETH Food-101 has been one standard benchmark in the
from last three layers of the network without PL and RE for computer vision, we also conduct the evaluation on this dataset
visualization. For these typical examples, the attentional regions to verify the effectiveness of the proposed method. As shown
are expanded with the stages going ahead, and more discrim- in Table III, we can observe that when adopting the same
inative and detailed parts have been included. Moreover, our setting, our method achieves the best performance compared
method can capture various local features in different stages for with existing methods for Top-1 classification accuracy. For

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
9942 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 8, AUGUST 2023

TABLE IV
RESULTS OF TRANSFERRING VISUAL REPRESENTATIONS LEARNED ON FOOD2K TO THREE DATASETS (%)

example, our method outperforms typical fine-grained methods 2.58%, 2.31%, 2.61% and 2.74% for VGG16, ResNet152, In-
with ResNet50 as the backbone. Compared with existing food ception V3, DenseNet161 and SENet154, respectively on Vireo
recognition methods, such as MSMVFA [12], our method also Food-172. These results show that features learned on Food2K
obtains the highest accuracy 90.74% for Top-1 classification ac- generalize well on food recognition. The average performance
curacy. When we use the trained backbone model from Food2K, improvement on several common popular neural networks in-
namely PRENet (SENet154+Pretrained), there is further perfor- cluding VGG16, ResNet152, Inception V3, DenseNet161 and
mance improvement. SENet154 is 1.68%, 3.51% and 3.41% in Top-1 classification ac-
curacy for ETH Food-101, Vireo Food-172 and ISIA Food-500,
respectively, indicating that higher performance gain is from
B. Generalization Ability of Food2K Vireo Food-172 and ISIA Food-500 while the lowest perfor-
In this section, we conduct comprehensive evaluation on the mance gain is from ETH Food-101. The probable reason is as
generalization ability of Food2K in various vision and multi- follows: both ISIA Food-500 and Food2K are Misc. (including
modal tasks, including food recognition, food image retrieval, both eastern and western cuisines). Therefore, their domain gap
cross-modal recipe retrieval, food detection and segmentation. is relatively small. In addition, there is a larger proportion of
1) Food Recognition: We assess the generalization of mod- eastern food categories in Food2K and Vireo Food-172 consists
els learned using Food2K to ETH Food-101. In addition, we of eastern cuisines, also resulting in the smaller domain gap
also conduct the evaluation on another two datasets, namely between Food2K and Vireo Food-172. For more complicated
Vireo Food-172 and ISIA Food-500 from the multimedia field. networks, we also observe higher gain in Vireo Food-172.
All of presented experiments follow the same training-test- Take PAR-Net as an example, the performance improvement
splitting in the mentioned papers. Representative methods in- is 0.63%, 1.32% and 0.74% in Top-1 classification accuracy for
cluding baseline networks, fine-grained recognition and food ETH Food-101, Vireo Food-172 and ISIA Food-500.
recognition methods are used for evaluation. For each method, We further compare the performance of transfer learning from
there are two settings. Take VGG16 for example: VGG16 different food datasets, and give discussions on the recognition
denotes we use the target dataset to fine-tune the ImageNet performance with trained models based on different ratios of
pre-trained network, while VGG16 + Fine-tuned on Food2K Food2K. Details are in supplementary materials, which can be
denotes we first use Food2K to fine-tune the ImageNet pre- found on the Computer Society Digital Library at http://doi.
trained network, then fine-tune it on the target dataset for ieeecomputersociety.org/10.1109/TPAMI.2023.3237871.
evaluation. 2) Food Image Retrieval: We further validate the general-
We report experimental results in Table IV. From Table IV, ization ability of Food2K on food image retrieval. Such task
we can see that all the transferred features are better than training is to find food images containing relevant content with the
on target datasets alone. For example, the network fine-tuned on given query in a database. We conduct the evaluation on ETH
Food2K can improve Top-1 classification accuracy by 7.33%, Food-101, Vireo Food-172 and ISIA Food-500 with the same

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9943

TABLE V
EVALUATING VISUAL REPRESENTATION LEARNED FROM ETH FOOD101 AND FOOD2K FOR RETRIEVAL ON THREE DATASETS (%)

training-test splitting in the mentioned papers. Each image in percentage of queries for which the matching answer is included
the test set is used in turn as the query, and the retrieval set is in the top K results (K=1,5,10).
formed as all the remaining images of the test set. We initialize Table VI reports experimental results of different methods
the networks using convolutional layers of ResNet101. The using backbones from ImageNet, ETH Food-101 and Food2K,
models are optimized using Adam. The initial learning rate and shows that (1) all the methods obtains the improvement on
l0 = 2 × 10−5 , an exponential decay l0 exp(−0.1i) over epoch the model from ETH Food-101 and Food2K in MedR and R@K,
i, momentum 0.9, weight decay 5 × 10−4 , margin τ = 0.85 for and (2) The performance gain from Food2K is higher than ETH
contrastive loss and triplet loss, and the batch size is 32. All Food-101. These experimental results prove that the backbone
training images are resized to a maximum size of 362 × 362, trained on Food2K is more helpful in visual food embedding
while keeping the original aspect ratio [89]. mAP and Recall@1 learning, and thus improves cross-modal embedding learning.
are used as the evaluation metrics. The following methods are This is because of the diversity and scale of Food2K. Note that
used: fine-tuning the network using Cross-entropy loss and for JE [47] on 10k test size, because the original literature only
metric-learning based methods using contrastive loss[87] and uses the MedR for comparison, we also adopt this metric for
triplet loss [88], respectively. ResNet101 is used as the backbone consistent comparison.
network for all methods. 4) Food Detection: We assess the generalization of models
The results in Table V show that the methods using the back- from Food2K on food detection, where the task is to detect food
bone fine-tuned from Food2K all achieve the performance gain items from food trays. For comparison, we also conduct the
with various degrees on these benchmarks. The improvement evaluation on models trained from ETH Food-101 on this task.
in Vireo Food-172 is the highest: the average improvement As the food detection model is supposed to have the ability of
on these methods is 4.04%, 5.28% and 4.16% in mAP for detecting negative samples (e.g., background), we add 2,500
ETH Food-101, Vireo Food-172 and ISIA Food-500, which non-food instance samples from [30] to ETH Food-101 and
is consistent with the performance trend in food recognition. Food2K to fine-tune backbone models for detection. We conduct
Further observation on Vireo Food-172 shows that when the the evaluation on two available datasets UNIMIB2016 [98] and
performance from the base model is very high (probably close Oktoberfest [99] with original training-test splitting. Fig. 11
to saturation), the improvement from the model fine-tuned on shows some examples. We train all the detectors using stochastic
Food2K is relatively small even the domain gap is small. For gradient descent with a batch-size of 8. The learning rate is 10−3
example, there is only 0.51% improvement for the triplet loss in and 5 × 10−3 on UNIMIB2016 and Oktoberfest, respectively,
Vireo Food-172. In contrast, 4.32% improvement is obtained in and the input size of image is set to 512 × 512. The COCO style
ETH Food-101. Note that there is no performance improvement mAP, AP50 and AP75 are adopted as metrics, where the settings
using additional ETH Food-101 for metric-learning methods in of IoU (Intersection over Union) threshhold for mAP, AP50 and
our context. We speculate that metric-learning methods are more AP75 are 0.50:.05:.95, 0.5 and 0.75 respectively. Two single-
sensitive to the larger domain gap between ETH Food-101 and stage (SSD [92], RetinalNet [93]) and four two-stage detection
these two target datasets (especially Vireo Food-172), which models (Faster-RCNN [94],PAN [95],Cascade-RCNN [96], Dy-
indirectly indicates the higher diversity of categories and scale namic R-CNN [97]) with the same backbones mentioned in their
of images from Food2K enable better and stable generalization papers are used for evaluation.
on food image retrieval. Table VII reports the detection results of different methods
3) Cross-Modal Recipe Retrieval: We evaluate the general- using backbones from ImageNet, ETH Food-101 and Food2K,
ization of Food2K on cross-modal recipe retrieval on Recipe1M, and shows that (1) all the methods obtain the improvement
which is currently one popular task in the computer vision on the model from ETH Food-101 and Food2K in mAP and
community [47]. We adopt the original data splits with 238,999 AP75, and (2) the performance gain from Food2K is higher
image-recipe pairs, 51,119 pairs and 51,303 pairs for train- than ETH Food-101, indicating the advantage of Food2K in both
ing, validation and testing, respectively, and similar experiment categories and image numbers. Particularly, the average mAP of
setup [47]. The evaluation metrics are the median retrieval rank all detectors from Food2K on UNIMIB2016 (66.4%) is higher
(MedR) and the recall percentage at top K (Recall@K), i.e., the (0.9%) than it from ETH Food-101 (65.5%). Similar trend can

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
9944 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 8, AUGUST 2023

TABLE VI
EVALUATING VISUAL REPRESENTATION LEARNED FROM ETH FOOD-101 AND FOOD2K FOR CROSS-MODAL RECIPE RETRIEVAL ON THE RECIPE1M DATASET (%)

TABLE VII
EVALUATING VISUAL REPRESENTATION LEARNED FROM ETH FOOD-101 AND FOOD2K FOR FOOD DETECTION ON UNIMIB2016 AND OKTOBERFEST(%)

be found on Oktoberfest. It can be explained that this is because to get more obvious advantages in judging the categories of
of the diversity of food categories present in Food2K. With food instances after selecting positive proposals from the back-
the supplement of more new food categories from Food2K, the ground, resulting in better performance gain on UNIMIB2016
trained backbones can help generalize better to food detection. than Oktoberfest. In addition, since two-stage models depends
In addition, considering UNIMIB2016 provides more accurate more on the proposals based on the accurately extracted features,
annotations than Oktoberfest, and the environment background they obtains bigger increase in precision than two single-stage
of UNIMIB2016 is more steady than Oktoberfest, it makes the detectors. The average mAP growth of four two-stage models is
backbone that delivers better visual food features for detectors 2.1% and the average mAP growth of two single-stage models

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9945

TABLE VIII
EVALUATING VISUAL REPRESENTATION LEARNED FROM ETH FOOD-101 AND
FOOD2K FOR FOOD SEGMENTATION ON THE UEC-FOODPIX COMPLETE
DATASET (%)

Fig. 11. Comparison of food detection results via Dynamic R-CNN trained
on ETH Food-101 and Food2K.

is 1.6% on two target datasets. It confirms that more discrimina-


tive features from better backbones can provide more obvious
performance gain in two-stage detectors while the single-stage
detectors relatively rely more on the structure of detection head.
We then visualize some detection results in Fig. 11, and find
that detectors with backbones from Food2K have better per-
formance in providing high-quality proposals for all methods.
Fig. 12. Comparison of food segmentation results via DeeplabV3 trained on
Meanwhile, representative examples chosen out of test samples ETH Food-101 and Food2K.
in Oktoberfest prove that the reasons for relatively lower AP50
for some RCNN detectors (e.g., PAN [95]) from Food2K is that
the models that learn abundant food features tend to recognize all
Table VIII reports segmentation results of different methods.
the food instances in images even they are wrongly mislabeled
We can see that our algorithm can provide higher performance
and AP50 with IoU threshold 0.5 magnifies this “precise” flaw
gain by training the backbone part using Food2K for most of
as it is only more tolerant of low-quality proposals and cannot
methods. For example, we have more than 2 points improvement
reflect the models with high precision as mentioned in [96].
over ImageNet pre-trained models for DeepLabv3+ model and
Therefore, several models have higher mAP and AP75 but lower
also improvement over ETH Food-101. We also provide visual
AP50 can still be regarded as more accurate detectors.
comparison results with DeeplabV3 on UEC-FoodPix Complete
5) Food Segmentation: We assess the generalization ability
in Fig. 12. We can see that our model can segment food regions
of Food2K on food image segmentation on the recently released
more accurately than others.
UEC-FoodPix Complete dataset [46]. It consists of 9,000 train-
ing images and 1,000 testing ones. The mean Intersection over
Union (mIOU) and Pixel Accuracy (Pix Acc) are employed C. Discussion
to evaluate the performance, where mIoU is a standard mea- Our final set of experiments demonstrate the generality of the
surement for semantic segmentation that evaluates the overlap learned features from Food2K for various vision and multimodal
and the union in inference and ground truth, and Pix Acc is a tasks, indicating its usefulness and value. We believe this is be-
simpler measurement that is the accuracy for all pixels. Various cause of the higher diversity and larger scale of Food2K. Below
models including FCN [100], SegNet [101], PSPNet [102], we discuss some potential research problems and methods based
DUC_HDC [103], GCN [104] and DeepLabv3+ [105] are on Food2K.
adopted for evaluation. We employ the same learning rate 1) Large-scale robust food recognition: Based on our experi-
schedule(“poly” policy, the momentum 0.9 and same initial mental results, although existing fine-grained recognition
learning rate 0.01), crop size is 400 × 400, fine-tuning batch methods, e.g., PMG [64] obtain the state-of-the art perfor-
normalization parameters when output stride is 16, and random mance in existing fine-grained datasets, they fail to obtain
scale data augmentation during training. For backbones, we the desired performance on Food2K. In addition, there are
similarly add 2,500 non-food instance samples from [30] to ETH also some recently proposed food recognition methods,
Food-101 and Food2K to make fine-tuned backbone models such as PAR-Net [9], which have achieved better per-
have the ability of segmenting negative samples. formance in small or median-level recognition datasets.

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
9946 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 8, AUGUST 2023

However, they also fail to obtain better performance in hundreds of novel food categories without forgetting those
large-scale food recognition on Food2K. We speculate categories, where each novel category has only a few
that there are more complex visual patterns about food samples [108]. Food2K provides such large-scale food
generated from different ingredients, accessories and ar- dataset test benchmark to support this task. In addition,
rangements with the increase of both the diversity and the constructed food ontology can also help the method
scale of food data, and these methods are not suitable or design of LS-FSFR as the prior knowledge.
robust for this case. As one initial attempt, we combined 5) More applications on Food2K: We have verified better
progressive training and self-attention to learn more stable generalization ability of Food2K in various tasks, includ-
and discriminative global and local features, resulting in ing food image recognition, food image retrieval, cross-
good performance. More methods are worth further ex- modal recipe retrieval, food detection and segmentation
ploration. For example, recently, transformers have made in the paper. Furthermore, Food2K can also support more
tremendous impact in image recognition [106], where the novel applications. Food image generation is one novel
performance is higher than CNNs on large-scale datasets. and interesting applications, and it can synthesize new
Food2K can provide sufficient training data to develop food images which are similar to those in real-life scenar-
transformer-based food recognition methods to improve ios by Generative Adversarial Networks (GANs) [109].
its performance. For example, Zhu et al. [54] can generate highly realistic
2) Human vision evaluation on food recognition: Conducting and semantically consistent images from given ingredients
human vision research on Food2K is also an interesting and instructions. Another work [52] aims to teach a ma-
topic to study. Compared with human vision research on chine how to make a pizza by building a generative model
generic object recognition, it is probably more difficult to that mirrors the step-by-step procedure. Different GANs
conduct such evaluation on food recognition. For example, such as Lightweight GAN [110] can also be used to gener-
food has strong regional and cultural characteristics, and ate synthetic food images based on Food2K. Please refer
human subjects from different regions thus have stronger to the supplementary materials, available online for more
bias for food recognition. Recent works [107] give an ini- details about the evaluation for food image generation on
tial empirical comparison between human visual system Food2K.
and CNNs in the food recognition task. In order to avoid 6) Extension of Food2K for more tasks: Researchers are
information overburden, the number of dishes to learn was encouraged to apply trained models on Food2K to more
restricted to 16 different types of food for human subjects. food-relevant tasks. Moreover, we hope Food2K will
More interesting problems can be further explored. For evolve over time. Considering that some works [15] have
example, What’s the upper bound for human performance showed that ingredients can improve the recognition per-
on food recognition? What’s their own advantages and formance, we plan to extend Food2K by providing richer
disadvantages for human vision system and CNNs in attribute annotation to support food recognition with dif-
recognizing food types and number of categories? More- ferent semantic levels. We can also conduct region-level
over, knowledge from other fields, e.g., food science is and pixel-level annotation on Food2K to enable broader
probably needed for further explanation on experimental range of its application. In addition, we can also con-
results. duct some novel tasks, such as aesthetic assessment of
3) Cross-X transfer learning for food recognition: We have food images via annotating aesthetic attribute labels on
verified the generalization of Food2K in various vision Food2K [111].
and multimodal tasks. We can study the transfer learn-
ing from more aspects in the future. For example, food
has its own geographical and culture attributes. We can VI. CONCLUSION
conduct cross-cuisine transfer learning. That means we In this paper, we present Food2K with larger data vol-
use trained models from eastern cuisines for performance ume, larger category coverage and higher diversity compared
analysis on western cuisines, and vice versa. After more with existing ones, which can be served as a new bench-
fine-grained scenario annotation, such as region-level or mark for scalable food recognition. It can benefit various vi-
even restaurant-level annotation, we can conduct cross- sion and multimodal tasks, including food recognition, re-
scenario transfer learning for food recognition. In addition, trieval, detection, segmentation, and cross-modal recipe re-
we can also study cross super-class transfer learning for trieval for its better generalization ability. To date, Food2K
food recognition. For example, we can use trained models is the largest food recognition dataset with its diversity and
from the seafood super-class for performance analysis on scale. We believe it will enable development of large-scale
the meat super-class. These interesting problems are worth food recognition methods, and also help the researchers to
deep exploration. utilize Food2K for the future research on more food-relevant
4) Large-Scale Few-Shot Food Recognition (LS-FSFR): Re- tasks, such as large-scale few-shot food recognition and trans-
cently, there are some works on few-shot food recognition fer learning on food recognition from various aspects, such
on small/medium-scale food categories [14], [43]. In con- as cross-scenario, cross-cuisine and cross-super-class transfer
trast, LS-FSFR is a more realistic task that aims to identify learning.

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9947

REFERENCES [26] R. Xu, L. Herranz, S. Jiang, S. Wang, X. Song, and R. Jain, “Geolocalized
modeling for dish recognition,” IEEE Trans. Multimedia, vol. 17, no. 8,
[1] W. Min, S. Jiang, L. Liu, Y. Rui, and R. Jain, “A survey on food pp. 1187–1199, Aug. 2015.
computing,” ACM Comput. Surv., vol. 52, no. 5, pp. 1–36, 2019. [27] G. M. Farinella, D. Allegra, and F. Stanco, “A benchmark dataset to
[2] R. G. Boswell, W. Sun, S. Suzuki, and H. Kober, “Training in cognitive study the representation of food images,” in Proc. Eur. Conf. Comput.
strategies reduces eating and improves food choice,” Proc. Nat. Acad. Vis., 2014, pp. 584–599.
Sci., vol. 115, no. 48, pp. E11 238–E11 247, 2018. [28] F. Zhou and Y. Lin, “Fine-grained image classification by exploring
[3] T. David and C. Michael, “Global diets link environmental sustainability bipartite-graph labels,” in Proc. Conf. Comput. Vis. Pattern Recognit.,
and human health,” Nature, vol. 515, no. 7528, pp. 518–22, 2014. 2016, pp. 1124–1133.
[4] P. Rozin, “The selection of foods by rats, humans, and other animals,” in [29] M. Merler, H. Wu, R. Uceda-Sosa, Q.-B. Nguyen, and J. R. Smith, “Snap,
Advances in the Study of Behavior, vol. 6. New York, NY, USA: Academic eat, repeat: A food recognition engine for dietary logging,” in Proc. Int.
Press, 1976, pp. 21–76. Workshop Multimedia Assist. Dietary Manage., 2016, pp. 31–40.
[5] A. Meyers et al., “Im2Calories: Towards an automated mobile vision [30] A. Singla, L. Yuan, and T. Ebrahimi, “Food/non-food image classi-
food diary,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 1233–1241. fication and food categorization using pre-trained googlenet model,”
[6] Q. Thames et al., “Nutrition5k: Towards automatic nutritional under- in Proc. Int. Workshop Multimedia Assist. Dietary Manage., 2016,
standing of generic food,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern pp. 3–11.
Recognit., 2021, pp. 8903–8911. [31] G. M. Farinella, D. Allegra, M. Moltisanti, F. Stanco, and S. Battiato,
[7] Y. Lu, T. Stathopoulou, M. F. Vasiloglou, S. Christodoulidis, Z. Stanga, “Retrieval and classification of food images,” Comput. Biol. Med., vol. 77,
and S. Mougiakakou, “An artificial intelligence-based system to as- pp. 23–39, 2016.
sess nutrient intake for hospitalised patients,” IEEE Trans. Multimedia, [32] G. Ciocca, P. Napoletano, and R. Schettini, “Learning CNN-based fea-
vol. 23, pp. 1136–1147, 2021. tures for retrieval of food images,” in Proc. Int. Conf. Image Anal.
[8] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discrim- Process., 2017, pp. 426–434.
inative components with random forests,” in Proc. Eur. Conf. Comput. [33] X. Chen, H. Zhou, and L. Diao, “ChineseFoodNet: A large-scale image
Vis., 2014, pp. 446–461. dataset for Chinese food recognition,” 2017, arXiv: 1705.02743.
[9] J. Qiu, F. P.-W. Lo, Y. Sun, S. Wang, and B. Lo, “Mining discriminative [34] S. Hou, Y. Feng, and Z. Wang, “VegFru: A domain-specific dataset for
food regions for accurate food recognition,” in Proc. Brit. Mach. Vis. fine-grained visual categorization,” in Proc. Int. Conf. Comput. Vis., 2017,
Conf., 2019, pp. 588–598. pp. 541–549.
[10] N. Martinel, G. L. Foresti, and C. Micheloni, “Wide-slice residual net- [35] W. Min, L. Liu, Z. Luo, and S. Jiang, “Ingredient-guided cascaded
works for food recognition,” in Proc. IEEE/CVF Winter Conf. Appl. multi-attention network for food recognition,” in Proc. ACM Int. Conf.
Comput. Vis., 2018, pp. 567–576. Multimedia, 2019, pp. 99–107.
[11] P. Kaur, K. Sikka, W. Wang, S. J. Belongie, and A. Divakaran, “Foodx- [36] D. Sahoo et al., “FoodAI: Food image recognition via deep learning for
251: A dataset for fine-grained food classification,” in Proc. Conf. Com- smart food logging,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discov.
put. Vis. Pattern Recognit. Workshop, 2019. Data Mining, 2019, pp. 2260–2268.
[12] S. Jiang, W. Min, L. Liu, and Z. Luo, “Multi-scale multi-view deep feature [37] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “ImageNet: A
aggregation for food recognition,” IEEE Trans. Image Process., vol. 29, large-scale hierarchical image database,” in Proc. Conf. Comput. Vis.
pp. 265–276, 2020. Pattern Recognit., 2009, pp. 248–255.
[13] L. Deng et al., “Mixed-dish recognition with contextual relation net- [38] K. Yanai and Y. Kawano, “Food image recognition using deep convo-
works,” in Proc. ACM Int. Conf. Multimedia, 2019, pp. 112–120. lutional network with pre-training and fine-tuning,” in Proc. IEEE Int.
[14] H. Zhao, K.-H. Yap, and A. Chichung Kot, “Fusion learning using se- Conf. Multimedia Expo Workshops, 2015, pp. 1–6.
mantics and graph convolutional network for visual food recognition,” in [39] W. Min, B.-K. Bao, S. Mei, Y. Zhu, Y. Rui, and S. Jiang, “You
Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2021, pp. 1711–1720. are what you eat: Exploring rich recipe information for cross-region
[15] J. Chen and C.-W. Ngo, “Deep-based ingredient recognition for cooking food analysis,” IEEE Trans. Multimedia, vol. 20, no. 4, pp. 950–964,
recipe retrieval,” in Proc. ACM Int. Conf. Multimedia, 2016, pp. 32–41. Apr. 2018.
[16] W. Min, L. Liu, Z. Wang, Z. Luo, X. Wei, and X. Wei, “ISIA Food- [40] E. Aguilar, B. Remeseiro, M. Bola nos, and P. Radeva, “Grab, Pay and Eat:
500: A dataset for large-scale food recognition via stacked global- Semantic food detection for smart restaurants,” IEEE Trans. Multimedia,
local attention network,” in Proc. ACM Int. Conf. Multimedia, 2020, vol. 20, no. 12, pp. 3266–3275, Dec. 2018.
pp. 393–401. [41] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, “Food recognition
[17] J. Marín et al., “Recipe1M+: A dataset for learning cross-modal embed- using statistics of pairwise local features,” in Proc. Conf. Comput. Vis.
dings for cooking recipes and food images,” IEEE Trans. Pattern Anal. Pattern Recognit., 2010, pp. 2249–2256.
Mach. Intell., vol. 43, no. 1, pp. 187–203, Jan. 2021. [42] N. Martinel, C. Piciarelli, and C. Micheloni, “A supervised extreme learn-
[18] H. Wang, G. Lin, S. C. H. Hoi, and C. Miao, “Structure-aware generation ing committee for food recognition,” Comput. Vis. Image Understanding,
network for recipe generation from images,” in Proc. Eur. Conf. Comput. vol. 148, pp. 67–86, 2016.
Vis., 2020, pp. 359–374. [43] S. Jiang, W. Min, Y. Lyu, and L. Liu, “Few-shot food recognition via
[19] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and J. Yang, multi-view representation learning,” ACM Trans. Multimedia Comput.,
“PFID: Pittsburgh fast-food image dataset,” in Proc. Int. Conf. Image Commun. Appl., vol. 16, no. 3, pp. 87:1–87:20, 2020.
Process., 2009, pp. 289–292. [44] O. Beijbom, N. Joshi, D. Morris, S. Saponas, and S. Khullar,
[20] T. Joutou and K. Yanai, “A food image recognition system with multiple “Menu-match: Restaurant-specific food logging from images,” in Proc.
kernel learning,” in Proc. IEEE 16th Int. Conf. Image Process., 2009, IEEE/CVF Winter Conf. Appl. Comput. Vis., 2015, pp. 844–851.
pp. 285–288. [45] S. Horiguchi, S. Amano, M. Ogawa, and K. Aizawa, “Personalized
[21] H. Hoashi, T. Joutou, and K. Yanai, “Image recognition of 85 food classifier for food image recognition,” IEEE Trans. Multimedia, vol. 20,
categories by feature fusion,” in Proc. IEEE Int. Symp. Multimedia, 2010, no. 10, pp. 2836–2848, Oct. 2018.
pp. 296–301. [46] K. Okamoto and K. Yanai, “UEC-FoodPIX Complete: A large-scale food
[22] Y. Matsuda and K. Yanai, “Multiple-food recognition considering co- image segmentation dataset,” in Proc. Int. Conf. Pattern Recognit., 2021,
occurrence employing manifold ranking,” in Proc. Int. Conf. Pattern pp. 647–659.
Recognit., 2012, pp. 2017–2020. [47] A. Salvador et al., “Learning cross-modal embeddings for cooking recipes
[23] Y. Kawano and K. Yanai, “Automatic expansion of a food image dataset and food images,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2017,
leveraging existing categories with domain adaptation,” in Proc. Eur. pp. 3020–3028.
Conf. Comput. Vis., 2014, pp. 3–17. [48] H. Wang, D. Sahoo, C. Liu, E.-P. Lim, and S. C. Hoi, “Learning
[24] M. M. Anthimopoulos, L. Gianola, L. Scarnato, P. Diem, and S. G. cross-modal embeddings with adversarial networks for cooking recipes
Mougiakakou, “A food recognition system for diabetic patients based on and food images,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2019,
an optimized bag-of-features model,” IEEE J. Biomed. Health Inform., pp. 11 572–11 581.
vol. 18, no. 4, pp. 1261–1271, Jul. 2014. [49] D. P. Papadopoulos, E. Mora, N. Chepurko, K. W. Huang, F. Ofli,
[25] X. Wang, D. Kumar, N. Thome, M. Cord, and F. Precioso, “Recipe and A. Torralba, “Learning program representations for food images
recognition with large multimodal food dataset,” in Proc. IEEE Int. Conf. and cooking recipes,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Multimedia Expo, 2015, pp. 1–6. Recognit., 2022, pp. 16 559–16 569.

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
9948 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 8, AUGUST 2023

[50] H. Fu, R. Wu, C. Liu, and J. Sun, “MCEN: Bridging cross-modal [76] K. Yanai and Y. Kawano, “Food image recognition using deep convo-
gap between cooking recipes and dish images with latent vari- lutional network with pre-training and fine-tuning,” in Proc. IEEE Int.
able model,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2020, Conf. Multimedia Expo Workshops, 2015, pp. 1–6.
pp. 14 570–14 580. [77] H. Wu, M. Merler, R. Uceda-Sosa, and J. R. Smith, “Learning to make
[51] A. Salvador, E. Gundogdu, L. Bazzani, and M. Donoser, “Revamping better mistakes: Semantics-aware visual food recognition,” in Proc. ACM
cross-modal recipe retrieval with hierarchical transformers and self- Int. Conf. Multimedia, 2016, pp. 172–176.
supervised learning,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2021, [78] P. Pandey, A. Deepthi, B. Mandal, and N. B. Puhan, “FoodNet: Recog-
pp. 15 475–15 484. nizing foods using ensemble of deep networks,” IEEE Signal Process.
[52] D. P. Papadopoulos, Y. Tamaazousti, F. Ofli, I. Weber, and A. Torralba, Lett., vol. 24, no. 12, pp. 1758–1762, Dec. 2017.
“How to make a pizza: Learning a compositional layer-based GAN [79] S. Ao and C. X. Ling, “Adapting new categories for food recognition
model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, with deep representation,” in Proc. Int. Conf. Des. Minings Workshop,
pp. 7994–8003. 2015, pp. 1196–1203.
[53] F. Han, R. Guerrero, and V. Pavlovic, “CookGAN: Meal image synthesis [80] C. Liu, Y. Cao, Y. Luo, G. Chen, V. Vokkarane, and Y. Ma, “DeepFood:
from ingredients,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., Deep learning-based food image recognition for computer-aided dietary
2020, pp. 1439–1447. assessment,” in Proc. Conf. Inclusive Smart Cities Digit. Health, 2016,
[54] B. Zhu and C. Ngo, “CookGAN: Causality based text-to-image synthe- pp. 37–48.
sis,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2020, pp. 5518–5526. [81] M. Bolanos and P. Radeva, “Simultaneous food localization and recog-
[55] A. Salvador, M. Drozdzal, X. Giró-i Nieto, and A. Romero, “Inverse nition,” in Proc. Int. Conf. Pattern Recognit., 2017, pp. 3140–3145.
cooking: Recipe generation from food images,” in Proc. IEEE/CVF Conf. [82] P. R. López, D. V. Dorta, G. C. Preixens, J. M. Gonfaus, and J. G. Sabaté,
Comput. Vis. Pattern Recognit., 2019, pp. 10 453–10 462. “Pay attention to the activations: A modular attention mechanism for
[56] A. Salvador, M. Drozdzal, X. Giró i Nieto, and A. Romero, “Inverse fine-grained image recognition,” IEEE Trans. Multimedia, vol. 22, no. 2,
cooking: Recipe generation from food images,” in Proc. Conf. Comput. pp. 502–514, Feb. 2020.
Vis. Pattern Recognit., 2019, pp. 10 445–10 454. [83] E. Aguilar, M. Bola nos, and P. Radeva, “Food recognition using fusion
[57] M. Nestle, Food Politics: How the Food Industry Influences Nutrition and of classifiers based on CNNs,” in Proc. Int. Conf. Image Anal. Process.,
Health, vol. 3. Berkeley, CA, USA: Univ. California Press, 2013. 2017, pp. 213–224.
[58] National food safety standard for uses of food additives (GB 31632– [84] H. Hassannejad, G. Matrella, P. Ciampolini, I. D. Munari, M. Mordonini,
2014), China Food Additives, no. 8X, 2015, Art. no. 28. and S. Cagnoni, “Food image recognition using very deep convolutional
[59] L. Zhang, S. Huang, W. Liu, and D. Tao, “Learning a mixture of networks,” in Proc. Int. Workshop Multimedia Assist. Dietary Manage.,
granularity-specific experts for fine-grained categorization,” in Proc. Int. 2016, pp. 41–49.
Conf. Comput. Vis., 2019, pp. 8331–8340. [85] N. Martinel, G. L. Foresti, and C. Micheloni, “Wide-slice residual net-
[60] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” works for food recognition,” in Proc. IEEE/CVF Winter Conf. Appl.
in Proc. Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7794–7803. Comput. Vis., 2018, pp. 567–576.
[61] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely [86] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
connected convolutional networks,” in Proc. Conf. Comput. Vis. Pattern D. Batra, “Grad-CAM: Visual explanations from deep networks via
Recognit., 2017, pp. 4700–4708. gradient-based localization,” in Proc. Int. Conf. Comput. Vis., 2017,
[62] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. pp. 618–626.
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141. [87] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by
[63] S. Min, H. Yao, H. Xie, Z.-J. Zha, and Y. Zhang, “Multi-objective matrix learning an invariant mapping,” in Proc. Conf. Comput. Vis. Pattern
normalization for fine-grained visual recognition,” IEEE Trans. Image Recognit., 2006, pp. 1735–1742.
Process., vol. 29, pp. 4996–5009, 2020. [88] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embed-
[64] R. Du et al., “Fine-grained visual classification via progressive multi- ding for face recognition and clustering,” in Proc. Conf. Comput. Vis.
granularity training of jigsaw patches,” in Proc. Eur. Conf. Comput. Vis., Pattern Recognit., 2015, pp. 815–823.
2020, pp. 153–168. [89] F. Radenović, G. Tolias, and O. Chum, “Fine-tuning CNN image retrieval
[65] C. Szegedy et al., “Going deeper with convolutions,” in Proc. Conf. with no human annotation,” IEEE Trans. Pattern Anal. Mach. Intell.,
Comput. Vis. Pattern Recognit., 2015, pp. 1–9. vol. 41, no. 7, pp. 1655–1668, Jul. 2019.
[66] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, [90] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large
inception-resnet and the impact of residual connections on learning,” in margin nearest neighbor classification,” J. Mach. Learn. Res., vol. 10,
Proc. Conf. Assoc. Advance. Artif. Intell., 2017, pp. 4278–4284. no. 2, pp. 207–244, 2009.
[67] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for [91] M. Carvalho, R. Cadène, D. Picard, L. Soulier, N. Thome, and M. Cord,
image recognition,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2016, “Cross-modal retrieval in the cooking context: Learning semantic text-
pp. 770–778. image embeddings,” in Proc. Int. ACM SIGIR Conf. Res. Develop. Inf.
[68] Z. Sergey and K. Nikos, “Wide residual networks,” in Proc. Brit. Mach. Retrieval, 2018, pp. 35–44.
Vis. Conf., 2016, pp. 87.1–87.12. [92] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf.
[69] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, and L. Wang, “Learning to Comput. Vis., 2016, pp. 21–37.
navigate for fine-grained classification,” in Proc. Eur. Conf. Comput. Vis., [93] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
2018, pp. 438–454. dense object detection,” in Proc. Conf. Comput. Vis. Pattern Recognit.,
[70] C. Yu, X. Zhao, Q. Zheng, P. Zhang, and X. You, “Hierarchical bilinear 2017, pp. 2980–2988.
pooling for fine-grained visual recognition,” in Proc. Eur. Conf. Comput. [94] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
Vis., 2018, pp. 595–610. time object detection with region proposal networks,” in Proc. Int. Conf.
[71] Y. Chen, Y. Bai, W. Zhang, and T. Mei, “Destruction and construction Neural Inf. Process. Syst., 2015, pp. 91–99.
learning for fine-grained image recognition,” in Proc. Conf. Comput. Vis. [95] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for
Pattern Recognit., 2019, pp. 5157–5166. instance segmentation,” in Proc. Conf. Comput. Vis. Pattern Recognit.,
[72] T. Hu, H. Qi, Q. Huang, and Y. Lu, “See better before looking closer: 2018, pp. 8759–8768.
Weakly supervised data augmentation network for fine-grained visual [96] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality
classification,” 2019, arXiv: 1901.09891. object detection,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2018,
[73] S. Kornblith, J. Shlens, and Q. Le, “Do better ImageNet models transfer pp. 6154–6162.
better?,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2019, pp. 2661– [97] H. Zhang, H. Chang, B. Ma, N. Wang, and X. Chen, “Dynamic R-CNN:
2671. Towards high quality object detection via dynamic training,” in Proc. Eur.
[74] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,” Conf. Comput. Vis., 2020, pp. 260–275.
in Proc. Conf. Comput. Vis. Pattern Recognit., 2018, pp. 2403–2412. [98] G. Ciocca, P. Napoletano, and R. Schettini, “Food recognition: A new
[75] P. McAllister, H. Zheng, R. Bond, and A. Moorhead, “Combining deep dataset, experiments, and results,” IEEE J. Biomed. Health Inform.,
residual neural network features with supervised machine learning al- vol. 21, no. 3, pp. 588–598, May 2017.
gorithms to classify diverse food image datasets,” Comput. Biol. Med., [99] A. Ziller, J. Hansjakob, V. Rusinov, D. Zügner, P. Vogel, and S. Günne-
vol. 95, pp. 217–233, 2018. mann, “Oktoberfest food dataset,” 2019, arXiv: 1912.05007.

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.
MIN et al.: LARGE SCALE VISUAL FOOD RECOGNITION 9949

[100] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for Mengjiang Luo received the BE degree from the
semantic segmentation,” in Proc. Conf. Comput. Vis. Pattern Recognit., School of Yanbian University, Yianbian, China, in
2015, pp. 3431–3440. 2018. He is currently working toward the master’s
[101] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep con- degree in computer science with the Key Labora-
volutional encoder-decoder architecture for image segmentation,” IEEE tory of Intelligent Information Processing, Institute
Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, of Computing Technology, Chinese Academy of Sci-
Dec. 2017. ences, Beijing, China. His research interests include
[102] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing multimedia content analysis, understanding and food
network,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2881– recognition.
2890.
[103] P. Wang et al., “Understanding convolution for semantic segmentation,” in
Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2018, pp. 1451–1460.
[104] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters– Liping Kang received the graduation degree from
improve semantic segmentation by global convolutional network,” in the Xi’an Jiaotong University of Information Engi-
Proc. Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4353–4361. neering in July 2013, and the master’s degree from
[105] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder- the Chinese Academy of Sciences University in July
decoder with atrous separable convolution for semantic image segmen- 2016. She is currently the algorithm expert in Meituan
tation,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 801–818. Vision AI Department, Beijing, China. Her current
[106] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for research interests include deep learning, computer
image recognition at scale,” 2020, arXiv: 2010.11929. vision, fine-grained image recognition and retrieval,
[107] P. Furtado, M. Caldeira, and P. Martins, “Human visual system ver- and their applications. She has applied for 17 patents
sus convolution neural networks in food recognition task: An empiri- as the first inventor, and 10 have been authorized.
cal comparison,” Comput. Vis. Image Understanding, vol. 191, 2020,
Art. no. 102878.
[108] A. Li, T. Luo, Z. Lu, T. Xiang, and L. Wang, “Large-scale few-shot learn-
ing: Knowledge transfer with class hierarchy,” in Proc. Conf. Comput.
Vis. Pattern Recognit., 2019, pp. 7212–7220.
[109] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Int. Conf. Xiaoming Wei is currently the leader of Vision
Neural Inf. Process. Syst., 2014, pp. 2672–2680. Understanding group, Computer Vision Division at
[110] B. Liu, Y. Zhu, K. Song, and A. Elgammal, “Towards faster and stabilized Meituan. His research interests focus on fine-grained
GAN training for high-fidelity few-shot image synthesis,” in Proc. Int. image recognition, multimodal analysis and genera-
Conf. Learn. Representations, 2021. tion, etc. He has led the team and got top rankings
[111] K. Sheng, W. Dong, H. Huang, C. Ma, and B.-G. Hu, “Gourmet photog- in several fine-grained matches such as Herbarium
raphy dataset for aesthetic assessment of food images,” in Proc. Conf. 2022 FGVC9 (the 1st place), Product Recognition in
SIGGRAPH Asia Tech. Briefs, 2018, pp. 1–4. CVPR2019 (the 2nd place) . He has published 10+
papers in CVPR, ECCV, IJCAI, ACM MM, AAAI,
etc.
Weiqing Min (Senior Member, IEEE) is currently an
Associate Professor with the Key Laboratory of Intel-
ligent Information Processing, Institute of Computing
Technology, Chinese Academy of Sciences. His re-
search interests include multimedia content analysis
and food computing. He has authored or coauthored
more than 50 peer-referenced papers in relevant jour- Xiaolin Wei received the PhD degree in computer
nals and conferences, including Patterns(Cell Press), science from Texas A & M University. He is
ACM Computing Surveys, Trends in Food Science now leading Computer Vision Division with Meituan.
& Technology, IEEE Transaction on Pattern Anal- His research area includes computer vision, machine
ysis and Machine Intelligence, IEEE Transaction on learning, computer graphics, 3D vision and aug-
Image Processing, IEEE Transaction on Multimedia, ACM MM, AAAI, etc. mented reality. He worked as a research engineer with
He organized several special issues on international journals, such as IEEE Google, CEO of Virtroid and principal engineer of
Transaction on Multimedia and IEEE Multimedia as a Guest Editor. He was a Magic Leap. He has been granted 40+ patents and
recipient of the 2016 ACM TOMM Nicolas D. Georganas Best Paper Award published 30+ papers in SIGGRAPH, ICCV, ECCV,
and the 2017 IEEE Multimedia Magazine Best Paper Award. ACM MM, IJCAI, etc.
Zhiling Wang received the BE degree from the
School of Geodesy and Geomatics, Wuhan Univer-
sity, Wuhan, China, in 2019 and the master’s degree
in computer science from the Chinese Academy of
Sciences University, Beijing, China, in 2022. He is Shuqiang Jiang (Senior Member, IEEE) is a profes-
currently the algorithm engineer in Meituan Vision AI sor with the Institute of Computing Technology, Chi-
Department, Beijing. His research interests include nese Academy of Sciences(CAS) and a professor with
food computing, fine-grained image recognition and the University of CAS. He is also with the Key Lab-
retrieval. oratory of Intelligent Information Processing, CAS.
His research interests include multimedia analysis,
multimodal intelligence and food computing. He has
Yuxin Liu received the BE degree from the School of authored or coauthored more than 200 papers on the
Computer Science and Technology, Shandong Uni- related research topics. He was supported by National
versity, Qingdao, China, in 2020. He is currently Science Fund for Distinguished Young Scholars in
working toward the PhD degree in computer sci- 2021. He won the CAS International Cooperation
ence with the Key Laboratory of Intelligent Infor- Award for Young Scientists, the CCF Award of Science and Technology, Wu
mation Processing, Institute of Computing Technol- Wenjun Natural Science Award for Artificial Intelligence, CSIG Natural Science
ogy, Chinese Academy of Sciences, Beijing, China. Award, and Beijing Science and Technology Progress Award. He is the associate
His research interests include computer vision, ma- editor of ACM Transactions on Multimedia Computing, Communications, and
chine learning and fine-grained and multi-label image Applications, vice chair of IEEE CASS Beijing Chapter, vice chair of ACM
recognition. SIGMM China Chapter.

Authorized licensed use limited to: VTU Consortium. Downloaded on February 22,2024 at 06:47:57 UTC from IEEE Xplore. Restrictions apply.

You might also like