1 Detection Segmentation
1 Detection Segmentation
41, 000 labeled instances of apples. The object instances are used a combination of NRI and RGB images of fruits in
small compared to the image size, and a single image may indoor environments. Their dataset contains only 122 images
contain between 1 and 120 objects. We collected data from from which training and test data are extracted. Bargoti and
multiple fruit varieties over two years, to create the largest Underwood [1] used a similar network for apple detection.
and most diverse dataset of its kind. We hope that this dataset They released their dataset of roughly 1000 image crops that
will provide an important stepping stone in advancing the field they used for training and testing. The images are of size
of precision agriculture. 308 × 202 pixels with circular annotation of the fruits. Stein et
The rest of the paper is organized as follows: In section II, al. [30] used a Faster RCNN network to detect mango fruits.
we introduce current datasets and testing methods, as well as [3] used a Fully Convolutional Network (FCN) to compute
some of the algorithms used as baselines. Then we introduce feature maps. Integrating these feature maps gives them a yield
the dataset and annotation procedure in section III. Section V estimate. They split the dataset of 71 images 50/50 between
contains the dataset statistics and we evaluate benchmark training and testing. In our previous work [26], [27], [28],
algorithms in section VI. [13] we presented results on HD sized images, showing parts
of an orchard row. We presented multiple methods including
II. R ELATED W ORK semi-supervised Gaussian Mixture Model (GMM) a Faster R-
CNN object detector and a semantic segmentation network.
Many computer vision techniques rely on large datasets
The training dataset contained 100 images, and the test set
for training, testing, and comparing different approaches to
contained 207 images. In contrast, we have increased the size
a given problem. They not only provide the means to train
of the dataset by a factor of ×3.5 for this work.
and evaluate new algorithms but encourage direct comparison
of results. Ultimately, they provide the means for researchers
to tackle new and more challenging research problems. The B. Fruit counting
ImageNet [5], Pascal VOC [10] and the COCO [20] datasets After detection of the fruits, they need to be counted.
have made millions of labeled images available to the public Rahnemoonfar and Sheppard [22] used synthetic data to train a
and enabled breakthroughs in image classification and ob- network to classify images according to fruit counts. They test
ject segmentation. Similarly, researchers released specialized their approach on 100 annotated images. Chen et al. [3] used a
datasets for autonomous driving [4], [31], [11] or pedestrian fully convolutional network together with a regression head for
detection [9], [6]. While precision automation and automated counting. They used a total of 71 orange- and 21 apple images
yield mapping have seen much research effort [1], [30], [26], from which they extracted image patches for training. Roy and
[3], [14], [13], each of these papers used their own datasets Isler [27] proposed an unsupervised counting method based on
of varying completeness and level of detail. Gaussian Mixture Models. They used a manually annotated
dataset of 440 images for testing. In our previous work [14],
A. Fruit detection we used a neural network to count clustered fruits. We trained
a network on 13000 patches and tested our approach on 4
The first step in a yield estimation or fruit picking pipeline different datasets with a total of 2800 images.
is the detection of the fruit. Early methods mostly relied
on static color thresholds for detection. The limitations of
these methods were often compensated by adding additional C. Comparison of Datasets
sensors, such as thermal- or Near Infrared (NIR) cameras. Table I summarizes the problem with current fruit detec-
Gongal et al. [12] offer a comprehensive overview of these tion and counting datasets. Because labeling effort is time-
early detection methods. More recent papers used object consuming and costly, researchers have focused on small
detection networks to detect fruits [29], [1], [13]. Sa et al. [29] datasets with little if any in dataset variety. Acquired images
TABLE I: Comparison of datasets used in recent research papers on apple detection and counting.
Fruit Detection type # train images # test images # annotations # scenes resolution ground truth public
Bargoti and Underwood [1] outdoor 729 112 5765 1 308 × 202 circles yes
Stein et al. [30] outdoor 1154 250 7065 1 500 × 500 circles yes
Sa et al. [29] indoor 100 22 359 1 1296 × 964 boxes yes
Liu et al. [21] outdoor 100 - - 1 1920 × 1200 boxes no
MinneApple (ours) outdoor 670 331 41,325 17 1280 × 720 polygons yes
Fruit Counting type # train images # test images # scenes resolution public
Chen et al. [3] outdoor 47 45 1 1280 × 960 no
Rahnemoonfar and Sheppard [22] synthetic 24,000 2400 1 128 × 128 no
MinneApple (ours) outdoor 64597 5764 6 varying yes
3
are chopped into smaller chunks to increase the dataset size different stages of the ripening cycle. The datasets were taken
artificially. These crops show only a small portion of the either from the sunny or shady side of the tree row, and
original image, which shows in the small number of labeled we spread out data capture over multiple days to get more
fruits. While this technique increases dataset size, it does varied illumination conditions. See Figure 2 for samples of
not increase dataset variation, and the developed methods are the annotated images in our dataset.
prone to overfitting. Another issue is that the whole datasets Training Sets: We sampled ten datasets from six different
are split into training/testing. Such splits lead to in-dataset tree rows for training purposes. Dataset show either the front
testing, which makes it impossible to analyze an algorithm on (sunny) or back (shady) side of a tree row. From these ten
its generalization capabilities. datasets, we randomly selected and annotated 670 images
The MinneApple dataset tries to correct these problems. of resolution 1280 × 720 pixels. All of these datasets were
The dataset contains only full resolution images. Data for the acquired in 2015 at the HRC, and they contain different apple
train/test splits are taken from different tree rows and different varieties, fruits across different growing stages and a variety
years. We included a variety of apple species and illumination of tree shapes.
conditions to avoid overfitting. The MinneApple dataset gives Test Sets: To evaluate detection/segmentation and yield
researchers a tool to test their algorithms in an unbiased way estimation performance; we arbitrarily chose four different
and compare to other approaches. sections of the orchard. We collected seven videos from these
four segments in 2016. Acquiring datasets during different
III. I MAGE C OLLECTION years guarantees the independence of the test set. Additionally,
we collected yield estimation ground truth for three tree
The data for this paper were collected at the University
rows by hand collecting and counting per tree yield and
of Minnesota’s Horticultural Research Center (HRC) between
by measuring fruit diameters after harvest. Yield estimation
June 2015 and September 2016. Since this is a university
includes additional steps, such as fruit tracking and tree row
orchard, used for phenotyping research, it is home to a large
merging. Since we do not include a baseline for tracking,
variety of apple tree species. We collected video footage from
we provide only anecdotal results for yield estimation in this
different sections of the orchard using a standard Samsung
paper.
Galaxy S4 cell phone. During data collection, we acquired
video footage by facing the camera horizontally at a single
side of a tree row and moving (by foot) along the tree row B. Counting Datasets
with approximately 1 m/s. Moving the camera at slow speeds
mitigates motion blur effects. We then extracted every fifth Training Sets: We provide two annotated datasets to
image from these video sequences. For the test datasets, we train patch-based counting approaches. One of these datasets
extracted every 30th image. contains green, and one contains red apples, and both were
acquired in 2015. Both datasets were obtained from the sunny
side of the tree row. In total, we obtained 13000 image patches,
A. Detection and Segmentation Datasets which we annotated manually with a ground truth count.
For detection and localization of the fruits, we collected 17 Additionally, we extracted 4500 patches at random that do not
different datasets over two years, ten for training and seven contain apples as negative examples. See Figure 2 for samples
for evaluation. We included fruits of different colors and at of the annotated images in our dataset.
3 2
4 1
3 2
Fig. 2: Samples of annotated images of the detection, segmentation and counting datasets. The detection/segmentation datasets
are annotated with object instance masks, while the counting dataset contains image patches and a corresponding ground truth
count.
4
Test Sets: The test dataset consists of a total of 2874 ground truth integer, representing the count. Due to the small
image patches taken from four image sequences. Two of the resolution these patches and the large volume, the annotation
test datasets contain red apples, one contains greens, and one task proved to be error-prone. We had two different workers
contains a mixture of colors. Additionally, we acquired the annotate each image, and disparities were resolved by a third
fourth dataset from a further distance to test the algorithms worker through a validation process.
generalization capability for counting low-resolution fruits.
V. DATASET S TATISTICS
IV. I MAGE A NNOTATION Here, we give an overview of the properties of MinneApple
We next describe how we labeled images for training in comparison to other object detection datasets. These include
and evaluation. We follow the annotation method in [13]. COCO [20], ImageNet Detection [5] and PASCAL VOC [10].
Following established evaluation protocols, annotations for Each of these datasets varies considerably in the number of
train and validation data will be released, but not for the test annotated images, image types, number of categories, number
one. To test your algorithm please submit your results online. of instances per image, and the size of the annotated objects.
Detection and segmentation: Fruits for the detection and The MS COCO dataset was created to show common objects
segmentation datasets were annotated using the excellent VGG in their natural context. The goal of the ImageNet Detection
annotator tool [8]. We used polygons to label fruits on trees dataset was to detect a large number of object categories. PAS-
in the foreground, while the ones on the ground and trees in CAL VOC contains fewer categories but focuses on objects in
the background were not tagged. Additionally, we labeled the natural images. Our MinneApple dataset, on the other hand,
tree trunks where visible. Please note, that we only provide focuses on detecting many small objects in highly cluttered
annotations for fully or partially visible fruits. Each of the environments.
objects in the scene was then categorized into fruit or tree A summary of the datasets showing the number of instances
trunk. We used an internally recruited workforce for the per category is shown in Figure 3a. MinneApple contains
instance labeling task. Due to the large number of instances fewer categories but far more instances per category than
per image, the small object size and the many occlusions of most datasets. In this it is comparable to other specialized
fruits instances, labeling is an arduous task. Labeling a single datasets such as the Caltech Pedestrian Detection [7] dataset
image takes up to 30 minutes, which translates to roughly 18 or the KITTI [11] dataset. Figure 3b shows the number of
work-hours per 1000 instances. As such, we chose to assign instances per image in comparison to other datasets. MinneAp-
each image to only a single worker for labeling. Each worker is ple contains 1.5 categories and 41.2 instances on average per
instructed in proper labeling techniques before they can begin image. In contrast, the COCO dataset has 3.5 categories and
to annotate. After the worker annotated the first ten image 7.7 instances, and the ImageNet and PASCAL VOC datasets
frames, we conducted an in-person review to give feedback both have less than two categories and three instances per
and issue the first round of corrections. While the instruction image on average. The spread of the number of instances per
and initial feedback improved annotation quality considerably, image compared to COCO, ImageNet, and PASCAL VOC,
we performed an additional verification step to correct each is also more extensive. The MinneApple dataset can contain
object instance if necessary. between 1 and 120 object instances, while the other datasets
Patch-based counting: For the patch-based counting have maximally 15.
method we used a semi-supervised GMM detector [27] to Finally, we analyze the average size of objects in the dataset.
detect image patches that are likely to contain fruits. The In general, smaller objects are harder to detect and require
patches were cropped and annotated by hand with a single specialized network structures [17]. For COCO, PASCAL
COCO
MinneApple (ours)
Percentage of images
10000 40%
PASCAL VOC ImageNet Detection COCO
ImageNet Detection
1000 30%
SUN PASCAL VOC
Caltech 256
MinneApple (ours)
100 20%
Caltech 101
10 10%
1 0%
1 10 100 1000 10000 1 11 21 31 41 51 61 71 81 91 101 111 121
Number of categories Number of instances
(a) (b)
Fig. 3: a) Number of annotated instances per category for some common datasets in comparison. b) Distribution of number of
annotated object instances per image for COCO, ImageNet Detection, PASCAL VOC compared with MinneApple. While the
other datasets contain mainly 1-5 objects, ours contains up to 120 instances per image.
5
(322 ≥ object area ≥ 962 ) and large objects (area > 962 ).
We evaluate three different models.
Faster RCNN: The latest implementation of Faster
RCNN [23] with a ResNet-50 backbone. The network uses
pretrained COCO weights for initialization. Faster RCNN con-
sists of a region proposal head and two branches for bounding
box regression and classification. We used parameters in the
paper for optimization.
Tiled Faster RCNN: A reimplementation of Bargoti and
Underwoods [1] proposed model. Due to memory constraints,
Fig. 4 they split the training images into 500 × 500 pixel chunks,
with an overlap of 50 pixels. The detections of the individual
Fig. 5: Area distribution of the objects in our dataset. The chunks are aggregated and filtered using non-maximum sup-
dataset contains mainly small object instances with area < 502 pression. We added a ResNet-50 backbone, a Feature Pyramid
pixels (FPN) [18] head for region proposal to the network and trained
it with focal loss [19]. The network was initialized with
weights pretrained on COCO [20]. We follow [1], [13] in our
VOC and ImageNet Detection, roughly 50% of all objects choice of parameters for optimization.
occupy no more than 10% of the image itself. The other 50%
Mask RCNN: Implementation of a Mask RCNN [15] with
contains objects that occupy between 10 and 100% of the
a ResNet-50 backbone, pretrained on the COCO dataset. In
image (evenly distributed). Our MinneApple dataset contains
addition to using bounding box inputs as Faster RCNN, Mask
almost exclusively small instances. The average object size is
RCNN has an additional branch predicting the object instance
only 40 × 40 pixels in an image of 1280 × 720 pixels, making
mask.
up only 0.17% of the original image size.
If we compare the bounding box detection results in Ta-
ble II, we find that processing the image in a tiled fashion
VI. A LGORITHMIC A NALYSIS performs worse than the detectors operating on the whole
We run a set of state-of-the-art algorithms on each the tasks image. We hypothesize that this is due to the additional
of object detection, segmentation, and counting to establish a filtering step at the end, where non-maximum suppression is
common baseline for future work. used to filter out overlapping bounding boxes. If we compare
the two state of the art object detectors, we find that Faster
RCNN slightly outperforms Mask RCNN. This is somewhat
A. Detection and Segmentation Baselines unexpected since Mask RCNN has access to additional in-
For the following experiments, we take a subset of 600 formation (the instance masks). Further, we find that all the
images from our dataset for training. The leftover 30 images detectors struggle on smaller object instances. Future research
are used for validation during training. We test each algorithm should focus on improving the object detection performance
individually on each of the 331 test images and report average on small and medium-sized objects to achieve significant gains
performance over the test dataset. in overall performance. Our findings confirm Hoiem et al. [17],
Detection evaluation metrics: For bounding box detection, which found that object size is one of the main error factors
we follow established evaluation protocols used by other ob- in object detection.
ject detection datasets [10], [20]. We report Average Precision Semantic segmentation evaluation metrics: While bound-
(AP) as our main evaluation metric. Namely, we use AP ing box prediction is the method predominantly used for
starting at Intersection over Union (IoU) threshold 0.5 and object detection; we recognize that there exist other methods
increase it in intervals of 0.05 up to 0.95 (shorthand notation which only achieve bounding box prediction after an addi-
is [email protected]:0.05:0.95). Additionally, we provide [email protected] and tional post-processing step. These methods include mainly
[email protected]. Since our dataset contains many small objects, we detection through semantic segmentation. To avoid explicit
report AP scores for small (object area < 322 pixels), middle bias towards bounding box prediction methods, we introduce
TABLE II: Fruit detection benchmark results for object detection approaches. Higher numbers are better and the bold marked
numbers indicate the highest performing approach.
Metric
AP [@ 0.5:0.05:0.95] AP [@ 0.5] AP [@ 0.75] AP [small] AP [middle] AP [large]
Tiled FRCNN [1] 0.341 0.639 0.339 0.197 0.519 0.208
Method
separate benchmark algorithms for semantic segmentation. For counting. We report baseline results for approaches which
evaluation, we follow the established metrics used by the were previously published in [14] for completion. We evaluate
COCO dataset [20]. We report Intersection over Union (IoU) two approaches. GMM: An unsupervised method based on
as the primary challenge metric. Additionally, we report class Gaussian Mixture Models. This method fits a mixture of
IoU for apples, pixel accuracy, and class accuracy for apple Gaussians probability distribution to a previously segmented
pixels. We evaluate four different models. image. CNN This method uses a network to classify the
Semi-supervised GMM: A semi-supervised clustering fruits into k distinct classes. The network is based on a
method based on Gaussian Mixture Models (GMM), devel- ResNet50 [16] backbone, and we choose to classify six classes.
oped by Roy and Isler [26]. The model is pretrained on an Table IV shows the counting accuracy. The ResNet50 network
unlabeled dataset, different from the ones contained in the outperforms the GMM model on all of the test sets. How-
train and test sets. ever, the network exhibits considerable variation in counting
User-supervised GMM: The same model as in the semi- performance. Dataset 1 and 3 contain red and mixed apples.
supervised case. The method uses human supervision to create Dataset 4 contains red apples, but the images were acquired
a single model per tree row in the test set. from further away. The best performance is achieved on test
UNet (not pretrained): A semantic segmentation network, dataset 3, which contains green apples. We believe that this is
based on a fully convolutional network architecture [24], [13]. the case because the green fruits show considerably less color
The images in the train and test sets are split into 224 × 224 variation than the red and mixed fruits. The GMM method
sized chunks, and the weights of the network are initialized performs best on test dataset 1, as this dataset is closest to the
randomly. GMM training data. For an in-depth analysis and qualitative
UNet (pretrained): the same model as the one before, comparison, we refer the reader to [14].
but the weights are initialized from a pretrained ImageNet
network. TABLE IV: Fruit cluster counting benchmark results.
TABLE III: Fruit detection benchmark results for semantic Method Dataset 1 Dataset 2 Dataset 3 Dataset 4
segmentation approaches. Higher numbers are better and the GMM [27] 88.0 % 81.8 % 77.2 % 76.1 %
bold marked numbers indicate the highest performing ap-
CNN [14] 88.8 % 92.68 % 95.1 % 88.5 %
proach.
Metric
IoU Class IoU Pixel Acc. Class Acc.
C. Yield estimation
Semi-supervised
0.635 0.341 0.968 0.455 Fruit detection and counting are integral to solving the
GMM [28]
problem of yield estimation. However, yield estimation con-
User-supervised tains additional steps to map detections and counts to tree
0.649 0.455 0.959 0.634
GMM [28] row yield. For one, we need to track fruits across the image
Method
VII. C ONCLUSION [11] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous
driving? The KITTI vision benchmark suite. In 2012 IEEE Confer-
We introduced a new dataset for detecting and segmenting ence on Computer Vision and Pattern Recognition, pages 3354–3361,
apples in orchards and a second dataset for counting clustered Providence, RI, June 2012. IEEE.
[12] A. Gongal, S. Amatya, M. Karkee, Q. Zhang, and K. Lewis. Sensors
fruits. With this collection of annotated object instances, we and systems for fruit detection and localization: A review. Computers
hope to help the advancement of object detection, segmenta- and Electronics in Agriculture, 116:8–19, Aug. 2015.
tion, and counting of small objects in cluttered environments. [13] N. Häni, P. Roy, and V. Isler. A comparative study of fruit detection
and counting methods for yield mapping in apple orchards. Journal of
In creating this dataset, we wanted to emphasize the need Field Robotics, Aug. 2019.
for a diverse and unbiased dataset, containing a large number [14] N. Häni, P. Roy, and I. Volkan. Apple Counting using Convolutional
of object instances and apple varieties between the individual Neural Networks. In Intelligent Robots and Systems (IROS), 2018 IEEE
International Conference on. IEEE, 2018.
tree rows. Dataset statistics and results from the baseline algo- [15] K. He, G. Gkioxari, P. Dollr, and R. Girshick. Mask R-CNN.
rithms indicate that the images contain challenging scenarios arXiv:1703.06870 [cs], Mar. 2017. arXiv: 1703.06870.
[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image
for current state-of-the-art object detection algorithms. Recognition. In 2016 IEEE Conference on Computer Vision and Pattern
There are several promising directions for future work Recognition (CVPR), pages 770–778, Las Vegas, NV, USA, June 2016.
to improve the performance of detection and counting al- IEEE.
[17] D. Hoiem, Y. Chodpathumwan, and Q. Dai. Diagnosing Error in Object
gorithms using this dataset. Our analysis of state-of-the-art Detectors. In D. Hutchison, T. Kanade, J. Kittler, J. M. Kleinberg,
object detectors indicates that networks can gain in accuracy F. Mattern, J. C. Mitchell, M. Naor, O. Nierstrasz, C. Pandu Ran-
by putting a broader focus on small object instances (area gan, B. Steffen, M. Sudan, D. Terzopoulos, D. Tygar, M. Y. Vardi,
G. Weikum, A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and
< 322 pixels). Similarly, semantic segmentation networks may C. Schmid, editors, Computer Vision ECCV 2012, volume 7574, pages
employ weighting schemes to address the class imbalance 340–353. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.
between foreground (object instances) and background pixels. [18] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie.
Feature Pyramid Networks for Object Detection. In 2017 IEEE Con-
We hope that our dataset will help computer vision researchers ference on Computer Vision and Pattern Recognition (CVPR), pages
working on fruit detection. 936–944, Honolulu, HI, July 2017. IEEE.
[19] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal Loss for
To download and learn more about MinneApple please Dense Object Detection. In 2017 IEEE International Conference on
see the project website: http://rsn.cs.umn.edu/index.php/ Computer Vision (ICCV), pages 2999–3007, Venice, Oct. 2017. IEEE.
MinneApple. [20] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays,
P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollr. Microsoft COCO:
Common Objects in Context. arXiv:1405.0312 [cs], May 2014. arXiv:
VIII. ACKNOWLEDGEMENTS 1405.0312.
[21] X. Liu, S. W. Chen, S. Aditya, N. Sivakumar, S. Dcunha, C. Qu, C. J.
This work was supported by the USDA NIFA MIN-98-G02, Taylor, J. Das, and V. Kumar. Robust Fruit Counting: Combining Deep
and UMN MnDrive. The authors acknowledge the Minnesota Learning, Tracking, and Structure from Motion. In 2018 IEEE/RSJ
Supercomputing Institute (MSI) at the University of Minnesota International Conference on Intelligent Robots and Systems (IROS),
pages 1045–1052, Madrid, Oct. 2018. IEEE.
for providing resources that contributed to the research results [22] Maryam Rahnemoonfar and Clay Sheppard. Deep Count: Fruit Counting
reported within this paper http://www.msi.umn.edu. Based on Deep Simulated Learning. Sensors, 17(12):905, Apr. 2017.
[23] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-
Time Object Detection with Region Proposal Networks. In C. Cortes,
R EFERENCES N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,
[1] S. Bargoti and J. Underwood. Deep fruit detection in orchards. In Advances in Neural Information Processing Systems 28, pages 91–99.
Robotics and Automation (ICRA), 2017 IEEE International Conference Curran Associates, Inc., 2015.
[24] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks
on, pages 3626–3633. IEEE, 2017.
[2] L. Calvin and P. Martin. The u.s. produce industry and labor: Facing for biomedical image segmentation. In International Conference on
the future in a global economy. (1477-2017-4011):57, 2010. Medical image computing and computer-assisted intervention, pages
[3] S. W. Chen, S. S. Shivakumar, S. Dcunha, J. Das, E. Okon, C. Qu, 234–241. Springer, 2015.
C. J. Taylor, and V. Kumar. Counting Apples and Oranges With Deep [25] P. Roy, W. Dong, and V. Isler. Registering Reconstructions of the Two
Learning: A Data-Driven Approach. IEEE Robotics and Automation Sides of Fruit Tree Rows. In Intelligent Robots and Systems (IROS),
Letters, 2(2):781–788, 2017. 2018 IEEE International Conference on. IEEE, 2018.
[4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- [26] P. Roy and V. Isler. Surveying apple orchards with a monocular
son, U. Franke, S. Roth, and B. Schiele. The Cityscapes Dataset for vision system. In International Conference on Automation Science and
Semantic Urban Scene Understanding. In 2016 IEEE Conference on Engineering (CASE), pages 916–921. IEEE, Aug. 2016.
Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, [27] P. Roy and V. Isler. Vision-Based Apple Counting and Yield Estimation.
In D. Kuli, Y. Nakamura, O. Khatib, and G. Venture, editors, 2016
Las Vegas, NV, USA, June 2016. IEEE.
[5] J. Deng, W. Dong, R. Socher, L.-J. Li, Kai Li, and Li Fei-Fei. ImageNet: International Symposium on Experimental Robotics, volume 1, pages
A large-scale hierarchical image database. In 2009 IEEE Conference on 478–487. Springer International Publishing, Cham, 2017.
[28] P. Roy, A. Kislay, P. A. Plonski, J. Luby, and V. Isler. Vision-based
Computer Vision and Pattern Recognition, pages 248–255, Miami, FL,
June 2009. IEEE. preharvest yield mapping for apple orchards. Computers and Electronics
[6] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A in Agriculture, 164:104897, Sept. 2019.
[29] I. Sa, Z. Ge, F. Dayoub, B. Upcroft, T. Perez, and C. McCool.
benchmark. In 2009 IEEE Conference on Computer Vision and Pattern
Recognition, pages 304–311, Miami, FL, June 2009. IEEE. DeepFruits: A Fruit Detection System Using Deep Neural Networks.
[7] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian Detection: Sensors, 16(8):1222, Aug. 2016.
An Evaluation of the State of the Art. IEEE Transactions on Pattern [30] M. Stein, S. Bargoti, and J. Underwood. Image Based Mango Fruit
Detection, Localisation and Yield Estimation Using Multiple View
Analysis and Machine Intelligence, 34(4):743–761, Apr. 2012.
[8] A. Dutta, A. Gupta, and A. Zissermann. VGG Image Annotator (VIA). Geometry. Sensors, 16(11):1915, Nov. 2016.
2016. [31] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell.
[9] A. Ess, B. Leibe, K. Schindler, and L. van Gool. Robust Multiperson BDD100k: A Diverse Driving Video Database with Scalable Annotation
Tracking from a Mobile Platform. IEEE Transactions on Pattern Tooling. arXiv:1805.04687 [cs], May 2018. arXiv: 1805.04687.
Analysis and Machine Intelligence, 31(10):1831–1846, Oct. 2009.
[10] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisser-
man. The Pascal Visual Object Classes (VOC) Challenge. International
Journal of Computer Vision, 88(2):303–338, June 2010.