Comic Characters Detection Using Deep Learning
Comic Characters Detection Using Deep Learning
42
Authorized licensed use limited to: Kwame Nkrumah Univ of Science and Technology. Downloaded on April 07,2024 at 07:40:39 UTC from IEEE Xplore. Restrictions apply.
Figure 2 shows the base model used in YOLOv2, which character appear. We made ground truth bounding boxes for
has 19 convolutional layers and 5 maxpooling layers. For the all characters, small or big, speaking or not and in the
detection task, YOLOv2 replaces the last convolutional layer background. Because the Sequencity comics are private, we
by three 3×3 convolutional layers with 1024 filters, each cannot provide the images, but we provide the ground truth
followed by a final 1×1 convolutional layer with the number and corresponding album names and page numbers for anyone
of outputs we need for detection. who wants to access the images7. Sequencity612 is split into
three sets: training set of 500 images, validating set of 50
images and testing set of 62 images. The three image lists for
the training set, the validating set and the testing set will also
be publicly available. To our knowledge, this is the largest
dataset for comic characters.
To create the ground-truth for Sequencity612, we have
identified five different types of characters that we classified
into: human-like, near human-like, far human-like, animal-like
and manga characters (from japanese mangas). Three people
have done the ground truth based on 5 targeted classes.
B. Fahad18 dataset
Fahad18 dataset is a collection of 586 images of 18
favorite cartoon characters obtained from Google [5]. There
are 18 cartoon characters in this dataset: Bart, Homer, Marge,
Lisa (The Simpsons), Fred and Barney (the Flintstones), Tom,
Jerry, Sylvester, Tweety, Bugs, Daffy, Scooby, Shaggy,
Roadrunner, Coyote, Donald Duck and Mickey Mouse. In the
whole dataset, the number of images for each character is
ranging from 28 (Marge) to 85 (Tom). Note that an image may
contain more than one character.
43
Authorized licensed use limited to: Kwame Nkrumah Univ of Science and Technology. Downloaded on April 07,2024 at 07:40:39 UTC from IEEE Xplore. Restrictions apply.
A. Sequencity612 dataset
The results of experiments on the Sequencity612 dataset
are presented in Figure. 5. We have compared the one-class
detector versus 5-classes detector for the task of character
detection. By observing the precision-recall curve, we can
confirm the effectiveness of the 5-classes model (dividing
character into multiple classes). Although the one-class model
has a little advantage when we care only about precision (the
recall is less necessary according to the application), the
precision is often better for the 5-classes model (when the
recall > 0.3). Moreover the 5-classes model can achieve higher
Fig. 4. Samples from Ho42 dataset. Best viewed in color. recall (0.88) while the recall of the one-class model reach
0.81. Even if the five classes that we have identified are not
hundred comic pages from America, Japan (manga) and indeed representing the diversity of comic characters, they are
Europe8. The dataset includes annotation of four different still more homogeneous than a single class. This characteristic
types of objects: text lines, balloons, panels and characters. is the reason why the 5-classes model performs better than the
The position property is defined by the coordinates of the one-class model (see Figure. 5).
bounding boxes that includes all the pixels of the object.
In this research, we experimented and compared our
results with available datasets from [5, 6, 7]. Because [5, 6, 7]
require extracted panels to detect characters inside panels. We
chose to use the same setting and extract panels by using the
algorithm in [28]. Panel positions will be provided with the
ground truth link9. While the objectives of [6, 7] on Sun60
and Ho42 datasets are not the same as ours, we have tried our
model on these datasets to prove its effectiveness as we can
detect more characters with our model than existing works
without additional knowledge. In contrast, we do not show the
comparison with methods in [6, 7] on Sequencity612 dataset
because they perform poorly due to their requirements that fit
only a limited subset of this dataset. For example, the Ho42
dataset needs images with repeated color characters on each
Fig. 5. Results on Sequencity612 dataset for one-class and 5-classes
comic's page, which is not the case of Sequencity612. models. The precision-recall curves show that one-class model is slightly
Similarly, Fahad’s work requires only rich colors images. better when recall is small (<0.2). However, 5-classes model gives an
overall better performance.
IV. RESULTS & DISCUSSION
To evaluate detection performance, we follow the B. Fahad18 dataset
PASCAL VOC evaluation criteria [29]. We report the In [5], the authors used color attributes as an explicit color
interpolated average precision (AP%) and the precision-recall representation for object detection. They proved that their
curve. Average precision computes the average value of method is most effective for comic characters in which color
precision p(r) over the recall intervals from r=0.0 to r=1.0 plays a pivotal role. In this section, we present the result of our
(see Formula 1). The PASCAL Visual Object Classes approach on the Fahad18 dataset. The Fahad18 dataset is
challenge (a well-known benchmark for object detection in divided into two sets. A training set of 304 comic images and
computer vision) computes average precision by averaging the a testing sets of 182 comic images. To evaluate detection
precision over a set of evenly spaced recall levels {0, 0.1, performance, the authors follow the PASCAL VOC evaluation
0.2, ... 1.0}. criteria [29]. Although, the Fahad18 dataset is composed of
cartoon images, which have different styles compared to
(1) comic images. We can prove the effectiveness of deep
where p_interp(r) is an interpolated precision that takes the learning approach compared to the method in [5] by using the
maximum precision over all recalls greater than r. same setting as this paper to train and test our model. Table I
shows results on Fahad18 dataset of our 18-classes model and
(2) results of [5]. Fahad18 dataset contains 586 images of 18
classes. The AP% for 18 classes are shown with the mean
Some visual results of our experiments (purple bounding
AP% over all classes in the last column. Note that the
boxes) are shown in the figures 1, 3 and 4.
proposed approach outperforms the method presented in [5]
with the detection and recognition of 14/18 classes and it gives
a significant improvement for the meanAP about 18.1%. The
proposed approach shows higher performance than [5].
8
http://ebdtheque.univ-lr.fr/database/ However, there are 4/18 classes where the method in [5] gives
9
https://bitbucket.org/l3ivan/sequencity612
better results than our approach: bart, marge, lisa, barney.
44
Authorized licensed use limited to: Kwame Nkrumah Univ of Science and Technology. Downloaded on April 07,2024 at 07:40:39 UTC from IEEE Xplore. Restrictions apply.
TABLE I. RESULTS ON FAHAD18 DATASET
bart homer marge lisa fred barney tom jerry sylvester tweety buggs daffy scooby shaggy roadrunner coyote donaldduck micky meanAP
Fahad 72.3 40.4 43.4 89.8 72.8 55.1 32.8 52.3 32.9 51.4 22.2 35.6 19.8 25.2 21.9 10.0 27.9 45.3 41.7
[5]
This 63.6 60.6 41.7 65.6 78.5 54.5 84.5 59.1 54.5 56.1 54.5 60.3 65.4 61.4 60.5 63.1 42.4 59.3 59.8
paper
Frequency in 19 17 14 16 30 15 43 42 28 21 62 23 31 26 21 26 25 20
training set
Note that these classes come from the same cartoon: The redundant character has been detected in 71.4% of the pages.
Simpsons. This lower AP% may originate from small number Partial characters have been detected in 9.6% of the Ho42
of training samples. The Table II shows the numbers of dataset. For the rest, characters have been detected but only
examples in the training set for 18 classes. These 4 classes are one time.
in the last five in terms of training instances (the five classes
are: bart, homer, marge, lisa and barney). This is a potential
indicator that the deep learning approach may be more
sensitive to the number of class instances in the training set
than other approaches.
We have also tested directly our model learnt from
Sequencity612 dataset on Fahad18 dataset. This model can
only find characters without knowing the specific class among
the 18 classes of Fahad18. The mean average precision (mAP)
of this experiment is 43.77%. This result is quite below the
previous result and it is reasonable because of the difference
between cartoon images (Fahad18) and comics images
(Sequencity612). We can see that the mAP is higher than [5],
however, [5] can detect the class of the characters. Fig. 6. Results on Sun60 dataset. Precision-recall curves in %.
C. Sun60 and Ho42 datasets
For both datasets, authors don’t use any training. They
applied the algorithms directly to detect characters. Therefore,
we will use these datasets to prove the effectiveness of our
approach by applying the model trained on the Sequencity612
dataset to these two datasets without re-training it. Table III
shows the results of character detection taken in the paper [7].
While we know that the Sun60 dataset of 60 comic pages used
in [7] is a subset of eBDtheque we did not know exactly which
60 comics pages of eBDtheque are used. So we tested multiple
sets of 60 comics pages which are randomly taken from
eBDtheque and compute the average results. In [7], the authors
have not reported the precision recall curve, but final values of
precision and recall: 35.48% for the recall and 79.73% for the Fig. 7. Precision-recall curve of our method on Ho42 dataset.
precision. Figure 6 depicts the results of our method. When the
recall is 35.48%, the precision is about 85% (compare to TABLE III. RESULTS OF CHARACTER DETECTION IN [7]
79,43% of [7]). And when the precision is 79.73%, the recall Title 1 2 3 4 5 6 Mean
is about 51% (compare to 35.48% of [7]). This result is Recall (%) 23 36 50 33.3 23.6 47 35.48
obviously better than [7]. Precision (%) 72 81.8 68 100 57 97.8 79.43
In [6], the inexact graph matching is used to automatically Fig. 7 depicts the result of our method on the Ho42 dataset.
localize the most frequent color group apparitions and label The precision-recall curve shows a very high performance.
them as main characters. Their experiment was carried out While [6] can detect at least 2 characters on 81% of pages, our
with the Ho42 dataset where all pages contain at least one method can detect almost 92% of characters on the whole
redundant character and each page consists of 4 panels. They dataset. Note that the algorithm in [6] can detect redundant
evaluated the method by verifying if the algorithm is able to characters without training a model while our approach can
detect redundancies in each comic page and not in the whole detect character in the Ho42 dataset with a model trained on
album. To evaluate algorithm performance, the detection is another non-related dataset (Sequencity612).
considered as valid if the redundancy condition is true. It
means that at least one redundant character should be detected D. Conclusion
in a page. Table III presents the results of [6]. At least one To make digital comic books be used on a large scale in
45
Authorized licensed use limited to: Kwame Nkrumah Univ of Science and Technology. Downloaded on April 07,2024 at 07:40:39 UTC from IEEE Xplore. Restrictions apply.
future devices, we need first to solve the problem of comic [13] B. Duc. L’art de la B.D.: “Du scénario à la réalisation
image understanding. However, scene analysis, component graphique, tout sur la création des bandes dessinées.” Editions
extraction and story understanding progress are still Glénat, 1997. 1, 5, 76
insufficient to be industrialized. Our goal was not to develop a [14] J.-M. Lainé and S. Delzant. Le lettrage des bulles.
new character detection method but to introduce a state-of-the- Eyrolles, 2010. 5, 76
art baseline method thanks to the recent development of deep [15] M. Iwata, A. Ito, K. Kise, “A study to achieve manga
learning and to propose a new large dataset with ground-truth character retrieval method for manga images.” in DAS, 2014
of comic characters. Experiments on four different datasets [16] T.-N. Le, M. M. Luqman, J.-C. Burie, and J.-M. Ogier,
reaffirm the benefit of deep learning approach on comic “A comic retrieval system based on multilayer graph
character detection. Deep learning approach needs offline representation and graph mining.” in GbRPR. Springer, 2015.
training because of important computational power
[17] Y. Matsui, K. Ito, Y. Aramaki, T. Yamasaki, and K.
requirements and more data to get most of its performance, but
Aizawa, “Sketch-based manga retrieval using Manga109
we have proved in this paper that it already gives the best
results in comparison with existing methods. dataset.” CoRR, vol. abs/15 10.04389, 2015
[18] A. Krizhevsky, I. Sutskever, G. E. Hinton, “Imagenet
Acknowledgment classification with deep convolutional neural networks.” in
This work is supported by the University of La Rochelle NIPS, 2012, pp. 1106–1114.
(France), the town of La Rochelle and the PIA-iiBD [19] Y. Bengio. “Learning deep architectures for ai.”
Foundations and Trends in Machine Learning, 2(1):1–127, 2009.
(“Programme d’Investissements d’Avenir”). We are grateful
[20] Y. Bengio, A. Courville, and P. Vincent. “Representation
to all authors and publishers of comics images from Fahad18,
learning: A review and new perspectives.” IEEE PAMI,
eBDtheque and Sequencity dataset for allowing us to use their
35(8):1798–1828, 2013.
works.
[21] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H.
REFERENCES Lipson. “Understanding neural networks tlhrough deep
visualization.” In ICML Workshop on Deep Learning, 2015
[1] W. Sun and K. Kise, “Detection of exact and similar
[22] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and
partial copies for copyright protection of manga.” IJDAR, vol.
D. Ramanan. “Object detection with discriminatively trained
16, no. 4, p.331–349, 2016.
part based models.” IEEE PAMI, 32(9):1627–1645, 2010
[2] W. Sun and K. Kise, “Similar Partial Copy Detection of
[23] R. Girshick, J. Donahue, T. Darrell, and J. Malik. “Rich
Line Drawings Using a Cascade Classifier and Feature
feature hierarchies for accurate object detection and semantic
Matching," in ICWF, vol. 6540. Springer, 2010, pp. 126–137.
segmentation.” In CVPR, pp. 580–587, 2014
[3] P.A.Viola and M.J.Jones, “Robust real-time face
[24] J. Redmon, A. Farhadi. “YOLO9000: Better, Faster,
detection.”In IJCV, vol. 57, no. 2, pp. 137–154, 2004.
Stronger “, in Proc. of the Conf. of Computer Vision and
[4] T. Kohei, J. Henry, and N. Tomoyuki. “Face detection
Pattern Recognition (CVPR), 2017.
and face recognition of cartoon characters using feature
[25] M. Ueno, N. Mori, T. Suenaga, and H. Isahara,
extraction.” In IEVC’12, Kuching, Malaysia, 2012. 18.
“Estimation of structure of four-scene comics by
[5] F. S. Khan, R. M. Anwer, J. van de Weijer, A. D.
convolutional neural networks.” in MANPU@ICPR, 2016.
Bagdanov, M. Vanrell,and A. M. Lopez, “Color attributes for
[26] J.R.R. Uijlings, K.E.A. van de Sande, & all “Selective
object detection.” in CVPR. 2012, pp. 3306–3313.
search for object recognition.”, In IJCV, vol. 104, no. 2, 2013.
[6] H.N Ho, C. Rigaud, J.-C. Burie, J.-M. Ogier.
[27] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid
“Redundant structure detection in attributed adjacency graphs
pooling in deep convolutional networks for visual
for character detection in comics books.” 10th IAPR Int.
recognition.” CoRR, vol. abs/1406.4729, 2014.
Workshop on Graphics Recognition, Aug 2013, USA.
[28] C. Rigaud, N. Tsopze, J.-C. Burie, J.-M. Ogier, “Robust
[7] W. Sun, J.-C. Burie, J.-M. Ogier, and K. Kise, “Specific
frame and text extraction from comic books.” in GREC, 2011
comic character detection using local feature matching.” in
[29] M. Everingham, L. J. V. Gool, C. K. I. Williams, J. M.
ICDAR. 2013, pp. 275–279
Winn, and A. Zisserman, “The pascal visual object classes
[8] C. Guérin, C. Rigaud, & al: eBDtheque: a representative
(voc) challenge.” In IJCV, vol. 88, no. 2, pp. 303–338, 2010.
database of comics. ICDAR, 2013, p. 1145-1149.
[30] S. Ren,K. He, R. B. Girshick, J. Sun, “Faster r-cnn:
[9] C. Rigaud, “Segmentation and indexation of complex
Towards real-time object detection with region proposal
objects in comic book images.” Ph.D. dissertation, Univ. of La
networks”, NIPS, 2015, pp. 91–99.
Rochelle, France, 2014.
[31] M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, S. Liu, “Towards
[10] S. Medley. “Discerning pictures: How we look at and
better analysis of deep convolutional neural networks.” CoRR,
understand images in comics.” Studies in Comics, 2010
vol.abs/1604.07043, 2016.
[11] N. Cohn. “The limits of time and transitions: Challenges
[32] W.-T. .Chu and W.-C. Cheng, “Manga-specific features
to theories of sequential image comprehension.” Studies in
and latent style model for manga style analysis.” in ICASSP.
Comics, 1(1):127–147, 2010.
IEEE, 2016, pp. 1332–1336.
[12] H.A.Ahmad, S. Koyama, H. Hibino. “Impacts of manga
[33] R. Girshick. “Fast R-CNN.” In Proceedings of the 2015
on indonesian readers self-efficacy and behavior intentions to
IEEE ICCV, Washington DC, USA, 1440-1448. 2015.
imitate its visuals.” Bulletin of JSSD, 59(3), 2012. 18
46
Authorized licensed use limited to: Kwame Nkrumah Univ of Science and Technology. Downloaded on April 07,2024 at 07:40:39 UTC from IEEE Xplore. Restrictions apply.