Parametric Comics Creation From 3D Interaction
Parametric Comics Creation From 3D Interaction
Ariel Shamir
Michael Rubinstein
Tomer Levinboim
Efi Arazi School of Computer Science, The Interdisciplinary Center
Contact Information:
Dr. Ariel Shamir
Efi Arazi School of Computer Science
The Interdisciplinary Center
P.O.B. 167
Herzliya 46150
Israel
Tel: +972-9-9527378
Email: [email protected]
Abstract
There are times when Computer Graphics is required to be succinct and simple. Carefully chosen simplified and
static images can portray a narration of a story as effectively as 3D photo-realistic dynamic graphics. In this paper we
present an automatic system which transforms dynamic graphics originating from real 3D virtual-world interactions
into a sequence of comics images. The system traces events during the interaction and then analyzes and breaks them
into scenes. Based on user defined parameters of point-of-view and story granularity it chooses specific time-frames
to create static images, renders them, and applies post-processing to reduce their cluttering. The system utilizes the
same principle of intelligent reduction of details in both temporal and spatial domains for choosing important events
and depicting them visually. The end result is a sequence of comics images which summarize the main happenings
and present them in a coherent but concise manner.
Keywords: Information Visualization, Non-Photorealistic Rendering, Comics, Succinct Graphics.
Introduction
1. In the temporal domain the challenge is to automatically detect and choose the most important or most
interesting events in the interaction. The definition
of interest depends on the point-of-view specified by
the viewer.
2. In the spatial domain the challenge is to depict these
events with one or more images that will convey
their meaning faithfully. This includes choosing the
points in time to portray an event, selecting camera
parameters to create descriptive images, choosing a
rendering style and arranging their layout to convey
the storyline.
Traditionally in comics, selecting or producing image
and text that communicate the right message is the work
of an artist. Indeed, we are still far from an artists capability and expressiveness. However, this work goes one
step towards this goal by presenting a system which is capable of extracting a sequence of important events from a
continuous temporal storyline, and converting them into a
graphical representation automatically. The main contribution of this work is the creation of an automatic end-toend system which transforms interactions in 3D graphics
into a succinct representation depicted by comics. The
system is based on principles of comics theory and is capable of producing different comics sequences based on
different semantic parameters provided by the user such
as point-of-view and the level-of-details using the same
reduction principle in both the temporal and the spatial
domains.
1
2 Related Work
Transformations of a virtual environment or games interactions to a static visualization were presented in [9]
to trace user behavior, and in [23] to create an enhanced
spectator view. However, these works mostly deal with
global views of the happenings in the environment (e.g.
top view) and the goal of visualization is not to convey
a narrative, but rather some global characteristics of the
events. Another fine example of this type of transformation can be found in [33], where a system for rendering
dynamic VR events using symbols from visual arts is presented. These types of depictions can be utilized for converting specific events to comics, while our work deals
more in the global extraction of a narrative storyline and
its depiction.
Non-photo-realistic rendering (NPR) is an active field
of research in computer graphics [18]. A major trend
in NPR is targeted towards the definition of stylized filters to achieve certain painting or sketching effects. In
our work, we do not concentrate on artistic stylization but
try and follow technique for reduction in details in order
to enhance the essence of the image ([22] used the term
minimal-graphics). A similar abstraction approach is presented in [12], using a perceptual model of eye movements. We use the previous semantic knowledge from
the virtual world and story extraction to distinguish foreground from background, main subject from superfluous
details etc. Our stylization is based on two techniques
from image processing: edge detection and clustering.
For edge detection and enhancement we use anisotropic
Laplacian operator [36], while the clustering and filtering
are based on the k-means [29] and mean-shift [11] algorithms.
Figure 2: A schematic view of the visual conversion process from log to comics frames, and the places where different types of comics transitions are used.
graphical depictions.
One of the primary principles of comics is the translation of time into space. Comics frames are inherently
static, but they display dynamic information. The principle of depicting dynamic events using a sequence of still
images is basic and natural to human perception. Still,
when considering how to create a sequence of still images
to convey dynamic events, comics presents several possible transitions of time between frames [30]:
1. Moment-to-moment: breaks the action or motion
into images based on time intervals. The results may
look like a sequence of frames from a movie.
2. Action-to-action: breaks the action or motion into
images based on the type or essence of the action.
This transition is mostly used in comics.
3. Subject-to-subject: images switch from one action
to another or from one character to another, but still
remain within the same scene.
4. Scene-to-scene: consecutive images leap in time or
space, usually signifying a change of scene.
The logger is responsible for tracing all events in an interaction and storing them in a log file. To apply our techniques to real 3D-virtual interaction, the logger must be
incorporated inside the engine of the 3D world. After examining several options we resolved into using a 3D game
as the basis for the 3D interaction input. Not many opensource games allow modifications to the core engine (and
it seems that this situation is not going to change anytime
soon [17]). For our experiments we use Doom [38], a simple 3D shooting game. Although we use the multi-player
version of the game, the possible events and interactions
in such a game are somewhat limited. Nevertheless, we
have designed our system in a modular fashion and new
types of events and interactions can easily be defined, recognized and supported for any type of virtual interaction.
Each entity in the virtual world includes a set of attributes defining its state. These include its position, its
possessions (e.g. tools or weapons), and sometimes a set
of measures such as hunger, vitality or strength. Furthermore, the state of an entity includes its current action
from a set of predefined possible actions such as walking,
shooting, watching, picking etc. An event in the virtual
world is defined as any change in the state of any entity
in the world. This means either the entity has changed its
position, its action, or some of its attribute has changed
(getting hungry or tired). Usually events are created as a
result of an action carried out by characters such as shooting, picking up an object, sitting down, opening a door
etc.
Since many events can happen simultaneously, we
measure the change-of-state of the world at some atomic
time-unit called a tick. The log stores for each tick a set
of events defining all changes that occurred to all entities between the previous and the current tick. All related
parameters of each event such as the entity identifier, its
position, and the action type are stored in the log. The
log also holds all the initial information regarding the geometry and entities of the world at time (tick) 0. Thus, at
any tick t, the state of the world can be reconstructed by
Many narrational forms such as theater and motion pictures decompose a story into a sequence of scenes. Hence,
one of the first steps in transforming a log of events into
a coherent story is to separate it into scenes. This is done
by the scener. However, the notion of a scene is difficult to define precisely. In general, a set of related actions or events that occur in one place at one time can
be composed into a scene. Hence, the two main parameters defining a scene are usually time and space, and we
can separate the events according to their location and
time. Nevertheless, there may be many unrelated events
occurring simultaneously at the same location, when only
some of them are relevant to a specific story. Similarly,
some events may occur in different locations but belong
to the same scene (such as a car chasing scene). A simple
scheme of separating events in time and space would not
be sufficient, and some level of understanding and classification should be performed.
Separating events into scenes also depends on the
point-of-view. Before transforming a sequence of events
into a story there is a need to select a point-of-view or
a narrator. Story understanding is a complex problem,
hence, we use a simpler scheme of modeling interest or
importance of events. We model the interactions between
entities in the virtual world in order to recognize scenes.
We analyze interactions, which involve two entities, instead of examining events, which are actions of a single
4
l(e,t) =
w f R(t, f , e, e0 ) =
e0 E f F
wf
should occur when two arguments hold. First, the interactions are low, and second, there was a change of location. However, we also smooth the result using a Gaussian
with local support to gain an ease-in and ease-out effect in
scenes and events:
h(e,t) = G(t) (l(e,t) s(e,t))
h(e,t) is a smoother version of l(e,t) which also incorporates the change of location information (Figure 4b).
We now use h(e,t) to segment the log of events into
scenes. The beginning of a scene is signified by a transition from a period of low h(e,t) to a period of high
> 0, and conversely an end of scene is
h(e,t) i.e. h(e,t)
t
(a)
(b)
(c)
(b) The view of the snapshots from (a) as seen in the game, without
using our system cameras.
(c) The same type of idiom depicting a similar event from the blue
players story, only this time the player is getting hit.
Idiom
Change of Scene
Shooting
Conversation
Peak Interaction
Object Picking
Shielded Interaction
Depiction
Wide shot
Two action-to-action shots
Interchanging shots
Wide shot
Long shot
First person POV shots
Figure 9: For each specific point in time 18 camera positions are sampled around the events central point (only 9
are shown). In this top view of the scene, this point is between the two main entities - the blue player and a yellow
monster. The best image in terms of visibility (in this case
the lower right image) is chosen for the comics.
Figure 8: A plot of the full path of the green player during the interaction. At different points in time and space,
and based on the interaction level and idioms, the director chooses a snapshot. In this example, the granularity is
high and many shots were taken. The renderer is responsible for positioning the cameras (green pyramids) to shoot
the comics frames. Some examples of images taken from
different cameras are shown closer-up at the top. Some
cameras take two shots from the same position, for example, to create an action-to-action transition.
Figure 11: An example of the output of the layout algorithm for placing text balloons in our comics. The direction of
the balloons alternate between entities and the text is broken if it exceeds a certain size.
Figure 12: An extract from the green player Comics sequence created in the rendering style stressing the foreground.
A larger portion with the cartoon-style rendering can be seen in Figure 13.
11
Results
We present results created from the interactions of a twoplayer episode of the Doom game. We denote the two
players as the green player and the blue player. The
original interaction takes 160 seconds and can be seen
from the green players point of view in the attached video
(see regularInteraction.avi). All examples were created on
an Intel pentium-4 2.8GHZ 500MB.
The log created from the interaction is 6.3 MB large
in XML format and includes 5574 ticks. The scener first
runs on the log to compose an interaction-script which
includes all the iHoods for all entities. This takes 392
seconds, creating a script size of 683KB. This enables
subsequent calls to the scener to be optimized and take
only a few seconds. Hence, given a specific entity (pointof-view) and a threshold for scene cutting (granularity),
the scener uses the interaction-script and log to create a
scened-log. This takes 3 seconds for the blue and green
players. The log is larger in size (8.68MB) since it is
now segmented into scenes, and at the start of each scene
it contains a synchronization point for all entities in the
world. Later, this enables each scene to be portrayed independently and may be used to change the order of scenes
in the future.
The director uses the level-of-detail threshold and converts the scened-log into a directing-script based on
this threshold. This script includes the high-level directives used for creating snapshots of the story over
time. In our example it is 60.4KB large. Lastly,
the directing script is used to create the comics images. We used several different thresholds to create several level-of-details of the story for both the
green and the blue players (the reader is referred to
http://www.faculty.idc.ac.il/arik/comics/ to
view the full set of examples).
Assuming E is the number of entities in the world, T
is the number of ticks and V is the number of events,
the most costly stage of the process beside rendering is
the creation and update of the iHood for each entity in
O(T E 2 V ). We use the game engine itself to render
the basic images, and extract all the masks for stylization. Stylization is done by filtering all images in postprocessing. Currently this can take between 90 seconds
in focused-style (black and white background) to 150 seconds in cartoon-style per image. The rendering stylization
10
Conclusion
This paper describes an end-to-end system for the automatic creation of comics from 3D graphics interaction.
The major challenge met by this system is the transformation of elaborate and continuous graphic data into a
discrete and succinct representation. In both the temporal and the visual domains we follow the same principle
of abstraction by reducing details, and focusing on major
events and main characters. The system is built on top
of a real 3D game engine and is able to trace and log the
happenings in a multi-user world, and transform them into
comics automatically.
Doom as an interactive virtual environment has its limitations both in terms of rendering quality and in terms of
plot. Nevertheless, we have built our approach in a general and modular fashion. Although our examples come
from Doom game, our approach and solutions are applicable to many other scenarios. These include more complex
games such as Sims or massive multi players, and other
types of interaction applications such as training, educational or medical. This work suggests that the goal of automatic extraction and generation of more compact representation of graphics while preserving its essence is feasible. This includes some level of semantic understanding,
reduction of details, and the conversion of continuous 3D
events to 2D images.
There are many possible extensions to this work and
many enhancements need to be addressed. In terms of
story understanding, causality between events should be
utilized to create better scene partitioning and to identify
important events. In terms of directing, more idioms, pos-
12
References
[1] Maneesh Agrawala, Doantam Phan, Julie Heiser, John
Haymaker, Jeff Klingner, Pat Hanrahan, and Barbara Tversky. Designing effective step-by-step assembly instructions. ACM Transactions on Graphics, 22(3):828837,
2003.
[12] Doug DeCarlo and Anthony Santella. Stylization and abstraction of photographs. In Procceedings of ACM SIGGRAPH 2002, pages 769776, 2002.
[13] Steven K. Feiner and Kathleen R. McKeown. Automating the generation of coordinated multimedia explanations.
Computer, 24(10):3341, 1991.
[14] Alan Fern, Jeffrey Mark Siskind, and Robert Givan. Learning temporal, relational, force-dynamic event definitions
from video. In Eighteenth national conference on Artificial intelligence, pages 159166, 2002.
[15] D. Friedman and Y. Feldman. Knowledge-based formalization of cinematic expression and its application to animation. In Proc. Eurographics 2002, pages 163168, Saarbrucken, Germany, 2002.
[16] D. Friedman, Y. Feldman, A. Shamir, and Z. Dagan. Automated creation of movie summaries in interactive virtual environments. In Proceedings of IEEE Virtual Reality
2004, pages 191199, 2004.
[17] Adam
Geitgey.
Where
are
the
good
open
source
games?
OSNews,
http://www.osnews.com/story.php?news id=8146, 2004.
[18] Bruce Gooch and Amy Ashurst Gooch.
Photorealistic Rendering. A K Peters, 2001.
13
Non-
Figure 13: The beginning of the Comics page from the point of view of the green player. The full sequence and more
examples can be found at http://www.faculty.idc.ac.il/arik/comics/.
14
[19] Nick Halper and Maic Masuch. Action summary for computer games extracting and capturing action for spectator modes and summaries. In Proceedings of 2nd International Conference on Application and Development of
Computer Games, pages 124132, 2003.
[21] L. He, M. F. Cohen, and D. H. Salesin. The virtual cinematographer: A paradigm for automatic real-time camera control and directing. Computer Graphics, 30(Annual
Conference Series, Siggraph 96):217224, 1996.
[22] I. Herman and D. J. Duke. Minimal graphics. IEEE Computer Graphics and Applications, 21(6):1821, 2001.
15