0% found this document useful (0 votes)
29 views

Parametric Comics Creation From 3D Interaction

The document describes a system that automatically transforms 3D interactions in a virtual world into a sequence of comic book panels. The system works by logging events during interactions, analyzing them to identify scenes, selecting important events based on parameters like point-of-view, rendering images of the selected events, and arranging the images into a coherent comic strip summary. The goal is to concisely portray the key happenings in a simplified yet visually pleasing manner, reducing both temporal and spatial complexity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Parametric Comics Creation From 3D Interaction

The document describes a system that automatically transforms 3D interactions in a virtual world into a sequence of comic book panels. The system works by logging events during interactions, analyzing them to identify scenes, selecting important events based on parameters like point-of-view, rendering images of the selected events, and arranging the images into a coherent comic strip summary. The goal is to concisely portray the key happenings in a simplified yet visually pleasing manner, reducing both temporal and spatial complexity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Parametric Comics Creation from 3D Interaction

Ariel Shamir
Michael Rubinstein
Tomer Levinboim
Efi Arazi School of Computer Science, The Interdisciplinary Center
Contact Information:
Dr. Ariel Shamir
Efi Arazi School of Computer Science
The Interdisciplinary Center
P.O.B. 167
Herzliya 46150
Israel
Tel: +972-9-9527378
Email: [email protected]

Abstract
There are times when Computer Graphics is required to be succinct and simple. Carefully chosen simplified and
static images can portray a narration of a story as effectively as 3D photo-realistic dynamic graphics. In this paper we
present an automatic system which transforms dynamic graphics originating from real 3D virtual-world interactions
into a sequence of comics images. The system traces events during the interaction and then analyzes and breaks them
into scenes. Based on user defined parameters of point-of-view and story granularity it chooses specific time-frames
to create static images, renders them, and applies post-processing to reduce their cluttering. The system utilizes the
same principle of intelligent reduction of details in both temporal and spatial domains for choosing important events
and depicting them visually. The end result is a sequence of comics images which summarize the main happenings
and present them in a coherent but concise manner.
Keywords: Information Visualization, Non-Photorealistic Rendering, Comics, Succinct Graphics.

Introduction

or move through the virtual world, a logger captures the


events and stores them in a log file (Section 4). This log is
analyzed and processed based on parameters given by the
user, such as point-of-view and interests. Consequently,
it is cut into logical units or scenes by the scener using
a model for character interactions. Next, specific scenes
and events are chosen according to the user definition of
interest (Section 5). These scenes are transformed into a
series of comics strips by choosing time-frames and camera positions by the director (Section 6). New images are
created by the renderer which is responsible for the style
and abstraction of the images (Section 7), and for arranging their page layout (Section 8). This transformation creates a more succinct representation of the original graphics by a reducing the complexity of both the temporal and
the spatial domains. Hence, the challenge is twofold:

One of the major ongoing efforts in computer graphics is


striving for more elaborate, more complex and more realistic visual displays. This includes striving for higher
rendering quality and for higher frame rates. Nevertheless, there are times when more succinct use of graphics
is required. For example, when there is no time to watch a
whole movie, or a whole game; when it is too expensive to
transfer or store elaborate graphic data, or when the target
display device is crude or even static. In such situations,
the graphic complexity must be reduced both in terms of
quantity and in terms of quality, both temporally and spatially. Nevertheless, the need to carry a message or deliver
information calls for intelligent techniques of reduction
where significant information is preserved and unimportant data is dismissed. This paper investigates this type of
graphics reduction, presenting a system for automatic creation of comics from a full scenario 3D virtual game. In
this scenario comic strips can be a useful medium for saving experiences or sharing adventures with others, similar
to photo albums or stories of real world activities.
Storyboards and scripts use comic-like frames to narrate a story. Inversely, such displays can be created to
summarize the main events in a film, a video or a virtual
world interaction [6]. Viewing comics as juxtaposed pictorial and other images in deliberate sequence [30] covers not only art and entertainment, but other possible applications as well. For instance, in medical or scientific
visualization juxtaposition can be used to compare or illustrate data. In education and training visual juxtaposition can be used for explanation of structures, assemblies
or other processes [13, 4, 1]. In graphics, comics-like
metaphors can be used to represent interaction histories
for editing operations in [26], and for programming by
example in [28], and to convey a synopsis of movement
in a video sequence [5]. In fact, Tuftes small multiples
idiom used for envisioning any kind of information [37]
can be interpreted as juxtaposed pictorial elements.
Our goal is to create a sequence of comics images
which summarize the main happenings in a virtual world
environment, and presents them in a coherent, concise and
visually pleasing manner. An overview of our system can
be seen in Figure 1. We start by obtaining a log of events
which is based on the interactions of the players inside
a 3D virtual world. While users interact with the game

1. In the temporal domain the challenge is to automatically detect and choose the most important or most
interesting events in the interaction. The definition
of interest depends on the point-of-view specified by
the viewer.
2. In the spatial domain the challenge is to depict these
events with one or more images that will convey
their meaning faithfully. This includes choosing the
points in time to portray an event, selecting camera
parameters to create descriptive images, choosing a
rendering style and arranging their layout to convey
the storyline.
Traditionally in comics, selecting or producing image
and text that communicate the right message is the work
of an artist. Indeed, we are still far from an artists capability and expressiveness. However, this work goes one
step towards this goal by presenting a system which is capable of extracting a sequence of important events from a
continuous temporal storyline, and converting them into a
graphical representation automatically. The main contribution of this work is the creation of an automatic end-toend system which transforms interactions in 3D graphics
into a succinct representation depicted by comics. The
system is based on principles of comics theory and is capable of producing different comics sequences based on
different semantic parameters provided by the user such
as point-of-view and the level-of-details using the same
reduction principle in both the temporal and the spatial
domains.
1

Figure 1: An overview of the process of transforming 3D graphic interaction into comics.

2 Related Work

ing. Our input comes from a 3D virtual world and its


different nature includes more elaborate semantic information.
Two previous approaches have been suggested for the
preservation of cinematographic constraints in movie creation. One is based on rule-based constraint satisfaction [15] and the other uses idioms which are stereotypical ways to capture some specific actions as a series of
shots [10]. Our approach for transforming events into
comics follows the latter, but extends the notion of idioms
to the creation of different types of comics frame transitions based on comics principles. The language of comics
is different from the language of cinematography. On the
one hand it seems simpler since only static images are created. On the other hand, the choices are more critical as
they must convey movements or actions using static images, which more difficult. Furthermore, the general approach of automatic movie creation is inherently photorealistic. In contrast, we follow the same path of detailreduction and abstraction in rendering as in the temporal
story extraction. Since we do not have full geometric information for rendering, we create non photo-realistic images by applying post-processing stylization on the rendered images with the intension of enhancing important
details and reducing cluttering.
Analyzing a log of events is directly connected to story
understanding which has been studied from early days of
artificial intelligence [8, 35], and is still an active research
area [34, 31]. The more thorough the understanding of the
story, the better the selection of interesting and important
events will be. Nevertheless, our approach does not depend on deep story understanding [32]. Instead, we use a
model of the interactions between entities or characters in
the world to recognize scenes and choose events.

Automatic comics creation has been successfully utilized


to portray chat sessions in comic-chat [27]. Comicchat converts text conversations to 2D comics frames by
choosing iconic 2D figures with several pre-defined postures as representatives for the participants in the chat.
Next, the text of the conversation is displayed inside balloons which are connected to the figures. As the main purpose of comic-chat is to portray a textual chat, the textual
balloon positioning presented is more elaborate than the
simple one we use. However, comic-chat is inherently 2D
in nature, while our system deals with the conversion of
3D worlds interaction to images, including extraction of a
story-line, positioning of camera, and layout of frames.
Automatic creation of comics is also closely related to
automatic creation of animations or movies [24, 21, 2, 19,
16]. In both cases, a series of given events must be decomposed into scenes and a visual display must be created.
In fact, the general outline of our system follows that of
similar systems for movie creation by first segmenting the
events into scenes, then selecting specific scenes or scene
parts, and then transforming each one into a visual depiction. Moreover, some cinematographic rules must be followed also in comics sequences. For example, the 180
camera line rule applies also for comics images. In fact, a
discussion of a similar framework as ours for the creation
of a summary of computer games using video and possibly also static images can be found in [19]. Nevertheless, as a work in progress the results shown were limited,
and were not oriented towards the creation of comics. Although video summarization [6, 20, 14] is also related to
our work, most of the research in video has a somewhat
different focus, which is image analysis and understand2

Transformations of a virtual environment or games interactions to a static visualization were presented in [9]
to trace user behavior, and in [23] to create an enhanced
spectator view. However, these works mostly deal with
global views of the happenings in the environment (e.g.
top view) and the goal of visualization is not to convey
a narrative, but rather some global characteristics of the
events. Another fine example of this type of transformation can be found in [33], where a system for rendering
dynamic VR events using symbols from visual arts is presented. These types of depictions can be utilized for converting specific events to comics, while our work deals
more in the global extraction of a narrative storyline and
its depiction.
Non-photo-realistic rendering (NPR) is an active field
of research in computer graphics [18]. A major trend
in NPR is targeted towards the definition of stylized filters to achieve certain painting or sketching effects. In
our work, we do not concentrate on artistic stylization but
try and follow technique for reduction in details in order
to enhance the essence of the image ([22] used the term
minimal-graphics). A similar abstraction approach is presented in [12], using a perceptual model of eye movements. We use the previous semantic knowledge from
the virtual world and story extraction to distinguish foreground from background, main subject from superfluous
details etc. Our stylization is based on two techniques
from image processing: edge detection and clustering.
For edge detection and enhancement we use anisotropic
Laplacian operator [36], while the clustering and filtering
are based on the k-means [29] and mean-shift [11] algorithms.

Figure 2: A schematic view of the visual conversion process from log to comics frames, and the places where different types of comics transitions are used.
graphical depictions.
One of the primary principles of comics is the translation of time into space. Comics frames are inherently
static, but they display dynamic information. The principle of depicting dynamic events using a sequence of still
images is basic and natural to human perception. Still,
when considering how to create a sequence of still images
to convey dynamic events, comics presents several possible transitions of time between frames [30]:
1. Moment-to-moment: breaks the action or motion
into images based on time intervals. The results may
look like a sequence of frames from a movie.
2. Action-to-action: breaks the action or motion into
images based on the type or essence of the action.
This transition is mostly used in comics.
3. Subject-to-subject: images switch from one action
to another or from one character to another, but still
remain within the same scene.
4. Scene-to-scene: consecutive images leap in time or
space, usually signifying a change of scene.

The Language of Comics

Comics carry a somewhat childish reputation, the images


in comics are often deliberately simplified and non photorealistic. Still, comics are capable of invoking strong and
emotional reactions, create identification and convey a
story in an effective and appealing manner. In fact, the
use of symbolism and abstraction in comics can promote
identification since realistic images resemble specific people or places whereas symbolic figures can represent anyone or anywhere. For these reasons we can leverage from
comics narrative to create effective and appealing succinct

Another leading principle in comics is the use of visual


icons. The meaning of icon here is taken in its general
sense as any image used to represent a person, place, thing
or idea (see e.g. Figure 1). The use of abstract icons as
opposed to realism focuses our attention through simplification, by eliminating superfluous features and creates
higher identification and involvement.
We use these two principles to create a succinct depiction of 3D interaction. First by transforming a continu3

ous sequence of events into a discrete set of static images


which portray them (Figure 2), and second by using abstraction in the visual depiction instead of realism.

accumulating the events information from tick 0 to tick t.


When the difference between ticks is small, the granularity of events is finer, and when it is large, the granularity
is coarser and some events may even be missed. To preserve all important events and still prevent the log from
exploding we use adaptive tick differences along with a
filtering scheme. When concentrating on the story of specific characters, we store only events related to these characters. This means that a new tick is created and data is
stored in the log, when, and only when, there is a change
of state of one of these characters. To keep track of positional information of the characters we also add uniform
ticks with only positional information every second or so.

Logging Events in a 3D World

The logger is responsible for tracing all events in an interaction and storing them in a log file. To apply our techniques to real 3D-virtual interaction, the logger must be
incorporated inside the engine of the 3D world. After examining several options we resolved into using a 3D game
as the basis for the 3D interaction input. Not many opensource games allow modifications to the core engine (and
it seems that this situation is not going to change anytime
soon [17]). For our experiments we use Doom [38], a simple 3D shooting game. Although we use the multi-player
version of the game, the possible events and interactions
in such a game are somewhat limited. Nevertheless, we
have designed our system in a modular fashion and new
types of events and interactions can easily be defined, recognized and supported for any type of virtual interaction.
Each entity in the virtual world includes a set of attributes defining its state. These include its position, its
possessions (e.g. tools or weapons), and sometimes a set
of measures such as hunger, vitality or strength. Furthermore, the state of an entity includes its current action
from a set of predefined possible actions such as walking,
shooting, watching, picking etc. An event in the virtual
world is defined as any change in the state of any entity
in the world. This means either the entity has changed its
position, its action, or some of its attribute has changed
(getting hungry or tired). Usually events are created as a
result of an action carried out by characters such as shooting, picking up an object, sitting down, opening a door
etc.
Since many events can happen simultaneously, we
measure the change-of-state of the world at some atomic
time-unit called a tick. The log stores for each tick a set
of events defining all changes that occurred to all entities between the previous and the current tick. All related
parameters of each event such as the entity identifier, its
position, and the action type are stored in the log. The
log also holds all the initial information regarding the geometry and entities of the world at time (tick) 0. Thus, at
any tick t, the state of the world can be reconstructed by

Recognizing Scenes: The iHood


Model

Many narrational forms such as theater and motion pictures decompose a story into a sequence of scenes. Hence,
one of the first steps in transforming a log of events into
a coherent story is to separate it into scenes. This is done
by the scener. However, the notion of a scene is difficult to define precisely. In general, a set of related actions or events that occur in one place at one time can
be composed into a scene. Hence, the two main parameters defining a scene are usually time and space, and we
can separate the events according to their location and
time. Nevertheless, there may be many unrelated events
occurring simultaneously at the same location, when only
some of them are relevant to a specific story. Similarly,
some events may occur in different locations but belong
to the same scene (such as a car chasing scene). A simple
scheme of separating events in time and space would not
be sufficient, and some level of understanding and classification should be performed.
Separating events into scenes also depends on the
point-of-view. Before transforming a sequence of events
into a story there is a need to select a point-of-view or
a narrator. Story understanding is a complex problem,
hence, we use a simpler scheme of modeling interest or
importance of events. We model the interactions between
entities in the virtual world in order to recognize scenes.
We analyze interactions, which involve two entities, instead of examining events, which are actions of a single
4

entity is a sphere S centered at the entity position and with


a given radius r (see Figure 3).
For each interaction type f F we define a weight
w f which is a real value number configured by the user.
The weights denote the importance of these interactions
in a specific story, e.g. from a specific point-of-view.
Hence, different choices of weights for different interactions eventually govern if an action will appear in a story
or not.
At any tick t we define the level of interaction of an
entity e as:
Figure 3: Two example of interaction neighborhoods
(iHoods) of two entities the orange and the blue. The image shows a top view of a 3D game environments where
each entity is represented by a circle. Entities are part
of an iHood of another entity either if they are within its
vicinity (large orange or blue circles) or if they interact
with it (dotted lines).

l(e,t) =

w f R(t, f , e, e0 ) =

e0 E f F

wf

R(t, f ,e,e0 )I(e,t)

For a specific entity e we can graph l(e,t) as a function


of the tick t (see Figure 4a). We find that the level of interaction has several peaks and valleys during the course
of interaction. Furthermore, we can change the shape of
this function by changing the weights of specific interaction types. Our key assumption is that high level of interaction reflects significance in the story and vice verse.
Hence, we would want to include the events around the
peaks of l(e,t) in any story involving entity e, and skip
the events when l(e,t) is low. This can be done using a
simple threshold over the value of l(e,t). Nevertheless,
this simple scheme does not create a good partitioning of
the story into scenes and is sensitive to local fluctuations
in l(e,t). The interaction level l(e,t) is an indication only
for one of the arguments defining a scene - the time of
the events. There is a need to augment it with the second
argument - the location of the events.
We define a second, change-of-location function s(e,t)
which is 0 almost always apart from a finite number of
times k where it is defined as a uniform spike with maximum height ck > 0. s(e,t) is similar to a series of k approximated delta functions. The times when s(e,t) 6= 0
are the times the entity e has moved from one physical location to another and the constant ck depends on the type
of location change. This information can either be found
in the log file or extracted from the entities position and
actions. For instance, moving from one room to another
and exiting or entering a building is signified by opening a
door or using a portal, going up or down stairs or elevators
is signified by the change in z coordinate etc. We subtract
the change-of-location function from the interaction-level
function, l(e,t) s(e,t), since the separation of scenes

entity. Furthermore, the interaction importance is factored


by the importance of the entities and actions, enabling the
creation of a bias toward a specific point-of-view or a specific type of event.
An interaction is defined as a relation between two entities in the world, for example see, shoot, pick, hold
can be interactions, but also spatial proximity can be defined as a feeble type of interaction. Let E be the set of
all entities in the world, F the set of all possible interactions, and N the natural numbers (representing ticks). An
interaction R is a function R : N F E E {0, 1}. If
two entities e1 and e2 interact at tick t in an event of type
f then R(t, f , e1 , e2 ) = 1, else it is 0.
An interaction-neighborhood or iHood for short, of entity e is the set of all active interactions at tick t of entity
e with any other entity in the world:
I(e,t) = {R|e0 E, f F, R(t, f , e, e0 ) = 1}
Each entity e in the virtual world owns an iHood, which
lists all entities it is currently interacting with, and in what
manner. These iHoods are updated each tick using two
procedures. The first is by continuously simulating all
events of the log to find events that involve the entity e.
The second is by examining the vicinity of entity e for the
entrance or departure of other entities. The vicinity of an
5

should occur when two arguments hold. First, the interactions are low, and second, there was a change of location. However, we also smooth the result using a Gaussian
with local support to gain an ease-in and ease-out effect in
scenes and events:
h(e,t) = G(t) (l(e,t) s(e,t))
h(e,t) is a smoother version of l(e,t) which also incorporates the change of location information (Figure 4b).
We now use h(e,t) to segment the log of events into
scenes. The beginning of a scene is signified by a transition from a period of low h(e,t) to a period of high
> 0, and conversely an end of scene is
h(e,t) i.e. h(e,t)
t

(a)

signified by h(e,t) going down, i.e. h(e,t)


< 0. Inside
t
each scene, we still recognize interesting events by times
when a smoothed version of l(e,t) > T for some threshold T (Figure 4c). An interesting-scene would be one that
has at least one interesting event. Non-interesting scenes,
such as periods when the character is waiting, will not be
shown since no interesting interaction has happened to the
character, and hence l(e,t) would be low all through this
scene.
Choosing a specific character as the main character and
using this mechanism will result in a specific segmentation of the log into scenes from the point-of-view of
this character. Changing the main character will create
a different segmentation and a different story. Hence, this
mechanism is a type of parametric temporal detail reduction, which is capable of emphasizing important information and removing insignificant data depending on the
specific point-of-view. Furthermore, choosing different
thresholds for defining interesting scenes and events, results in different level of details in the story. The lower
the threshold the finer the granularity of the events in the
story (see Section 9).

(b)

(c)

Figure 4: (a) An example of the interaction level l(e,t)


over time of the sequence shown in Figure 13. (b) The
smoother version h(e,t) which includes also the change
in location, and the scene partitioning based on it. Note,
that scene 3 could have been separated to two scenes using a slightly higher threshold. (c) Choosing interesting
scenes and events is based on a threshold on the smoothed
l(e,t). Note that scenes 4 and 5 would be cut-out since no
interesting event occurred inside them.

Converting to Visual Depiction

Once the interesting scenes have been selected from the


log file, there is a need to convert them into visual display. This is the main responsibility of the director. Currently, the director examines each scene independently,
although inter-scene relations are certainly of importance
and might be used in the future, for example, to change the
6

order of scenes. Our goal is to transform each scene into


a sequence of images which depict its main happenings
visually (Figure 2). This transformation includes three
major steps:
(a) A shooting sequence of the green player is depicted using an
idiom composed of two action-to-action transitions and one subjectto-subject.

1. Choosing the specific events within the scene that


should be portrayed.
2. Using one of the possible comics temporal transitions of frames (Section 3) to portray the events.
3. Choosing the camera parameters for each specific
frame image: position, direction, zoom, etc.

(b) The view of the snapshots from (a) as seen in the game, without
using our system cameras.

Each scene is opened with one change-of-scene frame


which represents the scene-to-scene transition and creates
continuity in the storyline. This frame is usually a wide
angle image of the scene which gives the exposition of
the scene. Next, the main events in the scene are recognized using a threshold on a smoothed version of l(e,t)
within the scene (Figure 4c). If more than one event is
recognized in a scene, a subject-to-subject transition must
be used. Nevertheless, by depicting each event separately,
a subject-to-subject transition will emerge implicitly between the frames of the different events (Figure 5).
Each specific event can be displayed using momentto-moment transition or action-to-action. Moment-tomoment transition is simple to create as the event is shown
with k frames of equal intervals in time. Nevertheless, it is
rarely used in comics, and when it is, it is mainly to stress
very important actions or to build tension by prolonging
the perception of time (Figure 5(d)). A more appropriate
and useful type of transition is the action-to-action transition. This transition depicts an event using fewer frames
presenting significant movements in the event. Although
even the same action performed by the same entity on different occasions may differ in the choice of significant
frames, in most cases the frames depend on the type of action and its interval in time. Therefore, to depict an event
using an action-to-action transition of frames we use predefined idioms depending on the type of event.
In general, an idiom is a mapping between a specific
pattern of events in the scene to a specific sequence of
time-frames. We can use the same idiom for different
types of events, or, depending on the state of entities,
choose different idioms to depict the same event. For instance, for a shooting sequence the director uses an idiom

(c) The same type of idiom depicting a similar event from the blue
players story, only this time the player is getting hit.

(d) An example of a moment-to-moment transition, which is not used


in practice.

Figure 5: The shooting sequences idiom is composed of


two sets of action-to-action transitions: one for the shooting entity and one for the target entity. This implicitly creates a subject-to-subject transition between the two sets.

Idiom
Change of Scene
Shooting
Conversation
Peak Interaction
Object Picking
Shielded Interaction

Depiction
Wide shot
Two action-to-action shots
Interchanging shots
Wide shot
Long shot
First person POV shots

Figure 6: Examples of idioms used to depict interactions.

7 Rendering: Visual Abstractions


As a result of the directors process, the story of a specific
player or entity in the virtual world is portrayed using discrete points in time defined by idioms or when interaction
level is high. However, there is still a need to convert the
3D scene at these points in times into 2D images. This is
done by the renderer, which is responsible for positioning
the camera and choosing its parameters (Figure 8). The
idioms used by the director define high level directives.
This includes the main entity in the scene, or a point in
space at the center of the scene, and a list of secondary
entities, that should be portrayed if possible. It can also
include hints regarding the desired type of shot: close-up,
medium, long, or wide shot. These directives are translated by the renderer to explicit camera parameters. For
example, a central point or entity is translated into a direction and the type of shot to a distance from the object.
Figure 7: Tracking the idiom origin of the frames of FigNumerous works have addressed the problem of posiure 13.
tioning a camera in a virtual environment under various
constraints [3, 25, 7]. Nevertheless, most of them concentrate on dynamic shots and rely on cinematographic
idioms. We use a simpler scheme to create the static shots
for our comics. Using the center point of the scene we
which displays two frames in an action-to-action transi- create a circle whose radius is defined by the type of detion manner: one just before the peak of the action (the sired shot (wide, long, medium or close-up). A number of
peak in l(e,t)) and one a little after the outcome (see camera positions are sampled around the circle uniformly
frames 5-8 in Figure 13 and Figure 5 (a) and (c)). A single and shots are taken from these positions. If there is an obpeak in l(e,t) containing interactions with many entities, stacle such as a wall or a pillar between the camera and the
is depicted by an idiom of a single wide-shot frame. How- center, then the camera is advanced towards the center unever, when a player is shielded it is important to stress til the center is visible. Using the 180 rule between shots
its personal point-of-view and the director uses an idiom eliminates some of the images. The remaining images are
which defines a first-person point of view shots (frames weighted first based on the visibility of the primary entity
12,16 in Figure 13).
and then on the visibility of the secondary entities. This
is done by measuring the percent of the space the entity
Idioms can also define higher level mappings between occupies in the resulting image. The image with highest
otherwise individual events. In a conversation scene many visibility is chosen for the comics (see Figure 9).
single events of interactions are replaced by one proTo create the shading of the comics images, we follows
longed idiom which later enables the insertion of text- the same principle of reduction in details as in the tempoballoons (frames 2-3, 18-20 in Figure 13 and Figure 11). ral selection. There are two options for achieving a cerFigure 7 explains the origin of each frame of the comics tain style while rendering 3D graphics. The first is to use
example of Figure 13. The table in Figure 6 lists the pri- a specific shading style during rendering, and the second
mary idioms used for Doom comics. Although in this is to apply post-processing to the images. There are sevspecific game, the types of interactions are limited, the id- eral 3D rendering engines and techniques which give the
iom mechanism is versatile and easily extendible to other objects a cartoon-like form. Unfortunately in our case, the
fields including other types of interactions.
game engine supports only the original dark-mood type of
8

Figure 9: For each specific point in time 18 camera positions are sampled around the events central point (only 9
are shown). In this top view of the scene, this point is between the two main entities - the blue player and a yellow
monster. The best image in terms of visibility (in this case
the lower right image) is chosen for the comics.

shading of Doom. Furthermore, the game itself has very


low quality graphics and is based primarily on sprites and
textures. Instead of full 3D data, we can acquire only fixed
resolution 2 12 D images from the game. Therefore, we first
acquire the image in the original style. Next, we use the
game to render only foreground and only background images to be used as masks.
Using these extracted images, we stylize the resulting
image in post processing. Our goal for the visual endresults is to create the look-and-feel of comics, reduce the
cluttering of the original images, and enhance the main
characters and events in the images. Several image processing techniques are combined to achieve the desired results. Color clustering based on mean-shift is used for the
background and k-means clustering for the foreground.
Edge enhancement is achieved by applying an anisotropic
Laplacian kernel on the image creating stroke-like edges.
Combining all these together creates the resulting images
(Figure 10).
The renderer is also responsible for creating text balloons when a conversation takes place in the log. Currently, the idiom used for conversation is a simple alternation of images in front view of the speakers. Depending

Figure 8: A plot of the full path of the green player during the interaction. At different points in time and space,
and based on the interaction level and idioms, the director chooses a snapshot. In this example, the granularity is
high and many shots were taken. The renderer is responsible for positioning the cameras (green pyramids) to shoot
the comics frames. Some examples of images taken from
different cameras are shown closer-up at the top. Some
cameras take two shots from the same position, for example, to create an action-to-action transition.

izontal length of images in each row. Next, we classify the


images to four basic classes: B (big), S (small), F (fixed),
N (neutral). The classification is based on their semantics,
e.g. based on the idiom or event they portray. Peak interaction images and change of scene images are classified
as B, e.g. those that can be expanded horizontally. Images
from an action-to-action pair or first person point-of-view
shots are classified as S, e.g. they can be condensed. Images that include text balloons are classified as F to protect
the balloons layout. All other images are defined as N an
can be expanded or condensed.
The layout process receives as input the length of the
row
in the comics. We always choose a whole multiple of
Figure 10: The 2 12 D image data extracted from the game
the length of a regular image (usually k=4) as the length
engine and the different image processing techniques used
of a row. The layout proceeds in a greedy manner by fitto create the image stylization for our comics. There are
ting sets of up-to k images, one row at a time, beginning
two different styles, one is cartoon-like and one enhances
from the first image until the last. At the beginning of
the entities in the foreground by removing all color from
each row it examines the next k images. If this set does
the background.
not include any B-type image, then k images are put in the
row with a fixed size (see rows 2,6 in Figure 13). Otheron the length of each image text, the renderer chooses the wise, the first B-type image is put aside as a filler-image.
size of the balloon, and positions it to the left or right of For all other images, the layout tries to comply with their
the head, alternating between the speakers. The position constraints based on their types. Every B-type image is
takes into account the foreground mask of the speaker, the assigned a random expansion ratio between 1.2 and 2.0.
masks of other entities and the borders of the image. If the Every S-type image is assigned a random condensing ratext does not fit into one balloon, it is distributed into sev- tio between 0.7 and 1.0. Lastly, the filler-image is expanded to compensate for all other images. If this is not
eral consecutive equivalent images (Figure 11).
sufficient, other N-type images are expanded. If no solution is found, the layout tries another round with different
random values (see rows 4,5 in Figure 13). If after five tri8 Comics Layout
als it still fails, it inserts the images in their original sizes
The last stage in the comics creation is the page-layout. and proceeds to the next row (see first row in Figure 13).
After rendering and stylization, all images are in the same This basic scheme is further enhanced for specific situasize and can be displayed consecutively similar to the lay- tions. For instance, when two consecutive B-type images
out of Figure 11. Nevertheless, many comics pages do not are found in a row, the first is expanded as usual by a
use a sequence of uniform images, but present variations random ratio r1 , and the second is expanded in r2 so that
in the size, aspect ratio and positioning of the images to r1 + r2 = 3 (see third row in Figure 13).
create effects such as stressing or prolonging. Imitating
the full skills of a comics artist remains a desire. Still,
Once the expansion and condensing factors are set for a
we devise a layout algorithm for images that breaks the row of images, the images are pruned at the fringes either
symmetry to create the look-and-feel of comics.
horizontally (mostly for expansion) or vertically (mostly
To simplify the problem, we constrain all rows in our for condensing) to achieve the right aspect ratio. Pruning
page to the same height. This is often the case in many is done in a symmetric manner from the top and bottom,
real comics sequences. Therefore, our layout algorithm or left and right, respectively with as little damage to the
becomes a one dimensional problem of choosing the hor- foreground mask as possible.
10

Figure 11: An example of the output of the layout algorithm for placing text balloons in our comics. The direction of
the balloons alternate between entities and the text is broken if it exceeds a certain size.

Figure 12: An extract from the green player Comics sequence created in the rendering style stressing the foreground.
A larger portion with the cartoon-style rendering can be seen in Figure 13.

11

Results

We present results created from the interactions of a twoplayer episode of the Doom game. We denote the two
players as the green player and the blue player. The
original interaction takes 160 seconds and can be seen
from the green players point of view in the attached video
(see regularInteraction.avi). All examples were created on
an Intel pentium-4 2.8GHZ 500MB.
The log created from the interaction is 6.3 MB large
in XML format and includes 5574 ticks. The scener first
runs on the log to compose an interaction-script which
includes all the iHoods for all entities. This takes 392
seconds, creating a script size of 683KB. This enables
subsequent calls to the scener to be optimized and take
only a few seconds. Hence, given a specific entity (pointof-view) and a threshold for scene cutting (granularity),
the scener uses the interaction-script and log to create a
scened-log. This takes 3 seconds for the blue and green
players. The log is larger in size (8.68MB) since it is
now segmented into scenes, and at the start of each scene
it contains a synchronization point for all entities in the
world. Later, this enables each scene to be portrayed independently and may be used to change the order of scenes
in the future.
The director uses the level-of-detail threshold and converts the scened-log into a directing-script based on
this threshold. This script includes the high-level directives used for creating snapshots of the story over
time. In our example it is 60.4KB large. Lastly,
the directing script is used to create the comics images. We used several different thresholds to create several level-of-details of the story for both the
green and the blue players (the reader is referred to
http://www.faculty.idc.ac.il/arik/comics/ to
view the full set of examples).
Assuming E is the number of entities in the world, T
is the number of ticks and V is the number of events,
the most costly stage of the process beside rendering is
the creation and update of the iHood for each entity in
O(T E 2 V ). We use the game engine itself to render
the basic images, and extract all the masks for stylization. Stylization is done by filtering all images in postprocessing. Currently this can take between 90 seconds
in focused-style (black and white background) to 150 seconds in cartoon-style per image. The rendering stylization

process is in fact the major bottleneck since its complexity


is dependent on the number of pixels and not the entities
or timesteps.
In Figures 13 we present part of the results of the green
player story using the coarsest granularity. This means
only highlights of the interaction are shown. Lowering
the threshold will yield more frames automatically as can
be seen in the above mentioned web-site. We used four
different level of details for each of the two players: the
blue and the green players. We also present the comics
result in the second rendering style, and an example can
be seen in Figure 12.

10

Conclusion

This paper describes an end-to-end system for the automatic creation of comics from 3D graphics interaction.
The major challenge met by this system is the transformation of elaborate and continuous graphic data into a
discrete and succinct representation. In both the temporal and the visual domains we follow the same principle
of abstraction by reducing details, and focusing on major
events and main characters. The system is built on top
of a real 3D game engine and is able to trace and log the
happenings in a multi-user world, and transform them into
comics automatically.
Doom as an interactive virtual environment has its limitations both in terms of rendering quality and in terms of
plot. Nevertheless, we have built our approach in a general and modular fashion. Although our examples come
from Doom game, our approach and solutions are applicable to many other scenarios. These include more complex
games such as Sims or massive multi players, and other
types of interaction applications such as training, educational or medical. This work suggests that the goal of automatic extraction and generation of more compact representation of graphics while preserving its essence is feasible. This includes some level of semantic understanding,
reduction of details, and the conversion of continuous 3D
events to 2D images.
There are many possible extensions to this work and
many enhancements need to be addressed. In terms of
story understanding, causality between events should be
utilized to create better scene partitioning and to identify
important events. In terms of directing, more idioms, pos-

12

sibly for other applications should be defined and used.


Furthermore, the current system for choosing transitions
based on idioms is almost memoryless. Remembering the
history of frames, idioms and plot, may create a smoother
and more interesting flow in the story.
In terms of rendering and stylization, the use of 3D rendering instead of 2D image post-processing will open-up
new possibilities. More elaborate camera placement and
more interesting shots should be used (extreme angles,
close-ups and more). The use of text in comics must also
be investigated further. Text is often used as a type of narration. This is a powerful tool to advance the story and fill
gaps in terms of plot and atmosphere. In terms of layout,
algorithms for combining text and images automatically
to create effective displays, and algorithms for automatic
page layout are still in their infancy and should be investigated further. Lastly, the creation of comics and movies
have much in common. Nevertheless, additional research
is needed for the creation of camera movements and dynamic flow of display to support the automatic creation of
high-quality movies as well as better comics.

Proceedings 14th Conference on Artificial Intelligence,


pages 132137, 1997.
[7] Fred Charles, Jean-Luc Lugrin, Marc Cavazza, and
Steven J. Mead. Real-time camera control for interactive
storytelling. In GAME-ON, 2002.
[8] E. Charniak. Toward a model of childrens story comprehension. Technical Report AITR-266, Artificial Intelligence Laboratory, Massachusetts Institute of Technology,
Cambridge, MA, 1972.
[9] Luca Chittaro and Lucio Ieronutti. A visual tool for tracing users behavior in virtual environments. In AVI 04:
Proceedings of the working conference on Advanced visual interfaces, pages 4047, New York, NY, USA, 2004.
ACM Press.
[10] D. B. Christianson, S. E. Anderson, L.-W. He, D. H.
Salesin, D. S. Weld, and M. F. Cohen. Declarative camera control for automatic cinematography. In Proceedings
of the Thirteenth National Conference on Artificial Intelligence, pages 148155, 1996.
[11] D. Comaniciu and P. Meer. Mean shift: A robust approach
towards feature space analysis. IEEE Trans. Pattern Analysis and Machine Intelligence, 24:603619, May 2002.

References
[1] Maneesh Agrawala, Doantam Phan, Julie Heiser, John
Haymaker, Jeff Klingner, Pat Hanrahan, and Barbara Tversky. Designing effective step-by-step assembly instructions. ACM Transactions on Graphics, 22(3):828837,
2003.

[12] Doug DeCarlo and Anthony Santella. Stylization and abstraction of photographs. In Procceedings of ACM SIGGRAPH 2002, pages 769776, 2002.
[13] Steven K. Feiner and Kathleen R. McKeown. Automating the generation of coordinated multimedia explanations.
Computer, 24(10):3341, 1991.

[2] D. Amerson and S. Kime. Real time cinematic camera


control for interactive narratives. In proceedings AAAI SSS
2001, 2001.

[14] Alan Fern, Jeffrey Mark Siskind, and Robert Givan. Learning temporal, relational, force-dynamic event definitions
from video. In Eighteenth national conference on Artificial intelligence, pages 159166, 2002.

[3] D. Amerson and K. Shaun. Real-time cinematic camera


control for interactive narratives. In Proceedings of the
American Association for Artificial Intelligence, page 14,
2000.

[15] D. Friedman and Y. Feldman. Knowledge-based formalization of cinematic expression and its application to animation. In Proc. Eurographics 2002, pages 163168, Saarbrucken, Germany, 2002.

[4] Elisabeth Andre and Thomas Rist. Generating coherent


presentations employing textual and visual material. Artificial Intelligent Review, 9(2-3):147165, 1995.

[16] D. Friedman, Y. Feldman, A. Shamir, and Z. Dagan. Automated creation of movie summaries in interactive virtual environments. In Proceedings of IEEE Virtual Reality
2004, pages 191199, 2004.

[5] Jackie Assa, Yaron Caspi, and Daniel Cohen-Or. Action


synopsis: Pose selection and illustration. ACM transaction on graphics, Siggraph 2005 special issue (to appear),
2005.
[6] Matthew Brand. The inverse hollywood problem: From
video to scripts and storyboards via causal analysis. In

[17] Adam
Geitgey.
Where
are
the
good
open
source
games?
OSNews,
http://www.osnews.com/story.php?news id=8146, 2004.
[18] Bruce Gooch and Amy Ashurst Gooch.
Photorealistic Rendering. A K Peters, 2001.

13

Non-

Figure 13: The beginning of the Comics page from the point of view of the green player. The full sequence and more
examples can be found at http://www.faculty.idc.ac.il/arik/comics/.

14

[19] Nick Halper and Maic Masuch. Action summary for computer games extracting and capturing action for spectator modes and summaries. In Proceedings of 2nd International Conference on Application and Development of
Computer Games, pages 124132, 2003.

[33] Marc Nienhaus and Jurgen Dollner. Depicting dynamics


using principles of visual art and narrations. IEEE Computer graphics and Applications, 25(3):4051, 2005.
[34] Shlomith Rimmon-Kenan. Narrative Fiction: Contemporary Poetics. Routledge, London, 2nd edition, 2002.

[20] Alan Hanjalic, Reginald L. Lagendijk, and Jan Biemond.


Automatically segmenting movies into logical story units.
In Visual Information and Information Systems, pages
229236, 1999.

[35] R. C. Schank and R. P. Abelson. Scripts, Plans, Goals,


and Understanding: An Inquiry into Human Knowledge
Structures. Lawrence Erlbaum, Hillsdale, NJ., 1977.
[36] Emanuele Trucco and Alessandro Verri. Introductory
Techniques for 3-D Computer Vision. Prentice Hall, 1st
edition, 1998.

[21] L. He, M. F. Cohen, and D. H. Salesin. The virtual cinematographer: A paradigm for automatic real-time camera control and directing. Computer Graphics, 30(Annual
Conference Series, Siggraph 96):217224, 1996.
[22] I. Herman and D. J. Duke. Minimal graphics. IEEE Computer Graphics and Applications, 21(6):1821, 2001.

[37] Edward Tufte. Envisioning information. Graphics Press,


1990.
[38] ZDOOM. http://zdoom.org, 2004.

[23] Nathan Hoobler, Greg Humphreys, and Maneesh


Agrawala. Visualizing competitive behaviors in multiuser virtual environments.
In Proceedings of IEEE
Visualization 2004, pages 163170, October 2004.
[24] P. Karp and S. Feiner. Automated presentation planning
of animation using task decomposition with heuristic reasoning. In Proceedings of Graphics Interfaces 93, pages
118127, 1993.
[25] Kevin Kennedy and Robert E. Mercer. Planning animation
cinematography and shot structure to communicate theme
and mood. In Proceedings of the 2nd international symposium on Smart graphics, pages 18, 2002.
[26] David Kurlander. Chimera: example-based graphical editing. In Watch what I do: programming by demonstration,
pages 271290. MIT Press, Cambridge, MA, USA, 1993.
[27] David Kurlander, Tim Skelly, and David Salesin. Comic
chat. In Proceedings of the 23rd annual conference
on Computer graphics and interactive techniques, SIGGRAPH96, pages 225236, 1996.
[28] Henry Lieberman. Mondrian: a teachable graphical editor.
In Watch what I do: programming by demonstration, pages
341358. MIT Press, Cambridge, MA, USA, 1993.
[29] S. Lloyd. Least square quantization in pcm. IEEE Transactions on Information Theory, 28:129137, 1982.
[30] Scott McCloud. Understanding Comics: The Invisible Art.
Harper Perennial, 1994.
[31] E. T. Mueller. Story understanding. In Lynn Nadel, editor, Encyclopedia of Cognitive Science. London: Nature
Publishing Group, 2002.
[32] E. T. Mueller. Understanding script-based stories using
commonsense reasoning. Cognitive Systems Research,
5(4):307340, 2004.

15

You might also like