0% found this document useful (0 votes)

15 views15 pages

Magicquill: An Intelligent Interactive Image Editing System

MagicQuill is an innovative interactive image editing system that utilizes diffusion models to allow users to edit images through intuitive brushstrokes, enhancing creativity and precision. The system integrates three core components: an Editing Processor for high-quality edits, a Painting Assistor for real-time intent prediction, and an Idea Collector for user-friendly input, all designed to streamline the editing process. By leveraging a multimodal large language model (MLLM), MagicQuill significantly improves the efficiency and accuracy of image modifications compared to existing tools.

Uploaded by

alfathtanazziqh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views15 pages

Magicquill: An Intelligent Interactive Image Editing System

Uploaded by

alfathtanazziqh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

MagicQuill: An Intelligent Interactive Image Editing System

Zichen Liu♡,1,2 Yue Yu♡,1,2 Hao Ouyang2 Qiuyu Wang2 ,

Ka Leong Cheng1,2 Wen Wang3,2 Zhiheng Liu4 Qifeng Chen†,1 Yujun Shen†,2
1 2 3 4
HKUST Ant Group ZJU HKU
arXiv:2411.09703v1 [cs.CV] 14 Nov 2024

Figure 1. MagicQuill is an intelligent and interactive image editing system built upon diffusion models. Users seamlessly edit images
using three intuitive brushstrokes: add, subtract, and color (A). A MLLM dynamically predicts user intentions from their brush strokes and
suggests contextual prompts (B1-B4). The examples demonstrate diverse editing operations: to generate a jacket from clothing contour
(B1), add a flower crown from head sketches (B2), remove background (B3), and apply color changes to the hair and flowers(B4).

Abstract 1. Introduction

As a highly practical application, image editing encoun- Performing precise and efficient edits on digital pho-
ters a variety of user demands and thus prioritizes excellent tographs remains a significant challenge, especially when
ease of use. In this paper, we unveil MagicQuill, an aiming for nuanced modifications. As shown in Fig. 1,
integrated image editing system designed to support users in consider the task of editing a portrait of a lady where
swiftly actualizing their creativity. Our system starts with a specific alterations are desired: converting a shirt to a
streamlined yet functionally robust interface, enabling users custom-designed jacket, adding a flower crown at an exact
to articulate their ideas (e.g., inserting elements, erasing position with a well-designed shape, dyeing portions of
objects, altering color, etc.) with just a few strokes. These her hair in particular colors, and removing certain parts
interactions are then monitored by a multimodal large of the background to refine her appearance. Despite the
language model (MLLM) to anticipate user intentions in rapid advancements in diffusion models [6, 10, 14, 19, 35–
real time, bypassing the need for prompt entry. Finally, we 38, 47, 62, 68] and recent attempts to enhance control [20,
apply the powerful diffusion prior, enhanced by a carefully 23, 48, 69], achieving such fine-grained and precise edits
learned two-branch plug-in module, to process the editing continues to pose difficulties, typically due to a lack of
request with precise control. Please visit https://magic- intuitive interfaces and models for fine-grained control.
quill.github.io to try out our system. The challenges highlight the critical need for interactive
editing systems that facilitate precise and efficient modifi-
cations. An ideal solution would empower users to specify
♡ Equal contribution. † Corresponding author. what they want to edit, where to apply the changes, and how

1
the modifications should appear, all within a user-friendly current image editing tools and providing an innovative
interface that streamlines the editing process. solution that enhances both precision and efficiency, our
We aim to develop the first robust, open-source, in- work advances the field of digital image manipulation.
teractive precise image editing system to make image Our framework opens possibilities for users to engage
editing easy and efficient. Our system seamlessly integrates creatively with image editing, achieving their goals easily
three core modules: the Editing Processor, the Painting and effectively.
Assistor, and the Idea Collector. The Editing Processor
ensures a high-quality, controllable generation of edits, 2. Related Works
accurately reflecting users’ editing intentions in color and
edge adjustments. The Painting Assistor enhances the
2.1. Image Editing
ability of the system to predict and interpret the users’ Image editing involves modifying the visual appearance,
editing intent. The Idea Collector serves as an intuitive structure, or elements of an existing image [19]. Recent
interface, allowing users to input their ideas quickly and breakthroughs in diffusion models [17, 44, 49] have sig-
effortlessly, significantly boosting the editing efficiency. nificantly advanced visual generation tasks, outperforming
The Editing Processor implements two kinds of GAN-based models [15] in terms of image editing capa-
brushstroke-based guidance mechanisms: scribble guid- bilities. To enable control and guidance in image editing,
ance for structural modifications (e.g., adding, detailing, or a variety of approaches have emerged, leveraging different
removing elements) and color guidance for modification of modalities such as textual instructions [6, 11, 14, 32, 47,
color attributes. Inspired by ControlNet [66] and Brush- 65], masks [20, 23, 48, 69], layouts [10, 33, 68], segmen-
Net [23], our control architecture ensures precise adherence tation maps [35, 62], and point-dragging interfaces [36–
to user guidance while preserving unmodified regions. Our 38]. Despite these advances, these methods often fall
Painting Assistor reduces the repetitive process of typing short when precise modifications at the regional level are
text prompts, which disrupts the editing workflow and required, such as alterations to object shape, color, and other
creates a cumbersome transition between prompt input and details. Among the various methods, sketch-based editing
image manipulation. It employs an MLLM to interpret approaches [22, 25, 34, 42, 59, 61, 64] offer users a more
brushstrokes and automatically predicts prompts based on intuitive and precise means of interaction. However, the
image context. We call this novel task Draw&Guess. We current methods remain limited by the accuracy of the text
construct a dataset simulating real editing scenarios for signals input alongside the sketches, making it challenging
fine-tuning to ensure the effectiveness of the MLLM in to precisely control the information of the editing areas,
understanding user intentions. This enables a continuous such as color. To achieve precise control, we introduce two
editing workflow, allowing users to iteratively edit images types of local guidance based on brushstrokes: scribble and
without manual prompt input. The Idea Collector provides color, thereby enabling fine-grained control over shape and
an intuitive interface compatible with various platforms color at the regional level.
including Gradio and ComfyUI, allowing users to draw
with different brushes, manipulate strokes, and perform 2.2. MLLMs for Image Editing
continuous editing with ease. Multi-modal large language models (MLLMs) extend
We present a comprehensive evaluation of our interactive LLMs to process both text and image content [16], en-
editing framework. Through qualitative and quantitative abling text-to-image generation [9, 28, 52, 53, 58], prompt-
analyses, we demonstrate that our system significantly refinement [60, 63], and image quality evaluation [51].
improves both the precision and efficiency of performing In the area of image editing, MLLMs have demonstrated
detailed image edits compared to existing methods. Our significant potential. MGIE [13] enhances instruction-
Editing Processor achieves superior edge alignment and based image editing by using MLLMs to generate more
color fidelity compared to baselines like SmartEdit [20] and expressive, detailed instructions. SmartEdit [20] leverages
BrushNet [23]. The Painting Assistor exhibits superior user MLLM for better understanding and reasoning towards
intent interpretation capabilities compared to state-of-the- complex instruction. FlexEdit [55] integrates MLLM to
art MLLMs, including LLaVA-1.5 [31], LLaVA-Next [30], understand image content, masks, and textual instructions.
and GPT-4o [21]. User studies indicate that the Idea GenArtist [57] uses an MLLM agent to decompose complex
Collector significantly outperforms baseline interfaces in all tasks, guide tool selection, and enable systematic image
aspects of system usability. generation, editing, and self-correction with step-by-step
By leveraging advanced generative models and a user- verification. Our system extends this line of research by
centric design, our interactive editing framework signifi- introducing a more intuitive approach, utilizing MLLM
cantly reduces the time and expertise required to perform to simplify the editing process. Specifically, it directly
detailed image edits. By addressing the limitations of integrates the image context with user-input strokes to

2
Instant prediction
Precise control
“... a ‘draw and guess’ game ... what am I > cake
drawing with these strokes in the image?” > cake
MLLM > red vase
> red vase
“... a ‘draw and guess’ game ... identify
what is inside the red contours?” Text prompt Raw Mask Text Edge Color

Controlnet
❸

Inpainting UNet

Diffusion UNet
A
B
User-friendly interface

Consecutive editing

❶ ❹

C D E
Raw image
F G
Edited image

Figure 2. System framework consisting of three integrated components: an Editing Processor with dual-branch architecture for
controllable image inpainting, a Painting Assistor for real-time intent prediction, and an Idea Collector offering versatile brush tools.
This design enables intuitive and precise image editing through brushstroke-based interactions.

infer and translate the editing intentions, thereby automat- 3. System Design
ically generating the necessary prompts without requiring
repeated user input. This innovative task, which we term Our system is structured around three key aspects: Editing
Draw&Guess, facilitates a continuous editing workflow, Processor with strong generative prior, Painting Assistor
enabling users to iteratively refine images with minimal with instant intent prediction, and Idea Collector with a
manual intervention. user-friendly interface. An overview of our system design
is presented in Fig. 2.
Our system introduces brushstroke-based control signals
2.3. Interactive Support for Image Generation to give intuitive and precise control. These signals allow
users to express their editing intentions by simply drawing
Interactive support enhances the performance and usability what they envision. We designed two types of brushes,
of generative models through human-in-the-loop collabora- scribble and color, to accurately manipulate the edited
tion [27]. Recent works have focused on making prompt image. The scribble brushes, add brush and subtract
engineering more user-friendly through techniques like brush, aim to provide precise structural control by oper-
image clustering [4, 12] and attention visualization [56]. ating on the edge map of the original image. The color
Despite advancements in interactive support, a key chal- brush works with downsampled color blocks to enable
lenge remains in bridging the gap between verbal prompts fine-grained color manipulation of specific regions. Fig. 3
and visual output. While systems like PromptCharm [56] illustrates the workflow to convert the user hand-drawn
and DesignPrompt [39] use inpainting for interactive image input signal into control condition for faithfully inpainting
editing, these tools typically offer only coarse-grained the target editing area. Inspired by Ju et al. [23], Zhang
control over element addition and removal, requiring users et al. [66], we employ two additional branches to the latent
to brush over areas before generating objects within those diffusion framework [44], with the inpainting branch giving
regions. Furthermore, users must manually input prompts to content-aware per-pixel guidance for the re-generation of
specify the objects they wish to generate. Our approach ad- the editing area, and the control branch providing structural
dresses these limitations by introducing fine-grained image guidance. The model architecture is illustrated in Fig. 4.
editing through the use of brushstrokes. Additionally, we Further details will be discussed in Sec. 3.1.
incorporate a multimodal large language model (MLLM) To reduce the cognitive load for users to input appropri-
that provides on-the-fly assistance by interpreting user ate prompts at every stage of editing, our system integrates
intentions and suggesting prompts in real-time, thereby a MLLM [29] as the Painting Assistor. This component
reducing cognitive load and enhancing overall usability. analyzes user brushstrokes to deduce the editing intention

3
based on the image context, thereby automatically suggest-
ing contextually relevant prompts for editing. We have
named this innovative task Draw&Guess. To effectively Raw image Editing signals

prepare the MLLM for Draw&Guess, we designed a dataset Editing mask 𝐌

construction method to simulate user hand-drawn editing 𝐹!""

scenarios and acquire ground truth for Draw&Guess. We

fine-tuned a dedicated LLaVA [31] model, achieving instant Raw edge Add brush Subtract brush
Edge cond. 𝐄𝒄𝒐𝒏𝒅

prompt guessing with satisfactory accuracy. More specifics

downscale
will be covered in Sec. 3.2.
Additionally, to provide users with a streamlined, intu- Color blocks Color brsuh
Color cond. 𝐂%&'(
itive interface that empowers them to express their ideas for
complex image editing tasks with ease, we designed an Idea
Collector with a user-friendly interface. The key features of Figure 3. Data processing pipeline. The input image undergoes
edge extraction via CNN and color simplification through down-
the interface will be outlined in Sec. 3.3.
scaling. Three editing conditions are then generated based on
brush signals: editing mask, edge condition, and color condition,
3.1. Editing Processor
which together provide control for image editing.
Control Condition from Brushstroke Signal. Let Madd
and Msub denote the binary masks corresponding to add obtained by dilating the union of brush regions by p pixels.
and subtract brush respectively. These masks share the same The masked image Imasked can then be formulated as
dimensions as the original image I, where values are set
to 1 in regions corresponding to user brush strokes and 0 M = Growp (Madd ∪ Msub ∪ Mcolor ),
(3)
elsewhere. The subtract brush masks out the edges from the Imasked = I ⊙ (1 − M).
edge map E, which is initially extracted from the original
image using a pre-trained CNN fCN N . Conversely, the add This expansion accounts for the fact that editing can
brush introduces new edges by setting designated regions affect areas surrounding the mask, such as shadows or other
to white in the edge map. The resulting modified edge adjacent details. By growing the mask, we ensure that
map Econd serves as the control condition for manipulating these peripheral regions are properly generated, resulting in
geometric structure in the editing processor. This can be a more seamless and realistic edit.
formally expressed as Controllable Image Inpainting. The inpainting branch
adopts the UNet [23, 45] architecture, incorporating the
E = fCN N (I), masked image feature into the pre-trained diffusion net-
Esub = E ⊙ (1 − Msub ), (1) work. This branch inputs the concatenated noisy latent
Econd = Esub + Madd ⊙ (1 − Esub ). at t-th step zt , masked image latent zmasked extracted
using VAE [26] from Imasked , and downsampled mask m
For precise region-specific colorization, we represent by cubic interpolation from M. The inpainting branch
each color brush stroke as a tuple (Mcolor , c, α), where processes these features, utilizing a trainable clone of
Mcolor denotes a binary mask indicating the user-defined the diffusion model, stripped of cross-attention layers to
stroke region, c specifies the stroke color, and α ∈ [0, 1] focus solely on the image feature. The extracted features
represents the stroke opacity. The colorization operation carrying pixel-level information are inserted into each layer
can be formally expressed as of the frozen diffusion model through zero-convolution
layers Z [66]. Given text condition τ , timestep t, let
Ic = (1 − α · Mcolor ) ⊙ I + α · Mcolor · c, (2) F (zt , t, c; Θ)i represents the feature of the i-th layer in
the total n layers of the diffusion UNet with parameter
where the color c with an alpha blending factor α is applied Θ. Similarly, let F I ([zt , zmasked , m], t; ΘI )i represents the
output of the i-th layer in the inpainting UNet, where [·]
over a specific region of the image I defined by the binary
denotes the concatenation operation. This feature insertion
mask Mcolor . can be represented by
To generate the color condition Ccond , we first down-
scale the image Ic by a factor of 16 using cubic interpola- F (zt , τ, t; Θ)i + = wI · Z(F I ([zt , zmasked , m], t; ΘI )i ), (4)
tion, followed by upscaling to the original resolution using
nearest-neighbor interpolation. This process generated a where wI is an adjustable hyperparameter that determines
color block preserving the global color structure while the inpainting strength. Equipped with the inpainting
simplifying local details. branch, the diffusion UNet can fill the masked area in a
The edge condition Econd and color condition Ccond content-aware manner based on the text prompt.
jointly guide the inpainting process for precise editing The control branch aims to introduce conditional gen-
control. The editing region, represented by mask M, is eration ability to the diffusion UNet based on condition

4
bypass the Q&A process, as the results demonstrate that
ⓒ > cake
prompt-free generation achieves satisfactory results.
> red vase
For the color brush, the Q&A setup is similar: “The
Masked image Mask 𝐌 Noisy latents Text Edge 𝐄𝒄𝒐𝒏𝒅 Color 𝐂%&'( user will upload an image containing some contours in red
color. To help you locate the contour, ... You need to identify
what is inside the contours using a single word or phrase.”,

Controlnet
(the repetitive part is omitted). The system extracts contour
Inpainting UNet

Text Embedding
Diffusion UNet
information from the color brush stroke boundaries. The fi-
nal predicted prompt is generated by combining the stroke’s
color information with Q&A outputs. To optimize response
time, we constrain Q&A responses to concise, single-word
Edited image or short-phrase formats.
For the color brush Q&A task, accurate object recogni-
Figure 4. Overview of our Editing Processor. The proposed tion within contours is essential. LLaVA [31] inherently
architecture extends the latent diffusion UNet with two special- excels in object recognition tasks, making it adept at iden-
ized branches: an inpainting branch for content-aware per-pixel tifying the content within color brush stroke boundaries.
inpainting guidance and a control branch for structural guidance, However, the interpretation of add brush strokes poses
enabling precise brush-based image editing. a significant challenge due to the inherent abstraction of
human hand-drawn strokes or sketches. To address this,
C = {Econd , Ccond }. We adopt ControlNet [66] to insert we find it necessary to construct a specialized dataset to
conditional control into the middle and decoder blocks of fine-tune LLaVA to better understand and interpret human
the diffusion UNet. Let F C (zt , C, t; ΘC )i represent the hand-drawn brush strokes.
output of the i-th layer in the ControlNet, the control feature
Dataset Construction. We selected the Densely Captioned
insertion can be formulated as
Images (DCI) dataset [54] as our primary source. Each
F (zt , τ, t; Θ)⌊ n2 ⌋+i + = wC · Z(F C (zt , C, t; ΘC )i ), (5) image within the DCI dataset has detailed, multi-granular
masks, accompanied by open-vocabulary labels and rich
where wC is an adjustable hyperparameter that determines descriptions. This rich annotation structure enables the
the control strength. Both the inpainting and control capture of diverse visual features and semantic contexts.
branches don’t alter the weights of the pre-trained diffusion Step 1: Answer Generation for Q&A. The initial stage
models, enabling it to be a plug-and-play component appli- involves generating edge maps using PiDiNet [50] from
cable to any community fine-tuned diffusion models. The images in the DCI dataset, as shown in Fig. 5b. We calculate
control branch is trained using the denoising score matching the edge density within the masked regions and select the
objective, which can be written as top 5 masks with the highest edge densities, as illustrated in
h Fig. 5c. The labels corresponding to these selected masks
2i
L = Ezt ,t,ϵ∼N (0,I) ϵ − ϵc zt , C, t; {Θ, ΘC } , (6) serve as the ground truths for the Q&A. To ensure the
model focuses on guessing user intent rather than parsing
where ϵc is the combination of the denoising U-Net and the irrelevant details, we clean the label to keep only noun
ControlNet model. components, streamlining to emphasize essential elements.
Step 2: Simulating Brushstroke with Edge Overlay. In
3.2. Painting Assistor the second part of the dataset construction, we focus on the
five masks identified in the first step. Each mask undergoes
Prompt formatting. In our system, we implement two
random shape expansion to introduce variability. We use the
types of question answering (Q&A) [3] tasks to facilitate
BrushNet [23] model based on the SDXL [41] to perform
the Draw&Guess. For the add brush, we utilize a prompt
inpainting on these augmented masks with empty prompt,
structured as follows: “This is a ‘draw and guess’ game.
as shown in Fig. 5d. Subsequently, the edge maps generated
I will upload an image containing some strokes. To help
earlier are overlaid onto the inpainted areas as in Fig. 5e.
you locate the strokes, I will give you the normalized
These overlay images simulate practical examples of how
bounding box coordinates of the stokes where their original
user hand-drawn strokes might alter an image.
coordinates are divided by the padded image width and
MLLM Fine-Tuning. Our dataset construction method
height. The top-left corner of the bounding box is at effectively prepares the model to understand and predict
(x1 , y1 ), and the bottom-right corner is at (x2 , y2 ). Now user edits, which contains a total of 24, 315 images, cat-
tell me in a single word a phrase, what am I trying to draw egorized under 4, 412 different labels, ensuring a broad
with these strokes in the image?” The Q&A output directly spectrum of data for training. To optimize the performance
serves as the predicted prompt. For the subtract brush, we of the MLLM over Draw&Guess, we fine-tuned the LLaVA

5
(a) Original Image (b) Edge Map (c) Chosen Mask (d) Inpainting Result (e) Edge Overlay

Figure 5. Illustration of dataset construction process. (a) Original images from the DCI dataset; (b) Edge maps extracted from original
images; (c) Selected masks (highlighted in purple) with highest edge density; (d) Results after BrushNet inpainting on augmented masked
regions; (e) Final results with edge map overlay on selected areas. By overlaying edge maps on inpainted results, we simulate scenarios
where users edit images with brush strokes, as the edge maps resemble hand-drawn sketches. The bounding box coordinates of the mask
and labels are inherited from the DCI dataset.

model, leveraging the Low-Rank Adaptation (LoRA) [18] 4. Experiment

technique, allowing the efficient fine-tuning without exten-
sively large dataset. Consistent with the original LLaVA In evaluating our system, we focused on three primary
training objectives, our approach aims to maximize the modules: the Editing Processor, the Painting Assistor,
likelihood of the correct labels given the input corpora u, and the Idea Collector. First, we assessed the quality of
which is defined as controllable generation provided by the Editing Processor,
with particular attention to edge alignment and color fi-
|u|
X delity. This evaluation involved analyzing how effectively
max log P ui | u1 , . . . , ui−1 ; {Θpt , Θlora } , (7) users could manipulate and achieve desired visual outputs,
Θlora
i=1
which ensures the system responds accurately to user’s
control signal, detailed in Sec. 4.1. Second, We evaluated
where Θpt and Θlora are parameters in the pre-trained the Painting Assistor’s semantic prediction accuracy using
MLLM and the LoRA respectively. simulated hand-drawn inputs. This assessment was critical
for validating the capability of the MLLM in interpreting
user intentions, ensuring contextually appropriate sugges-
3.3. Idea Collector tions that align with the image semantics. Additionally, we
conducted user studies to gather feedback on the system’s
Interface Design. The user interface of MagicQuill
efficiency improvements and prediction accuracy in real-
is designed for an intuitive and streamlined image editing
world scenario, presented in Sec. 4.2. Third, we assessed
experience, as depicted in Figure 2. The interface is divided
the usability of the user interfaces across all modules. We
into several interactive sections, emphasizing ease of use
decomposes the assessment into four distinct dimensions
while providing flexible control over the editing process.
spanning from operational efficiency to user satisfaction.
The interface comprises several key areas: a Prompt Area
This multi-dimensional assessment framework enabled sys-
(A) displaying MLLM-suggested prompts, a Toolbar (B)
tematic comparison with baseline systems while ensuring
with essential editing tools, Layer Management (C) for
thorough evaluation of the interface, as shown in Sec. 4.3.
organizing brush strokes, the main Canvas (D) for editing, a
Generated Images area (E) for previewing results, Execute
Button (F), and Parameter Adjustment (G). 4.1. Controllable Generation
Cross-Platform Support. We implement the Idea Col- To thoroughly evaluate the controllable generation capa-
lector as a modular ReactJS component library, designed bilities of our editing processor, we compared it with
for cross-platform compatibility with various generative four representative baselines from different categories: (1)
AI frameworks, such as Gradio [1] and ComfyUI [7]. SmartEdit [20], an instruction-based editing method. We
The architecture separates client-side user interactions from utilize LLaVA-Next [30] to generate the editing instruc-
server-side model computations through HTTP protocols, tion; (2) SketchEdit [64], a GAN-based sketch-conditioned
enabling platform-independent deployment via standard method; (3) BrushNet [23], the mask and prompt-guided
HTML rendering. inpainting method; and (4) a composite baseline combining

6
Figure 6. Visual result comparison. The first two columns present the edge and color conditions for editing, while the last column shows
the ground truth image that the models aim to recreate. SmartEdit [20] utilizes natural language for guidance, but lacks precision in
controlling shape and color, often affecting non-target regions. SketchEdit [64], a GAN-based approach [15], struggles with open-domain
image generation, falling short compared to models with diffusion-based generative priors. Although BrushNet [23] delivers seamless
image inpainting, it struggles to align edges and colors simultaneously, even with ControlNet [66] enhancement. In contrast, our Editing
Processor strictly adheres to both edge and color conditions, achieving high-fidelity conditional image editing.

BrushNet [23] and ControNet [66]. As illustrated in We further conducted a quantitative analysis of our
Fig. 6, the instruction-based method, SmartEdit, tends to constructed test dataset in Sec. 3.2, which contains 490
produce outputs that are too random, lacking the precision images. Our model outperformed the baselines across
required for accurate editing purposes. Similarly, while all key metrics as in Tab. 1. These results demonstrate
BrushNet enables region-specific modifications, it struggles significant improvements in controllable generation.
with maintaining predictable detail generation even with
ControlNet enhancement, making precise manipulation 4.2. Prediction Accuracy
challenging. In contrast, our model achieves more accurate
To evaluate the prediction accuracy of the Painting Assis-
edge alignment and color fidelity, which we attribute to our
tor, we compared it with three state-of-the-art MLLMs:
specialized design of the inpainting and control branch that
LLaVA-1.5 [31], LLaVA-Next [30], and GPT-4o [21] on
emphasizes these aspects.
our test dataset of 490 images from Sec. 3.2. Each model
was prompted with images containing sketches and bound-
Table 1. Quantitative results and input condition comparisons
between the baselines and ours. Our Editing Processor performs
ing box coordinates to generate semantic interpretations.
better than the baselines across all metrics, indicating its superior- The semantic outputs were assessed using three metrics:
ity in controllable generation over edge and color. BERT [8], CLIP [43], and GPT-4 [2] similarity scores,
which measure the closeness of the generated descriptions
Input Condition to the ground truth. For GPT-4 similarity, we ask GPT-
Method LPIPS[67] PSNR SSIM
Text Edge Color 4 to rate the semantic and visual similarity between the
SmartEdit ✓ ✗ ✗ 0.339 16.695 0.561 predicted response and the ground truth on a 5-point scale,
SketchEdit ✗ ✓ ✗ 0.138 23.288 0.835 where 1 means “completely different”, 3 means “somewhat
BrushNet ✓ ✗ ✗ 0.0817 25.455 0.893 related”, and 5 means “exactly same”.
Brush.+Cont. ✓ ✓ ✓ 0.0748 25.770 0.894 The evaluation results are presented in Tab. 2, illustrating
Ours ✓ ✓ ✓ 0.0667 27.282 0.902 that our model achieves the highest prediction accuracy

7
Table 2. Performance comparison between our Painting Assistor
and other MLLMs, demonstrating superior visual and semantic
consistency in predictions.

GPT-4 [2] BERT [8] CLIP [43]

Method
Similarity Similarity Similarity
LLaVA-1.5 1.894 0.721 0.795
LLaVA-Next 1.941 0.716 0.794
GPT-4o 1.976 0.684 0.790
Ours 2.712 0.749 0.824

Figure 8. Comparative user ratings between our system and the

baseline in four dimensions, with standard deviation shown as
error bars.

This setup enables the focus on the value provided with our
Figure 7. User ratings for the Painting Assistor, focusing on its Idea Collector by controlling other variables.
prediction accuracy and efficiency enhancement capabilities. Procedure. The study lasted approximately 30 minutes
for each participant with two systems (our system and the
among all tested MLLMs. This superior performance in- baseline). Each session began with a brief introduction to
dicates that our Painting Assistor more accurately captures the system using the case illustrated in Fig. 1. Participants
and predicts the semantic meanings of user drawings. then had 5 minutes to freely explore and edit images. After
To further qualitatively evaluate the Painting Assistor, using both systems, participants completed a questionnaire
we conducted a user study with 30 participants who freely with 22 questions (10 questions per system covering all four
edited images using our system. Participants rated the dimensions and 2 questions regarding the Painting Assistor
Painting Assistor on a 5-point scale for prediction accuracy detailed in Sec. 4.2). We employed the System Usability
(1: very poor, 5: excellent) and efficiency facilitation (1: Scale (SUS) [5] for scoring, using a Likert scale from 1
significantly reduced, 5: significantly enhanced). As shown (strongly disagree) to 5 (strongly agree), to capture a global
in Fig. 7, 86.67% of users rated prediction accuracy at least view of subjective usability for each system.
4, validating the ability of our fine-tuned MLLM to interpret As shown in Fig. 8, our system demonstrated signifi-
user intentions. Similarly, 90% rated efficiency facilita- cantly higher scores across all dimensions compared to the
tion 4 or above, confirming that Draw&Guess effectively baseline. Indicating the effectiveness of our Idea Collector.
streamlines the editing process by reducing manual prompt Further details can be found in the supplementary.
inputs. The average scores for accuracy and efficiency were
4.07 and 4.37. 5. Conclusion
4.3. Idea Collection Effectiveness and Efficiency In conclusion, our interactive image editing system
MagicQuill effectively addresses the challenges of per-
Collecting user ideas effectively and efficiently is critical for forming precise and efficient edits by combining the
the usability and adoption of interactive systems, especially strengths of the Editing Processor, Painting Assistor, and
in creative applications where user engagement is crucial. Idea Collector. Our comprehensive evaluations demonstrate
To evaluate the Idea Collector, we conducted a user study significant improvements over existing methods in terms
with 30 participants, comparing our system against a base- of controllable generation quality, editing intent prediction
line system on the following dimensions: accuracy, and user interface efficiency. For future work,
• Complexity and Efficiency measures how streamlined and we aim to expand the capabilities of our system by incor-
intuitive the user finds the system for creative editing. porating additional editing types, such as reference-based
• Consistency and Integration assesses whether the system editing, which would allow users to guide modifications
maintains a cohesive interface and interaction design. using external images. We also plan to implement layered
• Ease of Use captures the learnability of the system, image generation to provide better editing flexibility and
especially for users with varying backgrounds. support for complex compositions. Moreover, enhancing
• Overall Satisfaction reflects users’ general satisfaction typography support will enable more robust manipulation
with the design, features, and usability of the system. of textual elements within images. These developments
Baseline. The baseline system was implemented as a cus- will further enrich our framework, offering users a more
tomized ComfyUI workflow, replacing our Idea Collector versatile and powerful tool for creative expression in digital
interface with an open-source canvas, Painter Node [40]. image editing.

8
References [14] Zigang Geng, Binxin Yang, Tiankai Hang, Chen Li, Shuyang
Gu, Ting Zhang, Jianmin Bao, Zheng Zhang, Houqiang Li,
[1] Abubakar Abid, Ali Abdalla, Ali Abid, Dawood Khan, Han Hu, et al. Instructdiffusion: A generalist modeling
Abdulrahman Alfozan, and James Zou. Gradio: Hassle-free interface for vision tasks. In Proceedings of the IEEE/CVF
sharing and testing of ml models in the wild. arXiv preprint Conference on Computer Vision and Pattern Recognition,
arXiv:1906.02569, 2019. 6 pages 12709–12720, 2024. 1, 2
[2] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- [15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Yoshua Bengio. Generative adversarial networks. Commu-
Gpt-4 technical report. arXiv preprint arXiv:2303.08774, nications of the ACM, 63(11):139–144, 2020. 2, 7
2023. 7, 8
[16] Yingqing He, Zhaoyang Liu, Jingye Chen, Zeyue Tian,
[3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret
Hongyu Liu, Xiaowei Chi, Runtao Liu, Ruibin Yuan, Yazhou
Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh.
Xing, Wenhai Wang, et al. Llms meet multimodal generation
Vqa: Visual question answering. In Proceedings of the IEEE
and editing: A survey. arXiv preprint arXiv:2405.19334,
international conference on computer vision, pages 2425–
2024. 2
2433, 2015. 5
[17] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-
[4] Stephen Brade, Bryan Wang, Mauricio Sousa, Sageev Oore,
fusion probabilistic models. Advances in neural information
and Tovi Grossman. Promptify: Text-to-image generation
processing systems, 33:6840–6851, 2020. 2
through interactive prompt exploration with large language
models. In Proceedings of the 36th Annual ACM Symposium [18] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
on User Interface Software and Technology, pages 1–14, Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
2023. 3 Lora: Low-rank adaptation of large language models. arXiv
preprint arXiv:2106.09685, 2021. 6, 1, 2
[5] John Brooke et al. Sus-a quick and dirty usability scale.
Usability evaluation in industry, 189(194):4–7, 1996. 8, 3 [19] Yi Huang, Jiancheng Huang, Yifan Liu, Mingfu Yan, Jiaxi
Lv, Jianzhuang Liu, Wei Xiong, He Zhang, Shifeng Chen,
[6] Tim Brooks, Aleksander Holynski, and Alexei A Efros. In-
and Liangliang Cao. Diffusion model-based image editing:
structpix2pix: Learning to follow image editing instructions.
A survey. arXiv preprint arXiv:2402.17525, 2024. 1, 2
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 18392–18402, 2023. [20] Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan,
1, 2 Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui
[7] ComfyUI. The most powerful and modular diffusion model Huang, Ruimao Zhang, et al. Smartedit: Exploring complex
gui, api and backend with a graph/nodes interface. https: instruction-based image editing with multimodal large lan-
//github.com/comfyanonymous/ComfyUI, 2024. guage models. In Proceedings of the IEEE/CVF Conference
6 on Computer Vision and Pattern Recognition, pages 8362–
8371, 2024. 1, 2, 6, 7
[8] Jacob Devlin. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint [21] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel-
arXiv:1810.04805, 2018. 7, 8 man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli-
[9] Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.
Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, arXiv preprint arXiv:2410.21276, 2024. 2, 7
Haoran Wei, et al. Dreamllm: Synergistic multimodal com- [22] Youngjoo Jo and Jongyoul Park. Sc-fegan: Face editing
prehension and creation. arXiv preprint arXiv:2309.11499, generative adversarial network with user’s sketch and color.
2023. 2 In Proceedings of the IEEE/CVF international conference on
[10] Dave Epstein, Allan Jabri, Ben Poole, Alexei Efros, and computer vision, pages 1745–1753, 2019. 2
Aleksander Holynski. Diffusion self-guidance for control- [23] Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying
lable image generation. Advances in Neural Information Shan, and Qiang Xu. Brushnet: A plug-and-play image
Processing Systems, 36:16222–16239, 2023. 1, 2 inpainting model with decomposed dual-branch diffusion.
[11] Kunyu Feng, Yue Ma, Bingyuan Wang, Chenyang Qi, arXiv preprint arXiv:2403.06976, 2024. 1, 2, 3, 4, 5, 6, 7
Haozhe Chen, Qifeng Chen, and Zeyu Wang. Dit4edit: [24] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine.
Diffusion transformer for image editing. arXiv preprint Elucidating the design space of diffusion-based generative
arXiv:2411.03286, 2024. 2 models. Advances in neural information processing systems,
[12] Yingchaojie Feng, Xingbo Wang, Kam Kwai Wong, Sijia 35:26565–26577, 2022. 1
Wang, Yuhong Lu, Minfeng Zhu, Baicheng Wang, and Wei [25] Kangyeol Kim, Sunghyun Park, Junsoo Lee, and Jaegul
Chen. Promptmagician: Interactive prompt engineering for Choo. Reference-based image composition with sketch
text-to-image creation. IEEE Transactions on Visualization via structure-aware diffusion model. arXiv preprint
and Computer Graphics, 30(1):295–305, 2024. 3 arXiv:2304.09748, 2023. 2
[13] Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, [26] Diederik P Kingma. Auto-encoding variational bayes. arXiv
Yinfei Yang, and Zhe Gan. Guiding instruction-based preprint arXiv:1312.6114, 2013. 4
image editing via multimodal large language models. arXiv [27] Hyung-Kwon Ko, Gwanmo Park, Hyeon Jeon, Jaemin Jo,
preprint arXiv:2309.17102, 2023. 2 Juho Kim, and Jinwook Seo. Large-scale text-to-image

9
generation models for visual artists’ creative works. In Pro- [41] Dustin Podell, Zion English, Kyle Lacey, Andreas
ceedings of the 28th International Conference on Intelligent Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and
User Interfaces, page 919–933, New York, NY, USA, 2023. Robin Rombach. Sdxl: Improving latent diffusion mod-
Association for Computing Machinery. 3 els for high-resolution image synthesis. arXiv preprint
[28] Jing Yu Koh, Daniel Fried, and Russ R Salakhutdinov. Gen- arXiv:2307.01952, 2023. 5
erating images with multimodal language models. Advances [42] Tiziano Portenier, Qiyang Hu, Attila Szabo, Siavash Ar-
in Neural Information Processing Systems, 36, 2024. 2 jomand Bigdeli, Paolo Favaro, and Matthias Zwicker.
[29] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Faceshop: Deep sketch-based face image editing. arXiv
Improved baselines with visual instruction tuning. In Pro- preprint arXiv:1804.08972, 2018. 2
ceedings of the IEEE/CVF Conference on Computer Vision [43] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
and Pattern Recognition, pages 26296–26306, 2024. 3 Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[30] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- transferable visual models from natural language supervi-
proved reasoning, ocr, and world knowledge, 2024. 2, 6, sion. In International conference on machine learning, pages
7 8748–8763. PMLR, 2021. 7, 8
[31] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. [44] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Visual instruction tuning. Advances in neural information Patrick Esser, and Björn Ommer. High-resolution image
processing systems, 36, 2024. 2, 4, 5, 7 synthesis with latent diffusion models. In Proceedings of
[32] Zhiheng Liu, Ruili Feng, Kai Zhu, Yifei Zhang, Kecheng the IEEE/CVF conference on computer vision and pattern
Zheng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. recognition, pages 10684–10695, 2022. 2, 3, 1
Cones: Concept neurons in diffusion models for customized [45] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
generation. arXiv preprint arXiv:2303.05125, 2023. 2 net: Convolutional networks for biomedical image segmen-
[33] Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai tation. In Medical image computing and computer-assisted
Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang intervention–MICCAI 2015: 18th international conference,
Cao. Cones 2: Customizable image synthesis with multiple Munich, Germany, October 5-9, 2015, proceedings, part III
subjects. In Proceedings of the 37th International Con- 18, pages 234–241. Springer, 2015. 4
ference on Neural Information Processing Systems, pages [46] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
57500–57519, 2023. 2 Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
[34] Weihang Mao, Bo Han, and Zihao Wang. Sketchffusion: Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
Sketch-guided image editing with diffusion model. In 2023 man, et al. Laion-5b: An open large-scale dataset for training
IEEE International Conference on Image Processing (ICIP), next generation image-text models. Advances in Neural
pages 790–794. IEEE, 2023. 2 Information Processing Systems, 35:25278–25294, 2022. 1
[35] Naoki Matsunaga, Masato Ishii, Akio Hayakawa, Kenji [47] Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain,
Suzuki, and Takuya Narihira. Fine-grained image editing by Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman.
pixel-wise guidance using diffusion models. arXiv preprint Emu edit: Precise image editing via recognition and gen-
arXiv:2212.02024, 2022. 1, 2 eration tasks. In Proceedings of the IEEE/CVF Conference
[36] Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and on Computer Vision and Pattern Recognition, pages 8871–
Jian Zhang. Dragondiffusion: Enabling drag-style manipula- 8879, 2024. 1, 2
tion on diffusion models. arXiv preprint arXiv:2307.02421, [48] Jaskirat Singh, Jianming Zhang, Qing Liu, Cameron Smith,
2023. 2 Zhe Lin, and Liang Zheng. Smartmask: Context aware high-
[37] Shen Nie, Hanzhong Allan Guo, Cheng Lu, Yuhao Zhou, fidelity mask generation for fine-grained object insertion and
Chenyu Zheng, and Chongxuan Li. The blessing of random- layout control. In Proceedings of the IEEE/CVF Conference
ness: Sde beats ode in general diffusion-based image editing. on Computer Vision and Pattern Recognition, pages 6497–
arXiv preprint arXiv:2311.01410, 2023. 6506, 2024. 1, 2
[38] Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie [49] Jiaming Song, Chenlin Meng, and Stefano Ermon.
Liu, Abhimitra Meka, and Christian Theobalt. Drag your Denoising diffusion implicit models. arXiv preprint
gan: Interactive point-based manipulation on the generative arXiv:2010.02502, 2020. 2
image manifold. In ACM SIGGRAPH 2023 Conference [50] Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao,
Proceedings, pages 1–11, 2023. 1, 2 Qi Tian, Matti Pietikäinen, and Li Liu. Pixel difference
[39] Xiaohan Peng, Janin Koch, and Wendy E. Mackay. Design- networks for efficient edge detection. In Proceedings of
prompt: Using multimodal interaction for design exploration the IEEE/CVF international conference on computer vision,
with generative ai. In Proceedings of the 2024 ACM pages 5117–5127, 2021. 5
Designing Interactive Systems Conference, page 804–818, [51] Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin,
New York, NY, USA, 2024. Association for Computing Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjoerd van
Machinery. 3 Steenkiste, Ranjay Krishna, et al. Dreamsync: Aligning text-
[40] Aleksey Petrov. Comfyui custom nodes alekpet. https:// to-image generation with image understanding feedback.
github.com/AlekPet/ComfyUI_Custom_Nodes_ In Synthetic Data for Computer Vision Workshop@ CVPR
AlekPet, 2024. 8 2024, 2023. 2

10
[52] Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong ceedings of the IEEE/CVF conference on computer vision
Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun and pattern recognition, pages 5951–5961, 2022. 2, 6, 7
Huang, and Xinlong Wang. Generative pretraining in [65] Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su.
multimodality. arXiv preprint arXiv:2307.05222, 2023. 2 Magicbrush: A manually annotated dataset for instruction-
[53] Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying guided image editing. Advances in Neural Information
Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Processing Systems, 36, 2024. 2
Huang, and Xinlong Wang. Generative multimodal models [66] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
are in-context learners. In Proceedings of the IEEE/CVF conditional control to text-to-image diffusion models. In
Conference on Computer Vision and Pattern Recognition, Proceedings of the IEEE/CVF International Conference on
pages 14398–14409, 2024. 2 Computer Vision, pages 3836–3847, 2023. 2, 3, 4, 5, 7
[54] Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary [67] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
Williamson, Vasu Sharma, and Adriana Romero-Soriano. and Oliver Wang. The unreasonable effectiveness of deep
A picture is worth more than 77 text tokens: Evaluating features as a perceptual metric. In Proceedings of the IEEE
clip-style models on dense captions. In Proceedings of conference on computer vision and pattern recognition,
the IEEE/CVF Conference on Computer Vision and Pattern pages 586–595, 2018. 7
Recognition, pages 26700–26709, 2024. 5 [68] Xin Zhang, Jiaxian Guo, Paul Yoo, Yutaka Matsuo, and
[55] Jue Wang, Yuxiang Lin, Tianshuo Yuan, Zhi-Qi Cheng, Yusuke Iwasawa. Paste, inpaint and harmonize via denois-
Xiaolong Wang, Jiao GH, Wei Chen, and Xiaojiang Peng. ing: Subject-driven image editing with pre-trained diffusion
Flexedit: Marrying free-shape masks to vllm for flexible model. arXiv preprint arXiv:2306.07596, 2023. 1, 2
image editing. arXiv preprint arXiv:2408.12429, 2024. 2 [69] Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan,
[56] Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, and Tianyi and Kai Chen. A task is worth one word: Learning with task
Zhang. Promptcharm: Text-to-image generation through prompts for high-quality versatile image inpainting. arXiv
multi-modal prompting and refinement. In Proceedings preprint arXiv:2312.03594, 2023. 1, 2
of the CHI Conference on Human Factors in Computing
Systems, New York, NY, USA, 2024. 3
[57] Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu.
Genartist: Multimodal llm as an agent for unified image
generation and editing. arXiv preprint arXiv:2407.05600,
2024. 2
[58] Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng
Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint
arXiv:2309.05519, 2023. 2
[59] Chufeng Xiao and Hongbo Fu. Customsketching: Sketch
concept extraction for sketch-based image synthesis and
editing. arXiv preprint arXiv:2402.17624, 2024. 2
[60] Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu,
Stefano Ermon, and CUI Bin. Mastering text-to-image
diffusion: Recaptioning, planning, and generating with
multimodal llms. In Forty-first International Conference on
Machine Learning, 2024. 2
[61] Shuai Yang, Zhangyang Wang, Jiaying Liu, and Zongming
Guo. Deep plastic surgery: Robust and controllable image
editing with human-drawn sketches. In Computer Vision–
ECCV 2020: 16th European Conference, Glasgow, UK,
August 23–28, 2020, Proceedings, Part XV 16, pages 601–
617. Springer, 2020. 2
[62] Yifan Yang, Houwen Peng, Yifei Shen, Yuqing Yang, Han
Hu, Lili Qiu, Hideki Koike, et al. Imagebrush: Learn-
ing visual in-context instructions for exemplar-based image
manipulation. Advances in Neural Information Processing
Systems, 36, 2024. 1, 2
[63] Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin,
Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. Idea2img:
Iterative self-refinement with gpt-4v (ision) for auto-
matic image design and generation. arXiv preprint
arXiv:2310.08541, 2023. 2
[64] Yu Zeng, Zhe Lin, and Vishal M Patel. Sketchedit: Mask-
free local image manipulation with partial sketches. In Pro-

11
MagicQuill: An Intelligent Interactive Image Editing System
Supplementary Material
A. Implementation Details cutting a slice out of it, as shown in Fig 2. The user
begins by uploading the image through the toolbar, which
A.1. Editing Processor provides access to a range of tools (Fig. 2-B). Using the add
Our Editing Processor is built upon Stable Diffusion brush, the user outlines the slice to be cut directly on the
v1.5 [44] and is compatible with all customized fine-tuned canvas (Fig. 2-D). Meanwhile, the Draw & Guess feature
weights. We set the control parameters with inpainting introduced in Sec. 3.2 predicts that the user intends to
strength wI = 1.0 and control strength wC = 0.5, while manipulate a “cake” and suggests the relevant prompt
expanding the mask region by 15 pixels during controllable automatically in the prompt area (Fig. 2-A). Afterward, the
inpainting. The generation process employs the Euler user switches to the subtract brush to fill in the outlined
ancestral sampler with Karras scheduler [24], requiring 20 slice, visually marking the area to be removed from the
steps per generation. On standard hardware, generating cake. For additional precision, the eraser tool is available
a 512 × 512 resolution image takes approximately 2 sec- to refine the cut. Once the adjustments are made, the user
onds with 15 GB VRAM consumption. For the control generates the image by clicking the Run button (Fig. 2-F),
branch, we conduct fine-tuning on the LAION-Aesthetics which runs the model detailed in Sec. 3.1.
dataset [46], specifically selecting images with aesthetic The resulting image appears in the generated image
scores above 6.5. The training process spans 3 epochs with area (Fig. 2-E). Users can confirm changes via the tick
a learning rate of 5e − 6 and batch size of 8. icon to update the canvas, or click the cross icon to revert
modifications. This workflow enables iterative refinement
A.2. Painting Assistor of edits, providing flexible control throughout the process.
We fine-tune a LLaVA-1.5 model with 7B parameters
for Draw&Guess task on our own constructed dataset in B. Failure Case
Sec. 3.2, leveraging LoRA [18]. The LoRA rank and
B.1. Failure Case of Editing Processor
alpha are 64 and 16 respectively. The model is trained
for 3 epochs with a learning rate of 2e − 5 and batch Scribble-Prompt Trade-Off. We observe quality degra-
size of 8. Under 4-bit quantization, the model achieves dation when user-provided add brush strokes deviate from
real-time prompt inference within 0.3 seconds using only 5 the semantic content specified in the prompt, a common
GB VRAM, enabling efficient on-the-fly prompt generation occurrence among users with limited artistic skills. This
with satisfactory accuracy. creates a fundamental trade-off: strictly following the
scribble structure may compromise the generation quality
A.3. Idea Collector with respect to the text prompt. To address this issue, we
Cross-platform Support. Besides Gradio, MagicQuill propose adjusting the edge control strength.
can also be integrated into ComfyUI as a custom node, as
shown in Fig. 9. It is designed with customizable widgets
for parameter settings and extensible architecture for future
platform integrations.

(a) User’s Input (b) Edge Strength: 0.6 (c) Edge Strength: 0.2
Figure 10. Illustration of the Scribble-Prompt Trade-Off. Given
user-provided brush strokes (a) with the text prompt “man”, we
show generation results with different edge control strengths: (b)
Figure 9. MagicQuill as a custom node in ComfyUI. with strength of 0.6 and (c) with strength of 0.2.

Usage Scenario. To demonstrate the user-friendly work- As demonstrated in Fig. 10, when presented with an
flow of MagicQuill, we present an illustrative scenario: oversimplified sketch that substantially deviates from the
A user wants to modify an image of a complete cake, prompt “man”, a high edge strength of 0.6 produces results

1
that, while faithful to the sketch, appear inharmonious.
By reducing the edge strength to 0.2, we achieve notably
improved generation quality.
Colorization-Details Trade-Off. We observe a trade-
off between colorization accuracy and detail preservation.
Since our conditional image inpainting pipeline relies on
downsampled color blocks and CNN-extracted edge maps
as input, structrual details in the edited regions may be
(a) User’s Input (b) Prompt: Candy (c) Prompt: Raspberry
compromised during the generation process.
Figure 12. Demonstration of semantic ambiguity in sketch inter-
pretation. (A) User’s sketch intended to represent a raspberry; (B)
Our Draw&Guess model incorrectly interprets the sketch as candy,
leading to a misaligned generation; (C) The expected generation
result with correct raspberry interpretation.

misaligned generations that deviate from the user’s expec-

tations. Fortunately, our user study reveals that participants
were generally understanding of such interpretation errors
and considered the model’s predictions to be reasonable
(a) Original Image (b) Color brush, α 1.0 (c) Result for α 1.0
attempts at disambiguating their sketches.

C. Generalizability of Editing Processor

Our Editing Processor demonstrates generalization capabil-
ities across various fine-tuned Stable Diffusion v1.5 models.
Since both the inpainting and control branches preserve
the weights of pre-trained diffusion models, our method
seamlessly integrates with any community fine-tuned model
(d) Color brush, α 0.8 (e) Result for α 0.8 as a plug-and-play component. We validate this versatility
Figure 11. Illustration of the Colorization-Detail Trade-Off. by testing on several popular fine-tuned models including
Results of color brush strokes with different alpha values: (b, c) RealisticVision, GhostMix, and DreamShaper, achieving
using alpha value 1.0, and (d, e) using alpha value 0.8, where the consistent editing performance while inheriting the unique
latter better preserves more structural details of the original image. stylistic characteristics of each model, as shown in Fig. 13.
As illustrated in Fig. 11, this limitation can be partially This compatibility highlights the practical value of our
mitigated by reducing the alpha value of the color brush Editing Processor, as users can leverage their preferred fine-
trokes, which preserves more information from the original tuned models or LoRA [18] weight while maintaining the
image when downsampled to color blocks. Future work editing capabilities provided by our framework.
could explore using grayscale images as the control condi-
tion to achieve colorization while maintaining fine-grained
structural details.
D. In-Context Editing Intent Interpretation
The MLLM in Painting Assistor, fine-tuned on our own
B.2. Failure Case of Painting Assistor
constructed dataset in Sec. 3.2, demonstrates sophisticated
Ambiguity of the Brush Strokes. Our system enables in-context reasoning capabilities for editing intent interpre-
users to express their editing intentions through brush tation. The model effectively leverages contextual visual
strokes, which are then interpreted by the Painting Assistor information to interpret user brush strokes based on their
via Draw&Guess. However, this approach faces inherent surrounding environment. For instance, a simple vertical
limitations due to the ambiguous nature of user-provided line is interpreted differently based on its context: as a
sketches. For instance, a simple circular sketch could candle on a cake, a column on ruins, or an antenna on
represent various objects like strawberry, raspberry, or a robot, as illustrated in Fig. 14. These context-aware
candy, making it challenging for the model to accurately interpretations validate the effectiveness of our dataset
infer the user’s intended modification, as shown in Fig. 12. construction approach and highlight the model’s ability to
This ambiguity in sketch interpretation can lead to incorporate environmental cues in its reasoning process.

2
Figure 15. The baseline system implemented in ComfyUI.

editing experience, with varying skill levels, providing a

realistic range of user proficiency.
To control for learning effects, we randomly divided
participants into two groups: Group A used MagicQuill
before the baseline (Fig. 15), while Group B followed the
reverse order. Each participant completed a comprehensive
evaluation consisting of 10 questions per system, modified
from the System Usability Scale (SUS) [5], spanning four
key categories: Complexity and Efficiency, Consistency
and Integration, Ease of Use, and Overall Satisfaction. .
The detailed evaluation results are presented in Fig. 16.
Additionally, participants responded to 2 specific questions
addressing the Painting Assistor’s accuracy and efficiency
detailed in Sec. 4.2.
(a) Original Image (b) User’s Input (c) Editing Result
In the Ease of Use category, all participants rated the
easiness (Q1) with a score of 3 or above, and most
Figure 13. Demonstration of our method’s generalization capa-
reported learning our system more quickly (Q3, Q4) and
bility across different fine-tuned Stable Diffusion models. Results
independently (Q2) compared to the baseline. These
shown using RealisticVision (top row), GhostMix (middle row),
and DreamShaper (bottom row) as base models, all achieving findings indicate a lower barrier to entry for creative tasks
consistent editing performance. with our system. For Complexity and Efficiency, 80%
of participants found our system’s complexity appropriate
(Q5), contrasting with perceptions of excessive complexity
in the baseline. Additionally, 83.3% felt our system was
smooth to use (Q6), suggesting that our design lowered
cognitive load and supported efficient task completion.
In Consistency and Integration, 80% agreed on effective
feature integration (Q7), and 90% of participants agreed that
our system was consistent and coherent (Q8). This feedback
(a) Guess: Antenna (b) Guess: Candle (c) Guess: Column
suggests our system provided a cohesive and intuitive user
Figure 14. Examples of context-aware editing intention inter- experience. Lastly, for Overall Satisfaction, 93% expressed
pretation. The MLLM interprets the same vertical line sketch willingness to use our system (Q9), and 83% reported
differently based on surrounding context: (a) as an antenna on
confidence in using it (Q10). This high satisfaction rate
a robot’s head, (b) as a candle on a birthday cake, and (c) as a
reflects positive user reception and highlights the system’s
column among ancient ruins.
overall effectiveness in meeting user expectations in editing.
E. User Study Details and Questionnaires The system’s ability to maintain user engagement was
evidenced by users voluntarily extending their editing ses-
To assess the effectiveness and usability of the Painting sions beyond the allocated time. After minimal training,
Assistor and Idea Collector, we recruited 30 participants users were able to create compelling edits, demonstrating
from diverse backgrounds, including postgraduate students, the system’s accessibility and ease of use. A gallery of user-
artists, and computer vision researchers. All had image edited images is presented in Fig. 17.

3
Figure 16. The questionnaire and user ratings comparing MagicQuill to the baseline system (1=strongly disagree, 5=strongly agree).

Figure 17. A gallery of creative image editing achieved by the participants of the user study using MagicQuill. Each pair shows the
original image and its edited version, demonstrating diverse user-driven modifications.

Beyond Simple Edits: X-Planner For Complex Instruction-Based Image Editing
No ratings yet
Beyond Simple Edits: X-Planner For Complex Instruction-Based Image Editing
22 pages
051paint by Example2211.13227v1
No ratings yet
051paint by Example2211.13227v1
15 pages
Publication Paper 2 1
No ratings yet
Publication Paper 2 1
7 pages
Editing Fashion Images With Precision: A Controlled in Painting Method
No ratings yet
Editing Fashion Images With Precision: A Controlled in Painting Method
13 pages
Text-To-image Editing by Image Information Removal
No ratings yet
Text-To-image Editing by Image Information Removal
10 pages
Step1X-Edit: A Practical Framework For General Image Editing
No ratings yet
Step1X-Edit: A Practical Framework For General Image Editing
17 pages
AI PDF 2
No ratings yet
AI PDF 2
14 pages
Control Diffusion
No ratings yet
Control Diffusion
20 pages
Publication Paper 2
No ratings yet
Publication Paper 2
7 pages
2022-Imagic Text-Based Real Image Editing With Diffusion Models
No ratings yet
2022-Imagic Text-Based Real Image Editing With Diffusion Models
16 pages
Ush Edit
No ratings yet
Ush Edit
14 pages
Framepainter: Endowing Interactive Image Editing With Video Diffusion Priors
No ratings yet
Framepainter: Endowing Interactive Image Editing With Video Diffusion Priors
16 pages
Avrahami Blended Diffusion For Text-Driven Editing of Natural Images CVPR 2022 Paper
No ratings yet
Avrahami Blended Diffusion For Text-Driven Editing of Natural Images CVPR 2022 Paper
11 pages
2021-Blended Diffusion For Text-Driven Editing of Natural Images
No ratings yet
2021-Blended Diffusion For Text-Driven Editing of Natural Images
32 pages
Patashnik Localizing Object-Level Shape Variations With Text-to-Image Diffusion Models ICCV 2023 Paper
No ratings yet
Patashnik Localizing Object-Level Shape Variations With Text-to-Image Diffusion Models ICCV 2023 Paper
11 pages
Final Yash Reprty
No ratings yet
Final Yash Reprty
21 pages
NeurIPS 2020 Swapping Autoencoder For Deep Image Manipulation Paper
No ratings yet
NeurIPS 2020 Swapping Autoencoder For Deep Image Manipulation Paper
14 pages
Multi AI Agents
No ratings yet
Multi AI Agents
19 pages
Unified Visual Generation Model
No ratings yet
Unified Visual Generation Model
48 pages
Prompt To Prompt - Preprint
No ratings yet
Prompt To Prompt - Preprint
36 pages
LIME: Localized Image Editing Via Attention Regularization in Diffusion Models
No ratings yet
LIME: Localized Image Editing Via Attention Regularization in Diffusion Models
19 pages
2302 01329 PDF
No ratings yet
2302 01329 PDF
18 pages
Geodiffuser: Geometry-Based Image Editing With Diffusion Models
No ratings yet
Geodiffuser: Geometry-Based Image Editing With Diffusion Models
22 pages
Literature Review (DL) - 1
No ratings yet
Literature Review (DL) - 1
7 pages
In-Context Edit: Enabling Instructional Image Editing With In-Context Generation in Large Scale Diffusion Transformer
No ratings yet
In-Context Edit: Enabling Instructional Image Editing With In-Context Generation in Large Scale Diffusion Transformer
10 pages
28113-Article Text-32167-1-2-20240324
No ratings yet
28113-Article Text-32167-1-2-20240324
9 pages
Dragondiffusion: Enabling Drag-Style Manipulation On Diffusion Models
No ratings yet
Dragondiffusion: Enabling Drag-Style Manipulation On Diffusion Models
10 pages
Diffusion Models Are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors
No ratings yet
Diffusion Models Are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors
23 pages
Dit4Edit: Diffusion Transformer For Image Editing
No ratings yet
Dit4Edit: Diffusion Transformer For Image Editing
10 pages
Elements Software For Image Editing Over Voice Commands
No ratings yet
Elements Software For Image Editing Over Voice Commands
7 pages
Dynamic Prompt Editing for AI Art
No ratings yet
Dynamic Prompt Editing for AI Art
10 pages
Ledits
No ratings yet
Ledits
21 pages
Base Paper - StyleRAG
No ratings yet
Base Paper - StyleRAG
16 pages
Imagic: Text-Based Real Image Editing With Diffusion Models
No ratings yet
Imagic: Text-Based Real Image Editing With Diffusion Models
16 pages
G I - I E M L L M: Uiding Nstruction Based Mage Diting Via Ultimodal Arge Anguage Odels
No ratings yet
G I - I E M L L M: Uiding Nstruction Based Mage Diting Via Ultimodal Arge Anguage Odels
24 pages
Smart Draw! Doodle Recognition
No ratings yet
Smart Draw! Doodle Recognition
6 pages
Image Editing with Stable Flow
No ratings yet
Image Editing with Stable Flow
36 pages
RRL
0% (1)
RRL
7 pages
CV Lingdong
No ratings yet
CV Lingdong
3 pages
Ijmemr V12i2 001
No ratings yet
Ijmemr V12i2 001
8 pages
Meta
No ratings yet
Meta
17 pages
IJCRT22A6583
No ratings yet
IJCRT22A6583
8 pages
Questions For Text To Image Ai
No ratings yet
Questions For Text To Image Ai
5 pages
Exploring Diffusion Transformer Designs Via Grafting
No ratings yet
Exploring Diffusion Transformer Designs Via Grafting
22 pages
Photo Editing Web Application
No ratings yet
Photo Editing Web Application
43 pages
3DDesigner: Towards Photorealistic 3D Object Generation and Editing With Text-Guided Diffusion Models ( 2022 ) 3DDesigner
No ratings yet
3DDesigner: Towards Photorealistic 3D Object Generation and Editing With Text-Guided Diffusion Models ( 2022 ) 3DDesigner
15 pages
EmpTech - Q1 - Mod7 - Image Manipulation Techniques For Communication
No ratings yet
EmpTech - Q1 - Mod7 - Image Manipulation Techniques For Communication
10 pages
Swapping Autoencoder For Deep Image Manipulation: Webpage
No ratings yet
Swapping Autoencoder For Deep Image Manipulation: Webpage
23 pages
Koley Its All About Your Sketch Democratising Sketch Control in Diffusion CVPR 2024 Paper
No ratings yet
Koley Its All About Your Sketch Democratising Sketch Control in Diffusion CVPR 2024 Paper
11 pages
Project Presentation
No ratings yet
Project Presentation
15 pages
Group 4 Project Synopsis On Photo Editing Using Machine Learning
No ratings yet
Group 4 Project Synopsis On Photo Editing Using Machine Learning
53 pages
2023-PRedItOR Text Guided Image Editing With Diffusion Prior
No ratings yet
2023-PRedItOR Text Guided Image Editing With Diffusion Prior
26 pages
Multimodal Image Synthesis Survey
No ratings yet
Multimodal Image Synthesis Survey
22 pages
Artigo SBGames2025
No ratings yet
Artigo SBGames2025
8 pages
Photoshop CS3 Beginner Tutorial
100% (4)
Photoshop CS3 Beginner Tutorial
37 pages
Multimedia Data Processing Questions
No ratings yet
Multimedia Data Processing Questions
3 pages
Tutorial Gimp
No ratings yet
Tutorial Gimp
4 pages
Digital Photography and Enhancement
No ratings yet
Digital Photography and Enhancement
9 pages
Affinity Photo Start Guide
100% (2)
Affinity Photo Start Guide
23 pages
CorelDraw X6 Version Comparison
No ratings yet
CorelDraw X6 Version Comparison
11 pages
Corocam 8
No ratings yet
Corocam 8
65 pages
The Black & White Photography Book UK - Issue 1, 2013 PDF
100% (1)
The Black & White Photography Book UK - Issue 1, 2013 PDF
228 pages
Underwater Image Enhancement
100% (3)
Underwater Image Enhancement
26 pages
WIREs Forensic Science - 2024 - Jones - Non Original and Digitally Captured Handwriting Considerations For Forensic
No ratings yet
WIREs Forensic Science - 2024 - Jones - Non Original and Digitally Captured Handwriting Considerations For Forensic
10 pages
ICT Reviewer
50% (2)
ICT Reviewer
2 pages
Question Bank For 5 Units DIP-IT6005
0% (2)
Question Bank For 5 Units DIP-IT6005
5 pages
Photo Patterns for Woodworkers
No ratings yet
Photo Patterns for Woodworkers
4 pages
21ad1603 Unit Ii Image Enhancement
No ratings yet
21ad1603 Unit Ii Image Enhancement
27 pages
Map Production and Interpretation
No ratings yet
Map Production and Interpretation
30 pages
Modul ACP
No ratings yet
Modul ACP
36 pages
I Jcs It 20140503357
No ratings yet
I Jcs It 20140503357
3 pages
(Ebook PDF) A Short Course in Photography: Digital 3rd Editionpdf Download
80% (5)
(Ebook PDF) A Short Course in Photography: Digital 3rd Editionpdf Download
54 pages
MCM 209 Foundation of Broadcasting - 1710519815
No ratings yet
MCM 209 Foundation of Broadcasting - 1710519815
21 pages
Technical Datasheet - ANSYS SPEOS - Optical Sensor Test PDF
No ratings yet
Technical Datasheet - ANSYS SPEOS - Optical Sensor Test PDF
5 pages
Autosprite User Manual v1.1
No ratings yet
Autosprite User Manual v1.1
8 pages
10 Creative Collage Ideas Guide
No ratings yet
10 Creative Collage Ideas Guide
13 pages
Photoshop Lesson 03
No ratings yet
Photoshop Lesson 03
38 pages
Microsoft PowerPoint Offers A Wide Range of Features To Create and Deliver Engaging Presentations
No ratings yet
Microsoft PowerPoint Offers A Wide Range of Features To Create and Deliver Engaging Presentations
14 pages
Image Enhancement
No ratings yet
Image Enhancement
144 pages
Compass Solution
No ratings yet
Compass Solution
2 pages
Empowerment Technology: Quarter 1 - Module 2 Productivity Tools
No ratings yet
Empowerment Technology: Quarter 1 - Module 2 Productivity Tools
58 pages
Multimedia SYsytem Unit 1,2,3,4 Except Last 2 Parts
100% (1)
Multimedia SYsytem Unit 1,2,3,4 Except Last 2 Parts
69 pages
5 Essential PHOTO EDITING TRICKS For Landscape Photography Dave Morrow Photography
No ratings yet
5 Essential PHOTO EDITING TRICKS For Landscape Photography Dave Morrow Photography
53 pages
9626 Topical p3 Graphics 18
No ratings yet
9626 Topical p3 Graphics 18
25 pages

Magicquill: An Intelligent Interactive Image Editing System

Uploaded by

Magicquill: An Intelligent Interactive Image Editing System

Uploaded by

MagicQuill: An Intelligent Interactive Image Editing System

Zichen Liu♡,1,2 Yue Yu♡,1,2 Hao Ouyang2 Qiuyu Wang2 ,

prepare the MLLM for Draw&Guess, we designed a dataset Editing mask 𝐌

construction method to simulate user hand-drawn editing 𝐹!""

scenarios and acquire ground truth for Draw&Guess. We

prompt guessing with satisfactory accuracy. More specifics

model, leveraging the Low-Rank Adaptation (LoRA) [18] 4. Experiment

GPT-4 [2] BERT [8] CLIP [43]

Figure 8. Comparative user ratings between our system and the

misaligned generations that deviate from the user’s expec-

C. Generalizability of Editing Processor

editing experience, with varying skill levels, providing a

You might also like