Magicquill: An Intelligent Interactive Image Editing System
Magicquill: An Intelligent Interactive Image Editing System
Figure 1. MagicQuill is an intelligent and interactive image editing system built upon diffusion models. Users seamlessly edit images
using three intuitive brushstrokes: add, subtract, and color (A). A MLLM dynamically predicts user intentions from their brush strokes and
suggests contextual prompts (B1-B4). The examples demonstrate diverse editing operations: to generate a jacket from clothing contour
(B1), add a flower crown from head sketches (B2), remove background (B3), and apply color changes to the hair and flowers(B4).
Abstract 1. Introduction
As a highly practical application, image editing encoun- Performing precise and efficient edits on digital pho-
ters a variety of user demands and thus prioritizes excellent tographs remains a significant challenge, especially when
ease of use. In this paper, we unveil MagicQuill, an aiming for nuanced modifications. As shown in Fig. 1,
integrated image editing system designed to support users in consider the task of editing a portrait of a lady where
swiftly actualizing their creativity. Our system starts with a specific alterations are desired: converting a shirt to a
streamlined yet functionally robust interface, enabling users custom-designed jacket, adding a flower crown at an exact
to articulate their ideas (e.g., inserting elements, erasing position with a well-designed shape, dyeing portions of
objects, altering color, etc.) with just a few strokes. These her hair in particular colors, and removing certain parts
interactions are then monitored by a multimodal large of the background to refine her appearance. Despite the
language model (MLLM) to anticipate user intentions in rapid advancements in diffusion models [6, 10, 14, 19, 35–
real time, bypassing the need for prompt entry. Finally, we 38, 47, 62, 68] and recent attempts to enhance control [20,
apply the powerful diffusion prior, enhanced by a carefully 23, 48, 69], achieving such fine-grained and precise edits
learned two-branch plug-in module, to process the editing continues to pose difficulties, typically due to a lack of
request with precise control. Please visit https://magic- intuitive interfaces and models for fine-grained control.
quill.github.io to try out our system. The challenges highlight the critical need for interactive
editing systems that facilitate precise and efficient modifi-
cations. An ideal solution would empower users to specify
♡ Equal contribution. † Corresponding author. what they want to edit, where to apply the changes, and how
1
the modifications should appear, all within a user-friendly current image editing tools and providing an innovative
interface that streamlines the editing process. solution that enhances both precision and efficiency, our
We aim to develop the first robust, open-source, in- work advances the field of digital image manipulation.
teractive precise image editing system to make image Our framework opens possibilities for users to engage
editing easy and efficient. Our system seamlessly integrates creatively with image editing, achieving their goals easily
three core modules: the Editing Processor, the Painting and effectively.
Assistor, and the Idea Collector. The Editing Processor
ensures a high-quality, controllable generation of edits, 2. Related Works
accurately reflecting users’ editing intentions in color and
edge adjustments. The Painting Assistor enhances the
2.1. Image Editing
ability of the system to predict and interpret the users’ Image editing involves modifying the visual appearance,
editing intent. The Idea Collector serves as an intuitive structure, or elements of an existing image [19]. Recent
interface, allowing users to input their ideas quickly and breakthroughs in diffusion models [17, 44, 49] have sig-
effortlessly, significantly boosting the editing efficiency. nificantly advanced visual generation tasks, outperforming
The Editing Processor implements two kinds of GAN-based models [15] in terms of image editing capa-
brushstroke-based guidance mechanisms: scribble guid- bilities. To enable control and guidance in image editing,
ance for structural modifications (e.g., adding, detailing, or a variety of approaches have emerged, leveraging different
removing elements) and color guidance for modification of modalities such as textual instructions [6, 11, 14, 32, 47,
color attributes. Inspired by ControlNet [66] and Brush- 65], masks [20, 23, 48, 69], layouts [10, 33, 68], segmen-
Net [23], our control architecture ensures precise adherence tation maps [35, 62], and point-dragging interfaces [36–
to user guidance while preserving unmodified regions. Our 38]. Despite these advances, these methods often fall
Painting Assistor reduces the repetitive process of typing short when precise modifications at the regional level are
text prompts, which disrupts the editing workflow and required, such as alterations to object shape, color, and other
creates a cumbersome transition between prompt input and details. Among the various methods, sketch-based editing
image manipulation. It employs an MLLM to interpret approaches [22, 25, 34, 42, 59, 61, 64] offer users a more
brushstrokes and automatically predicts prompts based on intuitive and precise means of interaction. However, the
image context. We call this novel task Draw&Guess. We current methods remain limited by the accuracy of the text
construct a dataset simulating real editing scenarios for signals input alongside the sketches, making it challenging
fine-tuning to ensure the effectiveness of the MLLM in to precisely control the information of the editing areas,
understanding user intentions. This enables a continuous such as color. To achieve precise control, we introduce two
editing workflow, allowing users to iteratively edit images types of local guidance based on brushstrokes: scribble and
without manual prompt input. The Idea Collector provides color, thereby enabling fine-grained control over shape and
an intuitive interface compatible with various platforms color at the regional level.
including Gradio and ComfyUI, allowing users to draw
with different brushes, manipulate strokes, and perform 2.2. MLLMs for Image Editing
continuous editing with ease. Multi-modal large language models (MLLMs) extend
We present a comprehensive evaluation of our interactive LLMs to process both text and image content [16], en-
editing framework. Through qualitative and quantitative abling text-to-image generation [9, 28, 52, 53, 58], prompt-
analyses, we demonstrate that our system significantly refinement [60, 63], and image quality evaluation [51].
improves both the precision and efficiency of performing In the area of image editing, MLLMs have demonstrated
detailed image edits compared to existing methods. Our significant potential. MGIE [13] enhances instruction-
Editing Processor achieves superior edge alignment and based image editing by using MLLMs to generate more
color fidelity compared to baselines like SmartEdit [20] and expressive, detailed instructions. SmartEdit [20] leverages
BrushNet [23]. The Painting Assistor exhibits superior user MLLM for better understanding and reasoning towards
intent interpretation capabilities compared to state-of-the- complex instruction. FlexEdit [55] integrates MLLM to
art MLLMs, including LLaVA-1.5 [31], LLaVA-Next [30], understand image content, masks, and textual instructions.
and GPT-4o [21]. User studies indicate that the Idea GenArtist [57] uses an MLLM agent to decompose complex
Collector significantly outperforms baseline interfaces in all tasks, guide tool selection, and enable systematic image
aspects of system usability. generation, editing, and self-correction with step-by-step
By leveraging advanced generative models and a user- verification. Our system extends this line of research by
centric design, our interactive editing framework signifi- introducing a more intuitive approach, utilizing MLLM
cantly reduces the time and expertise required to perform to simplify the editing process. Specifically, it directly
detailed image edits. By addressing the limitations of integrates the image context with user-input strokes to
2
Instant prediction
Precise control
“... a ‘draw and guess’ game ... what am I > cake
drawing with these strokes in the image?” > cake
MLLM > red vase
> red vase
“... a ‘draw and guess’ game ... identify
what is inside the red contours?” Text prompt Raw Mask Text Edge Color
Controlnet
❸
Inpainting UNet
Diffusion UNet
A
B
User-friendly interface
Consecutive editing
❶ ❹
C D E
Raw image
F G
Edited image
Figure 2. System framework consisting of three integrated components: an Editing Processor with dual-branch architecture for
controllable image inpainting, a Painting Assistor for real-time intent prediction, and an Idea Collector offering versatile brush tools.
This design enables intuitive and precise image editing through brushstroke-based interactions.
infer and translate the editing intentions, thereby automat- 3. System Design
ically generating the necessary prompts without requiring
repeated user input. This innovative task, which we term Our system is structured around three key aspects: Editing
Draw&Guess, facilitates a continuous editing workflow, Processor with strong generative prior, Painting Assistor
enabling users to iteratively refine images with minimal with instant intent prediction, and Idea Collector with a
manual intervention. user-friendly interface. An overview of our system design
is presented in Fig. 2.
Our system introduces brushstroke-based control signals
2.3. Interactive Support for Image Generation to give intuitive and precise control. These signals allow
users to express their editing intentions by simply drawing
Interactive support enhances the performance and usability what they envision. We designed two types of brushes,
of generative models through human-in-the-loop collabora- scribble and color, to accurately manipulate the edited
tion [27]. Recent works have focused on making prompt image. The scribble brushes, add brush and subtract
engineering more user-friendly through techniques like brush, aim to provide precise structural control by oper-
image clustering [4, 12] and attention visualization [56]. ating on the edge map of the original image. The color
Despite advancements in interactive support, a key chal- brush works with downsampled color blocks to enable
lenge remains in bridging the gap between verbal prompts fine-grained color manipulation of specific regions. Fig. 3
and visual output. While systems like PromptCharm [56] illustrates the workflow to convert the user hand-drawn
and DesignPrompt [39] use inpainting for interactive image input signal into control condition for faithfully inpainting
editing, these tools typically offer only coarse-grained the target editing area. Inspired by Ju et al. [23], Zhang
control over element addition and removal, requiring users et al. [66], we employ two additional branches to the latent
to brush over areas before generating objects within those diffusion framework [44], with the inpainting branch giving
regions. Furthermore, users must manually input prompts to content-aware per-pixel guidance for the re-generation of
specify the objects they wish to generate. Our approach ad- the editing area, and the control branch providing structural
dresses these limitations by introducing fine-grained image guidance. The model architecture is illustrated in Fig. 4.
editing through the use of brushstrokes. Additionally, we Further details will be discussed in Sec. 3.1.
incorporate a multimodal large language model (MLLM) To reduce the cognitive load for users to input appropri-
that provides on-the-fly assistance by interpreting user ate prompts at every stage of editing, our system integrates
intentions and suggesting prompts in real-time, thereby a MLLM [29] as the Painting Assistor. This component
reducing cognitive load and enhancing overall usability. analyzes user brushstrokes to deduce the editing intention
3
based on the image context, thereby automatically suggest-
ing contextually relevant prompts for editing. We have
named this innovative task Draw&Guess. To effectively Raw image Editing signals
4
bypass the Q&A process, as the results demonstrate that
ⓒ > cake
prompt-free generation achieves satisfactory results.
> red vase
For the color brush, the Q&A setup is similar: “The
Masked image Mask 𝐌 Noisy latents Text Edge 𝐄𝒄𝒐𝒏𝒅 Color 𝐂%&'( user will upload an image containing some contours in red
color. To help you locate the contour, ... You need to identify
what is inside the contours using a single word or phrase.”,
Controlnet
(the repetitive part is omitted). The system extracts contour
Inpainting UNet
Text Embedding
Diffusion UNet
information from the color brush stroke boundaries. The fi-
nal predicted prompt is generated by combining the stroke’s
color information with Q&A outputs. To optimize response
time, we constrain Q&A responses to concise, single-word
Edited image or short-phrase formats.
For the color brush Q&A task, accurate object recogni-
Figure 4. Overview of our Editing Processor. The proposed tion within contours is essential. LLaVA [31] inherently
architecture extends the latent diffusion UNet with two special- excels in object recognition tasks, making it adept at iden-
ized branches: an inpainting branch for content-aware per-pixel tifying the content within color brush stroke boundaries.
inpainting guidance and a control branch for structural guidance, However, the interpretation of add brush strokes poses
enabling precise brush-based image editing. a significant challenge due to the inherent abstraction of
human hand-drawn strokes or sketches. To address this,
C = {Econd , Ccond }. We adopt ControlNet [66] to insert we find it necessary to construct a specialized dataset to
conditional control into the middle and decoder blocks of fine-tune LLaVA to better understand and interpret human
the diffusion UNet. Let F C (zt , C, t; ΘC )i represent the hand-drawn brush strokes.
output of the i-th layer in the ControlNet, the control feature
Dataset Construction. We selected the Densely Captioned
insertion can be formulated as
Images (DCI) dataset [54] as our primary source. Each
F (zt , τ, t; Θ)⌊ n2 ⌋+i + = wC · Z(F C (zt , C, t; ΘC )i ), (5) image within the DCI dataset has detailed, multi-granular
masks, accompanied by open-vocabulary labels and rich
where wC is an adjustable hyperparameter that determines descriptions. This rich annotation structure enables the
the control strength. Both the inpainting and control capture of diverse visual features and semantic contexts.
branches don’t alter the weights of the pre-trained diffusion Step 1: Answer Generation for Q&A. The initial stage
models, enabling it to be a plug-and-play component appli- involves generating edge maps using PiDiNet [50] from
cable to any community fine-tuned diffusion models. The images in the DCI dataset, as shown in Fig. 5b. We calculate
control branch is trained using the denoising score matching the edge density within the masked regions and select the
objective, which can be written as top 5 masks with the highest edge densities, as illustrated in
h Fig. 5c. The labels corresponding to these selected masks
2i
L = Ezt ,t,ϵ∼N (0,I) ϵ − ϵc zt , C, t; {Θ, ΘC } , (6) serve as the ground truths for the Q&A. To ensure the
model focuses on guessing user intent rather than parsing
where ϵc is the combination of the denoising U-Net and the irrelevant details, we clean the label to keep only noun
ControlNet model. components, streamlining to emphasize essential elements.
Step 2: Simulating Brushstroke with Edge Overlay. In
3.2. Painting Assistor the second part of the dataset construction, we focus on the
five masks identified in the first step. Each mask undergoes
Prompt formatting. In our system, we implement two
random shape expansion to introduce variability. We use the
types of question answering (Q&A) [3] tasks to facilitate
BrushNet [23] model based on the SDXL [41] to perform
the Draw&Guess. For the add brush, we utilize a prompt
inpainting on these augmented masks with empty prompt,
structured as follows: “This is a ‘draw and guess’ game.
as shown in Fig. 5d. Subsequently, the edge maps generated
I will upload an image containing some strokes. To help
earlier are overlaid onto the inpainted areas as in Fig. 5e.
you locate the strokes, I will give you the normalized
These overlay images simulate practical examples of how
bounding box coordinates of the stokes where their original
user hand-drawn strokes might alter an image.
coordinates are divided by the padded image width and
MLLM Fine-Tuning. Our dataset construction method
height. The top-left corner of the bounding box is at effectively prepares the model to understand and predict
(x1 , y1 ), and the bottom-right corner is at (x2 , y2 ). Now user edits, which contains a total of 24, 315 images, cat-
tell me in a single word a phrase, what am I trying to draw egorized under 4, 412 different labels, ensuring a broad
with these strokes in the image?” The Q&A output directly spectrum of data for training. To optimize the performance
serves as the predicted prompt. For the subtract brush, we of the MLLM over Draw&Guess, we fine-tuned the LLaVA
5
(a) Original Image (b) Edge Map (c) Chosen Mask (d) Inpainting Result (e) Edge Overlay
Figure 5. Illustration of dataset construction process. (a) Original images from the DCI dataset; (b) Edge maps extracted from original
images; (c) Selected masks (highlighted in purple) with highest edge density; (d) Results after BrushNet inpainting on augmented masked
regions; (e) Final results with edge map overlay on selected areas. By overlaying edge maps on inpainted results, we simulate scenarios
where users edit images with brush strokes, as the edge maps resemble hand-drawn sketches. The bounding box coordinates of the mask
and labels are inherited from the DCI dataset.
6
Figure 6. Visual result comparison. The first two columns present the edge and color conditions for editing, while the last column shows
the ground truth image that the models aim to recreate. SmartEdit [20] utilizes natural language for guidance, but lacks precision in
controlling shape and color, often affecting non-target regions. SketchEdit [64], a GAN-based approach [15], struggles with open-domain
image generation, falling short compared to models with diffusion-based generative priors. Although BrushNet [23] delivers seamless
image inpainting, it struggles to align edges and colors simultaneously, even with ControlNet [66] enhancement. In contrast, our Editing
Processor strictly adheres to both edge and color conditions, achieving high-fidelity conditional image editing.
BrushNet [23] and ControNet [66]. As illustrated in We further conducted a quantitative analysis of our
Fig. 6, the instruction-based method, SmartEdit, tends to constructed test dataset in Sec. 3.2, which contains 490
produce outputs that are too random, lacking the precision images. Our model outperformed the baselines across
required for accurate editing purposes. Similarly, while all key metrics as in Tab. 1. These results demonstrate
BrushNet enables region-specific modifications, it struggles significant improvements in controllable generation.
with maintaining predictable detail generation even with
ControlNet enhancement, making precise manipulation 4.2. Prediction Accuracy
challenging. In contrast, our model achieves more accurate
To evaluate the prediction accuracy of the Painting Assis-
edge alignment and color fidelity, which we attribute to our
tor, we compared it with three state-of-the-art MLLMs:
specialized design of the inpainting and control branch that
LLaVA-1.5 [31], LLaVA-Next [30], and GPT-4o [21] on
emphasizes these aspects.
our test dataset of 490 images from Sec. 3.2. Each model
was prompted with images containing sketches and bound-
Table 1. Quantitative results and input condition comparisons
between the baselines and ours. Our Editing Processor performs
ing box coordinates to generate semantic interpretations.
better than the baselines across all metrics, indicating its superior- The semantic outputs were assessed using three metrics:
ity in controllable generation over edge and color. BERT [8], CLIP [43], and GPT-4 [2] similarity scores,
which measure the closeness of the generated descriptions
Input Condition to the ground truth. For GPT-4 similarity, we ask GPT-
Method LPIPS[67] PSNR SSIM
Text Edge Color 4 to rate the semantic and visual similarity between the
SmartEdit ✓ ✗ ✗ 0.339 16.695 0.561 predicted response and the ground truth on a 5-point scale,
SketchEdit ✗ ✓ ✗ 0.138 23.288 0.835 where 1 means “completely different”, 3 means “somewhat
BrushNet ✓ ✗ ✗ 0.0817 25.455 0.893 related”, and 5 means “exactly same”.
Brush.+Cont. ✓ ✓ ✓ 0.0748 25.770 0.894 The evaluation results are presented in Tab. 2, illustrating
Ours ✓ ✓ ✓ 0.0667 27.282 0.902 that our model achieves the highest prediction accuracy
7
Table 2. Performance comparison between our Painting Assistor
and other MLLMs, demonstrating superior visual and semantic
consistency in predictions.
This setup enables the focus on the value provided with our
Figure 7. User ratings for the Painting Assistor, focusing on its Idea Collector by controlling other variables.
prediction accuracy and efficiency enhancement capabilities. Procedure. The study lasted approximately 30 minutes
for each participant with two systems (our system and the
among all tested MLLMs. This superior performance in- baseline). Each session began with a brief introduction to
dicates that our Painting Assistor more accurately captures the system using the case illustrated in Fig. 1. Participants
and predicts the semantic meanings of user drawings. then had 5 minutes to freely explore and edit images. After
To further qualitatively evaluate the Painting Assistor, using both systems, participants completed a questionnaire
we conducted a user study with 30 participants who freely with 22 questions (10 questions per system covering all four
edited images using our system. Participants rated the dimensions and 2 questions regarding the Painting Assistor
Painting Assistor on a 5-point scale for prediction accuracy detailed in Sec. 4.2). We employed the System Usability
(1: very poor, 5: excellent) and efficiency facilitation (1: Scale (SUS) [5] for scoring, using a Likert scale from 1
significantly reduced, 5: significantly enhanced). As shown (strongly disagree) to 5 (strongly agree), to capture a global
in Fig. 7, 86.67% of users rated prediction accuracy at least view of subjective usability for each system.
4, validating the ability of our fine-tuned MLLM to interpret As shown in Fig. 8, our system demonstrated signifi-
user intentions. Similarly, 90% rated efficiency facilita- cantly higher scores across all dimensions compared to the
tion 4 or above, confirming that Draw&Guess effectively baseline. Indicating the effectiveness of our Idea Collector.
streamlines the editing process by reducing manual prompt Further details can be found in the supplementary.
inputs. The average scores for accuracy and efficiency were
4.07 and 4.37. 5. Conclusion
4.3. Idea Collection Effectiveness and Efficiency In conclusion, our interactive image editing system
MagicQuill effectively addresses the challenges of per-
Collecting user ideas effectively and efficiently is critical for forming precise and efficient edits by combining the
the usability and adoption of interactive systems, especially strengths of the Editing Processor, Painting Assistor, and
in creative applications where user engagement is crucial. Idea Collector. Our comprehensive evaluations demonstrate
To evaluate the Idea Collector, we conducted a user study significant improvements over existing methods in terms
with 30 participants, comparing our system against a base- of controllable generation quality, editing intent prediction
line system on the following dimensions: accuracy, and user interface efficiency. For future work,
• Complexity and Efficiency measures how streamlined and we aim to expand the capabilities of our system by incor-
intuitive the user finds the system for creative editing. porating additional editing types, such as reference-based
• Consistency and Integration assesses whether the system editing, which would allow users to guide modifications
maintains a cohesive interface and interaction design. using external images. We also plan to implement layered
• Ease of Use captures the learnability of the system, image generation to provide better editing flexibility and
especially for users with varying backgrounds. support for complex compositions. Moreover, enhancing
• Overall Satisfaction reflects users’ general satisfaction typography support will enable more robust manipulation
with the design, features, and usability of the system. of textual elements within images. These developments
Baseline. The baseline system was implemented as a cus- will further enrich our framework, offering users a more
tomized ComfyUI workflow, replacing our Idea Collector versatile and powerful tool for creative expression in digital
interface with an open-source canvas, Painter Node [40]. image editing.
8
References [14] Zigang Geng, Binxin Yang, Tiankai Hang, Chen Li, Shuyang
Gu, Ting Zhang, Jianmin Bao, Zheng Zhang, Houqiang Li,
[1] Abubakar Abid, Ali Abdalla, Ali Abid, Dawood Khan, Han Hu, et al. Instructdiffusion: A generalist modeling
Abdulrahman Alfozan, and James Zou. Gradio: Hassle-free interface for vision tasks. In Proceedings of the IEEE/CVF
sharing and testing of ml models in the wild. arXiv preprint Conference on Computer Vision and Pattern Recognition,
arXiv:1906.02569, 2019. 6 pages 12709–12720, 2024. 1, 2
[2] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- [15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Yoshua Bengio. Generative adversarial networks. Commu-
Gpt-4 technical report. arXiv preprint arXiv:2303.08774, nications of the ACM, 63(11):139–144, 2020. 2, 7
2023. 7, 8
[16] Yingqing He, Zhaoyang Liu, Jingye Chen, Zeyue Tian,
[3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret
Hongyu Liu, Xiaowei Chi, Runtao Liu, Ruibin Yuan, Yazhou
Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh.
Xing, Wenhai Wang, et al. Llms meet multimodal generation
Vqa: Visual question answering. In Proceedings of the IEEE
and editing: A survey. arXiv preprint arXiv:2405.19334,
international conference on computer vision, pages 2425–
2024. 2
2433, 2015. 5
[17] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-
[4] Stephen Brade, Bryan Wang, Mauricio Sousa, Sageev Oore,
fusion probabilistic models. Advances in neural information
and Tovi Grossman. Promptify: Text-to-image generation
processing systems, 33:6840–6851, 2020. 2
through interactive prompt exploration with large language
models. In Proceedings of the 36th Annual ACM Symposium [18] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
on User Interface Software and Technology, pages 1–14, Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
2023. 3 Lora: Low-rank adaptation of large language models. arXiv
preprint arXiv:2106.09685, 2021. 6, 1, 2
[5] John Brooke et al. Sus-a quick and dirty usability scale.
Usability evaluation in industry, 189(194):4–7, 1996. 8, 3 [19] Yi Huang, Jiancheng Huang, Yifan Liu, Mingfu Yan, Jiaxi
Lv, Jianzhuang Liu, Wei Xiong, He Zhang, Shifeng Chen,
[6] Tim Brooks, Aleksander Holynski, and Alexei A Efros. In-
and Liangliang Cao. Diffusion model-based image editing:
structpix2pix: Learning to follow image editing instructions.
A survey. arXiv preprint arXiv:2402.17525, 2024. 1, 2
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 18392–18402, 2023. [20] Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan,
1, 2 Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui
[7] ComfyUI. The most powerful and modular diffusion model Huang, Ruimao Zhang, et al. Smartedit: Exploring complex
gui, api and backend with a graph/nodes interface. https: instruction-based image editing with multimodal large lan-
//github.com/comfyanonymous/ComfyUI, 2024. guage models. In Proceedings of the IEEE/CVF Conference
6 on Computer Vision and Pattern Recognition, pages 8362–
8371, 2024. 1, 2, 6, 7
[8] Jacob Devlin. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint [21] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel-
arXiv:1810.04805, 2018. 7, 8 man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli-
[9] Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.
Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, arXiv preprint arXiv:2410.21276, 2024. 2, 7
Haoran Wei, et al. Dreamllm: Synergistic multimodal com- [22] Youngjoo Jo and Jongyoul Park. Sc-fegan: Face editing
prehension and creation. arXiv preprint arXiv:2309.11499, generative adversarial network with user’s sketch and color.
2023. 2 In Proceedings of the IEEE/CVF international conference on
[10] Dave Epstein, Allan Jabri, Ben Poole, Alexei Efros, and computer vision, pages 1745–1753, 2019. 2
Aleksander Holynski. Diffusion self-guidance for control- [23] Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying
lable image generation. Advances in Neural Information Shan, and Qiang Xu. Brushnet: A plug-and-play image
Processing Systems, 36:16222–16239, 2023. 1, 2 inpainting model with decomposed dual-branch diffusion.
[11] Kunyu Feng, Yue Ma, Bingyuan Wang, Chenyang Qi, arXiv preprint arXiv:2403.06976, 2024. 1, 2, 3, 4, 5, 6, 7
Haozhe Chen, Qifeng Chen, and Zeyu Wang. Dit4edit: [24] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine.
Diffusion transformer for image editing. arXiv preprint Elucidating the design space of diffusion-based generative
arXiv:2411.03286, 2024. 2 models. Advances in neural information processing systems,
[12] Yingchaojie Feng, Xingbo Wang, Kam Kwai Wong, Sijia 35:26565–26577, 2022. 1
Wang, Yuhong Lu, Minfeng Zhu, Baicheng Wang, and Wei [25] Kangyeol Kim, Sunghyun Park, Junsoo Lee, and Jaegul
Chen. Promptmagician: Interactive prompt engineering for Choo. Reference-based image composition with sketch
text-to-image creation. IEEE Transactions on Visualization via structure-aware diffusion model. arXiv preprint
and Computer Graphics, 30(1):295–305, 2024. 3 arXiv:2304.09748, 2023. 2
[13] Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, [26] Diederik P Kingma. Auto-encoding variational bayes. arXiv
Yinfei Yang, and Zhe Gan. Guiding instruction-based preprint arXiv:1312.6114, 2013. 4
image editing via multimodal large language models. arXiv [27] Hyung-Kwon Ko, Gwanmo Park, Hyeon Jeon, Jaemin Jo,
preprint arXiv:2309.17102, 2023. 2 Juho Kim, and Jinwook Seo. Large-scale text-to-image
9
generation models for visual artists’ creative works. In Pro- [41] Dustin Podell, Zion English, Kyle Lacey, Andreas
ceedings of the 28th International Conference on Intelligent Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and
User Interfaces, page 919–933, New York, NY, USA, 2023. Robin Rombach. Sdxl: Improving latent diffusion mod-
Association for Computing Machinery. 3 els for high-resolution image synthesis. arXiv preprint
[28] Jing Yu Koh, Daniel Fried, and Russ R Salakhutdinov. Gen- arXiv:2307.01952, 2023. 5
erating images with multimodal language models. Advances [42] Tiziano Portenier, Qiyang Hu, Attila Szabo, Siavash Ar-
in Neural Information Processing Systems, 36, 2024. 2 jomand Bigdeli, Paolo Favaro, and Matthias Zwicker.
[29] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Faceshop: Deep sketch-based face image editing. arXiv
Improved baselines with visual instruction tuning. In Pro- preprint arXiv:1804.08972, 2018. 2
ceedings of the IEEE/CVF Conference on Computer Vision [43] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
and Pattern Recognition, pages 26296–26306, 2024. 3 Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[30] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- transferable visual models from natural language supervi-
proved reasoning, ocr, and world knowledge, 2024. 2, 6, sion. In International conference on machine learning, pages
7 8748–8763. PMLR, 2021. 7, 8
[31] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. [44] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Visual instruction tuning. Advances in neural information Patrick Esser, and Björn Ommer. High-resolution image
processing systems, 36, 2024. 2, 4, 5, 7 synthesis with latent diffusion models. In Proceedings of
[32] Zhiheng Liu, Ruili Feng, Kai Zhu, Yifei Zhang, Kecheng the IEEE/CVF conference on computer vision and pattern
Zheng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. recognition, pages 10684–10695, 2022. 2, 3, 1
Cones: Concept neurons in diffusion models for customized [45] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
generation. arXiv preprint arXiv:2303.05125, 2023. 2 net: Convolutional networks for biomedical image segmen-
[33] Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai tation. In Medical image computing and computer-assisted
Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang intervention–MICCAI 2015: 18th international conference,
Cao. Cones 2: Customizable image synthesis with multiple Munich, Germany, October 5-9, 2015, proceedings, part III
subjects. In Proceedings of the 37th International Con- 18, pages 234–241. Springer, 2015. 4
ference on Neural Information Processing Systems, pages [46] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
57500–57519, 2023. 2 Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
[34] Weihang Mao, Bo Han, and Zihao Wang. Sketchffusion: Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
Sketch-guided image editing with diffusion model. In 2023 man, et al. Laion-5b: An open large-scale dataset for training
IEEE International Conference on Image Processing (ICIP), next generation image-text models. Advances in Neural
pages 790–794. IEEE, 2023. 2 Information Processing Systems, 35:25278–25294, 2022. 1
[35] Naoki Matsunaga, Masato Ishii, Akio Hayakawa, Kenji [47] Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain,
Suzuki, and Takuya Narihira. Fine-grained image editing by Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman.
pixel-wise guidance using diffusion models. arXiv preprint Emu edit: Precise image editing via recognition and gen-
arXiv:2212.02024, 2022. 1, 2 eration tasks. In Proceedings of the IEEE/CVF Conference
[36] Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and on Computer Vision and Pattern Recognition, pages 8871–
Jian Zhang. Dragondiffusion: Enabling drag-style manipula- 8879, 2024. 1, 2
tion on diffusion models. arXiv preprint arXiv:2307.02421, [48] Jaskirat Singh, Jianming Zhang, Qing Liu, Cameron Smith,
2023. 2 Zhe Lin, and Liang Zheng. Smartmask: Context aware high-
[37] Shen Nie, Hanzhong Allan Guo, Cheng Lu, Yuhao Zhou, fidelity mask generation for fine-grained object insertion and
Chenyu Zheng, and Chongxuan Li. The blessing of random- layout control. In Proceedings of the IEEE/CVF Conference
ness: Sde beats ode in general diffusion-based image editing. on Computer Vision and Pattern Recognition, pages 6497–
arXiv preprint arXiv:2311.01410, 2023. 6506, 2024. 1, 2
[38] Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie [49] Jiaming Song, Chenlin Meng, and Stefano Ermon.
Liu, Abhimitra Meka, and Christian Theobalt. Drag your Denoising diffusion implicit models. arXiv preprint
gan: Interactive point-based manipulation on the generative arXiv:2010.02502, 2020. 2
image manifold. In ACM SIGGRAPH 2023 Conference [50] Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao,
Proceedings, pages 1–11, 2023. 1, 2 Qi Tian, Matti Pietikäinen, and Li Liu. Pixel difference
[39] Xiaohan Peng, Janin Koch, and Wendy E. Mackay. Design- networks for efficient edge detection. In Proceedings of
prompt: Using multimodal interaction for design exploration the IEEE/CVF international conference on computer vision,
with generative ai. In Proceedings of the 2024 ACM pages 5117–5127, 2021. 5
Designing Interactive Systems Conference, page 804–818, [51] Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin,
New York, NY, USA, 2024. Association for Computing Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjoerd van
Machinery. 3 Steenkiste, Ranjay Krishna, et al. Dreamsync: Aligning text-
[40] Aleksey Petrov. Comfyui custom nodes alekpet. https:// to-image generation with image understanding feedback.
github.com/AlekPet/ComfyUI_Custom_Nodes_ In Synthetic Data for Computer Vision Workshop@ CVPR
AlekPet, 2024. 8 2024, 2023. 2
10
[52] Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong ceedings of the IEEE/CVF conference on computer vision
Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun and pattern recognition, pages 5951–5961, 2022. 2, 6, 7
Huang, and Xinlong Wang. Generative pretraining in [65] Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su.
multimodality. arXiv preprint arXiv:2307.05222, 2023. 2 Magicbrush: A manually annotated dataset for instruction-
[53] Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying guided image editing. Advances in Neural Information
Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Processing Systems, 36, 2024. 2
Huang, and Xinlong Wang. Generative multimodal models [66] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
are in-context learners. In Proceedings of the IEEE/CVF conditional control to text-to-image diffusion models. In
Conference on Computer Vision and Pattern Recognition, Proceedings of the IEEE/CVF International Conference on
pages 14398–14409, 2024. 2 Computer Vision, pages 3836–3847, 2023. 2, 3, 4, 5, 7
[54] Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary [67] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
Williamson, Vasu Sharma, and Adriana Romero-Soriano. and Oliver Wang. The unreasonable effectiveness of deep
A picture is worth more than 77 text tokens: Evaluating features as a perceptual metric. In Proceedings of the IEEE
clip-style models on dense captions. In Proceedings of conference on computer vision and pattern recognition,
the IEEE/CVF Conference on Computer Vision and Pattern pages 586–595, 2018. 7
Recognition, pages 26700–26709, 2024. 5 [68] Xin Zhang, Jiaxian Guo, Paul Yoo, Yutaka Matsuo, and
[55] Jue Wang, Yuxiang Lin, Tianshuo Yuan, Zhi-Qi Cheng, Yusuke Iwasawa. Paste, inpaint and harmonize via denois-
Xiaolong Wang, Jiao GH, Wei Chen, and Xiaojiang Peng. ing: Subject-driven image editing with pre-trained diffusion
Flexedit: Marrying free-shape masks to vllm for flexible model. arXiv preprint arXiv:2306.07596, 2023. 1, 2
image editing. arXiv preprint arXiv:2408.12429, 2024. 2 [69] Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan,
[56] Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, and Tianyi and Kai Chen. A task is worth one word: Learning with task
Zhang. Promptcharm: Text-to-image generation through prompts for high-quality versatile image inpainting. arXiv
multi-modal prompting and refinement. In Proceedings preprint arXiv:2312.03594, 2023. 1, 2
of the CHI Conference on Human Factors in Computing
Systems, New York, NY, USA, 2024. 3
[57] Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu.
Genartist: Multimodal llm as an agent for unified image
generation and editing. arXiv preprint arXiv:2407.05600,
2024. 2
[58] Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng
Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint
arXiv:2309.05519, 2023. 2
[59] Chufeng Xiao and Hongbo Fu. Customsketching: Sketch
concept extraction for sketch-based image synthesis and
editing. arXiv preprint arXiv:2402.17624, 2024. 2
[60] Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu,
Stefano Ermon, and CUI Bin. Mastering text-to-image
diffusion: Recaptioning, planning, and generating with
multimodal llms. In Forty-first International Conference on
Machine Learning, 2024. 2
[61] Shuai Yang, Zhangyang Wang, Jiaying Liu, and Zongming
Guo. Deep plastic surgery: Robust and controllable image
editing with human-drawn sketches. In Computer Vision–
ECCV 2020: 16th European Conference, Glasgow, UK,
August 23–28, 2020, Proceedings, Part XV 16, pages 601–
617. Springer, 2020. 2
[62] Yifan Yang, Houwen Peng, Yifei Shen, Yuqing Yang, Han
Hu, Lili Qiu, Hideki Koike, et al. Imagebrush: Learn-
ing visual in-context instructions for exemplar-based image
manipulation. Advances in Neural Information Processing
Systems, 36, 2024. 1, 2
[63] Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin,
Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. Idea2img:
Iterative self-refinement with gpt-4v (ision) for auto-
matic image design and generation. arXiv preprint
arXiv:2310.08541, 2023. 2
[64] Yu Zeng, Zhe Lin, and Vishal M Patel. Sketchedit: Mask-
free local image manipulation with partial sketches. In Pro-
11
MagicQuill: An Intelligent Interactive Image Editing System
Supplementary Material
A. Implementation Details cutting a slice out of it, as shown in Fig 2. The user
begins by uploading the image through the toolbar, which
A.1. Editing Processor provides access to a range of tools (Fig. 2-B). Using the add
Our Editing Processor is built upon Stable Diffusion brush, the user outlines the slice to be cut directly on the
v1.5 [44] and is compatible with all customized fine-tuned canvas (Fig. 2-D). Meanwhile, the Draw & Guess feature
weights. We set the control parameters with inpainting introduced in Sec. 3.2 predicts that the user intends to
strength wI = 1.0 and control strength wC = 0.5, while manipulate a “cake” and suggests the relevant prompt
expanding the mask region by 15 pixels during controllable automatically in the prompt area (Fig. 2-A). Afterward, the
inpainting. The generation process employs the Euler user switches to the subtract brush to fill in the outlined
ancestral sampler with Karras scheduler [24], requiring 20 slice, visually marking the area to be removed from the
steps per generation. On standard hardware, generating cake. For additional precision, the eraser tool is available
a 512 × 512 resolution image takes approximately 2 sec- to refine the cut. Once the adjustments are made, the user
onds with 15 GB VRAM consumption. For the control generates the image by clicking the Run button (Fig. 2-F),
branch, we conduct fine-tuning on the LAION-Aesthetics which runs the model detailed in Sec. 3.1.
dataset [46], specifically selecting images with aesthetic The resulting image appears in the generated image
scores above 6.5. The training process spans 3 epochs with area (Fig. 2-E). Users can confirm changes via the tick
a learning rate of 5e − 6 and batch size of 8. icon to update the canvas, or click the cross icon to revert
modifications. This workflow enables iterative refinement
A.2. Painting Assistor of edits, providing flexible control throughout the process.
We fine-tune a LLaVA-1.5 model with 7B parameters
for Draw&Guess task on our own constructed dataset in B. Failure Case
Sec. 3.2, leveraging LoRA [18]. The LoRA rank and
B.1. Failure Case of Editing Processor
alpha are 64 and 16 respectively. The model is trained
for 3 epochs with a learning rate of 2e − 5 and batch Scribble-Prompt Trade-Off. We observe quality degra-
size of 8. Under 4-bit quantization, the model achieves dation when user-provided add brush strokes deviate from
real-time prompt inference within 0.3 seconds using only 5 the semantic content specified in the prompt, a common
GB VRAM, enabling efficient on-the-fly prompt generation occurrence among users with limited artistic skills. This
with satisfactory accuracy. creates a fundamental trade-off: strictly following the
scribble structure may compromise the generation quality
A.3. Idea Collector with respect to the text prompt. To address this issue, we
Cross-platform Support. Besides Gradio, MagicQuill propose adjusting the edge control strength.
can also be integrated into ComfyUI as a custom node, as
shown in Fig. 9. It is designed with customizable widgets
for parameter settings and extensible architecture for future
platform integrations.
(a) User’s Input (b) Edge Strength: 0.6 (c) Edge Strength: 0.2
Figure 10. Illustration of the Scribble-Prompt Trade-Off. Given
user-provided brush strokes (a) with the text prompt “man”, we
show generation results with different edge control strengths: (b)
Figure 9. MagicQuill as a custom node in ComfyUI. with strength of 0.6 and (c) with strength of 0.2.
Usage Scenario. To demonstrate the user-friendly work- As demonstrated in Fig. 10, when presented with an
flow of MagicQuill, we present an illustrative scenario: oversimplified sketch that substantially deviates from the
A user wants to modify an image of a complete cake, prompt “man”, a high edge strength of 0.6 produces results
1
that, while faithful to the sketch, appear inharmonious.
By reducing the edge strength to 0.2, we achieve notably
improved generation quality.
Colorization-Details Trade-Off. We observe a trade-
off between colorization accuracy and detail preservation.
Since our conditional image inpainting pipeline relies on
downsampled color blocks and CNN-extracted edge maps
as input, structrual details in the edited regions may be
(a) User’s Input (b) Prompt: Candy (c) Prompt: Raspberry
compromised during the generation process.
Figure 12. Demonstration of semantic ambiguity in sketch inter-
pretation. (A) User’s sketch intended to represent a raspberry; (B)
Our Draw&Guess model incorrectly interprets the sketch as candy,
leading to a misaligned generation; (C) The expected generation
result with correct raspberry interpretation.
2
Figure 15. The baseline system implemented in ComfyUI.
3
Figure 16. The questionnaire and user ratings comparing MagicQuill to the baseline system (1=strongly disagree, 5=strongly agree).
Figure 17. A gallery of creative image editing achieved by the participants of the user study using MagicQuill. Each pair shows the
original image and its edited version, demonstrating diverse user-driven modifications.