0% found this document useful (0 votes)
41 views9 pages

Automated Virtual Product Placement and Assessment in Images Using Diffusion Models

Uploaded by

kjjk3670
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views9 pages

Automated Virtual Product Placement and Assessment in Images Using Diffusion Models

Uploaded by

kjjk3670
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Automated Virtual Product Placement and

Assessment in Images using Diffusion Models

Mohammad Mahmudul Alam* Negin Sokhandan


University of Maryland, Baltimore County Amazon Web Services (AWS)
Baltimore, MD, USA Santa Clara, CA, USA
[email protected] [email protected]
arXiv:2405.01130v1 [cs.CV] 2 May 2024

Emmett Goodman
Amazon Web Services (AWS)
Santa Clara, CA, USA
[email protected]

Abstract

In Virtual Product Placement (VPP) applications, the


discrete integration of specific brand products into images
or videos has emerged as a challenging yet important task.
This paper introduces a novel three-stage fully automated
(a) Background (b) Inpainting
VPP system. In the first stage, a language-guided image
segmentation model identifies optimal regions within im- Figure 1. An illustration of the proposed VPP system with an
ages for product inpainting. In the second stage, Stable Dif- Amazon Echo Dot device. The input background image is shown
fusion (SD), fine-tuned with a few example product images, in (a), and the inpainted output image is shown in (b) where an
is used to inpaint the product into the previously identified Amazon Echo Dot device is placed on the kitchen countertop by
candidate regions. The final stage introduces an ‘Alignment automatic identification of optimal location.
Module’, which is designed to effectively sieve out low-
quality images. Comprehensive experiments demonstrate
that the Alignment Module ensures the presence of the in- Previous research underscores the impact of product place-
tended product in every generated image and enhances the ment within realms such as virtual reality [22] and video
average quality of images by 35%. The results presented games [5]. With the recent advancements in generative
in this paper demonstrate the effectiveness of the proposed AI technologies, the potential for product placement has
VPP system, which holds significant potential for trans- been further expanded through the utilization of diffusion
forming the landscape of virtual advertising and marketing models. Significant research has focused on the develop-
strategies. ment of controlled inpainting via diffusion models, albeit
largely without an explicit emphasis on advertising applica-
tions [1, 8, 11]. However, these methods can be fine-tuned
1. Introduction with a small set of 4 to 5 product sample images to generate
high-quality advertising visual content.
Virtual Product Placement (VPP) refers to the unobtru- In this paper, we propose a novel, three-stage, fully
sive, digital integration of branded products into visual con- automated system that carries out semantic inpainting of
tent, which is often employed as a stealth marketing strat- products by fine-tuning a pre-trained Stable Diffusion (SD)
egy [15]. Advertising solutions utilizing VPP have signifi- model [18]. In the first stage, a suitable location is identified
cant appeal due to their high customizability, effectiveness for product placement using visual question answering and
across diverse customer bases, and quantifiable efficiency. text-conditioned instant segmentation. The output of this
* The author performed this work as an intern at Amazon Web Services
stage is a binary mask highlighting the identified location.
(AWS). Accepted at 6th AI for Content Creation (AI4CC) workshop at Subsequently, this masked region undergoes inpainting us-
CVPR 2024. (Preprint) ing a fine-tuned SD model. This SD model is fine-tuned by

1
DreamBooth [19] approach utilizing a few sample images 2. Related Works
of the product along with a unique identifier text prompt.
Finally, the quality of the inpainted image is evaluated by a Recently, there has been significant progress in developing
proposed Alignment Module, a discriminative method that semantic or localized image editing using diffusion mod-
measures the image quality, or the alignment of the gen- els - largely without an explicit focus on digital market-
erated image with human expectations. An illustration of ing. Nevertheless, new generative AI approaches promise
the proposed VPP system is presented in Figure 1 with an significant advances in VPP technology. For instance, in
Amazon Echo Dot device. Blended Diffusion [1], the authors proposed a method of lo-
calized image editing using image masking and natural lan-
Controlled inpainting of a specific product is a challeng-
guage. The area of interest is first masked and then modified
ing task. For example, the model may fail to inpaint the
using a text prompt. The authors employed a pre-trained
intended object at all. If a product is indeed introduced
CLIP model [17] along with pre-trained Denoising Diffu-
through inpainting, the product created may not be real-
sion Probabilistic Models (DDPM) [7] to generate natural
istic and may display distortions of shape, size, or color.
images in the area of interest.
Similarly, the background surrounding the inpainted prod-
Similar to Blended Diffusion, Couairon et. al. [3] pro-
uct may be altered in such a way that it either meaning-
posed a method of semantic editing with a mask using a
fully obscures key background elements or even completely
diffusion model. However, instead of taking the mask from
changes the background image. This becomes especially
the user, the mask is generated automatically. Neverthe-
problematic when the background images contain human
less, a text query input from the user is utilized to generate
elements, as models can transform them into disturbing vi-
the mask. The difference in noise estimates, as determined
suals. As a result, the proposed Alignment Module is de-
by the diffusion model based on the reference text and the
signed to address these complications, with its primary fo-
query text, is calculated. This difference is then used to
cus being on the appearance, quality, and size of the gener-
infer the mask. The image is noised iteratively during the
ated product.
forward process and in the reverse Denoising Diffusion Im-
To exert control over the size of the generated product, plicit Model (DDIM) [21] steps, the denoised image is in-
morphological transformations, specifically erosion, and di- terpolated with the same step output of the forward process
lation, are employed. By adjusting the size of the mask using masking.
through dilation or erosion, the size of the inpainted product Paint by Word proposed by Bau et. al. [2], is also similar,
can be effectively increased or decreased. This allows the however instead of a diffusion model they utilized a Gener-
system to generate a product of an appropriate size. ative Adversarial Networks (GAN) [4] with a mask for se-
In summary, the main contributions of this paper are mantic editing guided by text. On the other hand, Imagic
twofold. The first pertains to the design of a fully au- [8] also performs text-based semantic editing on images us-
tomated Virtual Product Placement (VPP) system capable ing a diffusion model but without using any mask. Their
of generating high-resolution, customer-quality visual con- approach consists of three steps. In the beginning, text em-
tent. The second involves the development of a discrimi- bedding for a given image is optimized. Then the genera-
native method that automatically eliminates subpar images, tive diffusion model is optimized for the given image with
premised on the content, quality, and size of the product fixed-optimized text embedding. Finally, the target and op-
generated. timized embedding are linearly interpolated to achieve input
The remainder of this paper is organized as follows. In image and target text alignment. Likewise, a semantic edit-
section 2 we will delve into the related literature, with a spe- ing method using a pre-trained text-conditioned diffusion
cific emphasis on semantic inpainting methods utilizing dif- model focusing on the mixing of two concepts is proposed
fusion models, and section 3 will highlight the broad contri- by [12]. In this method, a given image is noised for several
butions of the paper. Next, the proposed end-to-end pipeline steps and then denoised with text condition. During the de-
for automatic VPP will be discussed in section 4. This in- noising process, the output of a denoising stage is also lin-
cludes a detailed examination of the three primary stages early interpolated with the output of a forward noise mixing
of the solution, along with the three sub-modules of the stage.
Alignment Module. Thereafter, we will elucidate the exper- Hertz et. al. [6] took a different approach to semantic
imental design and evaluation methodologies adopted and image editing where text and image embeddings are fused
report the corresponding results in section 5. Subsequently, using cross-attention. The cross-attention maps are incor-
deployment strategy and web application design will be ex- porated with the Imagen diffusion model [20]. However,
plained in section 6. Finally, the paper will conclude with an instead of editing any given image, their approach edits a
outline of the identified limitations of our proposed method- generated image using a text prompt which lacks any inter-
ology in section 7, complemented by a discussion on poten- est when VPP is concerned. Alternatively, Stochastic Dif-
tial avenues for future research. ferential Edit (SDEdit) [16] synthesizes images from stroke

2
paintings and can edit images based on stroke images. For 4. Methodology
image synthesis, coarse colored strokes are used and for
editing, colored stroke on real images or image patches on
target images is used as a guide. It adds Gaussian noise Stage 2
Stage 3
to an image guide of a specific standard deviation and then
solves the corresponding Stochastic Differential Equations Fine-tuned Content
DreamBooth
(SDE) to produce the synthetic or edited image. Model Score

To generate images from a prompt in a controlled fash-


ion and to gain more control over the generated image, Li Quality
et. al proposed grounded text-to-image generation (GLI- Score

GEN) [11]. It feeds the model the embedding of the guiding


elements such as bounding boxes, key points, or semantic CLIPSeg Volume
Semantic Score
maps. Using the same guiding components, inpainting can Segmentation
be performed in a target image. “desk”

DreamBooth [19] fine-tunes the pre-trained diffusion VILT Visual “which object
in the image
model to expand the dictionary of the model for a specific Question has a flat
Answering surface area?”
subject. Given a few examples of the subject, a diffusion
model such as Imagen [20] is fine-tuned using random sam- Stage 1
ples generated by the model itself and new subject images
by optimizing a reconstruction loss. The new subject im- Figure 2. The block diagram of the proposed solution for the VPP
ages are conditioned using a text prompt with a unique iden- system where each of the three stages is distinguished by varied
tifier. Fine-tuning a pre-trained diffusion model with a new color blocks. In stage 1, a suitable placement for product inpaint-
subject is of great importance in the context of VPP. There- ing is determined by creating a mask using CLIPSeg and VILT
fore, in this paper DreamBooth approach is utilized to ex- models. Next, in stage 2, semantic inpainting is performed in
pand the model’s dictionary by learning from a few sample the masked area using the fine-tuned DreamBooth model. Finally,
images of the product. stage 3 contains the cascaded sub-modules of the Alignment Mod-
ule to discard low-quality images.

3. Contributions
4.1. Proposed Method
In this paper, a method of automated virtual product place-
ment and assessment in images using diffusion models is For semantic inpainting, we utilized the DreamBooth algo-
designed. Our broad contributions are as follows: rithm [19] to fine-tune stable diffusion using five representa-
1. We introduce a novel fully automated VPP system that tive images of the product and a text prompt with a unique
carries out automatic semantic inpainting of the product identifier. Even with a limited set of five sample images,
in the optimal location using language-guided segmen- the fine-tuned DreamBooth model was capable of generat-
tation and fine-tuned stable diffusion models. ing images of the product integrated with its background.
2. We proposed a cascaded three-stage assessment module Nevertheless, when inpainting was conducted with this
named ‘Alignment Module’ designed to sieve out low- fine-tuned model, the resulting quality of the inpainted
quality images that ensure the presence of the intended product was significantly compromised. To enhance the
product in every generated output image. quality of the product in the inpainted image, we augmented
3. Morphological transformations are employed such as di- the sample images through random scaling and random
lation and erosion to adjust the size of the mask, there- cropping, consequently generating a total of 1,000 product
fore, to increase or decrease the size of the inpainted images used to fine-tune SD.
product allowing generating a product of appropriate
size. 4.2. Product Localization Module
4. Experiments are performed to validate the results by The proposed VPP system operates in three stages. A
blind evaluation of the generated images with and with- core challenge in product placement lies in pinpointing a
out the Alignment module resulting in 35% improve- suitable location for the item within the background. In
ment in average quality. the first stage, this placement is indicated via the genera-
5. The inpainted product generated by the proposed system tion of a binary mask. To automate this masking process,
is not only qualitatively more realistic compared to the we leveraged the capabilities of the Vision and Language
previous inpainting approach [23] but also shows a su- Transformer (ViLT) Visual Question Answering (VQA)
perior quantitative CLIP score. model [9] in conjunction with the Contrastive Language-

3
Image Pretraining (CLIP) [17]-based semantic segmenta- “a small dog sitting on a
desk next to a computer”
tion method, named CLIPSeg [13]. Notably, each product
tends to have a prototypical location for its placement. For Caption
example, an optimal location for an Amazon Echo Dot de- Generator
CLIP Score Product Exist
vice is atop a flat surface, such as a desk or table. Thus,
Fine-tuned
by posing a straightforward query to the VQA model, such “Input Image” Caption
as ”Which object in the image has a flat surface area?”, we
“a small dog sitting on a
can pinpoint an appropriate location for the product. Sub- desk next to a computer
sequently, the identified location’s name is provided to the with an echo dot”

CLIPSeg model, along with the input image, resulting in the


generation of a binary mask for the object. “Generated
Image”

4.3. Product Inpainting Module (a) Content Sub-module

In the second stage, the input image and the generated bi- “Sample
nary mask are fed to the fine-tuned DreamBooth model to Images”

perform inpainting on the masked region. Product inpaint-


ing presents several challenges: the product might not man- Mean CLIP Image
Feature
ifest in the inpainted region; if it does, its quality could be “Generated
Image”
compromised or distorted, and its size might be dispropor- Cosine Quality
Similarity Score
tionate to the surrounding context. To systematically detect
these issues, we introduce the third stage: the Alignment CLIP Image
Module. Feature

4.4. Product Alignment Module


(b) Quality Sub-module
The Alignment Module comprises three sub-modules: Con-
tent, Quality, and Volume. The Content sub-module serves “too large “regular size “too small
{product}” {product}” {product}”
as a binary classifier, determining the presence of the prod-
uct in the generated image. If the product’s probability of
existence surpasses a predefined threshold, then the Qual-
ity score is calculated for that image. This score evaluates CLIP Score Product Size
the quality of the inpainted product in relation to the sample
images originally used to train the SD model. Finally, if the
image’s quality score exceeds the set quality threshold, the “Generated
Volume sub-module assesses the product’s size in propor- Image”

tion to the background image. The generated image will be (c) Volume Sub-module
successfully accepted and presented to the user only if all
three scores within the Product Quality Alignment Module Figure 3. Block diagram of each of the components of the Align-
meet their respective thresholds. ment Module. The Content sub-module is built using a pre-trained
Within the Content module, an image captioning model caption generator and CLIP models shown in (a). The generated
[14] is employed to generate a caption, which is then refined caption is fine-tuned by adding the name of the intended product to
the caption. For the Quality sub-module, the image features of the
by incorporating the product’s name. The super-class name
same CLIP model are utilized shown in (b). Finally, in the Volume
of the product can also be utilized. Both the captions and sub-module, the same CLIP model with three different size text
the inpainted image are fed into the CLIP model to derive a prompts is used shown in (c).
CLIP score. If the modified caption scores above 70%, it’s
inferred that the product exists in the inpainted image. The
Quality module contrasts the mean CLIP image features of
the sample images with the CLIP image feature of the gen- size perception can be subjective and varies based on cam-
erated image. The greater the resemblance of the inpainted era proximity, a milder threshold of 34% (slightly above a
product to the sample images, the higher the quality score. random guess) has been selected. The comprehensive block
A threshold of 70% has been established. The Volume mod- diagram of the proposed VPP system is illustrated in Fig-
ule finally gauges the size of the inpainted product. The ure 2, with the three stages distinguished by varied color
generated image is processed through the CLIP model, ac- blocks. The block diagrams for each sub-module can be
companied by three distinct textual size prompts. Given that found in Figure 3.

4
The Volume sub-module provides insights regarding the 5.1. Assessing Alignment Module
size of the inpainted product. To modify the product’s size,
To assess the effectiveness of the Alignment Module, im-
the mask’s dimensions must be adjusted. For this task, mor-
ages were generated both with and without it. For each sub-
phological transformations, including mask erosion and di-
module, as well as for the overall Alignment Module, 200
lation, can be employed on the binary mask. These trans-
images were generated: 100 with the filter activated and 100
formations can either reduce or augment the mask area, al-
without (referred to as the ”Naive” case).
lowing the inpainting module to produce a product image
To prevent bias, all images were given random names
of the desired size. The relationship between alterations in
and were consolidated into a single folder. These im-
the mask area and the size of the inpainted product across
ages were also independently evaluated by a human, whose
various erosion iterations is depicted in Figure 4. Approxi-
scores served as the ground truth. This ground truth in-
mately, 25 iterations of erosion consume around 3 millisec-
formation was saved in a separate file for the final evalua-
onds, making it highly cost-effective.
tion, which followed a blindfolded scoring method. All the
experiments were also repeated for another product named
0 10 20 25 “Lupure Vitamin C”.

5.2. Evaluation Metrics


The evaluation and scoring method of each of the sub-
modules of the Alignment module is described in the con-
secutive segments.
• Content Score For the image content score, images are
categorized into two classes: ‘success’ if the product ap-
pears, and ‘failure’ otherwise. When the content module
is utilized, the Failure Rate (FR), defined as the ratio of
Failure to Success, is below 10% for both of the products.
Figure 4. Application of erosion to the mask where a kernel of size
(5 × 5) is used for 0, 10, 20, and 25 iterations shown in the figure • Quality Score For the quality score, images are rated
consecutively. The resulting output is presented at the bottom of on a scale from 0 to 10: 0 indicates the absence of a
the corresponding mask to show the size reduction of the generated product, and 10 signifies a perfect-looking product. To
product in the output image. evaluate in conjunction with the CLIP score, both the
Mean Assigned Quality Score (MAQS) and Mean Qual-
ity Score (MQS) are calculated. MAQS represents the
average score of images labeled between 0 and 10, while
5. Experimental Results MQS is the output from the quality module, essentially
reflecting cosine similarity.
Experiments were conducted to evaluate the performance of
the proposed VPP system. For these experiments, five sam- • Volume Score For the volume module, images are also
ple images of an “Amazon Echo Dot” were chosen. 1, 000 rated on a scale from 0 to 10: 0 for a highly unrealis-
augmented images of each product created from these five tic size, and 10 for a perfect size representation. When
sample images were used to fine-tune the DreamBooth evaluating the volume module, the content module is not
model using the text prompt ”A photorealistic image of a utilized. Since the size score necessitates the presence
sks Amazon Alexa device.” The model was fine-tuned for of a product, images without any product are excluded
1, 600 steps, employing a learning rate of 5 × 10−6 , and a from this evaluation. To gauge performance, the Mean
batch size of 1. Assigned Size Score (MASS) is calculated in addition to
The fine-tuned model can inpaint products into the the CLIP score.
masked region. However, issues such as lack of product ap-
pearance, poor resolution, and disproportionate shape per- 5.2.1 Overall Results
sist. The goal of the proposed Alignment Module is to auto-
matically detect these issues. If identified, the problematic The results of individual evaluations are presented in Ta-
images are discarded, and a new image is generated from ble 1. It can be observed from this table that using any
different random noise. Only if a generated image meets of the sub-modules consistently produced better outcomes
all the module’s criteria it is presented to the user. Other- compared to when no filtering was applied across various
wise, a new image generation process is initiated. This loop metrics. The results of the comprehensive evaluation, en-
continues for a maximum of 10 iterations. compassing all sub-modules, can be found in Table 2.

5
Table 1. Individual evaluation of content, quality, and volume sub-modules within the overall Alignment Module. “Naive” represents the
outputs without any filtering sub-modules. Content classifies the presence of the product in the generated images. Quality measures the
proximity of the generated product to the sample product images used to fine-tune the diffusion model. Finally, Volume identifies the size
category of the product.

Naive Content Naive Quality Naive Volume


Success 72 94 CLIP 32.49 ± 3.69 33.80 ± 2.69 CLIP 32.58 ± 3.70 33.42 ± 2.69
Amazon
Echo Dot Failure 28 6 MAQS 4.41 ± 3.23 6.41 ± 1.90 MASS 3.01 ± 2.68 4.81 ± 2.31
FR 38.89% 6.38% MQS 0.75 ± 0.14 0.83 ± 0.06
Success 87 100 CLIP 24.61 ± 2.4 25.23 ± 2.66 CLIP 24.22 ± 3.01 24.51 ± 2.89
Lupure
Vitamin C Failure 13 0 MAQS 5.65 ± 2.85 6.47 ± 1.09 MASS 5.64 ± 3.05 7.14 ± 1.53
FR 14.94% 0.0% MQS 0.81 ± 0.13 0.86 ± 0.04

Table 2. Comparison of the proposed method with and without using the Alignment Module in addition to the Paint-By-Example (PBE) [23]
inpainting model. The “Naive” performance represents the generated output without applying the Alignment Module. The “Alignment”
column represents the generated outputs where three cascaded filtering sub-modules are used, i.e., the Alignment Module.

Amazon Echo Dot Lupure Vitamin C


PBE Naive Alignment PBE Naive Alignment
CLIP 31.44 ± 3.43 32.85 ± 3.19 33.85 ± 2.54 27.01 ± 2.10 24.71 ± 2.64 24.89 ± 2.90
MAQS 1.13 ± 1.30 4.65 ± 3.60 6.31 ± 2.39 1.75 ± 1.51 6.60 ± 3.01 7.81 ± 1.13
MASS 1.22 ± 1.60 3.05 ± 2.98 4.70 ± 2.81 2.43 ± 2.07 6.25 ± 3.08 7.30 ± 1.59
MQS 0.64 ± 0.08 0.75 ± 0.14 0.82 ± 0.05 0.67 ± 0.06 0.82 ± 0.12 0.86 ± 0.05
FR 78.57% 29.87% 0.00% 38.89% 17.64% 0.00%

(a) (b)

Figure 6. Empirical performance of Alignment Module for Ama-


zon Echo Dot. Noticeably, no output is generated without any
product when the Alignment Module is employed. Moreover, the
mean quality score has increased from 4.65 to 6.31.

(c) (d)

Figure 5. Inpainted product image of Paint-by-Example (PBE). 5.3. Comparison with Paint-By-Example
PBE generates high-quality images which explains the higher
CLIP score in the case of Lupure Vitamin C. However, the in- The proposed method is compared with the Paint-By-
painted product does not look similar to the desired product at all
Example (PBE) [23] inpainting model and Table 2 shows
resulting in very poor mean assigned quality and size scores. Out-
the performance comparison of the proposed method along
put images for Amazon Echo Dot is shown in (a) and (b), and for
Lupure Vitamin C is shown in (c) and (d). with PBE. PBE can generate very high-quality images,
however, the inpainted product in the generated image does
not look alike the desired product at all as shown in Figure 5
resulting in very poor MAQS and MASS. Whereas the in-
painted product of our proposed method resembles much of
the original product shown in Figure Figure 7.

6
5.4. Frequency Distribution 6.2. Future Considerations for Product Scalability
The frequency distribution and density function of the as- Fine-tuning stable diffusion using DreamBooth can take up
signed quality scores in the case of “Naive” and “Align- to 30 minutes, depending on dataset size, image resolution,
ment” for Amazon Echo Dot is presented in Figure 6. The and extent of training. When considering a customer with
density mean has shifted from 4.65 to 6.31 when Alignment hundreds or thousands of products, this process could take
Module is adopted indicating the effectiveness of the pro- days to complete model training across different products.
posed module. Our pipeline is deployed on Amazon SageMaker, a man-
aged service that supports the automatic scaling of de-
6. Path to Production ployed endpoints. This service can dynamically accommo-
date large computational needs by provisioning additional
6.1. Product API instances as required. As such, fine-tuning 100 SD mod-
els for 100 different products would still only take about 30
The location identifier, fine-tuned model, and Alignment minutes if 100 instances were utilized in parallel.
Module are combined to develop an easy-to-use VPP
The fine-tuned models are stored in an Amazon S3 (Sim-
Streamlit web app 1 . This app is hosted on Amazon Sage-
ple Storage Service) bucket, with each model being 2.2 GB
maker using an “ml.p3.2xlarge” instance, which is a single
in size. Consequently, 100 fine-tuned models would occupy
V100 GPU with 16GB of GPU memory. The demo app’s
approximately 220 GB of storage space. A pertinent ques-
interface is illustrated in Figure 8. In the top-left ‘Image’
tion arises: Can we strike a space-time trade-off by training
section, users can either upload their own background im-
a single model with a unique identifier for each product?
age or choose from a selection of sample background im-
ages to generate an inpainted product image. If this is feasible, the space requirement would be re-
duced to a consistent 2.2 GB. However, that one model
The web app provides extensive flexibility for tuning
would need more extensive training - specifically training
the parameters of the Alignment Module so that users can
steps would increase by a factor of 100 for 100 products,
comprehend the effects of these parameters. In the ‘seed’
thereby lengthening the computation time. This approach
text box, a value can be input to control the system out-
remains untested and warrants future exploration [10].
put. The segmentation threshold for CLIPSeg defaults to
0.7, but users can refine this value using a slider. Within the
‘Mask Params’ section, the number of dilation and erosion 7. Conclusion
iterations can be set and visualized in real-time.
The filter, represented by the Alignment Module, can be In this paper, we present a novel, fully automated, end-
toggled on or off. The ‘Max Attempt’ slider determines the to-end pipeline for Virtual Product Placement. The pro-
number of regeneration attempts if the model doesn’t pro- posed method automatically determines a suitable location
duce a satisfactory output. However, if a seed value is spec- for product placement into a background image, performs
ified, the model will generate the output only once, regard- product inpainting, and finally evaluates image quality to
less of the set value. Lastly, in the ‘Filter Params’ section, ensure only high-quality images are presented for the down-
users can fine-tune the threshold values for each sub-module stream task.
of the Alignment Module, specifically for content, quality, Using two different example products, experiments were
and volume. conducted to evaluate the effectiveness of the proposed
The “show stats” button beneath the input image displays pipeline, the performance of the individual sub-modules,
the mask alongside details of the model outputs. These and the overarching Alignment Module. Notably, when
details include the seed value, placement, generated and upon employing the Alignment Module, the Failure Ratio
modified captions, and the content, quality, and volume/size (FR) plummeted down to 0.0% for both investigated prod-
scores. By visualizing the mask and its area, users can apply ucts. Additionally, images produced with the Alignment
erosion or dilation to adjust the product’s size. The default Module achieved superior CLIP, quality, and size scores.
threshold values for content, quality, and volume are 0.7, Qualitatively, the produced images present a clean and
0.7, and 0.34, respectively. While these values can be ad- natural semantic inpainting of the product within the back-
justed slightly higher, it’s recommended to also set the ’Max ground image. The accompanying web application facil-
Attempt’ to 10 in such cases. A higher threshold means that itates pipeline deployment by enabling image generation
the generated output is more likely to fail the criteria set by through a user-friendly interface with extensive image fine-
the Alignment Module. tuning capabilities. The high-quality integration of products
into images underscores the potential of the proposed VPP
1 S TREAMLIT : https://streamlit.io/ in the realms of digital marketing and advertising.

7
Amazon Echo Dot Background and Inpainted Images

Lupure Vitamin C Background and Inpainted Images

Figure 7. Qualitative results of the proposed VPP system. Experiments are performed using two different products, Amazon Echo Dot
as shown on top, and Lupure Vitamin C as shown on bottom. The original training images are shown on the left, and then the pairs of
background and inpainted output images are presented side by side.

Figure 8. The interface of the VPP web app demo was built using Streamlit hosted in Amazon SageMaker. The uploaded background
image is shown under the title “Input Image” and the inpainting image with an Amazon Echo Dot is shown under the title “Output Image”.
Moreover, the generated mask produced by the location identifier and the other intermediate details of the proposed VPP system is also
presented in the interface.

8
References ments. Journal of Promotion Management, 16(1-2):25–38,
2010. 1
[1] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended
[16] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia-
diffusion for text-driven editing of natural images. In Pro-
jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided
ceedings of the IEEE/CVF Conference on Computer Vision
image synthesis and editing with stochastic differential equa-
and Pattern Recognition, pages 18208–18218, 2022. 1, 2
tions. arXiv preprint arXiv:2108.01073, 2021. 2
[2] David Bau, Alex Andonian, Audrey Cui, YeonHwan Park,
[17] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ali Jahanian, Aude Oliva, and Antonio Torralba. Paint by
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
word. arXiv preprint arXiv:2103.10951, 2021. 2
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
[3] Guillaume Couairon, Jakob Verbeek, Holger Schwenk,
transferable visual models from natural language supervi-
and Matthieu Cord. Diffedit: Diffusion-based seman-
sion. In International conference on machine learning, pages
tic image editing with mask guidance. arXiv preprint
8748–8763. PMLR, 2021. 2, 4
arXiv:2210.11427, 2022. 2
[18] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[4] Antonia Creswell, Tom White, Vincent Dumoulin, Kai
Patrick Esser, and Björn Ommer. High-resolution image
Arulkumaran, Biswa Sengupta, and Anil A Bharath. Gen-
synthesis with latent diffusion models. In Proceedings of
erative adversarial networks: An overview. IEEE signal pro-
the IEEE/CVF conference on computer vision and pattern
cessing magazine, 35(1):53–65, 2018. 2
recognition, pages 10684–10695, 2022. 1
[5] Zachary Glass. The effectiveness of product placement in
[19] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
video games. Journal of Interactive Advertising, 8(1):23–32,
Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
2007. 1
tuning text-to-image diffusion models for subject-driven
[6] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
generation. In Proceedings of the IEEE/CVF Conference
Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im-
on Computer Vision and Pattern Recognition, pages 22500–
age editing with cross attention control. arXiv preprint
22510, 2023. 2, 3
arXiv:2208.01626, 2022. 2
[20] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
[7] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-
Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
fusion probabilistic models. Advances in neural information
Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
processing systems, 33:6840–6851, 2020. 2
et al. Photorealistic text-to-image diffusion models with deep
[8] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen
language understanding. Advances in Neural Information
Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic:
Processing Systems, 35:36479–36494, 2022. 2, 3
Text-based real image editing with diffusion models. In Pro-
[21] Jiaming Song, Chenlin Meng, and Stefano Ermon.
ceedings of the IEEE/CVF Conference on Computer Vision
Denoising diffusion implicit models. arXiv preprint
and Pattern Recognition, pages 6007–6017, 2023. 1, 2
arXiv:2010.02502, 2020. 2
[9] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-
[22] Ye Wang and Huan Chen. The influence of dialogic engage-
and-language transformer without convolution or region su-
ment and prominence on visual product placement in virtual
pervision. In International Conference on Machine Learn-
reality videos. Journal of Business Research, 100:493–502,
ing, pages 5583–5594. PMLR, 2021. 3
2019. 1
[10] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli
Shechtman, and Jun-Yan Zhu. Multi-concept customization [23] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin
of text-to-image diffusion. In Proceedings of the IEEE/CVF Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by
Conference on Computer Vision and Pattern Recognition, example: Exemplar-based image editing with diffusion mod-
pages 1931–1941, 2023. 7 els. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 18381–18391,
[11] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian-
2023. 3, 6
wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee.
Gligen: Open-set grounded text-to-image generation. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 22511–22521, 2023. 1, 3
[12] Jun Hao Liew, Hanshu Yan, Daquan Zhou, and Jiashi Feng.
Magicmix: Semantic mixing with diffusion models. arXiv
preprint arXiv:2210.16056, 2022. 2
[13] Timo Lüddecke and Alexander Ecker. Image segmenta-
tion using text and image prompts. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 7086–7096, 2022. 4
[14] Ziyang Luo, Yadong Xi, Rongsheng Zhang, and Jing Ma. A
frustratingly simple approach for end-to-end image caption-
ing. arXiv preprint arXiv:2201.12723, 2022. 4
[15] John McDonnell and Judy Drennan. Virtual product place-
ment as a new approach to measure effectiveness of place-

You might also like