0% found this document useful (0 votes)
63 views87 pages

A Very Unofficial Midjourney Manual

Uploaded by

Sara Largo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views87 pages

A Very Unofficial Midjourney Manual

Uploaded by

Sara Largo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 87

(CC-BY) shambibble / @shambibble / shambibble.

com

Note: The following is an unaffiliated fan-based document for version 5.2 of Midjourney as of fall 2023
(except the addendum, from winter 2024). Prompting is highly indeterminate and while the examples
shown here (except the variation chapter) are from first-roll grids, consider all tips and tricks “nudges”
that improve your odds of getting what you want rather than sure-fire solutions. Expect to tinker, re-roll,
and re-mix whenever chasing a specific result. Please support the official documentation.

🌅 HIGH-LEVEL OVERVIEW 🌅
✏️ PROMPT TWEAKING ✏️
⚙️ SIMPLE PARAMETERS ⚙️
🔢 MULTI-PROMPTS 🔢
📷 IMAGE PROMPTS 📷
🌱 Vs, Zs, and Ps 🌱
HASTY MJ 6 ADDENDUM

I am neither a programmer nor an artist, nor privy to any special insider information, just a guy who
prompts a lot and therefore this manual should not be read by anyone ▒
🌅 HIGH-LEVEL OVERVIEW 🌅

It would be a rough but not inaccurate analogy to say that the Midjourney AI is like a brain
with two halves: the right-brained, slightly stoned diffuser, which trains on how to shape
coherent and creative subjects out of random noise, and the left-brained clip guide, a
semantic egghead critic that tells the diffuser how it’s doing based on a gigantic matrix of
text-to-image pairs.

Diffuser AIs have been around for a minute; early versions of MJ share lineage with the
now-venerable Disco Diffusion. To put things very non-technically, diffusers train to pick
out signals from noise with a two-step process that you can mimic yourself with two
Photoshop filters: adding noise to an image, so the AI continues to recognize it even under
a bunch of static, and de-noising, or training the AI to remove the static noise and
reconstruct the original image.

Think of it like showing the AI several million clouds that you’ve told it are shaped like
certain things, and then asking it to tell you what it thinks an entirely new cloud represents.
If you’re a math person, I recommend this presentation from nVidia which has lots of Greek
letters and long division symbols for your eyes to glaze over. I’m not much for differential
equations myself, but this process can be intuited just watching MJ render prompts you
generate; you’ll see early on most of them start as random blurry smears of pixels and then
slowly cohere into something recognizable.

An illustration of noising/denoising from the above-linked presentation.

This diffusion is why every MJ /imagine starts with a seed value. The concept of seed value
should be familiar if you’ve ever played a roguelike, or Minecraft, or any game with pseudo-
random procedural generation: each seed value is a string of numbers used by the game
engine to generate a particular level layout (or enemy distribution or whatever else is being
randomized). When the seed changes (usually when you restart the game), the level
changes too.
On MJ, the seed value is mathematically extrapolated into a big ol square of random noise
for the diffuser to shape into something coherent. Every time you enter a prompt, or re-roll
one, you’re also re-rolling a seed value. This is not visible during general use, but if you
react to a grid with the ✉️emoji, you can get the seed ####, and then repeat your prompt
with --seed #### at the end. Running the exact same prompt with the exact same seed
twice will get you an identical result. But...

a lone ziggurat rises from the windswept sands a lone ziggurat rises from the windswept arabian
of arabia --seed 1234 sands --seed 1234

...as you can see above, even the tiniest bit of noise can visibly alter outcomes. A human
would read these two prompts identically, but to MJ, “sands of arabia” vs “arabian sands”
was enough to make the diffuser see the result very differently in terms of style,
composition, shape of the pyramid, etc. And of course, adjacent seed numbers don’t really
correspond to each other at all, because of how pseudo-random the noise function is; --
seed 997 isn’t going to be especially close to --seed 998.

You don’t really need to over-think seeds or even specify them most times; it’s only
necessary in cases where we want to do A/B comparisons and be (more) sure differences
aren’t random. Pick any number (better yet, pick several in advance, sometimes seeds are
lucky), the important thing is to keep it constant for whatever test you’re running at the
time.

Diffusers alone aren’t especially good at direction; they can spot patterns in the noise, but
as the noise starts to overwhelm the image, the denoising results look less and less like the
coherent original images we started with, so fully random noise might lead to fully random
results. The diffuser needs a guide to help it along, and that’s where the CLIP (short for
contrastive language-image pretraining) guide AI comes in.

As the acronym suggests, CLIP can train on any big set of text-to-image pairs. For both file
size and copyright reasons, public datasets don’t actually contain any picture files; instead
they’re mostly pairs of links going to those images on the internet, paired with English-
language descriptions of those images. Stock images, photographs with captions,
genre/style tags, artist metadata, all of these things can be used to train CLIP. By itself, CLIP
works like a mini version of Google Image search that looks at any place on the internet
where images are consistently paired with (English) text descriptions. (Finding relevant web
images from text searches was the original purpose of CLIP).

Eventually, someone (Katherine Crowson) got the bright idea to use CLIP as a combination
model and art critic for generating entirely new images rather than retrieving ones from
search. This was originally done as a combo with GANs (generative adversarial networks, as
seen on thispersondoesnotexist.com) but was soon applied to diffusers as well, which
became more popular, and the lion’s share of current image AIs are some descendant of
“CLIP-guided diffusion.”

A searchable archive of LAION-400M, one of the text-image datasets used to train the original CLIP.

It’s important to keep in mind there is very little manual curation of these text-to-image
datasets used to train CLIP. Images go on and off the internet all the time, descriptions get
changed, and they can always be inaccurate to begin with. The databases mostly rely on
sheer size to get over this, but that size also means most of the curation has to be done
with AI, too. The “aesthetic scores” in these datasets are largely assigned by AIs
extrapolating from a much smaller pool of actual human ratings. The fact that CLIP is
ultimately training on billions of haphazard human descriptions has a couple of
implications for prompting that can be worth thinking through.

For instance, one of the most common beginner mistakes on MJ is to add the word
“realistic” or even “photorealistic” to a prompt for something that looks like a photograph.
But “photorealistic” is a painting style, and including that will actually steer the clip guide
away from photographs. Photographers do not describe their work as “photorealistic,” they
instead describe it with stuff like camera brands or focal lengths, and those should be the
kinds of things you add to your prompt if you want something with more photograph-like
depth to it.

20-year-old mexican woman in a garden, portrait, 20-year-old mexican woman in a garden, portrait,
photorealistic canon 35mm photograph
--seed 1821 --seed 1821

More seriously, the fact that the text is ultimately scraped from the Internet, in English,
means that it will reflect pretty much any societal bias noticeable enough to show up in
that text. There are billions more Asian and African people on this planet, but since most of
them aren’t tagging photos on Flickr in English, “man” will almost always get you an Anglo-
looking man when you don’t otherwise specify an ethnicity or nationality.

And the exceptions to this can be even more embarrassing; it’s de rigeur for every media
thinkpiece on art AI to note that “nurse” tends to produce women, or “terrorist” tends to
produce swarthy Arab-looking guys (again, think about the kinds of bias you’d get searching
a term on Google Images; CLIP will probably reflect it).
Another consideration is that the AI does poorly with underspecified concepts with wide
visual variety. If you simply prompt a “bird” in MJ, CLIP will be stuck trying to create a
sensible gestalt out of pictures of eagles, robins, parrots, crows, flamingos, and penguins
(and at all different zoom levels, angles, perched or flying, male or female peacocks, etc.)
The same goes for concepts like “uniform” or “car.” Not recipes for realistic output without
lots of detail.

a car a white 1981 pontiac firebird trans-am


--seed 303 --seed 303

Notice how the style shifted from artistic to photographic with no explicit prompting other
than all those extra real-life car details, simply because including those in the prompt gets
it thinking about real-life photographs of these cars that it saw while looking at Craigslist or
dealership websites.

What you also might not appreciate at first is that this principle of specificity also applies to
people. Are you trying to prompt someone very famous, who should be in the dataset a
million times, but they end up looking like wish.com versions of themselves? Well, think
about how they’re represented in the database. Bob Marley is probably going to need a
photograph cue to get past all the graffiti/stoner art. Arnold Schwarzenegger has been
famous for so long, you’ll need to specify a year or an age or it’ll do a weird average of his
old face and young skin. Someone famous for decades AND caricatured a lot? They’ll
probably need both.
richard nixon giving peace sign gesture --seed 61-year-old richard nixon giving peace sign
1974 gesture, 1974 file photograph, white house
records --seed 1974

A large part of prompt engineering is just taking a step back and trying to think less like
yourself and more like a silly machine that has never once inhabited the physical three-
dimensional world, knows nothing about history or science, and lacks any understanding of
reality beyond the words used to describe them in pictures. So before we run off and start
throwing in parameters and multi-prompts and God knows what else, let’s take a minute to
sit and talk “plain English” prompting, which can be useful not just with MJ, but DALLE,
Firefly, StableDiffusion forks, and any other text-to-AI service that might show up in the
future (release Imagen you cowards).
✏️PROMPT TWEAKING ✏️

Neural networks are weird. The associations they end up making are often opaque even to
the programmers, and even with tricks like keeping the seed value constant, a lot of this
stuff is more art than science. It’s a shiny black box that will absolutely not tell you the
rules; they’re just things we have to guess at by looking at input and output. Please take
everything in this section with some low-confidence salt.

You don’t actually have to do this most of the time. Please.

Well, except for a few dev-confirmed things: capitalization and punctuation. Since a lot of
the text-image pairs used to train the clip guide didn’t have consistent capitalization, that’s
discarded. Lowercase, caps lock, and normal sentences all get parsed the same way. For
punctuation, there are only three kinds of punctuation that officially matter in the sense
that they alter how MJ parses your prompt: double dashes --, double colons ::, and curly
brackets { }. These represent parameters, multi-prompt breaks, and permutation sets,
respectively, which will be discussed later, so set them aside for now. Any other
punctuation is just treated like the rest of your prompt.

The fundamental unit of a prompt is a “token,” which is sort of equivalent to a word, but not
quite. Tokens are determined by feeding the AI a lot of text and letting it discover what the
most common strings are. Most common words get their own token. Uncommon words get
broken up into multiple sub-tokens. A borderline-common verb might show up often
enough for its stem to be tokenized, but have endings like “-ing” or “-ed” split into separate
tokens. (This is a feature, not a bug, it helps the AI learn English and gives it a shot at
contextualizing new words the same way we do, by root similarities).

One hard rule of prompting is that CLIP maxes out at 77 tokens. If you want to get right up
to the limit, use this NovelAI widget and select CLIP from the dropdown, but if you just want
an easy rule of thumb, treat the maximum as 50 words (to account for long/unusual words
getting split tokens). This limit is one of many reasons why it can be a bad idea to enlist
ChatGPT to write prompts for you.

A portrait of a British coal miner, aged around 45, with a British coal miner, 45 years old, tired
tired expression on his dirty face, and short scruffy grey
hair. The background of the image is a dark winter’s day. The expression with short grizzled hair, winter
camera used is a Nikon D780 with a 50mm lens. The composition daytime background, medium shot, angled from
is centered on the miner's face. The perspective is slightly below, sharp and gritty style, soft lighting,
angled from below to capture the [77 tokens, everything after
this was ignored] miner's pride and strength. The lighting is capturing pride and dignity, bokeh blur, Nikon
low and warm, with the yellow glow from the match illuminating D780 photograph
the miner's face. The depth of field is shallow, with the --seed 1972
miner's face and upper body in focus and the background
blurred. The photo quality is high, with sharp details and rich
colors. This image captures the rugged looks and hard work of a
British miner, with a sense of pride and dignity. The camera
settings are carefully chosen to emphasize the subject's
expressive features and convey the emotional depth of the
scene: an aperture of f/2 to create a shallow depth of field,
isolating the subject from a soft, muted background; an ISO of
100 for optimal image quality and minimal noise; and a shutter
speed of 1/200 sec to capture every nuance of the subject's
countenance. The composition is skilfully illuminated by a
combination of natural light and a soft box, which together
sculpt the man’s face, emphasizing the rugged looks. The warm,
golden tones of the light further enhance the sense of pride
and contentment radiating from the subject, creating a visually
arresting and deeply moving portrait that resonates with
viewers on a profound level. The background is kept simple and
unobtrusive, allowing the viewer's focus to remain on the
subject and his emotive expression. The subdued colors and soft
textures of the background complement the subject's features,
further contributing to the overall harmony of the composition.
--seed 1972
Notice also that all the noise in the first prompt drowned out important visual details that
were before the cutoff, like the low angle and the soft daytime setting. Language models
are not familiar with MJ, which launched after their training windows, and “visual” thinking
is not natural for them. If you just ask for a prompt “for Midjourney” they tend to assume it
works like they do, writing solid bricks of “instructional” prose that burn the token limit on
needless verbiage like “The image should be X,” when MJ only needs to hear “X.”

You might like the left image better; it’s still a cool image, after all. But at this stage in the AI
game, the measure of a prompt isn’t whether it produces a cool image. MJ is so great at
diffusing style from noise that keysmashes and chaos prompts can produce cool images
too. Skillful prompting is whether it produces the cool image you specified. For this goal, it’s
best to keep your descriptions concise and pointed.

One other quirk of the CLIP token scheme is that punctuation is always chopped on both
sides. Lots of people swear by things like using hyphens or parentheses to link concepts
together, or using exclamation points for emphasis, but there is little systematic proof
these things work. Unless someone has knock-down A/B proof from multiple --seed pairs,
it can provisionally be assumed that any unnatural noise between concepts tends to
separate rather than link them. But if a word naturally comes with a dash, like “close-up,”
keep that punctuation in, as it will help MJ recognize and apply the pattern.

a man eating chicken on a crowded street a man-eating chicken on a crowded street


--seed 369 --seed 369

We now exit the realm of objectively verifiable facts and tiptoe into well-grounded
speculation. Once CLIP has converted your prompt into a string of tokens, it then starts
paying “attention” to them. What “attention” means (grossly oversimplified) is that some
combinations of tokens, like “man” plus “dash” plus “eating” above, activate associations in
the neural network to further conceptualize common phrases. This is why a phrase like “#-
year-old,” despite consisting of 5+ separate tokens, reliably produces the patterns of
physical details we associate with age.

That trained attention is also combined with a weaker default attention value for simple
proximity (which is why you can mash pretty much any nonsense with the “____punk” or
“____core” tokens for some extra flavor) and the resulting invisible attention map is MJ’s
holistic hundred-dimensional “idea” of your prompt.

Some tokens don’t mean very much. Articles like “a” or “the” have slight, random effects on
prompts, with some key exceptions where meaning totally changes:

cure --seed 1982 the cure --seed 1982

But for the most part, they are treated like noise, enough to bounce the seed into a
different composition sometimes, but without a discernible, systematic difference. In
general, prompting like a patent claim and trying to refer back to something you
introduced with “the” rarely works. When possible, you should try and put every detail
about an element into one consecutive string of words, because referring to it at separate
points in the prompt will simply jumble it up.
a mask hangs on the stucco wall, the mask is ornate wooden mask hanging on stucco wall --seed
ornately carved from wood --seed 10001 10001

Here, the attempted back-reference crushes everything that comes between it, making MJ
forget “stucco” and ornately carve everything in the picture from wood. But in the second
prompt, all the mask details come before it, and the mask itself is cleanly separated from
the wall with “hanging on,” giving us our desired material assignment.

Two more frequent casualties of the attention mechanisms are short one-word
prepositions and conjunctions. If the only words you use to indicate a relationship are “in”
or “on,” MJ will have a tough time distinguishing between the two.

dachshund in a sweater --seed 2222 dachshund on a sweater --seed 2222


Provisional conclusion: When trying to relate aspects of a prompt, it pays to use
prepositions instead of conjunctions, and pays even more more to use verbs instead of
prepositions, such as “wearing X” instead of “in X” or “embroidered on X” instead of “on X.”
You want to avoid linking concepts with multiple noise-sized words (passive voice
especially).

Word order also has a significant effect on prompts. MJ, like any AI that trains on English,
has a signficant left-to-right bias. The first thing in your prompt will be the strongest
influence on the resulting composition, and the last thing the weakest. This lets you treat
the order of your words as an informal weighting system.

harsh antarctic landscape, wide-angle shot, bustling futuristic city, wide-angle shot,
hyperrealistic digital painting, vapor hyperrealistic digital painting, vapor
contrails, cinematic ambience, bustling contrails, cinematic ambience, harsh antarctic
futuristic city --seed 90 landscape --seed 90

Here, by putting the “harsh antarctic landscape” in front, and “bustling futuristic city” at the
end, we ensured the central focus of the render was on the landscape, with the city being
consigned to the periphery of the shot (and the word “bustling” more or less omitted
entirely). Switching them around restores at least some semblance of civilization to the
output, while the Antarctic isn’t quite as harsh now with liquid water and some patches of
bare dirt. But some tokens are even stronger than the Antarctic; in fact, some ideas are so
coherent in the database they overpower the prompt even if you put them at the very end.
grainy pixelated security video of carmen grainy pixelated security video of carmen
sandiego sneaking through french museum at sandiego sneaking through french museum at
midnight, 240p found footage, stealing priceless midnight, 240p found footage, stealing the mona
artwork --seed 1503 lisa --seed 1503

The most-reproduced image ever, a 2D painting with identical perspective? Carmen doesn’t
stand a chance, MJ yeets all the other concepts that came before it and rushes straight
towards a parody portrait. All the “carmen sandiego” tokens can do is toss a red fez on. This
example is a little cute, since the left side shows you can simply substitute “priceless
artwork” and get an acceptable image. But things get hairier (literally) when your big boy
token is part of a word.

catfish, nature photograph siluriformes catfish, nature photograph --seed


--seed 500 500
“Cat” will wreck anyone’s day if they want a catfish, or a catwalk, or some cattail grass. To
steer MJ off that singular focus, we “go up a level” of generality: look up catfish on Wikipedia
and find the scientific name for the class, then slap it on the front of the prompt to steer MJ
out of its rut. This is a version of the “humanoid” trick, where people found out quickly that
it was difficult for Midjourney to draw men or women with non-standard skin, such as
green or purple, for sci-fi/fantasy settings, even with “green-skinned” at the front of the
prompt. However, changing “woman” to “humanoid female” made things much easier; by
adding the “up one level” detail, MJ can be “focused out” and its grasp on the idea loosened.

Conversely, there are some concepts that aren’t very coherent, which might need help no
matter where you put them in the prompt. This is where you break out Roget’s Thesaurus.
For instance, if “top-down view” isn’t even working at the front of your prompt, or is so
weak you’re only getting slightly elevated views, you can load up with “overhead planform
top-down view” to try and force a given perspective. Makes sense, right? If a concept is
weak, you want to broaden the reference space as much as possible.

But what’ll really bake your noodle is this synonym stacking can also work to improve the
sampling resolution for verbs.

archer drawing a bow and arrow --seed 1415 archer nocking drawing aiming a bow and arrow --
seed 1415

If you aren’t much for flowery language, you can also just emphasize by repeating words,
either together (big boost) or at the beginning and end of the prompt (little boost). Another
neat option for reinforcing stuff, where applicable, is to use an emoji (although they aren’t
as powerful as in previous versions of MJ, so using them on their own won’t get you much).

For those suffering from total prompter’s block, MJ has a /describe feature where it will
return text descriptions of an image you provide it, and spit back four semi-randomized
descriptions which you can then run as standalone prompts. It has been confirmed by the
devs that this is not the full-fat CLIP model returning those suggestions. It’s a small,
seperate aesthetics model whose descriptions may or may not be interpreted similarly by
MJ. (I suspect that it’s the same image recognition model powering the “Explore Related”
recommendation section on the gallery website.)

As a result, running one of the four prompts directly as delivered is usually not going to get
you especially close to your original image. Here’s how it did with a faux-Ghibli screenshot I
made by request in prompt-craft, next to the image produced when I hit the first
suggestion.

11️⃣a girl wearing an omakase is riding a frog in 1️⃣run by itself


the forest, in the style of dark orange and
light cyan, animated film pioneer, ricoh ff-9d,
g. willow wilson, q hayashida, angela barrett,
animals and people --ar 3:2

Seasoned prompt vets will spot a few issues right away. “In the style of [two colors]” is a
common construction in /describe that does very little for MJ. Having both “animated film”
and a camera model (“ricoh ff-9d”) in the prompt is going to actively confuse its references.
And mystifyingly, comics writers (like G. Willow Wilson) turn up in /describe even though
prompting them is useless, and will get you their face, or the average of every artist they’ve
worked with (or for Alan Moore, both). Even newbies might notice “girl wearing an
omakase,” a trademark lol-random tokensmash (others include "mommys-on-the- phone-
core” and “no mikado mp4”) that seems more like an in-joke than anything else.
There are things it conspicuously doesn’t do, like attempt any kind of tagging for ethnicity or
body types. Presumably this is to avoid embarrassment (be warned it won’t hesitate to
clock your trans friends) but I’m not sure people running /describe on their photos and
having to add “fat” or “black” because the “generic person” result doesn’t look a damn thing
like them is an improvement from a mood standpoint.

You should think of /describe mostly as a brainstorming assistant rather than a full prompt
generator. I find the best way to use it is to pick and choose a few tokens that I either know
will work, or at least don’t know won’t work, from all four, plus using the original image as
an image prompt (see chapter 5).

The other main prompt-assist tool is /shorten, which purports to tell you which of your
words are the most important by bolding the important ones and crossing out the
unimportant ones. As with /describe, this tool is better than nothing if you’re completely
lost; certainly it will improve most chatGPT brick prompts. But if you’re already writing your
own prompts, then you should take the analysis with a pile of salt.

simple lineart of heavy-duty 18-wheeler truck, simple, heavy-duty 18-wheeler truck, coloring --
vector art, monochrome coloring book on white seed 18
background --seed 18 (first shortened prompt suggestion)

If I were to pick a phrase here doing the least work, it would probably be “heavy-duty,” since
that’s implied from “18-wheeler truck,” so it’s curious that /shorten saw it as so important.
And you certainly don’t need to be a prompt genius to predict what will happen when
“monochrome coloring book” is shortened to “coloring.” It’s also highly inconsistent and
relative importances can change entirely from substituting one synonym for another, even
if you know from separate testing that they have similar effects.

So let’s get back to the big picture. Keeping in mind that prompts are free-form, there are
four elements that you should always consider: subject, background, framing, and style. By
consider them, I don’t necessarily mean include them. “Extreme close-up” framing doesn’t
need a background (prompting one might cause you to zoom out), and “abstract painting”
style doesn’t need framing for a virtual camera. Just think through what you’re doing and
keep the descriptions (and omissions) harmonized instead of conflicting.

As a generic prompt skeleton, I default to [subject] in [background], [framing],


[style] before considering any details or reordering. MJ is sensitive to style cues and will
readily pick up on them even buried 20-30 words deep in a prompt, while the same is not
true of subjects. However, your experience might differ. Just remember that if MJ is
ignoring something important, or getting a relationship wrong, shifting the words around
and adding/removing details should be your first response before trying to fiddle with any
advanced stuff like multi-prompts and numerical weights.

vintage sci-fi robot in rococo palace garden, vintage sci-fi robot in rococo palace garden,
expressionist charcoal sketch dynamic perspective, expressionist charcoal
--seed 1927 sketch --seed 1927

Like all AI prompting, there’s an element of telling MJ “believe in yourself.” Say you don’t
have any good ideas for what you want the framing to be, or you explicitly want to leave it
up to chance. Here, I added the vaguely encouraging “dynamic perspective” and the result
really improved the subject and the background. The people I made fun of at the start, who
append 70 words of “octane render” and “award-winning” for crispiness? They don’t have
the wrong idea, necessarily. They just wildly overestimate the words needed to say “make it
pop,” probably because they’re copy-pasting a giant katamari ball that people have been
adding to since version 2, most of which isn’t necessary, contradicts itself, and distracts
from your idea rather than improving it.

Finally, a word on noise. You might think that me saying tiny words don’t matter means you
should always strip them from your prompts. It’s up to you, but more often than not, I don’t
bother. If I’m really sweaty and trying to do something difficult, maybe I’ll shift to terse
computer-speak, but even when given prose, MJ has an outstanding capacity to evoke the
vibe of words it clearly doesn’t understand when drilled on their meanings individually. This
is one of the central selling points of the current engine!

I bring you precious contraband, and ancient tales from distant lands;
Of conquerors, and concubines, and conjurers from darker times;
Betrayal and conspiracy, sacrilege and heresy;
But I feel alright, I feel alright tonight
--ar 2:1 --s 500 --w 300 --c 50 --seed 101

Of course I went ahead and opened the seal on all those cool parameters, so let’s talk
about those.
⚙️SIMPLE PARAMETERS ⚙️

Since MJ runs via command line interface, there’s a list of parameters you can put after
your prompt with a double-dash. The ability to fiddle with these dials gives MJ a fun middle
ground for prompting between the complexity of rolling massive crispy prompts for offline
StableDiffusion, and the simplistic pure English prompting of other online services like
DALLE. In order of most to least important:

--ar #:#

Aspect ratio (default 1:1) is arguably the most important parameter in your toolkit,
because MJ is very sensitive to how it influences composition. If you want a head-to-toe
shot of a person, for instance, your prompt will be more effective in a vertical aspect ratio.
For multiple subjects next to each other, you want landscape.

confident middle-aged samoan woman standing on a confident middle-aged samoan woman standing on a
beach, full shot, fujifilm finepix photograph, beach, full shot, fujifilm finepix photograph,
golden hour --seed 99 golden hour --ar 2:3 --seed 99

One reason aspect ratio is so fundamental to influencing prompts is because it completely


changes the noise pattern generated by the initial seed number. Two prompts, one at
default square and one at --ar 4:3 will therefore not necessarily look much like each
other at all, since the starting noise will be distributed in a totally different way.

The upside of aspect ratio control is that the model is good at letting it influence the layout
and composition. It has an intuition that a vertical aspect ratio is used for selfies, and that
TV/movie screenshots are inspiration for widescreen aspect ratios. It’s not perfect, but
compared to SD models, it’s quite refined and you can do some very cool things with it.

pixel art of platforming level for 2D side-scrolling game, jumping puzzles, desert theme, pixel
art, parallax scrolling, 16-bit retro game --ar 12:1 --seed 1985

Notably, MJ allows extreme aspect ratios. You can go out to somewhere around 14:1
widescreen for some niche uses. (The true max is 4096px on the long sides, or 128:9).

Aspect ratio’s biggest downside is the same as its upside: it can be more difficult to do
something MJ considers counter-intuitive to your aspect ratio. Say you want that Samoan
lady in widescreen, because she’s reclining, or because the camera is zoomed out. You may
need a few of the tricks from chapter 2 in order to get the framing right.

confident middle-aged samoan woman standing on a wide shot of confident middle-aged samoan woman
beach, full shot, fujifilm finepix photograph, walking along empty beach, golden hour, fujifilm
golden hour finepix photograph
--ar 2:1 --seed 99 --ar 2:1 --seed 99

Here you can see the full grids showing the bag of tricks in action. I changed “full shot” to
“wide shot” and moved it to the front, “standing” became “walking” to increase focus on her
lower body, and “empty” was added to the beach to emphasize the background more. Note
that at no point did I say “zoom out” (any mention of “zoom” tends to zoom in) or “showing
the whole body” or anything awkward like that. To rizz MJ outside the aspect ratio comfort
zone, always favor implication rather than direct instruction. The same goes for the
opposite case where you want a vertical aspect ratio but just a headshot (because you want
them looking up at something above them, for instance, like a pretty skybox or a UFO). In
order to avoid MJ drawing a comically long neck, it might take some experimentation with
word order/repetition, or more advanced techniques from later chapters to get it right.

double-bit viking battle-axe, wooden shaft, double-bit viking battle-axe, wooden shaft,
symmetrical, black background, honed edge symmetrical, black background, honed edge
--seed 13 --ar 1:2 --seed 13

And remember to always leave enough room for your subject. Weapons especially will “just
work” better in a tall/long aspect ratio that fits their handle/barrel.

--s # / --style

Stylize (default 100, range 0-1000) represents the strength of MJ’s “house style.” The higher
the stylize value, the more opinionated MJ will be about your prompt. Maximum stylize will
produce “prettier” images with more coherence; the drawback is that portions of your
prompt are more likely to be ignored. This is the strongest parameter after aspect ratio,
and since it’s the only one on by default, tends to be the most influential on MJ’s overall
“look.” It’s been one of the most confusing and counter-intuitive parameters to use since its
inception, and version 5 is no different.

For starters, --s 0 does not turn stylize off. To do that, you need to use the entirely
separate parameter --style raw. Yes, there are two different parameters named “style”
and “stylize,” so I’ll only refer to stylize as --s for the rest of this section. And they are both
compatible, so the true “absolute zero” house style command is --style raw --s 0. You
could think of this as --s being a big control knob for house effects under default style,
and a finer control knob when used with --style raw.

Given all this, it’s hard for me to understand why they didn’t just scale the command more
broadly and put them on the same gradient; there’s very little difference between
something like --s 650 and --s 660 anyway. But let’s forge ahead; sticking to minimum
and maximum --s for comparison’s sake (more about raw style later).

the maritime conquest of corinth, amphibious the maritime conquest of corinth, amphibious
landing by men-at-arms, realistic digital landing by men-at-arms, realistic digital
illustration --ar 5:4 --s 0 --seed 146 illustration --ar 5:4 --s 1000 --seed 146

I’ve mentioned “coherence” a lot and this side-by-side hopefully illustrates what I mean by
that. The minimum image is a bit scatterbrained, there’s scale issues with the soldiers, two
of them are standing on water, three-and-a-half of them are on a raft, and it looks like two
different kinds of boats are merging with each other under the Austrian flag. The maximum
image, on the other hand, is simply better. There’s some grit issues that could still use a
few variation rolls of TLC, but it's got cleaner ships and a better sense of scale, with siege
engines being assembled on a clearly Greek shoreline.

With MJ now on native megapixel resolution, high --s kinda does what high --q values
used to in earlier versions, now that it no longer needs 2x render to do it: it makes a
prompt look more “epic,” producing zoomed-out shots with clear separation between
multiple subjects, or foregrounds/backgrounds. So just max it out, right?
architectural sketch of the great zulu temple at architectural sketch of the great zulu temple at
ulundi --s 0 --seed 4444 ulundi --s 1000 --seed 4444

Not quite. Remember, the drawback of MJ house style is that, in the course of making your
prompt more epic and stylish, it may ignore parts of it entirely. This is not a long prompt,
and “architectural sketch” is the first two words. But that wasn’t really doing it for MJ, so at
--s 1000 our instructions get junked in favor of a neat little slice-of-life painting of
something more like Frank Lloyd Wright than anything else, while --s 0 faithfully gives us
something more bland and sketch-like.

To better understand when --s may be beneficial or harmful, consider that MJ “house
style” combines two influences: direct developer nudges and periodic user feedback in the
form of website votes, which get incorporated into the algorithm when they train a new
version. These both trade-off against your prompt in different ways.

Since MJ is not open-sourced, developer nudges can only be speculated on, but one has
been consistently talked about and was verifiable in previous versions of MJ: the nudge
towards that “realistic digital painting” style for Corinth. One thing you’ll notice if you use
raw style for a bit (or have experimented with stock StableDiffusion or other “raw” AIs) is
that everything can look like an uncanny photobash. AIs train on digital photography,
cartoon clip art, the old masters, and everything in between. To mitigate this, there’s a
universal nudge towards the “trending on artstation” aesthetic (not the actual preprompt,
I’m sure) which steers photographic tokens towards less realism, and stylized art tokens
towards more realism. Here’s another example of high --s leading MJ astray:
anime keyframe of redhead woman wearing denim anime keyframe of redhead woman wearing denim
jacket, standing pensively in neon city at jacket, standing pensively in neon city at
night, flat color cel animation, 1980s retro night, flat color cel animation, 1980s retro
artstyle artstyle
--ar 4:3 --s 0 --seed 8888 --ar 4:3 --s 1000 --seed 8888

I tried to prompt something that could be an actual screenshot from a 1980s anime, but --
s 1000 is cranking “realistic” up way too high to pay attention to any of the anime,
animation, or retro artstyle cues. You can get simple styles out of high --s values, or
detailed styles out of low --s values; I could perhaps fight the above tendency with a few
more cues or repeats. But parameters are free, while every extra token in your prompt
dilutes the others, so avail yourself of high/low values when appropriate. (Happily, --s is
much more compatible with photography prompts than in prior versions, and unless you
intentionally want something plain, can be unreservedly recommended to do exactly what
it says and stylize things more, albeit at the risk of ignoring unstylish tags.)

abandoned ferris wheel, collapsing in the swampy abandoned ferris wheel, collapsing in the swampy
ruins of an amusement park, decaypunk, zeiss ruins of an amusement park, decaypunk, zeiss
35mm photograph 35mm photograph
--ar 4:3 --s 0 --seed 8 --ar 4:3 --s 1000 --seed 8
Now consider user feedback. The benefits of this are obvious both from an artistic and a
cynical perspective. People tend to upvote good stuff, like faces with two eyes, so high --s
values reduce basic AI boo-boos like the half-soldier in the Corinth example, and MJ
producing popular stuff is great for subscriber retention. However, thanks to Society,
people don’t just upvote good stuff, they upvote the same kind of good stuff, which
definitely contributes to the “stiff” feeling that you can get at high --s levels. Just have a
look at the trending gallery on the website, and see how heavy everything is on steampunk
themes, splashes of paint, and more generally, portraits and photorealism.

a screaming comes across the sky; it has a screaming comes across the sky; it has
happened before, but there is nothing to compare happened before, but there is nothing to compare
it to now it to now
--ar 14:11 --s 0 --seed 1973 --ar 14:11 --s 1000 --seed 1973

Here, a prose prompt courtesy Thomas Pynchon leads to an incredibly evocative drawing
at minimum --s, while the max --s just gives us a typically pretty photorealistic render of
a man screaming. Ironically, the word ”stylized”, which tends to be a power word in
prompts for less realism and more exaggerated, dramatic proportions, often works against
the --s parameter’s taste for photorealism. Other low --s use cases to keep in mind
include wanting conventionally unattractive subjects (where you’re fighting the “pretty”
tendency) or plain linework where you don’t even want shading, let alone fancy
photorealism.

And for extremes, you should be unafraid to bust out --style raw, aka v5 launch style.
This was the first MJ version to launch with a “neutral” style. Many people disliked this as
they’d gotten used to the heavy stylization of previous versions enabling short, vague
prompts to still get beautiful results. But now that it’s an optional parameter, I still find
myself reaching for it, for the same reasons as --s 0, to maximize the attention paid to my
details, or for way outside-the-box requests.

primitive mspaint web art of medeival primitive mspaint web art of medeival
castle --s 0 --seed 1066 castle --style raw --seed 1066

Here’s a good example of where raw style understands the assignment in a fundamental
way that even minimum stylize can’t reach. The --s 0 result is still nudged enough by the
“illustration” tendency to look more like a basic painting on physical media than it does
mspaint. The raw version still looks a little inauthentic (it’s hard for diffusion models to do
true pixel art, and the word “pixel” is too dominant to add to either prompt without
conjuring a retro game) but much, much closer to what we’re going for, with jagged edges
and solid colors.

Raw style can do everything that normal stylize can do, it’ll just be a bit harder sometimes,
like driving with a stick shift instead of an automatic transmission. If you’re coming from
stock or lightly tuned StableDiffusion, you already prompt like this. You’ll also want it for
some very specific image prompting tricks later on where we want to keep MJ’s influence
minimal. But it does absolutely lack any default “vibes” or ability to evoke mood from brief
prompts like normal stylize. It defaults even more towards photorealism and style cues
need to come sooner in the prompt, possibly repeated more, and maybe even augmented
with multi-prompts (see chapter 4).
In particular, it’s much less receptive to “plain English” prompting. Give it those Steve Earle
lyrics from the end of last chapter, and it’ll read that, think it looks like a fancy quote, and
put garbage text next to a generic pretty face, like it’s attempting a meme.

I bring you precious contraband, and ancient tales from distant lands;
Of conquerors, and concubines, and conjurers from darker times;
Betrayal and conspiracy, sacrilege and heresy;
But I feel alright, I feel alright tonight
--ar 2:1 --style raw --s 500 --w 300 --c 50 --seed 101

Bottom line: Be aware of MJ’s house style and how it’ll be slanting your results. Many of the
examples used in this section are intentionally simplified for illustration purposes, and as
you get more proficient at prompting MJ, you will probably find yourself leaning towards
lower values of --s and even playing with raw style. However, for newer users, the default
style (or even some slight bumps) is a decent starting balance between respecting and
improving your prompt. It’s the default for a reason.

--w #

Weird (default 0, range 0-3000) is the newest addition to the toolkit and provides some
much-needed relief for people who find --s too confining. Weird converts the “push” of
stylize from a positive vector to a negative one, resulting in images that are specifically
directed not to resemble MJ’s average output. According to dev comments, it’s scaled to be
roughly equivalent to corresponding --s values, but the scale goes a little farther out (up
to 3000) because while stylize would get totally monotonous past 1000 (every image would
be a steampunk girl in a cloudy flying city no matter the prompt), very high weird values
leave more flexibility. “Away from stylize’s convergence point” is a more open-ended
direction in hundred-plus dimensional latent space.

sketchy crayon drawing of floating islands seen sketchy crayon drawing of floating islands seen
through fish-eye lens, dreamlike landscape, day- through fish-eye lens, dreamlike landscape, day-
glow colors, rotating orientation --ar 4:3 -- glow colors, rotating orientation --ar 4:3 --w
seed 10 500 --seed 10

The positive contribution of “weird” here is obvious. At default settings, MJ is just too
fanciful to bother with a “sketchy crayon drawing,” giving us more of a concept art
watercolor. But by upping weird to 500 (5x the influence of default stylize), we got a much
more “crayon-like” drawing that veers off the beaten path a little bit and even tries to
capture the “seen through fish-eye lens” portion (which default only hints at with circular
halos.)

The biggest difference between using weird and using stylize is that ultra-high levels of
stylize will ignore your prompt entirely, while ultra-high levels of weird will still attempt to
interpret as much of it as possible, just in a way that may be outrageously ugly. This makes
weird more useful than stylize when dealing with long prompts that might have lots of style
cues and details that MJ may find difficult to focus on or disaggregate.

This also makes weird less useful when dealing with shorter prompts. Remember, since
stylize is also responsible for “cleaner” looking images with sharper details and less grit, if
you bump up the weird parameter without giving it any specific instructions on where to go
other than “away from stylize,” it may end up putting out an image that looks like “default
stylize but fried with simulated JPEG artifacts,” since that’s a valid axis of comparison it
might hit on during diffusion.
houston texas downtown
skyline, daytime, aerial
view, panoramic photograph
--ar 5:2 --s 1000 --seed 713

houston texas downtown


skyline, daytime, aerial
view, panoramic photograph
--ar 5:2 --s 0 --w 3000 --
seed 713

This short prompt failed in different ways with max stylize and max weird. Max stylize, as
we’ve come to expect, simply doesn’t find “Houston” (or “daytime”, or “aerial view”) very
compelling and instead gives us a lovely eye-level sunset view of Seattle, complete with two
(?) space needles and clear Pacific waters (take it from a native, Buffalo Bayou is more of a
nondescript industrial brown).

Max weird gives us something much more down-to-earth, gesturing at a few Houston-like
buildings, but you’ll notice the output has a haze and a grit to it, like our drone’s camera
flew through a smokestack and fogged its lens. It’s also less panoramic (overweirding past
1000 is still prone to ignoring prompt terms, especially ones that happen to align well with
stylize, which favors panoramic photos in this aspect ratio). This is therefore an example of
a prompt where you’re better off keeping both weird and stylize on the low side.

However, because stylize has a default value and weird doesn’t, it can often pay to add a
small weird value to apply a brake to MJ’s samey tendencies while still providing some of
the coherence benefits from default or low doses of stylize. Throwing in --w 100 to
balance the default, or a smaller value to balance out a lower level of stylize, is one of those
generally-useful dashes of spice.
a neoexpressionist painting of what depression a neoexpressionist painting of what depression
feels like from the inside, cool palette, wild feels like from the inside, cool palette, wild
brush strokes brush strokes
--ar 4:5 --s 75 --seed 20 --ar 4:5 --s 75 --w 75 --seed 20

Go ahead and tab over to Google image search “neoexpressionism” if you’re unfamiliar
with the term. Once you have, it’ll hopefully be clear how 75% stylize, without the benefit of
weird, steered the original image into something a bit too realistic to be called that, even
though it did a fine job with the prompted palette and brushwork. A matching 75 value for
weird, though, distorts everything in a very evocative way that takes it much more towards
the wild, anatomically incorrect renditions associated with this style.

Just as I recommend working in lower-than-default stylize values, I recommend working in a


higher-than-default weird value as you become more familiar with MJ prompting, although
in the main I would default to keeping stylize equal to slightly higher, unless purposely
seeking “weird” subject matter. (“Weird” is a word that, unlike “stylize,” goes well with the
parameter that bears its name.)

And more generally, they can both synergize even at high levels. When you’re prompting
extremely weird things, the ability of high stylize to pull the image together with clear and
striking details can suddenly become relevant to truly elevating the “weirdness” capability
of --w.
surrealist cinematic still of giant carnivorous surrealist cinematic still of giant carnivorous
plant chomping at camera, sharp teeth flanked by plant chomping at camera, sharp teeth flanked by
twisted shadowy vines, distorted perspective, twisted shadowy vines, distorted perspective,
botanical garden nightmares botanical garden nightmares
--ar 5:3 --w 800 --seed 1031 --ar 5:3 --s 600 --w 800 --seed 1031

I don’t hate the first image, it’s good in a very lo-fi, SCP file film kind of way. If you wanted
something that could’ve been a “real” surrealist cinematic still from the 70s or something,
low stylize is the way to go. But for when you want something explicitly crazy and unreal,
that is very clearly “a photorealistic something that could not actually have existed,” high
weird and high stylize play off nicely.

--c #

Chaos (default 0, range 0-100) does more or less what it says on the tin. If --s exerts
influence towards MJ’s house style, and --w exerts influence away from it, --c exerts a
random influence on your prompt. Specifically, it does this by telling the four images of the
grid to diffuse away from each other. (For this reason, all the examples given here will be
full grids; apologies to those of you who aren’t reading at 200% zoom.)

perspective view of gotham city, stylized noir perspective view of gotham city, stylized noir
--ar 3:2 --w 75 --seed 1939 --ar 3:2 --w 75 --c 80 --seed 1939
Above, with default stylize and a modest weird buff, the grid mostly agrees on what the
prompt should depict: a dark city lit by the moon. Adding 80% chaos sends all four grid
images careening to Goth Batman or an orange sunset or our Antarctic base from chapter
2. Be warned: using chaos is explicitly telling MJ to disobey your prompt, unlike stylize or
weird where its level of disobedience will depend on your specifics. Chaos is not a good
prompt engineering tool, especially if you already know where you’re going.

But I do still reach for it sometimes when I’m fumbling around without a clue, as diverging
those four results can increase your prompt’s “surface area” and make it quicker to find
that sweet spot in latent space. When chaos was first launched, it was pitched as a tool for
upcoming “big grids” of 3x3 and 4x4 results, but as these proved intractable in Discord
compared to macros, its friend now is the --repeat parameter. Power users on the pro
plan who are using MJ semi-professionally should give it a deep look.

Beyond what is stated in the official documentation, there have been inconsistent
descriptions about the actual chaos mechanism. At one point, it was compared to getting
the diffuser drunk, making it less “sure” about where to go, improvising different concepts
as it staggered around. However, it was also described as throwing those different
concepts into the prompt directly (not as words, but as the abstract matrix numbers that
CLIP trains to correlate with words). This mystery is compounded by chaos having some
abilities other params don’t. Unlike stylize or weird, it can influence pure image prompts
(i.e., the /blend command in chapter 5), and has the ability to diverge remixes (see chapter
6) farther from each other as well as the initial /imagine.

retro 1980s cartoon of red panda with striped retro 1980s cartoon of red panda with striped
tail, standing on grassy hill overlooking tail, standing on grassy hill overlooking
vibrant beach sunset, surrounded by tall coconut vibrant beach sunset, surrounded by tall coconut
trees, distant seagulls flying --ar 4:3 --s 0 -- trees, distant seagulls flying --ar 4:3 --s 0 --
seed 12 c 90 --seed 12
Another counter-intuitive aspect of it is that super-high levels of chaos don’t really make
your prompt sketchy, no matter how detailed it is, which you’d expect if the diffuser was
having trouble deciding where to go (and you do see sketchy effects with complicated
prompts on ultra-low --q values, or early MJ versions where the diffuser was limited by
render time). The right-side images are perfectly fine looking; even with 90% chaos and
zero stylize, you can tell what they are. They just aren’t what we prompted at all. A creepy
anthropomorphic panda holding a surfboard? Some beachy-themed pixel and vector art? A
purple landscape? Shit’s wack.

Another strange aspect of chaos is that it scales in a very jagged, unpredictable way.

amazon basics exorcism machine amazon basics exorcism machine


--ar 4:5 --s 150 --w 100 --seed 800 --ar 4:5 --s 150 --w 100 --c 5 --seed 800

You can see why they normalized it to 100 instead of 1000; small changes just don’t perturb
a prompt much. Running the same prompt and same seed with --c 1 (or even --c 5, as
seen here) is like a subtle variation, as opposed to something like stylize or weird, which are
basically log curves where a similarly proportioned difference between --s 0 and --s 50
will often bounce all four quadrants. If you keep increasing the chaos, they’ll eventually all
“bounce” to a different layout entirely, but with chaos, most prompts seem to have just a
handful of “steps” between null and max chaos where they shift into true alternate
renditions (for instance there’ll be one at 8, another at 21, another at 29, another at 44,
randomly another at 45, etc.)
If you’re a minimalist prompter who enjoys “finding” things in the latent space more than
designing them, chaos is an indispensable tool. It’s a nice, lazy way to give simple prompts
some flair, when you don’t feel like adding more detail.

something bizarre, stylized watercolor marker --ar 5:7 --seed 2


--w 1000 --c 100 --w 1000 --c 100

Before the introduction of weird, chaos also played the role of an alternative/complement
to --s, since the randomness often served to jog stylize out of its visual rut, albeit at the
cost of more prompt disobedience. Now, both parameters can play this role to some
extent, but while they are compatible with each other, high values of --w (since they can
already diverge from each other unpredictably) tend to swamp high values of --c, so you
don’t get as much from adding chaos to already-weird prompts, as seen above where the
max-weird grid looks almost exactly like the max-weird + max-chaos grid. This also
demonstrates how max-chaos can be a very blunt instrument, as the one definite thing we
put in our prompt (the “watercolor marker” style cue) is pretty well ignored by 3/4 of the
grids on max-chaos.

Since it defaults to zero and has a more humble name compared to things like “stylize” and
“quality” which sound obviously good, the advantages of chaos aren’t immediately obvious
to the casual MJ user. This makes it one of the most underrated tools in the kit for taking
advantage of MJ’s surprising creativity. Unlike weird, I wouldn’t necessarily adopt a “default”
chaos value for myself, as it’s now a much more situational tool than either of its siblings.
But if you’re prompting with a more exploratory bent, it’s quite useful to throw on a chaos
value of 10-20 to see what happens, and super-high chaos values can often provoke new
ideas or produce striking images for use with the image prompting and remixing options of
chapters 5-6.

--q #

Quality (default 1, else 0.25 / 0.5) determines the amount of time MJ spends on rendering
a scene. It will round to one of these fixed settings, so there’s no such thing as “quality
0.75,” it will just process as “quality 1.” Unlike other parameters, quality directly affects the
amount of time MJ spends rendering and as such, drains your credits slower on metered
mode for lower values.

Despite the name, it's a much less important setting than in previous versions; you could
run a lot of prompts at 0.5 quality and not notice any major differences from the default
(and incidentally stretch your GPU time twice as far, I recommend 0.5 quality as a default if
you’re on a plan without relax mode).

photograph of submarine interior, los angeles- photograph of submarine interior, los angeles-
class nuclear fast-attack submarine, SSN-688, US class nuclear fast-attack submarine, SSN-688, US
navy, 1990s LCD monitors navy, 1990s LCD monitors
--ar 16:9 --q 0.5 --seed 1990 --ar 16:9 --seed 1990

The differences between the two images above are pretty subtle. Half-quality didn’t have
time to render a chair, the monitors are a little less well-developed, especially the weird
droopy circular ones in the back. But those are minor flaws in a very busy render for half
the GPU time, and that’s before you realize that I’m actually lying and switched the prompt
captions; the one labeled --q 0.5 up there is full-quality. Even quarter-quality has clear
merits from a cost-benefit standpoint:
victorian gentleman holding rose bouquet, flat- victorian gentleman holding rose bouquet, flat-
shaded anime on white background shaded anime on white background
--ar 11:14 --seed 1875 --s 0 --q 0.25 --ar 11:14 --seed 1875 --s 0 --q 1

You’d have a hard time calling that quarter-quality image worse than the full one. Sure, the
gripping hand is a little fuzzy, but that’s no worse than the single giant rose bloom, and the
hand in pocket is at least clear, while the full render sees him awkwardly sticking only his
pinky in there. For 4x the GPU efficiency, it’s a pretty good deal.

Low quality values are also utilitarian workhorses; if I want to see if MJ even has a concept
of something, I run a quick one-word prompt at low quality and raw style so I can see if it’s
anything close to what the word means.

In general, you should try lowering quality if prompting stylized illustrations, like cartoons,
or logo design, or anything that calls for clean shapes. It’s also recommended if you prompt
something very simple that doesn’t have much for MJ to “chew on” (and you don’t want the
directed/random improvisation that comes with stylize/chaos).
flat bronze medallion engraved with tiger flat bronze medallion engraved with tiger
design, medieval indian mandala style, isolated design, medieval indian mandala style, isolated
on black background on black background
--q 0.25 --seed 456 --seed 456

Notice how the linework on the medallion border and the tiger is more broad in the
quarter-quality grid than the full-quality grid, whose engravings can get super-fine to the
point of texturing. This is a good indication of what you can expect from turning quality
down. It’s also not a coincidence that both of the examples I showed of quarter-quality
turning out decent results had plain white or black backgrounds. Any kind of involved
background (which MJ will usually attempt unless prompted otherwise) requires at least a
half-quality render.

crowd of ancient chinese villagers gathering crowd of ancient chinese villagers gathering
under the RGB glow of a single anachronistic under the RGB glow of a single anachronistic
neon obelisk at midnight during typhoon neon obelisk at midnight during typhoon
--ar 14:11 --q 0.5 --seed 4321 --ar 14:11 --seed 4321
So if half-quality is almost always fine, and quarter-quality still works for a ton of simpler
prompts, when do you actually need full quality? Mostly, you need it for long prompts that
have multiple details about multiple aspects of the image. Here, the diffuser only really had
time for the first half of the prompt, and --q 0.5 simply couldn’t register the “neon
obelisk” and “midnight” portions before it had to wrap things up, so we get a generic tower
and what might be the eye of the typhoon, I guess.

You can also try using low quality settings to fish for a good result and then try to re-run its
seed at full quality, although this is not 100% reliable; often the extra rendering time causes
the noise to bounce in an entirely different way. (I've had some prompts that only worked
on --q 0.5.)

--no and --iw

These parameters require some context and are discussed in chapters 4 & 5, respectively.

--stop #

Stop (default 100, range 10-100) will stop a render at a predetermined percentage. It’s
there to deal with the problem of the AI going off the rails in the last few percentages. It’s
sort of similar to low quality settings (and also costs proportionally less GPU hours too) but
acts in a different way.

a grimly hyperrealistic portrait of the devil a grimly hyperrealistic portrait of the devil
himself --seed 666 --q 0.5 himself --seed 666 --stop 50
Think of it like the difference between counting to 100 by twos, and counting to 50 by ones.
When you prompt half quality (0.5) and default stop, MJ goes from beginning to end while
skipping every other step. When you prompt default quality and half stop (50), MJ does the
first 50 steps, not skipping any, then stops. So while low-quality outputs tend to be a little
sloppy, with sketchy details if you zoom in, low-stop outputs tend to be blurry and lack any
details at all.

There’s two main use cases for stop: high stop values for when you want to use the 80% or
90% image as a final result because it’s actually better, or lower stop values (around 50) for
when you like a general composition or vibe and want to use it as a base for remixing in
another MJ engine or Stable Diffusion.

--tile

Another straightforward parameter, --tile gets you a seamless repeating image at the
edges, which is obviously great for texture work, wallpapers, or backgrounds.
(Unfortunately, it’s currently only a boolean command and there is no way to make only the
horizontal or vertical edges tile.)

large irregular ledgestones, anasazi silver, tiled 4x


contrasting color range, dry weathered albedo
texture --c 5 --tile --seed 989

You can use --tile with something weird, like a portrait or a horizon, but it’s likely to
either not work, or simply do something lazy like draw a frame around the image so it tiles
as a grid.

--repeat # (and permutations)


If you thought the last two were self-explanatory, guess what this one does! Put a number
after and let it rip. If you have a baller-tier subscription, this is an okay substitute for bigger
grids as long as you’re happy with your prompt (or spending more GPU fishing).

Permutations are even more useful. By using curly brackets {} and (heavy sigh) commas,
you can macro up several variations of a prompt. This works on everything in the Discord
window, including parameters and image prompts. You can therefore do things like --c
{0,2,5,10} to test chaos levels. You can even use more than one, but be conscious of
multiplication. On the high likelihood that the prompt segment you want to permutate has
commas in it, escape them with backslash \, so they don’t count as permutation breaks
(and heavily sigh with me in the suggestions channel).

guess i ought talk about --niji

The parameter --niji will use a parallel version of the model (niji・journey) heavily
tuned to anime stylization. Instead of default and raw style, it’s default and four
styles (“original”, “expressive”, ”cute”, “scenic”) none of which are analogous to
raw (i.e., they each point in a specific, different direction from default).

45-year-old somali woman with curly black hair 45-year-old somali woman with curly black hair
wearing luxurious cyan bathrobe in dark marble wearing luxurious cyan bathrobe in dark marble
bathroom --niji --seed 987 bathroom, sony dslr photograph
--niji --seed 987 --s 0

As we’ve seen, it’s far from impossible to get anime out of the normal v5 model, and by the
same tokens, the niji model doesn’t need that much prodding to produce photorealism.
Zero stylize and some cues going the other way is all it takes to flip the models’ output.
They operate on the same codebase and function almost identically in terms of practical
use; all the parameters above apply to them in the same way (as will the remainder of the
document), the only difference is that the --s value represents “anime-ness” instead of
photorealism and --w will weird against it accordingly. Which of these you use more is
probably a matter of convenience; if you prompt more anime than you do all other styles
combined, then you might as well use the model tuned for it.

But upstream of style, there are some inherent advantages to using the model tuned on
more stylized, abstract subject matter.

mercenary femme fatale wearing eyepatch mercenary femme fatale wearing eyepatch
--seed 5 --niji --style expressive --seed 5

For example, anything involving facial irregularity is a difficult ask for v5. The base model
trained so heavily on photographs it got very bad at drawing “wrong” faces. Prompting a
character with purple eyes is likely to end up with purple eyeliner smeared around blue
eyes, and prompting an eyepatch will give you sunglasses or blindfolds or bandanas all day.
But niji (default and expressive styles especially) has a much looser grip on faces, or human
bodies, or other concepts v5 is “stiff” about (like overfitting every melee weapon to a
sword), since it has to contend with all of the wildly different ways they can be drawn, and
unless you absolutely need a photograph, is much easier to direct.

Also, I don’t want to harp on it too much, but niji has (thus far) been spared the heavy
bonks applied over v5’s lifetime that narrowed the PG-13 content policy to something more
like the Hays Code. When the alpha launched, several prompt mainstays (“full body,” for
instance) caused flagrant TOS violations, and in the course of brute forcing the clothes back
on, v5 now applies a hidden “no shirt, no service” negative vector to any prompt seeking
bare-chested men or women in bikinis. It’s not impossible, but I’m not inclined to get into
the nuts and bolts for obvious reasons; just use niji for anything “PG-13 sexy.”

Niji has significant downsides compared to v5, though. My biggest issue is that having five
different styles is great if your prompt definitely fits one, but if it fits more than one, or
you’re just exploring, you end up spending a lot of time fussing with prompts in parallel.
Two of them (default and original) don’t even have useful names, while expressive, cute,
and scenic at least give some kind of indication as to their uses. But forking your workflow
can lead to frustrating GPU burn.

Also, niji styles are tuned very heavy, even heavier than v5. There’s certain specific quirks
you can just about guarantee appearing at --s 500 and above, and with the exception of
the default style, they don’t do great when you try to prompt “against type” without
affirmatively cranking --s down to 50 or less and adding a healthy --w value. Making a
giant table with multiple examples for each style would be a bit much even for me but lol jk
here it is:

tsundere astronaut stock trading floor, eldritch


--ar 4:5 --niji --s 500 -- horror --ar 4:5 --niji --s
seed 2001 500 --seed 2001

Default is, sensibly, the broadest niji


setting and the easiest to direct
stylistically. In practice it mostly
subsumes cute and scenic for me
since you can get similar output with
one or two prompt keywords without
suffering from the same flaws as
those presets. It does tend towards
swirling, sparking, Epic Shots, but
what do you want, it’s midjourney,
just turn down --s.
Original niji style was, well, kind of
a shitshow. It took direction the least
well of all the settings and that hasn’t
changed, and it has an oppressive
default pose of a woman touching
her swirly hair strands. Despite this,
if you keep it away from female close
ups, there is much to recommend as
it's very intricate and well-suited to
detailed shots. Just be prepared to
endure some gacha.

Expressive is unique in that it’s


very opinionated about its style, but
that style is very “loose” and sketchy,
making it even easier to direct than
default with regards to subject. If
you’re using niji because some
concept is too tightly on-model, this
is your easy mode. It has distinctive
low-contrast color schemes, the
plainest background tendencies, and
tends to be the most prurient style.

Cute works about like you’d expect it


to, and if you’re looking to mostly do
very simple chibi stuff it could be
fine. The problem is that even with
healthy weirding, it’s inconsistent at
doing cute things with not-cute stuff;
such prompts tend to end up in
some kind of sloppy zero-weight
latent space, or worse, start spelling
out words it can’t render cutely as
gibberish text.

Scenic also mostly does what it says,


but it shares the foibles of both
original’s pose (when there’s a close-
up) and cute’s text (when there’s a
not-scenic concept), plus in any
widescreen aspect ratio (where it
should shine) it’s bar-none the worst
letterboxer of the lot. One silver
lining: its --s 0 point seems to be
the closest to “raw anime” of the five
styles, so give it a look occasionally.
--v

Older versions of MJ can be accessed with the version parameter --v. Some of these
versions aren’t compatible with newer parameters, while others were backported, and still
others work but totally differently. For instance, --style used to be free-form like --no,
but then in v4 this was obviated and it was used to archive old default styles, and then
changed again in v5 to introduce the presets. Slightly older versions of v5 are also available
at --v 5a, --v 5, and --v 5.1 (default stylize roughly ascending in power from all-raw
5a to current levels) with identical compatibility.

LEGACY --test --
--v 1 --v 2 --v 3 --v 4
VERSIONS testp

--ar Y Y Y Y (limited to 2:3 Y


or 3:2)

--s N N Y (625–60000; Y (1250–5000; Y (0–1000; 100


2500 default) 2500 default) default)

--w N N N N N

--c Y Y Y N Y
(uses --creative)

--q Y (0.25, 0.5, 2, 5) Y (0.25, 0.5, 2, 5) Y (0.25, 0.5, 2, 5) N Y (0.25, 0.5)

--tile Y Y Y Y N

--no Y Y Y N Y

--iw Y Y Y N (no image N (only default


prompts) weight)

--style Y (freeform) Y (freeform) Y (freeform) N Y (4a, 4b for past


styles, cursed)

comments Early versions First signs of Last version on Failed attempt at First totally in-
were based on a photorealistic the original code integrating MJ w/ house MJ engine.
heavily tuned coherence (for base. This one got StableDiffusion. Still the best one
cc12m_1. Sketchy some subjects). POPULAR. Great Very few features to use /blend with,
and incoherent Text savant; can vibes, stylize is a worked and GPU IMO. Aspect ratios
like most ‘20-’21 reliably spell with bit heavy-handed. cost was high. -- are broken in the
era diffusers, but vrolls. Sometimes, My v3-era version test is tuned for archived styles
image prompts foreshadowing is of this document art and --testp for (fixed in the final
were a huge help. relatively obvious. is archived here. photography. 4c release).
🔢 MULTI-PROMPTS 🔢

Multi-prompts :: are hard separators signified by a double colon, and they break the
prompt into two or more independent targets for the diffuser to match. They are one of
the most misunderstood and ill-used aspects of MJ, and for 90% of use cases (this goes
even for people tryhard enough to read this guide), the result is probably achievable with
carefully worded single prompts.

First, a quick toy example to illustrate what’s going on when you do this:

crab cake --seed 123 crab :: cake --seed 123

When “crab cake” is a single target for the diffuser, we get a delicious breaded seafood
appetizer. When “crab” and “cake” are targeted individually, we get lots of pastry cakes with
crab toppings, or crabs with pastry cake carapaces.

You might be surprised to see that, even though “crab” and “cake” were separated, we
didn’t get a crab and cake illustrated separately except in one grid. Don’t take the term
“hard separator” too literally: it only separates the words, not necessarily the resulting
artistic concepts. The independent diffusers don’t talk to each other, so as long as there is
something identifiable as “crab” and something identifiable as “cake” they will both be
happy with the result.

This fundamental equivocation underlies a lot of casual multiprompt misuse. People hear
that the prompts are “independent” and think that if they want two characters in their
prompt they should put each of them in a multiprompt. But since neither prompt knows
about the other, they have no overarching direction that there are multiple people in the
scene, and this makes the two targets more likely to merge than in a normal prompt. (This
can be advantageous if you’re trying to generate cryptids or furries). So rather than thinking
of multi-prompts as discrete subjects, think of them more like overlays, or emphasis. We
layered the “cake” concept over the “crab” concept and then hit the button.

Start with the easiest and most intuitive way to use multi-prompts: --no. Despite being one
of the most powerful words in plain English, “no” is very hard for MJ to pay attention to for
the same reasons as most short linking words. Unless “no X” is common enough in the
dataset to get a ton of attention (even the reliable-in-old-versions “no makeup” is iffy these
days), for the most part, MJ will not parse negatives. And it’s even more hopeless if you
want to try and negate things via prose like “without.”

The --no parameter is actually a shortcut to placing all of the things after it in their own
multi-prompt, with a negative half-weight (::-0.5), meaning whatever comes after --no
(or before the ::-0.5 in text) is weighted slightly negatively.

30-year-old dominican woman with curly black 30-year-old dominican woman with curly black
hair wearing hoodie with hood down, stylized hair wearing hoodie, stylized urban illustration
urban illustration --s 50 --seed 30 --no hood --s 50 --seed 30

Here, MJ doesn’t understand what we’re asking for in the prompt itself. “Hoodie with hood
down” is just too much of a comprehensive ask, even if we try and jog it by prompting hair
details. But by cleaving at the concept from both ends, and saying “hoodie + --no hood” we
can nudge it into showing what we want.

It’s an incredibly useful tool; if your prompt is doing something you don’t want it to,
you should slap it on like Flex Tape™. For instance, one common problem with MJ
prompts is that asking for a “sketch” or a “drawing” will often cause MJ to “break
the fourth wall” and supply a photograph of a sketch or a drawing, complete with stray
implements and canvas/paper background.

soninke goddess wagadu, color pencil sketch --s soninke goddess wagadu, color pencil sketch --
25 --w 50 --seed 5005 no photograph --s 25 --w 50 --seed 5005

Simply adding --no photograph does two salutary things: it knocks the photographic
pencils off of your desk, and also slightly steers the “pencil sketch” further away from
photography (it’s still not very “sketchy”; that usually requires --style raw).

This strategy of “dumb negative prompting” is a lot more powerful than it looks at first
glance, and because we don’t generally think as much about what we aren’t prompting, you
can make a lot of progress on an idea by simply taking what you don’t like about the first
draft and adding it to a --no. Garbage text in your logo? --no text. Unwanted split-
screen effect? --no split-screen. This process sounds stupid but I cannot emphasize
enough that it gets results; I’ve gotten centaurs (notoriously stubborn due to humans and
horses being so coherent) just from two rounds of this. Start with --no photograph;
notice it has a rider, so --no photograph, saddle, now it’s a deer, so --no
photograph, saddle, antlers” and suddenly you’ve got bona fide centaurs.

Another area where --no excels is moderating strong concepts in reverse. Since it’s a
shortcut to a half-weight negative prompt, it can do some things single affirmative prompts
can’t. From MJ’s inception, it’s been difficult to get a “heavyset” character that isn’t outright
fat, since the positive words for this (pudgy, chubby) tend to be used as euphemisms and
there’s not much difference between them and “fat” in effect. So if you want a “husky” fella,
I recommend instead adding antonyms to the negative prompt.

chubby guy wearing college sweatshirt drinking guy wearing college sweatshirt drinking pint
pint glass of beer glass of beer
--s 200 --seed 1981 --no skinny --s 200 --seed 1981

Despite (or due to) its power, --no is not free, and a lot of the downsides are universal to
multi-prompts. One common refrain you’ll hear among people who don’t like using it is that
it tends to wreck the style of the prompt. Remember, in the car example from chapter 1,
we shifted from artstation to photography simply by adding more details to the prompt
that cued it in on different references. The same thing can happen here, except harder to
predict, because “the opposite of x” in MJ’s high-dimensional latent space means “the most
dissimilar image possble” and not what we humans conceive as “opposite.” The latent
space opposite of “black” almost certainly isn’t “white,” for instance, because both are used
a lot to describe humans, or monochrome images. (When negative weights briefly worked
with image prompts due to a bug, negative prompting some images of one of my female
characters produced what looked like photographs from a furniture catalog.)

So using --no can jar your artistic prompt back to photography, or vice versa, based on the
unpredictable whims of the machine associating your --no with its stylistic opposite. You
can often correct this by also adding a “style antonym” to --no along with whatever you’re
trying to negate (e.g., add “photograph” if the --no ends up crushing your artistic cues
with photography), so that the stylistic direction of each multiprompt is more in sync.
time --no clock, hourglass --seed 1200

Additionally, there are some distinctions that are just too fine for MJ to make, and if you
have something too close to the same concept in both the positive and negative prompts,
the result can be noisy or nonsensical. (Default stylize likes deer a lot for some reason.)

So let’s keep this in mind and return to regular multiprompts, reiterating that they are
difficult to work with and you should only break them out when you’ve tried like hell to
make it work with pure words and parameters, like I did here:

anthropomorphic rhinoceros --seed 120 anthropomorphic horned male rhinoceros wearing


boxing gloves and trunks, standing in MMA gym,
full shot, stylized illustration, rhino horn --s
50 --seed 120
“Anthropomorphic rhinoceros” by itself works just fine, 4/4 grids have a horn. But once you
start adding more stuff to the prompt, such as costume, background, framing, and style, MJ
goes 0/4 on drawing the horn, indicating that part of the concept is very weak (rhino
conservation PSAs?). I tried weakening --s, and repeating the concept with “horned male
rhinoceros” and “rhino horn” to no avail. Even niji-expressive wasn’t bailing me out. So let’s
multi- prompt instead, emphasizing “rhino horn” alone instead of repeating it in the main
prompt.

anthropomorphic horned male rhinoceros wearing rhino horn --seed 120


boxing gloves and trunks, standing in MMA gym,
full shot, stylized illustration :: rhino horn
--seed 120

Keep in mind that even without any extra numbers, doing this inherently gives it >50% of
the weight, since we also kept one mention in the original. The horn is (mostly) back, but
the cost is obvious: all the details in the original prompt are half as strong. They’re less
anthropomorphic, less stylized, the gloves and trunks are ignored, the best-stylized one has
no gym setting.

The reasons are clear when you look at the prompt “rhino horn” in isolation. Remember, MJ
is merging the two prompts at equal weight, and “rhino horn” by itself is a heaping helping
of photorealism (of either rhinos or mounted rhino heads) and nothing else in our prompt.
And this is a mild error; a common newbie mistake is to weight this second half of the
prompt up, thinking that “rhino horn::2” translates to something like “double weight.” But it
really translates to more like 70% of the weight (2/3 plus “horned male rhinoceros” in the
original). This independence is also why the multiprompt says “rhino horn” rather than
merely “horn,” otherwise MJ would be attempting to merge our original image with the
concept of a brass instrument or something.

So what do we do? We can’t just add the missing stuff to the other multiprompt, that will
only recreate the same issue of the other elements distracting from the horn. Instead, we
use a light weight on that single concept to ensure that the emphasis is kept strictly on the
horn, but not making it so dominant that it starts shoving out the original prompt.

anthropomorphic horned male rhinoceros wearing anthropomorphic male rhinoceros wearing boxing
boxing gloves and trunks, standing in MMA gym, gloves, standing on MMA gym background, stylized
full shot, stylized illustration :: rhino illustration :: rhino horn::0.55 --seed 120
horn::0.5 --seed 120

By adding “rhino horn” as a half-weight multiprompt, we’re starting to get somewhere.


We’ve got two respectable horns in that grid while also getting back a lot of our stylization.
From here, there’s two kinds of fine-tuning we could try: nudging the multiprompt a little
farther up or down (I’d try 0.4 and 0.6), or tweaking our original prompt language to make
the costume less shonky. You can see the final version of this image on the right (one v-roll
off the original prompt, and sadly the trunks just wouldn’t play).

These fractional prompts (which have a bunch of different names: promptlets, sliders, etc.)
are clumsy to type out, but the most precise way of working with multiprompts to
accomplish something similar to “weighting” words in a StableDiffusion sense, where you
can just surround a word with some kind of special brackets to make it either more or less
important to the diffuser. If you want to weight a given concept up 10%, add a single
multiprompt, with just that concept, at a weight of ::0.1. Starting with these kinds of
marginal changes means the prompt you’re working with won’t be as significantly disrupted
by the addition of a multiprompt as it would be with --no or a full-weight multi.

Now let’s apply this thinking to the task of taming --style raw. As discussed earlier, the
cost of raw style’s increased prompt obedience is that it’s less inventive and, well, stylish.
Since you’re probably using raw for long prompts with lots of details anyway, piling on even
more synonyms and messing with word order isn’t that desirable. Instead, the best way to
jazz it up is a dedicated multiprompt with all your style cues plus one or two words
describing the core subject (for overlap), and then scale that multiprompt up from half-
weight to full-weight.

dark elf female with black skin and black hair, dark elf humanoid female with black skin and
glowing eyes, wearing gleaming white armor in black hair, glowing eyes, wearing gleaming white
mysterious forest armor in mysterious forest
--ar 4:5 --seed 86 --ar 4:5 --style raw --seed 86

First, the setup: this side-by-side shows why we’re using raw style in the first place (as well
as the word “humanoid”): because we want actual dark-elf style black skin, as close to #000
as we can get it. Default stylize, sensibly enough, tends to interpret “black skin” as a human
of African descent (it’s also more likely to draw weird headgear instead of elf ears).

Now let’s say I like where the right side is going but want to shift to an anime style. I put this
at the front of my prompt to give it the most attention:
anime-style dark elf humanoid female with black dark elf humanoid female with black skin and
skin and black hair, glowing eyes, wearing black hair, glowing eyes, wearing gleaming white
gleaming white armor in mysterious forest --ar armor in mysterious forest :: dark elf ranger,
4:5 --style raw --seed 86 flat-shaded anime style, cel animation, OVA
screenshot::0.5
--ar 4:5 --style raw --seed 86

There’s some movement towards anime, but it’s pretty slight, definitely adjacent to the
“D&D concept art” part of latent space. So instead, I threw on another half-weight
multiprompt with “dark elf ranger” plus a bunch of “make anime pls” style cues. The style
effect is much more faithful, even if the ears are still a little fraught (I would probably add --
no horns or --no helm to give it another nudge, v5 is weird about elf ears).

Incidentally, this is why I always recommend working with fractional weights. There are a lot
of prompt templates (and worse, chatGPT meta-prompt templates) that encourage crazy
stuff like weighing things like ::12 and ::7. But boosting random parts of your prompt to
random degrees like this then makes it very hard to tweak in a controllable way. And since
--no is equivalent to ::-0.5, it stops having any effect at these scales and forces you to do
arithmetic to derive a manual negative prompt weight roughly balanced to half your total
positive weights. It’s a needless pain in the ass when all of the values are normalized on the
back end anyway (there’s no difference between wordA::16 wordB::4 and wordA::
wordB::0.25).

This is all perfectly fine if you’re a low-intent “discover” guy and not a high-intent “design”
guy. I love a good gacha prompt as much as anyone, it’s important to do silly noisy prompts
once in a while to keep some humility/perspective about how cool they can look. What I’ve
hoped to disabuse here is the notion (again, helped along by Youtube clickbait) that
multiprompting is somehow more precise, or the “pro” way to prompt that gives you extra
control. More independent prompts always introduce more noise, more unseen
correlations, more randomness that might mess up other parts of your prompt. They
inevitably mean less control.

not even a v-roll!

And many things that used to require them in older versions to even get within sniffing
distance of your result, now simply don’t. You should use them when they’re necessary to
your concept and willing to fight MJ a little harder, not as a default.
📷 IMAGE PROMPTS 📷

An inexplicably-still-unique feature of MJ, image prompts let you use one or more images
as “inspiration.” You simply drop link(s) to image(s) at the start of your prompt, after
/imagine and before the words. You control their collective influence by the image weight
parameter: --iw 0.5 to make them half as strong, --iw 2 (the max) to double it.

Images can be hosted on discord, imgur, or anywhere else that allows linking. The first time
you use them, MJ will abbreviate the link, which can be useful if you want to use more than
3-4 images that you hosted on discord, where links are long enough to exceed maximum
prompt length. (These abbreviated links are only pointers. If the image is deleted from
wherever it’s hosted, the MJ link won’t work.)

So what does it mean to use an image as “inspiration?” Unlike variation rolls, MJ is not doing
an “init” directly off the image. Instead, MJ reads the image with its clip engine, translates it
into a bunch of text prompts (in its own internal language, not English), crams all those text
prompts into a dense boullion cube, and tosses it into the diffuser stock-pot.

Quick note for returning enthusiasts: image weight no longer directly trades off with
multiprompt weights. You can, in lieu of using --iw, weigh images individually
(http://s.mj.run/something ::0.8 http://s.mj.run/somethingelse ::0.2) but these weights and
multiprompt weights normalize independently now.

Image prompting is basically MJ letting one side draw what the other side describes to it
over the phone. (Except since we don’t know MJ’s language, it can actually pass on some
details more precisely than us prompting them by text would be!)
jolly dude, party animal wearing hawaiian shirt, NASA file photograph
stylized cartoon isolated on black background (not a Midjourney™ image)

The most basic way to use image prompts is with the /blend command, which does a pure
no-text prompt with two or more images. It’s identical to running an /imagine command
with images + no text, just trades off the ability to specify parameters with the convenience
of uploading directly without futzing with right click menus and URL tags. (I’ll often run a
/blend first to get the short URLs then /imagine based on what I think is needed).

https://s.mj.run/JsXV1M8iAeM https://s.mj.run/yBJhPTqP4iY (the above two images)

Blending image prompts with no text will give you an average of all the aspects of the
images you submitted. Here, we lost some of the details of the man like the Hawaiian shirt
and lei, I guess because flowers aren’t very compatible with the moon. He’s also more
realistically shaded. But it also applied a slight toon effect to the photographic lunar
surface, and darkened it considerably since the man was isolated on a black background.

https://s.mj.run/JsXV1M8iAeM https://s.mj.run/yBJhPTqP4iY
cartoon of man wearing hawaiian shirt and lei partying on the moon --iw 2 --style raw

This is where the text prompt comes in; we can max out image weight (--iw 2) and use
the text to guide MJ on what elements to emphasize. Now we can keep the man’s shirt and
lei while also brightening up the surface a bit by reminding MJ that it’s the moon.

There are a few rules for images used in a prompt: you aren’t allowed to submit a single
image as a prompt (only multiple images or single image + text). This is enforced by a
duplicate scanner, which is reasonably lenient, so you can easily submit something like two
v-rolls of a single image to do bootleg “single” blends. Less lenient is the NSFW scanning,
which due to Society, tends to hit women particularly hard. Almost anything with enough
flesh tones showing will be rejected, even if it’s just cleavage or legs.

The most practical use of image prompting is for visual mockups. As mentioned previously,
even with all the tools explained so far, there are certain very imaginative concepts that MJ
will be frustratingly bad at taking instruction for. This especially can be the case when
fighting against extremely coherent ideas with strong opinions about how they “should”
look. Take one recent example from prompt-craft: someone asked about generating a
picture of a boxer dog with a camera for a head.
boxer dog with vintage camera head --no eyes, nose, ears --s 0 --ar 3:2

No matter what words or prompt tactics they used, with any combination of parameters,
any remix trickery, MJ would only ever generate a picture of a boxer dog next to a camera,
or at best, wearing a camera around its neck. The idea of a “boxer dog” is just too strongly
assocated with certain facial features which MJ feels obliged to put into the image.

Image prompts are a great way to help get concepts like these across the line, and this is
especially true when the image prompt is itself a composite of other MJ outputs. So I had
MJ give me two separate ink outlines in two separate prompts: one for the camera, and one
for the dog.

line art of vintage camera, ink outline on white line art of boxer dog, ink outline on
background, stylized vector white background, stylized vector

Both component images being b/w ink outlines makes them versatile as image prompt
mockups (if you don’t prompt it otherwise, MJ tends to naturally color images anyway) and
as a bonus, easy to combine in Photoshop or Krita or Paint.NET. Just layer the camera
above the head as a “multiply” layer, and then paint out the dog’s head under it.
I want to re-emphasize that this is not init painting. We’re not trying to get MJ to paint over
and fancy up this drawing directly. We’re using this image mockup to communicate
information, specifically that the camera is instead of the dog’s head and not in addition to
it, which will hopefully reinforce our text. As a result, these mockups can be pretty crude,
since MJ will read and clip the image prompt rather than use its pixels, so professional-level
retouching is mostly wasted on it. You can photobash MJ generations, creative-commons
clip art, even MS Paint can help a bit.

https://s.mj.run/sZPJuJdDtI0 boxer dog with vintage camera on neck --ar 3:2 --s 0 --iw 2

Armed with the mockup image, and shoving image weight to maximum, MJ finally picks up
what we’re laying down and gives us a nice surreal camera-dog, complete with convenient,
uh, neck strap. It even splashes some color back on the image since we didn’t prompt for
continued line art.
Let’s get a bit more abstract. What if we want to image prompt style alone, but with new
subject matter? Consider Kuno Veeber, an artist I picked randomly from an MJ reference
spreadsheet because the name sounds funny. Wikipedia informs me that he’s an Estonian
oil painter and lists his styles as “cubism,” “constructivism,” and “expressionism.” He’s a
negative entry in the spreadsheet, indicating that his style is not trained.

Not Midjourney™ images. Obviously.

Above, you’ll see eight paintings of his that I pulled off Wikimedia and Flickr. I picked out
four portraits and four still lifes, with a variety of color schemes and subjects. The idea is
that this variety will cancel out when prompted en masse, leaving the focus on the unifying
style. The only adjustments I made were cropping out frames and edge tearing, since I
don’t want it to trigger drawing frames on the outside.

Now let’s run a quick test (--style raw to minimize any effects from MJ house style) and
confirm the spreadsheet’s claim that his name isn’t strong enough to trigger a distinct style:
woman in enchanted forest, oil painting by kuno veeber --ar 3:2 --style raw --seed 12345

It’s an oil painting, I guess, but it doesn’t look much like a cubist joint from the 1920s. We’re
not getting any of the broad shapes and blank faces in his work. I’m pretty satisfied that
he’s not in the database in any recognizable way. So let’s lock --seed 12345 here to
ensure like-for-like comparisons going forward, then load up those eight paintings as image
prompts at default weight and see what happens.

https://s.mj.run/OwisKksWDNQ https://s.mj.run/6Tryle_jbTs https://s.mj.run/9i47mNeALNc


https://s.mj.run/QwhQZ7pR0tU https://s.mj.run/Y-HDCxH30MA https://s.mj.run/f45no3NX6Z8
https://s.mj.run/pPH3uptn6C0 https://s.mj.run/5DO_Huhk710 woman in enchanted forest, oil painting
by kuno veeber --ar 3:2 --style raw --seed 12345
Even at default --iw, you can see an immediate difference with the prompts added. The
shapes are broader, the human figure is looser and less defined, and MJ has even more of
an oil painting vibe than it did before. This is definitely closer than with just the name. And
unlike the mockup prompting above, we’ve done nothing via text or increased --iw to help
it along.

This is intentional; for style transfers, we want minimalist prompting because too many
genre tags will “generify” the results along those lines, and high image weights here will
start crowding out your subjects with elements from the original images. Let me
demonstrate this with two less-successful variations on the idea:

[the same 8 refs] [the same 8 refs]


woman in enchanted forest, oil painting by kuno woman in enchanted forest, cubist oil painting
veeber by kuno veeber, 1920s constructivism,
--ar 3:2 --style raw --iw 2 --seed 12345 expressionist art
--ar 3:2 --style raw --seed 12345

Doubling --iw to 2 starts crowding out “forest” by bringing in elements from the reference
images, like multiple subjects and the block of houses, and doesn’t really yield a
corresponding improvement in style. Meanwhile, the genre tags make the shapes far more
angular and precise than they were in the original set of images. I suspect this is because
Picasso is so popular and repeated in the dataset that the term “cubist” gets overtrained by
his examples. Default with minimal text reinforcement is the way to go here.
[the same 8 refs]
epic sci-fi space battle, oil painting by kuno veeber --ar 3:2 --style raw --seed 12345

And here I’ve stretched the original concept a little more by prompting subject matter
completely outside anything Veeber ever painted; the style comes through. I know “what if
one guy’s stuff was anachronistic” is a basic way to use AI. But look at that fake brushwork!

Now let’s check in with the original image prompting killer app: recurring characters. There
are several ways to do this: you can generate a character entirely from scratch in MJ, or you
can try to model the character on an existing photograph (or photographs). Before we go
through the first workflow, a quick discursion to set expectations on the second:
Midjourney is not an Instagram filter, nor is it an image editor. There’s no way to tell it “this,
but only change one specific thing.”

This especially holds true if you prompt a self-portrait. People often see others making
seemingly flawless self-reproductions in MJ, but prompt themselves and come away
dissatisfied. While there are some qualities that MJ finds easier to reproduce than others,
for the most part, this is a psychological effect. Precisely because you see your own face in
the mirror every morning, you’re more apt to pick up on MJ’s subtle distortions of your own
self-image, compared to a stranger’s selfie and corresponding fake MJ selfie.

To illustrate the current theoretical limits of MJ’s facial reproduction, I took a random face
from thispersondoesnotexist, and tweaked a copy of the same image just enough to cheat
the de-dupe filter (sorry, trade secrets). But even this doesn’t give us the exact same guy.
a fake guy /blend of this ref, twice

MJ smooths out the face and applies a virtual hue slider to the pinkish skin (MJ likes skin on
a beige-brown spectrum that can look like foundation more than anything else). You’ll also
notice that the background has flipped from horizontal to vertical, and the clothing got
slightly altered; the collar around his neck is flared instead of flat, and lined in bright green
instead of yellow. While we can get the faces similar, costume and setting, even when
prompted, will have enough continuity issues to distract a reader. So what is it good for?

https://s.mj.run/LziAdzXIZA0 40-year-old man, stylized pencil sketch caricature on white


background --iw 0.75 --style raw --no photograph, photorealistic, monochrome
Rather than thinking of MJ as a filter for existing references, it helps to instead think of it as
an actual artist riffing on your character. The more slack you cut it stylistically, the happier
you’ll be with the result. Remember from chapter 2 to avoid direct instruction and reinforce
your image, rather than reference it directly. Stuff like “this but in pencil” or “the same
person, do not alter” will be confusing noise, and we especially want to minimize that here.
Just a brief description (I find age works well to ensure things aren’t smoothed too much),
your positive and negative style tags, and a lower weight of --iw 0.75 is enough to
produce a very evocative caricature.

But can you make noise work for you? Let’s investigate by making a character from scratch.

hasmik sargsyan, 38-year-old armenian woman, hasmik sargsyan, 38-year-old armenian woman,
greying black hair, close-up photograph on greying black hair, close-up photograph on
neutral portrait background --seed 2023 neutral portrait background --no makeup --style
raw --s 0 --w 5 --seed 2023

First, a brief exhortation to realism and intentionality: the left image is prettier, but the right
image is better. Default stylize converges portraits towards a uniform ideal that’s reinforced
each time you use that image, doubly so if you start with it. Your character will have no
character. Many Youtubers have “tutorials” for creating OCs where they prompt things like
“beautiful young woman” on default stylize. Spoiler: virtually any method works with this,
because they all look the goddamn same! Here, I’ve got absolute zero stylize (--style raw
--s 0), plus --no makeup, plus a dash of weird (chaos would also work here), all to break
out of the samey, ten-years-younger-than-prompted model look. If your character is
beautiful, you should still prompt that rather than use default stylize. Then at least the
adjectives can be influenced by your other details instead of a generic yassify filter.
Anyway, you can see the noise I brought to the generation: A fake name, which I got from
the top 20 lists for Armenia, so this is basically calling her “Jane Thompson.” I did a quick
Google Image search, just in case there was some Armenian celebrity with this name, but
the top row results were just different-looking LinkedIn and Facebook profiles, so no one
source should be dominating here. Then you’ve got the A/S/L and hair, plus I’ve also
specified a --seed, although I’ll demonstrate shortly how this is not all that useful.

Finally, the frame/background cues are important to minimize details, like setting or
costume, that MJ can’t consistently get right anyway. (I might crop the right result further if
the black t-shirt gets in the way of prompting other clothing.) We want just the head and
any above-neck accessories like glasses or piercings, nothing else. Now we have a base
image prompt for Hasmik and can include it with subsequent prompts to re-invoke her
character. Armed with this face, let’s illustrate the powers and limitations of --iw.

https://s.mj.run/KGeXTcwehqA hasmik sargsyan -- https://s.mj.run/KGeXTcwehqA hasmik sargsyan,


style raw --s 0 --iw 2 short ponytail, angry furious enraged --style
raw --s 0 --iw 2

On the left you can see the image successfully replacing almost every detail from the
original prompt. Just the name and the image was enough to get us about the same
theoretical maximum consistency /blend did, with no A/S/L, no seed, nothing else required.
Then on the right, we try to do something with it... and things don’t go so well. We tried
prompting a different hairstyle and facial expression, but the ponytail is messy and the
“angry furious enraged” emotion is just a mildly sour (and distinctly more youthful) face.
The problem is that image weight applies to everything. Not just your character’s features,
but also the zoom, the style, the costume, and even the expression on their face. If you
want to change even one of those things, you should expect to shift back down to at least
--iw 1.5, and changing more than one will probably require lower-than-default. This is
the value of “character noise” such as the name, A/S/L, etc. If you want to take the image
somewhere, you’ll need to lower the weight, and this means that, like our moon party
animal from earlier, text can help steer the interpretation of the image prompt to focus it
on what’s consistent rather than what changes.

https://s.mj.run/KGeXTcwehqA angry 38-year-old https://s.mj.run/KGeXTcwehqA hasmik sargsyan,


armenian woman, short ponytail, close-up 38-year-old armenian pianist, wearing slinky red
photograph, furious scowl --no makeup --style dress, playing grand piano, concert photograph,
raw --s 0 --iw 1.5 wide shot --style raw --s 0 --iw 0.75

Here, --iw 1.5 plus restoring some of our details helped the angry expression translate
more; it’s still a little subtle, but fine for a photograph. More stylized, exaggerated
expressions may need lower weights. And we can go down to --iw 0.75 (which remains
my “all-rounder” starting weight; normal default is just a bit too strong for most things) for
changing multiple aspects: costume, zoom, and posing for an action shot.

If you remember image prompts being more flexible on older versions of MJ, you’re not
wrong, because those older versions defaulted to a weight of 0.25 (in addition to
processing at a much lower resolution). You may need to go back to that for big transitions
like going from close-up to a true head-to-toe shot, or from photographs to stylized art.
https://s.mj.run/KGeXTcwehqA smiling hasmik https://s.mj.run/KGeXTcwehqA smiling hasmik
sargsyan, 38-year-old armenian woman, wearing sargsyan, 38-year-old armenian woman, wearing
slinky red cocktail dress and pumps, standing on slinky red cocktail dress and pumps, standing on
concert stage, spot lighting, sony a7 concert stage, spot lighting, sony a7
photograph, grand piano in background photograph, grand piano in background
--ar 9:16 --style raw --s 0 --iw 0.75 --ar 9:16 --style raw --s 0 --iw 0.25

This lighter weight has other advantages; her smile is more pronounced in the second
image, and with the --iw reduced to a strong hint, it could invent interesting compositions
like her pose in front of the piano. Don’t fret about the face going a little more off-model as
you zoom out; MJ is inherently less accurate as you zoom out, simply because there are less
pixels to diffuse. (We can buff the face back on-model next chapter.)

We’ve generated a lot of fake photographs, but what about translating them to different
styles? These transitions trade off a bit more harshly. I would start at default or less,
especially from photographic sources; there’s a funneling effect where the transition from
photo to a realistic painting, which you’d think would tolerate a higher --iw than photos to
rough sketches, actually needs lower values to shake the photography. (This isn’t
necessarily true going from non-photographic sources to photographs.)
https://s.mj.run/KGeXTcwehqA stylized caricature https://s.mj.run/KGeXTcwehqA anime-style
of 38-year-old armenian woman, color pencil portrait of hasmik sargsyan, 38-year-old
sketch --no photograph, photorealistic --style armenian woman, greying black hair and brown
raw --s 25 --w 25 --q 0.5 --iw 0.5 eyes, simple background --niji --s 300 --w 50 --
iw 0.5

https://s.mj.run/YTucGtLq-Jo oil-on-canvas https://s.mj.run/Z75zOBSrn-o stylized lineart of


portrait of 38-year-old armenian woman, greying 38-year-old armenian woman, greying dark hair,
black hair, thick brush strokes, stylized monochrome ink outline on white background --s 0
painting --iw 0.75 --s 0 --q 0.5 --no color

A couple of things to note here: you can be much less afraid of stylize when going to non-
photographic styles; I use raw style for the pencil sketches still because I think it withstands
the negative prompts better, but both the bottom two are done in standard mode (still --s
0 to keep the age right), and for the anime portrait I even told niji to bump things up
(otherwise your character will end up in an SD-esque uncanny valley of photorealistic
anime). Eagle-eyed viewers will note that for the bottom two styles, we used a variant of the
mockup strategy from earlier by “pre-treating” the image prompts with a simple Photoshop
filter (angled brush for the painting, stamp for the outline); this is a great strategy for style
transitions because it enables higher image weights that stay truer to the source image,
since we already did the “hard work” of moving MJ off raw photography. It’s also why it
remains a good idea to keep a “bank” of portraits for your characters in various styles; that
way you can start with whatever base image is closest to your destination.

Combining mockups and characters is one of the only feasible ways of surmounting the
classic AI problem of trying to get two different-looking characters in the same scene.
Prompt each of them in profile view in isolation, then do a background remove and paste
them over a suitable one. As with the camera dog, these intermdiate composites can be
laughably crude, they just need to give our language a little boost.

https://s.mj.run/hPitFdUD9Qs middle-aged puerto rican man with flattop facing middle-aged


armenian woman with greying black hair, cheerful desert encounter, wide shot, profile view,
stylized matte painting, ayyy we made it! --ar 3:2 --iw 0.75

Puppeteering is real now! It just takes a lot of hard work behind the magic. (Or at least it did
until the next chapter. 🦥)
🌱 Vs, Zs, and Ps 🌱

Except for the mockup trick, everything up until now has been pretty much one-shot
prompts. But there’s of course a fourth dimension to prompting: after your first grid you
have a whole suite of buttons and prompt chaining techniques available. (First thing’s first:
make sure remix mode is on. It defaults to giving you the opportunity to alter your prompt,
which you can simply decline if you want to develop it, and that one extra click is way easier
than changing modes.)

First you have the classic AI button, the variation, which re-initializes the prompt four times
on the grid image you selected, using it as a base and adding a second layer of random
noise for the engine to re-diffuse. This is different from image prompting and more similar
to img2img. MJ, in classic fashion, doesn’t give you the noise slider directly, just nuanced
buttons: “subtle” variation for low noise, “strong” variation for high noise, and “regional”
variation for targeted noise. Let’s start with subtle variations:

RGB gamer keyboard,


overhead product
photograph,
isolated on black
background, ANSI
104-key QWERTY
layout --ar 3:1

12 variations deep

Subtle variations are great for developing small things, like correcting hands or hoping a
piece of clothing renders a bit more sensibly. They’re also the cheapest and quickest-
rendering option, taking almost as little GPU as a quarter-quality prompt. Not a bad deal
given how many MJ prompts have results that are “almost perfect, but X.” 12 variations is
the equivalent of about 3 new grids, but no way would we have gotten a first result as good
from running it 3x as the one refined from subtle variations. We have the right amount of
keys, they’re all shaped correctly, and almost all in the right spot (the function row
stubbornly refused to group properly). Even the letter caps are correct, although things get
sketchy on the outer keys and numpad, and the spacebar LED broke.

happy eagle holding soccer ball, stylized mascot


1 subtle variation
lineart --style raw --no photograph --q 0.5

However, since subtle variations keep the depth map of your original prompt intact, there
are some things they will struggle to change. Errors that take up a significant chunk of the
image (that would involve redrawing more than about 5% of the total area) will often
persist even through several subtle variations. There’s a lot of stuff I don’t like about the
first image up there. The eagle’s face looks doofy. He has talon wings. The soccer ball is
blue. The collar is weird. None of these things are fixed consistently by a subtle variation.

1 strong variation U2
A strong variation, however, clears all of those right up. Now he’s got a better expression,
wings for hands, a b/w soccer ball, and that weird torn collar has been phased into a
proper shirt. You can think of strong variations as a kind of super-image-prompt, or a
“scramble” button for the components of your image. The content, style, and primary color
scheme of the image will stick around, but their locations, angles, orientations, and even
some secondary colors are free to shift around.

But what if you have a grid that’s absolutely perfect, except for one visible error that’s big
enough to persist through even a strong variation? This is when you turn to the newest
weapon, regional variation (aka inpainting). Clicking this will bring up a GUI that lets you
highlight the portions of the image you want regenerated.

pixel art of naga snakelady hybrid, long tail,


retro 8-bit RPG sprite on white background --ar 1 regional variation
4:3 --niji --style expressive --s 50 --w 50 --no 🙂
calligraphy

Here, niji expressive provided us with an excellent pixel arrangement, but rendered our
beastie with a spurious foot. Subtle variation did nothing, while strong variation stepped on
our good luck by separating the human and snake into separate characters. So I simply
highlighted the foot and clicked it away.

Keep in mind that region vary does make some barely perceptible changes to the non-
varied regions. I wouldn’t hesitate to recommend spamming subtle/strong on a “close”
prompt, but for region, I would only use it when there’s a clear, easily delineated error. It’s
better at subtraction than addition.
sphinx of black quartz, judge my vow:: 1 regional variation
transparent quartz::0.5 --style raw 🙁

One reason for that is that regional vary is bad at preserving perspective and style. If you
try to use it to improve a face, or a hand gesture, you’re liable to get a cartoon face on a
photograph or vice-versa, or a giant/shrunken hand positioned at a very odd angle.
Compared to most other image AIs, MJs inpainting doesn’t take a whole lot of information
from the surrounding pixels.

Now, I said to activate remix at the beginning of this chapter, and so far, we haven’t used it
by changing a prompt. (Remix mode will automatically do variations if it detects no change.)
Doing this will inherently change a few things, independent of whatever you type.
something awesome --ar 4:5 -- 1 strong variation of V1 1 strong remix of V1
niji --style original --c 100
(--c 75)
First thing to note: Variations do not reconsider your parameters, but remixes do. We had a
max chaos prompt (vague words & --c 100) in the original grid. The strong variations,
however, mostly preserve the results in the first grid: the girl, the guitar, the pose, the
framing. They’re about as far apart as the soccer eagle example, which had no chaos. As
long as you aren’t changing the prompt, chaos (as well as any other parameters [and image
prompts!]) will stop having an effect. But changing anything in the prompt, even turning
chaos down to --c 75, provokes Midjourney to look at all of those things afresh. And since
our prompt itself didn’t have much guidance, refreshing its chaos is enough to push the
results very far apart again, such that their only consistency is “a cool portrait of a girl”.

I intentionally selected a dramatic demonstration above. Subtle remixes, on the other


hand, will inherently mute any impact from changing prompts. But this can be useful, too,
as in this alternative “remix painting” workflow for image prompting characters:

character sheet, full body turnaround, dark- 1 subtle remix, adding


haired mature woman wearing slinky red dress, https://s.mj.run/uxoaN0X4l9Y
white background --niji --ar 3:4

It was a huge pain in the ass to get a full zoom-out on our Armenian pianist earlier. We had
to lower image weight all the way down to 0.25 to get it done in a single prompt, which led
to lots of tradeoffs regarding quality and prompt direction. Much easier is this two-step
process: simply generate a generic model first, without saying anything more about the
face than light-dark complexion, or maybe hair color. Then do a subtle remix with just the
character’s face (the anime-style face we made for her earlier, cropped a little tighter to
exclude clothes and background) injected as an image prompt, and it will “paint on.”

This is not without its own tradeoffs: you can see the image as a whole got a bit “blurrier.”
Subtle remixing, especially, tends to “disrupt” the pixels in a way that can often be
mitigated by doing a few extra variations after the remix, which allows the image to “settle”
a little bit into its new format. The remix-injection workflow could easily have gotten its own
chapter but instead I will link this outrageously detailed tutorial by Vic Gnarly which I see no
need to recapitulate further. (Animation workflows [!] are a bit outside the scope of this
manual.)

This also allows you to “stabilize” remixes which are going too far off the reservation by
copying the original image itself and putting it in the mix. Using that plus the --iw
parameter turns it into a bootleg method of throttling remix strength (max weight, weakest
remix, although obviously the effect will be limited if there are lots of existing image
prompts).

Did I mention you can change any parameter in a remix?

"DON'T INVENT THE TORMENT 6-7 variations (in --v 2) remix to regional remix to
NEXUS":: front cover of
--v 5.2 --style raw --niji --style expressive
classic sci-fi novel
--ar 5:8 --s 200 --w 200
"DON'T INVENT THE TORMENT
NEXUS" --v 2 --ar 5:8 --c 100 --ar 5:8

Remember how the legacy chart mentioned versions 2 and 3 of the engine are really good
at resolving multiprompted text? With subtle and regional remixes, you can get the best of
both worlds! The workflow here:
1. Multiprompt your text in version 2 or 3 using the above template (use --s 625 if
using --v 3). Don’t worry too much about the art looking mushy or changing
drastically; we’re here for the words.
2. Vary them (these versions have subtle only, but because they’re only 256 pixels, that
does a lot more for text) until the letters resolve.
3. Remix to either --v 5.2 --style raw or --niji --style cute. Both of these
are prone to text, and are more likely to preserve lettering on a remix.
4. Then, regional remix all the area around the text, to whatever version you find
aesthetically appealing, and text prompt chaos, or some art keywords.

This is a painstaking, max-effort, four-step process for needing whole phrases (“the quick
brown fox jumps over the lazy dog” is possible if you’re stubborn enough). There are
shorter alternatives for those who want just one word or a few letters. Firstly, if your text
wouldn’t work well as a square island on the regional vary screen (because it’s at an angle,
or integrated with an object somehow) you can start in the current version, dip into the old
version just for the text, then dip back out:

"MVB" text:: "MVB" text on 1 remix of V2 to 1 remix of V2 back to


football helmet, monochrome ink --v 3 --s 625 --v 5.2
outline on white background --v
5.2 --q 0.5 --style raw

We couldn’t start in the older versions here, because the helmet would look crappy. But by
getting the helmet first, dipping into version 3 for one remix, and then returning to current
year, we scooped the correct letters. The helmet is a little dinged up from the trip (you can
see the facemask is a little irregular and it's picked up some schmutz) but the letters are big
enough that we can comfortably run a few more subtle variations to clean it up if needed.

And in some cases, it’s not necessary to go to older versions at all. If v5 raw can get you
within one or two letters of your desired word, the easiest thing to do is regional remix the
errant letter and replace the entire prompt with just the correct letter. If you’re replacing
two letters, I recommend repeating the letters in a multiprompt with and without quotes. If
you’re replacing three letters, well, you might want to consider remixing to an older
version. Current versions can only really be relied on to get the first and last letter of a
prompt correct.

cabana:: cabana with "CABANA" text on straw 1 region remix to


pavilion, swimming pool, outdoor deck photograph B --style raw
--style raw

Image injection and parameter fiddling are sideshows; the most powerful thing you can do
is of course remixing the prompt words themselves. The general concept behind this is
“scaffolding,” where you start with a prompt for a broad first idea, then exploit the resulting
composition to go to a more specific second idea (similar to image prompt mockups.)

https://s.mj.run/V2wd7oWqmSU line art of praying subtle remix to


mantis, isolated on white background, ink mechanical mantis, steampunk, biopunk
outline, vector art --ar 3:2 --style raw --s 0 --ar 3:2 --no boring
--iw 0.75 --no color

Here’s a typical subtle remix use case. MJ didn’t have a solid enough grasp of a praying
mantis to get even a normal one without help from an image prompt (most of what was
generated was mosquitos or crickets). But since the image prompt was an actual mantis, it
would also be fighting against the steampunk theme. Rather than painstakingly fish for the
exact balance of forces needed for the result in a single prompt, I just prompted the mantis
first, then ditched the image prompt and remixed with steampunk cues second.

abstract watercolor painting, ink wash, deep strong remix to


colors --style raw --s 50 --w 25 --ar 7:8 abstract watercolor portrait of woman with bob
cut and glasses, ink wash, deep colors --style
raw --s 50 --w 25 --ar 7:8

For strong remixes, there’s very little point in changing the prompt wholesale, since that
will result in what might as well be a whole new image. For these, it’s best to keep
everything you like about the image identical and just substitute or add things as you want
them to appear or lose/receive emphasis. One of the most common strong remixes I do is
adding multiprompt reinforcement to something where I like the vibe and style a lot, but
the subject isn’t quite right because some detail got drowned out.

For regional remixes, especially when trying to add things rather than subtract them, it’s
important to give MJ at least some prompt clues about the perspective or zoom of the
things that it’s adding relative to the whole image. So if you are trying to add a detail in one
small part of the image, having something like “distant” or “wide shot” or even “tiny” can
help guide it towards keeping consistent scale.
two-headed chiroptera monster, fantasy regional remix to
illustration --ar 4:3 distant riders --ar 4:3

I liked the bat monster, invoked via scientific class and the “fantasy illustration” qualifier as
demonstrated in chapter 2. Then I remixed just the top part of the image, and replaced the
entire prompt with “distant riders,” which was sufficient to ensure they appeared, and
weren’t randomly facing the side or something (which might have happened if I didn’t add
“distant”, since the most common “rider” perspective is a side profile view).

joe biden and joe biden give speech remarks on regional remix to
white house rose garden, file photograph --ar anime hatsune miku gives speech --ar 5:3 --niji
5:3 --style raw --s 250

One other thing to note about region remix is that it can help to negate the parts of the
first prompt you don’t want to keep in your second prompt. The classic example here is
wanting to generate two different people. Here, I used it to blend anime and live-action,
which is normally pretty tough to do. Here, Miku was showing up as a bad cosplay or
hideous CGI until I threw niji stylize into the mix.

Now let’s talk about zooming, which is much less complicated. There are three buttons:
zooming out 1.5x, zooming out 2x, and custom zoom, which takes you to a prompt edit
window. The first two are pretty self-explanatory and I probably don’t need an example
here. It’s a decent way to get something like “a TV screen showing one image inside another
image” by first prompting the inner details, and then remixing to the outer details on the
zoom.

Custom zoom has two utilities which probably aren’t obvious, though. The first, and most
straightforward, is by running “--zoom 1,” MJ will regenerate the outer border of the
image. This is invaluable if you have an accidental black bars that would interfere with
zooming or panning, or just make the image look crappy.

The second is that by altering the aspect ratio to something taller or wider than the starting
one, you can do a selective --zoom 1 that expands the image only horizontally, or only
vertically

lineart --s 500 --w 200 --c 100 custom zoom to


--ar 2:3 --niji lineart --s 500 --w 200 --c 100
--ar 3:2 --niji --zoom 1

You can change aspect ratio directly, in a remix, but since there’s no new image data being
generated for the image, it will squash it (subtle) or drastically recompose it (strong).
Custom zoom is the easiest and most consistent way to move from widescreen to portrait
and vice versa. I wasn’t really picky about the surroundings here, since this was a chaos
prompt, but the same things apply here as apply in regional remixes if you want your new
image portions to look different from the original ones.
Two considerations to keep in mind for zooms are that they don’t add new pixels to your
image, and they have a tendency to vignette. The ones you zoom out from are shrunk but
otherwise unchanged. Zooming out from a single face, therefore, still has a tendency to
make the face look “pasted-on” because it will look much more detailed than the
corresponding details of its size that are freshly generated. You can counteract the
vignetting somewhat by adding words like “bright” or “light” or “vivid” into your prompt as
you zoom out.

The only way to add new pixels to your prompt and make it actually larger than it was is to
pan, or extend the canvas in a direction. However, this will lock you out of remix options
until you do a custom zoom (or use the “make square” button, which is just a shortcut to an
--ar 1:1 custom zoom) to reduce the total pixel count to an MJ standard. Panning should
therefore usually be the last thing you do since any other action will require losing pixels.

pale goth vampire with white hair and frilly pan up to


black dress looking up in field of metallic gold pale goth vampire with white hair and frilly
leaf roses under black midnight sky, scenic wide black dress looking up at red blood moon in
--niji --ar 3:2 --s 400 --w 10 --no red, moon field of metallic leaf flowers, scenic wide --
niji --s 400 --w 10 --no yellow

Here’s a pan where I wanted two things cutting against archetypes: gold roses and a red
moon. Any attempt at singleton prompting this will inevitably give you a yellowish moon
and a field of red flowers, the color attraction is too strong. Therefore, I generated the
bottom first, which let me focus solely on breaking the rose archetype by specifying “gold
leaf” and “--no red” with the moon negated as well since I didn’t want to get one before I
prompted it.
Then the up pan negated the yellow (switching to simply “metallic”) while prompting a red
blood moon. And since we used a pan, this got us a larger resolution (1344px 2 vs 1024px2)
than we would’ve gotten prompting straight to a square and doing regional remixes on the
moon.

Between remixing, custom zooms, and panning, it’s possible to brute-force just about any
composition, subject, and details you want to in Midjourney. The question isn’t whether
you can, but just whether it’s worth the trouble. (Doing all of this in a command line is not
ideal, and I can’t imagine how weird this must feel to people who didn’t learn computers on
a DOS prompt.) Once there’s a website, though, and these things can all be done with
sliders and the same mouse selection that regional variations can, I expect many of the
things in this guide that currently take a lot of effort to become trivial.
HASTY MJ 6 ADDENDUM

As is tradition, I don’t want to rewrite this entire guide for an even-numbered MJ version,
especially when it’s still an alpha, so here is a disconnected string of notes and
observations that you should take with even more salt than what came before:

The most important change is that MJ has finally slipped the surly bonds of CLIP. The max
prompt length is now somewhere in the neighborhood of 350 words, and prompts
are now case-sensitive. No one can tell you exactly where the token breaks are anymore,
so the precise length is a guess.

And even better, the length is actually usable! MJ has recovered its ability to pick up on style
cues deeper into a prompt, so you should need to front-load it less often than v5 (this is
still a valid troubleshoot if your prompted style struggles).

As a result of the increased length, MJ is more tolerant of paragraph-length prose. I have


seen no seed-locked evidence yet that it is better than terse single sentences, but it
certainly ain’t fatal like in v5 and which prompting style you prefer is much more a matter
of taste than it was before. I would still recommend against chatGPT prompts, though,
because chatGPT still tends to load its full sentences up with visually meaningless mood
words.

Instead, use that extra prompt length to direct the scene. In particular, you will want to
specify the relative positions of stuff when setting up a scene with two subjects. MJ really
rewards picturing something in your head first and then prompting it. Directions like “on
the left/right” have shifted from totally ineffective to more-or-less mandatory if you
want to see the increased composition control.

In general, you do still want to prompt in a terse fashion when assigning properties to
different elements in a scene. Back-referencing something in the first few words much later
in the prompt still tends to crush what's in between, so I still recommend doing all the
details at once and making sure that the distance between the details and the subject they
apply to is 0-2 words max.

Cowbelling simple repeats of the same word has almost no effect now, but stacking
multiple synonyms in a prompt is still good meta.
Double quotes are now a soft-encoded text signal. Writing something "in here" makes
MJ much more likely to spell it out as text (remember it’s case sensitive). I believe this is
soft, i.e., from improved dataset standardization during training and not a hard command
like :: because you'll still get words on occasion without them (generally by prompting
neon signs or similar) and sometimes on high stylize even "double quotes" will fail to make
things appear until the prompt gets an extra hint (generally the word "text" near them, like
title text “IT PROMPTS AT NIGHT”).

The summer 2022 multiprompt meta for text is, at long last, very deprecated and you
should avoid it. The independence of the multiprompts now works against you by
spawning duplicate text. However, I have noticed (with low confidence) an improvement by
repeating the word in both double and single quotes within a single prompt.

Both modes, but especially raw, are much less likely to accidentally render text, so while
chatGPT may not be good for deliberate construction, it makes for decent vibes-based
prompting. (If you want to get vibes-based outcomes from shorter, lazier inputs, a small
dose of --w or --c might be more vital than before to jazz things up).

I say “small” because weird in particular has been massively overbuffed in this version of
MJ. There is very little difference between --w 100 and --w 3000, and they will trump
any values of --s or --c. Until someone wakes up and realizes this is very stupid, I would
keep the values in the single-digit range to start and build up very slowly.

The "center of mass" between raw and default style is much closer. (This may change in
future updates.) In v5, voidprompting in raw got you product photos; voidprompting v6 raw
gets you mostly the same pretty girls you get in default, just desaturated photographs with
plain backgrounds vs epic paintings of flying fish. There's also a universal shift back
towards painterly textures and splashes, more similar to v4 default than v5, albeit
much easier to prompt past, so you won't notice it that much unless you do minimalist
unstyled prompts.

Multiprompts "work differently" although no elaboration has been offered and the
fundamental examples in the chapter still work, although many may no longer be
necessary. In particular, multiprompting styles is still good meta for when in-prompt styling
isn't enough to overcome an extensively described subject, and negative prompts are still
necessary to knock things out in many cases (prose "no" works in slightly broader
situations but mostly you still want to avoid telling the AI not to think about pink
elephants).

Negative multiprompts are also much more powerful. While --no is still a fine shortcut, a
downstream effect of the prompt comprehension is that you can often get what you
want with a lighter negative weight, which in turn will have much less drastic effects
on the rest of your prompt. I’d start with :: glasses::-0.1 for something like
knocking glasses off a character who MJ thinks should wear them for whatever reason.
Don’t start with --no unless it’s something you know (lol) will have a strong confounder.

Image prompts now have a max of --iw 3, athough in every test prompt I have run so far,
the differences between --iw 2 and --iw 3 are so minimal that I probably wouldn't tell
which is which in a blind pick, so the scaling seems quite similar to v5. Images can still
overrule your prompt quite easily on default so I am sticking to my recommendation of --
iw 0.75 as a starting sweet spot for image prompts.

General remixes seem to be noised up from v5. Subtle remixes are a bit less subtle
(think more scaffolding, less doom-rolling fingers) and strong remixes can be changed
nearly wholesale while sticking surprisingly well to the original style of the prompt. If you
like something's style, swap out the content entirely for something else in a strong remix
and it's probably a lot more portable than earlier versions. Additionally, strong remixing to
a new aspect ratio no longer stretches the image stupidly, and can now recompose it
without needing the middle step of custom zoom (especially if helped along with --c).

However, regional vary (inpainting) and pan are somewhat nerfed, in terms of their ability
to radically alter the image outside of the normal prompting parameters. The example
from the last chapter where I remixed “distant riders” on a two-headed bat now requires
you to remix to “distant riders atop a two-headed bat” to give the new direction context.

You might also like