Dan Luu - How Completely Messed Up Practices Become Normal
Dan Luu - How Completely Messed Up Practices Become Normal
There’s the company that is perhaps the nicest place I’ve ever worked,
combining the best parts of Valve and Netflix. The people are amazing and
you’re given near total freedom to do whatever you want. But as a side effect
of the culture, they lose perhaps half of new hires in the first year, some
voluntarily and some involuntarily. Totally normal, right?
There’s the office where I asked one day about the fact that I almost never
saw two particular people in the same room together. I was told that they
had a feud going back a decade, and that things had actually improved – for
years, they literally couldn’t be in the same room because one of the two
would get too angry and do something regrettable, but things had now
cooled to the point where the two could, occasionally, be found in the same
wing of the office or even the same room. These weren’t just random people,
either. They were the two managers of the only two teams in the office.
Normal!
There’s the company whose culture is so odd that, when I sat down to write a
post about it, I found that I’d not only written more than for any other single
post, but more than all other posts combined (which is well over 100k words
now, the length of a moderate book). This is the same company where
someone recently explained to me how great it is that, instead of using data
to make decisions, we use political connections, and that the idea of making
decisions based on data is a myth anyway; no one does that. What’s not only
normal, but the only possible way to do things is to use your political capital
to push your personal agenda through.
2 of 11
There’s the company that created multiple massive initiatives to recruit more
women into engineering roles, where women still get rejected in recruiter
screens for not being technical enough after asking questions like “was your
experience with algorithms or just coding?”, as is normal in the industry.
There’s the company where I worked on a four person effort with a multi-
hundred million dollar budget and a billion dollar a year impact, where
requests for things that cost hundreds of dollars routinely took months or
were denied.
You might wonder if I’ve just worked at places that are unusually screwed up.
Sure, the companies are generally considered to be ok places to work, and
two of them are considered to be among the best places to work, but maybe
I’ve just ended up at places that are overrated. But I have the same
experience when I hear stories about how other companies work, even places
with stellar engineering reputations, except that it’s me that’s shocked and
my conversation partner who thinks their story is normal.
There’s the companies that use @flaky, which includes the vast majority of
Python-using SF Bay area unicorns. If you don’t know what this is, this is a
library that lets you add a Python annotation to those annoying flaky tests
that sometimes pass and sometimes fail. When I asked multiple co-workers
and former co-workers from three different companies what they thought
this did, they all guessed that it re-runs the test multiple times and reports a
failure if any of the runs fail. Close, but not quite. It’s technically possible to
use @flaky for that, but in practice it’s used to re-run the test mutlitple times
and reports a pass if any of the runs pass. The company that created @flaky
is effectively a storage infrastructure company, and the library is widely used
at its major competitor. Marking tests that expose potential bugs as passing
is totally normal; after all, that’s what ext2/ext3/ext4 do with write errors.
As far as I can tell, what happens at these companies is that they started by
concentrating almost totally on product growth. That’s completely and totally
reasonable, because companies are worth approximately zero when they’re
founded; they don’t bother with things that protect them from losses, like
good ops practices or actually having security, because there’s nothing to
lose (well, except for user data when the inevetible security breach happens,
and if you talk to security folks at unicorns you’ll know that these happen).
The result is a culture where people are hyper-focused on growth and ignore
3 of 11
risk. That culture tends to stick even after company has grown to be worth
well over a billion dollars, and the companies have something to lose. Anyone
who comes into one of these companies from Google, Amazon, or another
place with solid ops practices is shocked. Often, they try to fix things, and
then leave when they can’t make a dent.
Google probably has the best ops and security practices of any tech company
today. It’s easy to say that you should take these things as seriously as
Google does, but it’s instructive to see how they got there. If you look at the
codebase, you’ll see that various services have names ending in z, as do a
curiously large number of variables. I’m told that’s because, once upon a
time, someone wanted to add monitoring. It wouldn’t really be secure to have
google.com/somename expose monitoring data, so they added a z.
google.com/somenamez. For security. At the company that is now the best in the
world at security.
Google didn’t go from adding z to the end of names to having the world’s
best security because someone gave a rousing speech or wrote a convincing
essay. They did it after getting embarrassed a few times, which gave people
who wanted to do things “right” the leverage to fix fundamental process
issues. It’s the same story at almost every company I know of that has good
practices. Microsoft was a joke in the security world for years, until multiple
disastrously bad exploits forced them to get serious about security. Which
makes it sound simple: but if you talk to people who were there at the time,
the change was brutal. Despite a mandate from the top, there was vicious
political pushback from people whose position was that the company got to
where it was in 2003 without wasting time on practices like security. Why
change what’s worked?
You can see this kind of thing in every industry. A classic example that tech
folks often bring up is hand-washing by doctors and nurses. It’s well known
that germs exist, and that washing hands properly very strongly reduces the
odds of transmitting germs and thereby significantly reduces hospital
mortality rates. Despite that, trained doctors and nurses still often don’t do
it. Interventions are required. Signs reminding people to wash their hands
save lives. But when people stand at hand-washing stations to require others
walking by to wash their hands, even more lives are saved. People can ignore
signs, but they can’t ignore being forced to wash their hands.
The data are clear that humans are really bad at taking the time to do things
that are well understood to incontrovertibly reduce the risk of rare but
catastrophic events. We will rationalize that taking shortcuts is the right,
reasonable thing to do. There’s a term for this: the normalization of deviance.
It’s well studied in a number of other contexts including healthcare, aviation,
mechanical engineering, aerospace engineering, and civil engineering, but
we don’t see it discussed in the context of software. In fact, I’ve never seen
the term used in the context of software.
4 of 11
Turning off or ignoring notifications because there are too many of them and
they’re too annoying? An erroneous manual operation? This could be straight
out of the post-mortem of more than a few companies I can think of, except
that the result was a tragic death instead of the loss of millions of dollars. If
you read a lot of tech post-mortems, every example in Banja’s paper will feel
familiar even though the details are different.
Once again, this could be from an article about technical failures. That makes
the next section, on why these failures happen, seem worth checking out.
The reasons given are:
5 of 11
People don’t automatically know what should be normal, and when new
people are onboarded, they can just as easily learn deviant processes that
have become normalized as reasonable processes.
The thing that’s really insidious here is that people will really buy into the
WTF idea, and they can spread it elsewhere for the duration of their career.
Once, after doing some work on an open source project that’s regularly
broken and being told that it’s normal to have a broken build, and that they
6 of 11
were doing better than average, I ran the numbers, found that project was
basically worst in class, and wrote someting about the idea that it’s possible
to have a build that nearly always passes with pretty much zero effort. The
most common comment I got in response was, “what kind of fantasy land is
this guy living in? Let’s get real. We all break the build at least a few times a
week”. This stuff isn’t rocket science, but once people get convinced that
some deviation is normal, they often get really invested in the idea.
The example in the paper is of someone who breaks the rule that you should
wear gloves when finding a vein. Their reasoning is that wearing gloves
makes it harder to find a vein, which may result in their having to stick a
baby with a needle multiple times. It’s hard to argue against that. No one
wants to cause a baby extra pain!
The second worst outage I can think of occurred when someone noticed that
a database service was experiencing slowness. They pushed a fix to the
service, and in order to prevent the service degradation from spreading, they
ignored the rule that you should do a proper, slow, staged deploy. Instead,
they pushed the fix to all machines. It’s hard to argue against that. No one
wants their customers to have degraded service! Unfortunately, the fix
exposed a bug that caused a global outage.
As companies grow up, they eventually have to impose security that prevents
every employee from being able to access basically everything. And at most
companies, when that happens, some people get really upset. “Don’t you
trust me? If you trust me, how come you’re revoking my access to X, Y, and
Z?”
Facebook famously let all employees access everyone’s profile for a long
time, and you can even find HN comments indicating that some recruiters
would explicitly mention that as a perk of working for Facebook. And I can
think of more than one well-regarded unicorn where everyone still has access
to basically everything, even after their first or second bad security breach.
It’s hard to get the political capital to restrict people’s access to what they
believe they need, or are entitled, to know. A lot of trendy startups have core
values like “trust” and “transparency” which make it difficult to argue
against universal access.
There are people I simply don’t give feedback to because I can’t tell if they’d
take it well or not, and once you say something, it’s impossible to un-say it. In
the paper, the author gives an example of a doctor with poor handwriting
who gets mean when people ask him to clarify what he’s written. As a result,
people guess instead of asking.
In most company cultures, people feel weird about giving feedback. Everyone
has stories about a project that lingered on for months after it should have
been terminated because no one was willing to offer explicit feedback. This is
a problem even when cultures discourage meanness and encourage
feedback: cultures of niceness seem to have as many issues around speaking
up as cultures of meanness, if not more. In some places, people are afraid to
speak up because they’ll get attacked by someone mean. In others, they’re
afraid because they’ll be branded as mean. It’s a hard problem.
I was shocked the first time I saw this happen. I must have been half a year
or a year out of school. I saw that we were doing something obviously
non-optimal, and brought it up with the senior person in the group. He told
me that he didn’t disagree, but that if we did it my way and there was a
failure, it would be really embarrassing. He acknowledged that my way
reduced the chance of failure without making the technical consequences of
failure worse, but it was more important that we not be embarrassed. Now
that I’ve been working for a decade, I have a better understanding of how
and why people play this game, but I still find it absurd.
Solutions
Let’s say you notice that your company has a problem that I’ve heard people
at most companies complain about: people get promoted for heroism and
putting out fires, not for preventing fires; and people get promoted for
shipping features, not for doing critical maintenance work and bug fixing.
How do you change that?
The simplest option is to just do the right thing yourself and ignore what’s
going on around you. That has some positive impact, but the scope of your
impact is necessarily limited. Next, you can convince your team to do the
right thing: I’ve done that a few times for practices I feel are really important
and are sticky, so that I won’t have to continue to expend effort on convincing
people once things get moving.
But if the incentives are aligned against you, it will require an ongoing and
probably unsustainable effort to keep people doing the right thing. In that
case, the problem becomes convincing someone to change the incentives,
and then making sure the change works as designed. How to convince people
8 of 11
is worth discussing, but long and messy enough that it’s beyond the scope of
this post. As for making the change work, I’ve seen many “obvious” mistakes
repeated, both in places I’ve worked and those whose internal politics I know
a lot about.
Small companies have it easy. When I worked at a 100 person company, the
hierarchy was individual contributor (IC) -> team lead (TL) -> CEO. That was
it. The CEO had a very light touch, but if he wanted something to happen, it
happened. Critically, he had a good idea of what everyone was up to and
could basically adjust rewards in real-time. If you did something great for the
company, there’s a good chance you’d get a raise. Not in nine months when
the next performance review cycle came up, but basically immediately. Not
all small companies do that effectively, but with the right leadership, they
can. That’s impossible for large companies.
At large company A (LCA), they had the problem we’re discussing and a
mandate came down to reward people better for doing critical but
low-visibility grunt work. There were too many employees for the mandator
to directly make all decisions about compensation and promotion, but the
mandator could review survey data, spot check decisions, and provide
feedback until things were normalized. My subjective perception is that the
company never managed to achieve parity between boring maintenance work
and shiny new projects, but got close enough that people who wanted to
make sure things worked correctly didn’t have to significantly damage their
careers to do it.
It’s sort of funny that this ends up being a problem about incentives. As an
industry, we spend a lot of time thinking about how to incentivize consumers
into doing what we want. But then we set up incentive systems that are
generally agreed upon as incentivizing us to do the wrong things, and we do
so via a combination of a game of telephone and cargo cult diffusion. Back
when Microsoft was ascendant, we copied their interview process and asked
brain-teaser interview questions. Now that Google is ascendant, we copy
their interview process and ask algorithms questions. If you look around at
trendy companies that are younger than Google, most of them basically copy
their ranking/leveling system, with some minor tweaks. The good news is
that, unlike many companies people previously copied, Google has put a lot
of thought into most of their processes and made data dirven decisions. The
9 of 11
bad news is that Google is unique in a number of ways, which means that
their reasoning often doesn’t generalize, and that people often cargo cult
practices long after they’ve become deprecated at Google.
This kind of diffusion happens for technical decisions, too. Stripe built a
reliable message queue on top of Mongo, so we build reliable message
queues on top of Mongo1. Our co-worker live edits the production database
to run tests, so we live edit the production database to run tests. It’s cargo
cults all the way down2.
Let’s look at how the first one of these, “pay attention to weak signals”,
interacts with a single example, the “WTF WTF WTF” a new person gives off
when the join the company.
“Pay attention to weak signals” sure sounds like good advice, but how do we
do it? Strong signals are few and far between, making them easy to pay
attention to. Weak signals are abundant. How do we filter out the ones that
aren’t important? And how do we get an entire team or org to actually do it?
These kinds of questions can’t be answered in a generic way; this takes real
thought. We mostly put this thought elsewhere. Startups spend a lot of time
thinking about growth, and while they’ll all tell you that they care a lot about
engineering culture, revealed preference shows that they don’t. With a few
exceptions, big companies aren’t much different. At LCB, I looked through
the competitive analysis slide decks and they’re amazing. They look at every
last detail on hundreds of products to make sure that everything is as nice
for users as possible, from onboarding to interop with competing products. If
there’s any single screen where things are more complex or confusing than
any competitor’s, people get upset and try to fix it. It’s quite impressive. And
then when LCB onboards employees, a third of them are missing at least one
of, an alias/account, an office, or a computer, a condition which can persist
for weeks or months. The competitive analysis slide decks talk about how
important onboarding is because you only get one chance to make a first
impression, and then employees are onboarded with the impression that the
10 of 11
company couldn’t care less about them and that it’s normal for quotidian
processes to be pervasively broken. LCB can’t even to get the basics of
employee onboarding right, let alone really complex things like acculturation.
This is understandable – external metrics like user growth or attrition are
measurable, and targets like how to tell if you’re acculturating people so that
they don’t ignore weak signals are softer and harder to determine, but that
doesn’t mean they’re any less important. People write a lot about how things
like using fancier languages or techniques like TDD or agile will make your
teams more productive, but having a strong engineering culture is much
larger force multiplier.
Thanks to Ezekiel Benjamin Smithburg and Marc Brooker for introducing me to the term
Normalization of Deviance, and Kelly Eskridge, Leah Hanson, Sophie Rapoport, Ezekiel
Benjamin Smithburg, Julia Evans, Dmitri Kalintsev, Raplph Corderoy, Jamie Brandon, and
Victor Felder for comments/corrections/discussion.
1. People seem to think I’m joking here. I can understand why, but try
Googling mongodb message queue. You’ll find statements like “replica sets in
MongoDB work extremely well to allow automatic failover and
redundancy”. Basically every company I know of that’s done this and
has anything resembling scale finds this to be an operational nightmare,
but you can’t actually find blog posts or talks that discuss that aspect of
it. All you see are the posts and talks from when they first tried it and
are in the honeymoon period. This is common with many technologies.
People really don’t like admitting that they based their infra on a
fundamentally bad idea, so you’ll mostly find glowing recommendations
even when, in private, people will tell you what a disaster the project
was. Today, if you do the search mentioned above, you’ll get a ton of
posts talking about how amazing it is to build a message queue on top of
Mongo, this footnote, and a maybe couple of blog posts by Kyle
Kingsbury depending on your exact search terms.
If there were an acute failure, you might see a postmortem, but while
we’ll do postmortems for “the site was down for 30 seconds”, we rarely
do postmortems for “this takes 10x as much ops effort as the alternative
and it’s a death by a thousand papercuts”, “we architected this thing
poorly and now it’s very difficult to make changes that ought to be
trivial”, or “a competitor of ours waws able to accomplish the same
thing with an order of magnitude less effort”. I’ll sometimes do informal
postmortems by asking everyone involved oblique questions about what
happened, but more for my own benefit than anything else, because I’m
not sure people really want to hear the whole truth. This is especially
sensitive if the effort has generated a round of promotions, which seems
to be more common the more screwed up the project. The larger the
project, the more visiblity and promotions, even if the project could have
been done with much less effort.↩
2. I’ve spent a lot of time asking about why things are the way they are,
both in areas where things are working well, and in areas where things
are going badly. Where things are going badly, everyone has ideas. But
11 of 11
where things are going well, as in the small company with the
light-touch CEO mentioned above, almost no one has any idea why
things work. It’s magic. If you ask, people will literally tell you that it
seems really similar to some other place they’ve worked, except that
things are magically good instead of being terrible for reasons they
don’t understand. But it’s not magic. It’s hard work that very few people
understand. Something I’ve seen multiple times is that, when a VP
leaves, a company will become a substantially worse place to work, and
it will slowly dawn on people that the VP was doing an amazing job at
supporting not only their direct reports, but making sure that everyone
under them was having a good time. It’s hard to see until it changes,
though.↩