Better Apis Quality Stability Observability
Better Apis Quality Stability Observability
Mikael Vesavuori
2024
Contents
Copyright . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Found anything wrong? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
What will you learn? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Why should you care? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Prior art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Project resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Workshop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Workshop assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
How to follow along . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
My solution pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
The application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Tech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Technical components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Architecture diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Solution diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Deployment diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Microservices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
API documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Fake user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Feature toggles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Weighing between a hardware- or a software-oriented approach . . . . . . 33
The benefits of hardware-segregated environments . . . . . . . . . . . 33
The drawbacks of hardware-segregated environments . . . . . . . . . 33
The benefits of a software-defined, dynamic environment . . . . . . . . 33
The drawbacks of a software-defined, dynamic environment . . . . . . 34
Yes, you can mix these patterns with a hardware-separated environment 34
2
3
Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Make your processes known . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
SOLID principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Clean architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Baseline tooling and plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Continuous Integration and Continuous Deployment . . . . . . . . . . . . . . 42
Refactor continuously (“boy scout rule”) . . . . . . . . . . . . . . . . . . . . . 44
Trunk-Based Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Test-Driven Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Generate documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Unit testing (and more) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Creating coverage, fast . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Test positive and negative states . . . . . . . . . . . . . . . . . . . . . . . 53
Test automation in CI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Synthetic testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Automated scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Generate a software bill of materials . . . . . . . . . . . . . . . . . . . . . . . . 58
Open source license compliance . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Release versioned software and produce release notes . . . . . . . . . . . . 61
Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Lifecycle management and roadmap . . . . . . . . . . . . . . . . . . . . . . . . 63
API schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
API schema validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
API client version using headers . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Branch by abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Beta functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Feature toggles (“feature flags”) . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Authorization and role-based access . . . . . . . . . . . . . . . . . . . . . . . 77
Canary deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
How we can do better . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
AWS baseline observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Structured logger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Alerting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Service discoverability and metadata . . . . . . . . . . . . . . . . . . . . . . . 91
Additional observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Setting up Bunyan as a logger . . . . . . . . . . . . . . . . . . . . . . . . 94
4
Copyright
Cover image adapts photographic material shared by Pawel Czerwinski on Unsplash. All
relevant ownership of the original photograph remains with Pawel Czerwinski.
Writing technical books is challenging. While concepts and ideas may remain relevant for
years, practical examples that rely on ever-changing technologies can become outdated
quickly.
If you find anything incorrect, not working, or otherwise unusual, I’d greatly appreciate
your feedback. I’ll do my best to incorporate updates as soon as possible.
Introduction
Writing and maintaining APIs can be hard. While the cloud, serverless, and the microser-
vices revolution made it easier and more convenient to set an API skeleton up, age-old
issues like software quality — SOLID, etc — and understanding the needs of the API
consumers still persist.
Werner Vogels, legendary CTO of Amazon, famously stated his API rules like this:
This book presents an application and a made-up (but “real-ish”) scenario that, taken
together, practically demonstrate a range of techniques or methods, patterns, implemen-
tations, as well as tools, that all help enhance quality, stability, and observability of ap-
plications:
Quality means our applications are well-built, functional, safe and secure, maintainable,
and are built to high standards.
Stability means that our application can withstand external pressure and internal change,
without failing at predictably providing its key business values.
Observability means that we can understand, from the outputs of our application, what
8
Monitoring is for running and understanding other people’s code (aka “your
infrastructure”).
Observability is for running and understanding your code – the code you write,
change and ship every day; the code that solves your core business problems.
Of the three above concepts, stability is the most misunderstood one, and it will be the
biggest and most pronounced component here. It will be impossible to reach what Vogels
is pointing to, without addressing the need for stability.
Information
Caveat: No single example project or book can fully encompass all details in-
volved in such a complex territory as this, but at least I will give it a try!
You will be especially interested in this project if you have ever been involved in situations
like the ones below, and want to have ideas for how to address them:
• Your team was unable to deliver new features because changes would mean breaking
them for someone else
• You built something but don’t know who your consumers are
• You looked up what DORA metrics are and laughed out the words “yeah not at my
company, no sirree, never”
9
• You say that you are “autonomous” but someone always keeps freezing deploy-
ments to the shared staging environment so the only actual autonomous thing is
localhost
• You say that you “implemented” continuous delivery, but it’s still too painful to
integrate and release without a gatekeeper and crashing a hundred systems
• You have heard “we cannot have any confidence in our systems if we don’t do real,
manual scheduled all-hands-on-deck testing with frozen versions of all external sys-
tems” so many times you actually have started to believe that ludicrous statement
• You wonder if things would be better with more infra and more environments, but
start having nightmares when you do some back-of-the-napkin math on how many
you’d need, not to mention the burden of supporting them logically
There are a million more of those, but you get the point; It’s a dark and strange place to
be in.
While it’s fun to throw around the old Piranesi drawing or Happiness in Slavery video as
self-deprecating memes, I think it is Escher who brilliantly captures the “positive” side
of this mess—That things become very, very strange when multiple realities and gravities
are seen together, at once. Even if one “reality” is perfectly stable and sound, the work of
architects is to support all the realities that need to work together.
So: We need to get back some of that control so the totality does not look more like chaos
than whatever it is that we are attempting to do.
At the end of the day, our work is about supporting the shared goals (business, organi-
zation, or personal goals, if it’s a pet project) and letting us stay safe, sound, and happy
working professionals while doing so.
11
Prior art
There’s also a previous piece of work you can look at: multicloud-serverless-canary, which
might pique your interest if you want to see more on the Azure and GCP side of CI and
canaries.
12
Project resources
Feel free to check out the generated static website on Cloudflare Pages and the API docs
on Bump.
13
Workshop
This section outlines the prep work, resources, and high-level details on how I would
approach creating an API that is qualitative, stable, and observable.
• Scenario
• Workshop assignment
• My solution pitch
• The application
14
Scenario
You are supporting your friends who are making a new social online game by providing
an API service that churns out fake users. For their initial MVP, they asked you to create
a service that just returns hardcoded feedback; an object like so, { "name": "Someguy←-
,→ Someguyson" }. This was enough while they built their network code and first Non-
Player Character engine.
import {
APIGatewayProxyResult,
} from "aws-lambda";
/**
* @description The controller for our "fake user" service, in its ←-
,→ basic or naive shape.
*/
export async function handler(
): Promise<APIGatewayProxyResult> {
try {
return {
statusCode: 200,
body: JSON.stringify({
name: "Someguy Someguyson",
}),
};
} catch (error) {
return {
statusCode: 500,
body: JSON.stringify(error),
};
}
}
• They are ready for a bit more detailed user data, meaning more fields.
• They are also considering pivoting the game to use cats, instead of humans.
15
• Of course, it’s also ideal if the team can be sensibly isolated from any work on new
features by individual contributors in the open-source community, who will need
to have the work thoroughly documented and coded to a level that marks the high
ambitions and makes it easy to contribute.
• Oh, and it would be good with a dedicated beta feature set, too.
For the near future, the API must be able to handle all these use-cases. It would also
be perfect if the API can have stable interfaces (for new clients and old), and not
use several endpoints, as the development overhead is already big for the two weekend
game programmers.
Workshop assignment
If you want to, here’s an example assignment for you to get going quickly.
Given the scenario, how would you approach addressing the requirements?
The workshop addresses typical skills needed of a technical lead or solution architect, pri-
oritizing making considered architectural choices and being able to clearly communicate
them.
This is possible to do two ways, either with or without code, as is needed based on the
audience.
or
Suggested timeframe (with code): 3 hours. You can start from src/FakeUserBasic/←-
,→ index.ts.
The output should be a diagram or other visual artifact, and if also coding, a set of code
that demonstrates a full or partial solution.
At the end of the period, present your solution proposal (~5-10 minutes), or if you are
alone, go through what you’ve produced. Maybe even share on X, Reddit, LinkedIn…?
17
You can…
• Go on a guided tour: Grab a coffee, just read and follow along with links and
references to the work.
or
• Do this as a full-on workshop: Clone the repo, run npm install and npm
start, then read about the patterns and try it out in your own self-paced way.
Commands
The below commands are those I believe you will want to use. See package.json for more
commands!
• npm run build: Package and build the code with Serverless Framework
Prerequisites
• Amazon Web Services (AWS) account with sufficient permissions so that you can
deploy infrastructure. A naive but simple policy would be full rights for Cloud-
Watch, Lambda, API Gateway, X-Ray, S3, and CodeDeploy.
18
• GitHub account to host your Git fork and for running CI with GitHub Action.
• Suggested: For example a Cloudflare account for hosting your static documenta-
tion on Cloudflare Pages.
• Optional: A Bump account to host your API description. You can remove the Bump
section from .github/workflows/main.yml if you want to skip this.
If you don’t want to use Bump, go ahead and remove the part at the end of .github/workflows/←-
,→ main.yml.
Go to the Bump website and create a free account and get your token (accessible under
Automatic deployment, see the Your API key section).
We will use Mockachino as a super-simple mock backend for our feature toggles. This way
we can continuously change the values without having to redeploy a service or anything
else.
It’s really easy to set up. Go to the website and paste this payload into the HTTP ←-
,→ Response Body:
{
"error": {
19
"enableBetaFeatures": false,
"userGroup": "error"
},
"legacy": {
"enableBetaFeatures": false,
"userGroup": "legacy"
},
"beta": {
"enableBetaFeatures": true,
"userGroup": "beta"
},
"standard": {
"enableBetaFeatures": false,
"userGroup": "standard"
},
"dev": {
"enableBetaFeatures": true,
"userGroup": "dev"
},
"devNewFeature": {
"enableBetaFeatures": true,
"enableNewUserApi": true,
"userGroup": "devNewFeature"
},
"qa": {
"enableBetaFeatures": false,
"userGroup": "qa"
}
}
Change the path from the standard users to toggles. Click Create.
You will get a “space” in which you can administer and edit the mock API. You’ll see a
link in the format https://www.mockachino.com/spaces/YOUR_RANDOM_ID.
Install all dependencies with npm install, then set up husky pre-commits with npm ←-
,→ run prepare.
20
For the next step you will need to be authenticated with AWS and have sufficient privi-
leges to deploy the stack to AWS. Once you are authenticated, make the first deployment
from your machine with npm run deploy.
We do this so that the dynamic endpoints are known to us; we have a logical dependency
on these when it comes to our test automation.
4. Update references
Next, update the environment value in serverless.yml (around lines 35-36) to reflect
your Mockachino endpoint:
environment:
TOGGLES_URL: https://www.mockachino.com/YOUR_RANDOM_ID/toggles
Next, also update the following files to reflect your Mockachino endpoint:
• jest.env.js (line 2)
Continue by updating the following files to reflect your FakeUser endpoint on AWS:
• api/schema.yml (line 8)
• tests/load/k6.js (line 6)
If you connect this repository to GitHub you will be able to use GitHub Actions to run a
sample CI script with all the tests, deployments, and stuff. The CI script acts as a template
for how you can tie together all the build-time aspects in a simple way. It should be easily
portable to whatever CI platform you might otherwise be running.
You’ll need a few secrets set beforehand if you are going to use it:
• FAKE_USER_ENDPOINT: Your AWS endpoint for the FakeUser service, in the for-
mat https://RANDOM.execute-api.REGION.amazonaws.com/shared/fakeUser
(known after the first deployment)
• BUMP_TOKEN: Your token for Bump which will hold your API docs (just skip if you
don’t want to use it; also remove it from the CI script in that case)
If you have this repo in GitHub you can also very easily connect it through Cloudflare
Pages to deploy the documentation as a website generated by TypeDoc.
You need to set the build command to npm run build:hosting, then the build output
directory to typedoc-docs.
You can certainly use something like Netlify if that’s more up your alley.
You can now deploy the project manually or through CI, now that all of the configurations
are done.
22
Great work!
23
My solution pitch
We’re going to cut down on environment sprawl by using a single, dynamic environ-
ment instead of multiple hardware-segregated environments (like dev, staging, prod se-
tups) to deliver our stable, qualitative, and observable application.
The only other environment we’ll use is a non-user-facing CI environment for non-
production testing.
We’ll use feature toggles and canary releases to do this safely in our single produc-
tion environment, now being able to separate a (business-oriented) release from a mere
technical deployment.
• Being able to run different sets of functionality based on if you are a “trusted” user
or deliver a default feature set if you are not trusted.
• We can roll out the application over a period of time, and stop delivering it (and
rollback) if we encounter errors.
• We top it off by using built-in observability in AWS to monitor our application and
become alerted if something severe happens.
Later in this guide, you can read more about the implementation patterns and how they
are organized into three areas (quality, stability, observability).
24
The application
• The second version (beta) includes a name and some other fields retrieved from
an external API, plus a profile image (of a cat) from another external API.
• Finally, a new feature is developed on top of the second version, using a third exter-
nal API. This feature is hidden under a feature toggle named enableNewUserApi.
Based on the user toggles, the service calls out to the following external APIs:
• JSONPlaceholder @ https://jsonplaceholder.typicode.com/users
• RandomUser @ https://randomuser.me/api/
The feature toggles are fetched, as has been stated previously, from Mockachino, where
you will have to create an endpoint with the toggles payload.
25
Tech
This section presents the tech stack of this project and other purely technical concerns.
• Technical components
• Architecture diagrams
• API documentation
Technical components
• Optional: Bump
• AWS API Gateway, to route incoming requests, handle request validation and au-
thorize the user
Architecture diagrams
Diagrams that try to make sense of the various views of the solution.
Solution diagram
Deployment diagram
Microservices
FakeUser
FakeUser: Controllers
FakeUserController
FakeUser: Usecases
createFakeUser
FakeUser: Frameworks
FakeUser: Config
endpoints userMetadata
FakeUserBasic
FakeUserBasic: Controllers
FakeUserBasic
FeatureToggles
FeatureToggles: Controllers
FeatureTogglesController AuthController
FeatureToggles: Usecases
getUserFeatureToggles
FeatureToggles: Config
userPermissions
API documentation
These are API docs that apply to the full “finished” state.
Fake user
Required headers:
X-Client-Version: 1 |2
GET {{API_URL_BASE}}/shared/fakeUser
Feature toggles
Send the user name (email) of a user to get their feature toggles. An unknown (or missing
etc.) user will return default toggles.
POST {{API_URL_BASE}}/shared/featureToggles
{
"userName": "[email protected]"
}
33
Let’s go through pros and cons for a traditional vs “modern” approach to environments.
• You start expecting that all systems are in similar, co-deployed stages
• The implicit reasoning starts becoming that you “should” or “can” only have a low
degree of variability in configuration
• There may be significant cost overhead with a higher count of static environments
• There is most likely a significant complexity overhead with a higher count of static
environments
• Scales to more intricate and realistic scenarios (such as testing system X in mode
A with customer type B in configuration D etc.)
• Will become harder to work with in the local scope (i.e. the actual code), and more
so if there are many branches
You can certainly use the patterns seen in this project in a more “traditional” hardware-
separated environment. However, the benefits become more pronounced as you also shed
some of the overhead and weight of classical environments.
35
Quality
This section represents overall quality-enhancing activities that can be done to ensure
your product is built with a solid engineering foundation.
• SOLID principles
• Clean architecture
• Trunk-Based Development
• Test-Driven Development
• Generate documentation
• Test automation in CI
• Synthetic testing
• Automated scans
Any non-trivial software development context requires some form of common ground to
keep all of the work together, and for the team to correctly claim they have done their
preparation work.
Example 1: See PROJECT.md which is the start page for the generated website. It uses
a basic template structure to aid in a team providing the right information. The file
provides (among other things) information on the project governance model with key
roles in the project and their responsibilities, outlines the requirements process, and
points out where and how to follow work on the tasks/requirements. This file is just an
opinionated starter to ensure that base-level questions around the project/product are
answered.
Example 3: Reading CONTRIBUTING.md makes it clear how the team, and external parties,
can contribute to the code base. This document also states some basic principles around
code standards, code reviews, bug reporting, and similar “soft tech” issues. Don’t under-
estimate the need to be clear on expectations, whether these are highly detailed nitpicky
bits or general guidance: Your project will do better with a good contribution document.
See Mozilla’s guide for more information.
Example 4: For our CODE_OF_CONDUCT.md, we are using the Contributor Covenant, which
is one standard to communicate baseline values and norms. Of course, enforcement and
such are still in your hands. While it’s made for open source circles, something like this
makes sense to have in corporate contexts as well.
SOLID principles
The very first (technical) thing is to respect that good code, despite programming lan-
guage, is good code even over time. Follow wise conventions like SOLID to guide your
daily work.
Information
See for example this Stack Overflow article or Khalil Stemmler’s write-up for a
concise introduction.
Example: In our project, one example of the dependency inversion principle comes into
play when calling the betaVersion() function in src/FakeUser/controllers/FakeUserController←-
,→ .ts, as we send in the toggles for it (and createFakeUser()) to use. Because this hap-
pens already in the controller or boot-strapping phase of the application, we ensure that
the dynamic values (i.e. the toggles) are always present throughout the full call chain
without the need to import them deeper inside the app.
/**
* @description Handle the new (v2) beta version.
38
*/
async function betaVersion(
toggles: Record<string, unknown>
): Promise<APIGatewayProxyResult> {
const response = await createFakeUser(toggles); // Run use case
return {
statusCode: 200,
body: JSON.stringify(response),
};
}
The single responsibility principle should hopefully also be evident throughout most of
the code.
39
Clean architecture
The second thing, and very much cross-functional in regards to quality and stability, is
having an understandable, concise and powerful software architecture.
Example 1: You can see a clear taxonomy for how the overall project and the microser-
vices are organized by simply browsing the folder structure and seeing how code is linked
together. Let’s look at the FakeUser service:
FakeUser└───
config└───
contracts└───
controllers└───
entities└───
frameworks└───
usecases
What we’re seeing is a somewhat simplified Clean Architecture structure. One of several
tenets of Robert Martin’s Clean Architecture concept is to produce acyclic code. You can
see that there are no cyclical relations in the Arkit diagrams. This, among other touches,
means that our code is easy to understand, easy to test and debug and that it is easy to
make stable, almost entirely by just logically organizing the code!
Like Martin, I’m also taking cues from Domain Driven Design, where we use the “entity”
concept to refer to “rich domain models”, as opposed to anemic domain models:
Information
Read more about the Anemic Domain Model anti-pattern on Martin Fowler’s
site.
Example 2: While the examples in the code part of this project may be a bit contrived
(bear in mind that they need to balance simplicity with meaningful examples), you can
see how the User entity (at src/FakeUser/entities/User.ts) has not just data, but also
business logic and internal validation on all the different operations it can perform. There
is no need to leak such internal detail anywhere else; the only thing we add to that scenario
is that we externalize the validation logic so that those functions can be independently
tested (for obvious reasons private class methods are not as easily testable).
Information
To keep it short here, I’ll just refer to Robert Martin’s original post on clean ar-
chitecture and Clean architecture for the rest of us for more details. Also, see
Khalil Stemmler’s article on how CA and DDD intersect if that floats your boat.
Clean architecture isn’t a revolutionary concept: it’s just the best and most logical realiza-
tion (I feel) so far for questions around code organization that have lingered for decades.
41
The collective impact of several (on their own) small tools can make the difference be-
tween misery and joy very tangible.
This project uses two very common—but hugely effective—tools, namely ESLint and Pret-
tier. These two ensure that you have a baseline, pluggable way of ensuring similar stan-
dards (and automation of them) across a team. I’d not write many lines without those
tools around.
Really, one of the very first things you want to make sure of is that the code looks and
reads the same, regardless of who wrote it. Using these tools, now you can.
Success
Don’t forget to enable “fix on save”! Also, consider the VS Code plugins for ES-
Lint and Prettier if you are using VS Code.
When it comes to more IDE-centric plugins in the security department, I highly recom-
mend the Snyk Vulnerability Scanner (the successor to vulncost) for Visual Studio Code.
Other nice ones include:
• Checkov
• DevSkim
Example: You’ll see that this project has configuration files such as .eslintrc and .←-
,→ prettierrc laying around.
42
We’re at the very basics of classic agile (and extreme programming) when we state that
Continuous Integration (or CI) and Continuous Delivery or Deployment (CD) are things
to strive for.
However, even 20 years later it still seems like these notions are pretty far away in many
organizations. To be frank, then, the problem is as Atlassian writes, “agile isn’t agile
without continuous delivery”.
Information
For more on the relation between delivery and deployment (and more), read this
article by Atlassian.
Success
You can take a simple test, called the DORA DevOps Quick Check to check your
DORA performance metrics.
When it comes to our practices, we can make meaningful steps towards this by opting
for smaller releases and having a limit on work-in-progress. This keeps the focus tighter,
features smaller, and the release cadence flowing smoother and faster. Everyone wins.
Example: You can see that we have CI running in GitHub with our script located at
.github/workflows/main.yml. Every commit runs the full pipeline and deploys new code
to AWS. The other “soft practices”, are somewhat less valid for a single author and can’t
be easily demonstrated.
43
CI/CD is nowadays 20% a technology problem (good and easy tooling is available in
abundance) and 80% a people and process problem. Using DORA metrics and CI/CD,
you can start setting a high mark on the objective delivery improvements that these tools
and practices help enable when compared to traditional waterfall teams. This is of course
contingent on you having a situation that allows such flexibility of operating (potentially)
differently than your organization already does things.
44
First, what is refactoring exactly? In the words of Martin Fowler, who wrote one of the
definitive books on the subject,
It takes perseverance, good communication, and business folks that truly understand
software to get dedicated time for improving the solutions we work on. Most people will
unfortunately not work in teams with dedicated refactoring time. What to do, then?
Rather than think of “change” as a drastic, singular, large-scale, and long-term event, it’s
better to see change as a stream of small, manageable events that we can shape. Robert
Martin wrote about the notion that every change in a codebase should also include some
form of improvement in his classic book, Clean Code: The boy scout rule which in the
engineering context means “always leave the code better than you found it”.
Information
Read a short summary here and a longer article on continuous refactoring here.
Moreover, the “boy scout rule” is definitely colored by other (at the time) contemporary
management ideas like kaizen that work well in agile/lean contexts.
But what about our early work, when we are just starting on a new feature or product?
In that case, I personally love, and truly resonate with, Martin’s notion to start with
“degenerate tests” (also in the book, Clean Craftsmanship):
45
We begin with the degenerate tests. We return an empty list if the input list
is empty, or if the number of requested elements is zero. […] Note that I am
following the rule of gradually increasing complexity. Rather than worrying
about the whole problem of random selection, I’m first focusing on tests that
describe the periphery of the problem.
We call this: “Don’t go for the Gold”. Gradually increase the complexity of
your tests by staying away from the center of the algorithm for as long as
possible. Deal with the degenerate, trivial, and simple administrative tasks
first.
Practically it means that we start coding and testing from a perspective where the bound-
ary functionality works, but has no complexity or elegance. Then, bit by bit, we add
necessary complexity (dealing with harder problems more realistically) while subtract-
ing unintended complexity (using SOLID principles, etc.) so we end up with something
that worked early on, yet evolved into a coherent and good piece of work.
It would be wrong to assume that all code necessarily has this evolutionary spiral into
something better—in fact, I think it’s correct to say that most code, unfortunately, grows
worse. Again, we must remember that all code is a liability. It is the efficient pruning
and nurturing, methodically done, that allows code to actually grow better with time.
Refactoring is the key to this, and we should do it as early and often as possible, ideally
within minutes of our first working code.
Example: Hard to point to something “post-fact”, but every single bit has been contin-
uously enhanced and refactored (sometimes removed) since starting this project. This
very book has, as well, gone from a README file to becoming a full Gitbook project!
Information
Go ahead and check out Refactoring.guru for lots of ways to approach making
practical code improvements. Also, see the reference list at the end for even
more materials.
46
Trunk-Based Development
There are two main patterns for developer teams to work together using ver-
sion control. One is to use feature branches, where either a developer or a
group of developers create a branch usually from trunk (also known as main
or mainline), and then work in isolation on that branch until the feature they
are building is complete. When the team considers the feature ready to go,
they merge the feature branch back to the trunk.
The second pattern is known as trunk-based development, where each devel-
oper divides their own work into small batches and merges that work into the
trunk at least once (and potentially several times) a day. The key difference
between these approaches is scope. Feature branches typically involve multi-
ple developers and take days or even weeks of work. In contrast, branches in
trunk-based development typically last no more than a few hours, with many
developers merging their individual changes into the trunk frequently.
For me, Trunk Based Development (TBD) captures an essential truth: That most branch-
ing models are just too complicated, and have too many adverse effects when it comes
to merging code and staying agile. Even if certain models (IMHO) balance those aspects
fairly well, like GitHub Flow (not to be mixed up with GitFlow!), nothing is simpler and
more core to the agile values than TBD: Just push to main (or master if you aren’t up with
the times) and trust your tooling to handle it.
Danger
Note that even Vincent Driessen, the original conceiver of GitFlow, nowadays
actively discourages the use of GitFlow in modern circumstances.
Trunk Based Development is worth reading about, if nothing else because it seems mis-
understood by some.
Example 1: I can’t easily point to some evidence here, but the full history of this project
(both the code and the book/guide) has been handled this way.
47
The pros and cons of TBD are of course only truly, fully visible when seen together with
other practices and tools you have. There needs to be certain maturity before doing this
while remaining safe. This project should represent perfectly valid conditions for which
TBD can be used instead of less agile strategies.
Success
You can usually set up various branching strategies and restrictions in your CI
tool, to effectively require TBD as part of the workflow.
Example 2: To truly support TBD, I’ve added husky to run pre-commit hooks for various
activities, such as testing. This way we get to know even before the code is “sent off ” if
it’s in shape to reach the CI stage.
Test-Driven Development
— From Codeacademy
Information
Test-driven development is a practice that a lot of people swear by. I’m a 50/50 person
myself—sometimes it makes sense to me, sometimes I just write the tests after I feel I’m
out of the weeds when it comes to the first implementation. No need to be a fundamen-
talist; be a pragmatist! The important thing is that there are tests, not so much how and
when they came. I’d still note that for this advice, I will assume that you have some kind
of rigid standards—it’s just too easy to skip the tests!
However, in case you want to be a good-spirited TDD crusader then I’ve made it easy to
do so.
Example: Just run npm run test:unit:watch and Jest will watch your tests and source
code. You can also modify what Jest “watches” in the interactive dialog.
49
Generate documentation
Documentation… Love it or hate it, but professionally produced software absolutely re-
quires documentation at least as good as the software itself.
• API documentation
So being lax about what types of docs we are talking about is maybe not too helpful. But
we can slice the cake differently, and try to bucket the above into two major categories:
Warning
We should attempt to reach the point where as much as possible of the documentation can
be generated and output through automation. Good candidates for automated document
generation would be dependency graphs, some of our architecture views, and technical
documentation from our comments, types, and code structure.
Now let’s look at how we’ve done a bit in the code part of this project.
50
Example 1: Open up any TS file in the src folders. Here, we use the JSDoc standard for
documenting source code. Since we are using TypeScript, the style I am using discards
“params” instead mostly focusing on the descriptive and referential aspects of JSDoc.
/**
* @description Check if JSON is really a string.
* @see https://stackoverflow.com/questions/3710204/how-to-check-if-a-←-
,→ string-is-a-valid-json-string-in-javascript-without-using-try
*/
const isJsonString = (str: string): Record<string, unknown> | boolean ←-
,→ => {
try {
JSON.parse(str);
} catch (e) {
return false;
}
return true;
};
Example 2: Then, we use TypeDoc for generating docs from the source code. You can
generate documentation into the typedoc-docs folder by running npm run docs. These
are the documents uploaded to the static website as well. This brings us full circle on
certain documentation being automatically updated and uploaded, as soon as we push
any changes!
Example 3: For diagrams, we use Arkit for generating diagrams from your software
architecture. As with the docs, these are also generated with npm run docs and presented
on the published website.
Example 4: Taken together, this means that you can easily make rich documentation
available in a central, visible location (like a static website in this case) in the CI stage.
See .github/workflows/main.yml for the CI script itself. Note that Cloudflare Pages is
set up outside of GitHub so you won’t see too much of that integration in the script.
51
Information
More code is bad. Less code is good. Only add code when the other options
have been exhausted. This also applies to tests.
You are writing tests to increase your confidence in your system. If you are
not confident about aspects of your system, do more testing. When you have
to make compromises due to time and budget constraints, always prioritize
testing the areas where you need the most confidence.
Testing is fundamentally about building confidence in our code. We need to have enough
tests to accurately be able to say that our code is covered for its critical use-cases and that
we feel confident about the code (and tests!) we wrote.
Information
Example: You’ll see that the tests under tests/ are segmented into several wider cat-
egories. The tests under tests/unit are for the individual relevant layers such as con-
trollers, entities, and frameworks.
If you are writing tests after having created the initial code—functions, classes, etc—then a
good (and fast!) way to create baseline coverage and confidence is by writing tests for the
high-level layers, such as the controllers. Some people call these fuller, outer-boundary
tests “component tests” since they address full use cases that likely encapsulate multiple
smaller functions.
Testing this way, you have the immediate benefit of understanding your code as an API,
as this is the closest to—or even the exact same—code that will be “the real deal” going
53
out into production. These tests, therefore, tend to be very similar when writing for the
integration and contract test use-cases: After all, they all try to address the same, or at
least similar, things through more or less the exact same API.
Information
Spending even a bit of time raising the overall coverage to 90% or more will pro-
vide valuable confidence to you and your team. Remember that there are dimin-
ishing returns after a certain point, and you should feel comfortable about just
leaving some things untested, especially so if the uncovered areas are not able to
create meaningful problems.
For “frameworks” (such as utilities and deterministic calculations) these should be writ-
ten so that they are very easy to test in isolation. These tend to be the easiest and least
messy parts to test. On the other hand, though, they are functionally the inverse of broad
tests: Less messy, but only covers a small subset of your overall codebase. Remember to
test both “success” (positive) and “failure” (negative) states as far as logically possible
and meaningful.
If you later discover unknown test cases that need to be added, to guard against those,
we just add them to the unit test collection. Some people call these “regression tests”—
though I never call anything a regression test, as the test collection overall acts as a re-
gression suite—but it all leads to the same effect in the end.
And if there is something testers, QAs, and people in the testing sphere seem to love,
it’s semantics. Screw the semantics and go for confidence instead, using all the tools you
need to get there!
54
Test automation in CI
First, let’s look at what I personally call “defensive testing”, that is, any typical testing
that we do to test the resiliency and functionality of our solution.
We need to establish clear boundaries and expectations of where our work ends, and
someone else’s work begins. I really this were a technical topic with clear answers, but
the more I work with people, the more I keep getting surprised by how little teams do to
precisely define their boundaries.
Information
It’s well worth understanding (and practically implementing!) the concept bounded
context in your product/project.
Example 2: Next up we set up contract testing using TripleCheck CLI. Contract testing
means that we can easily verify if our assumptions of services and their interfaces are
correct, but we skip verifying the exact semantics of the response. It’s enough that the
shape or syntax is right.
Information
For wider scale and bigger system landscapes, consider using the TripleCheck
broker to be able to store and load contracts from a centralized source.
55
Example 3: During the CI stage we will deploy a complete, realistic stack with the most
recent version. First, we’ll do some basic smoke tests to verify we don’t have a major
malfunction on our hands. In reality, smoke tests are just lighter types of integration tests.
Example 4: When it comes to actual integration testing of the real service we’ll do it after
we’ve seen our smoke tests pass. My solution is a home-built thingy that makes some
calls and evaluates the expected responses with the received data using ajv.
See tests/integration/index.ts.
Example 5: See .github/workflows/main.yml for the CI script. These scripts are not
magic – get acquainted with them, and just as with regular code, make these easy to read
and understand.
You’ll see all the overall steps covered sequentially (I’ve cut out all non-name informa-
tion):
If that battery doesn’t cover your needs you can just spend a bit of time extending it
with your specifics. However, this script should provide a more than amble basis for
production circumstances.
56
Synthetic testing
Synthetic testing sounds cool and intriguing and vaguely futuristic, but it’s really no more
than a directed stream of traffic from a non-human source, such as a computer.
Running synthetic traffic is something that can certainly stress the existing instance/server/-
machine/environment, but in our case we could also leverage it during the deployment win-
dow (when rolling out a canary deployment) to more fully exercise the system, flushing
out any issues. This would also give us a higher probability of hitting any issues—which
we may not do during a (for example) 10-minute window with low organic traffic.
Example: We can use load testing to run various larger-scale synthetic traffic volumes as
one-offs to stress the API. Under tests/load/k6.js you can see our k6 script that we run
in CI. The use-case we have here is to ensure that our system responds correctly when
provided with a range of inputs rather than just doing the quick poke-and-feel we did
with the smoke test.
Synthetics can be run with almost anything: Examples include a regular API client like
Insomnia, a load testing tool like k6 or Artillery, a SaaS product like Checkly, or a managed
tool like AWS CloudWatch Synthetics. Even curl works fine! If you want to set something
up to continuously run against your endpoint, I recommend using an easy-to-use tool like
CloudWatch Canaries or Checkly, directing them to your endpoint.
57
Automated scans
In this age of DevOps, we don’t want to forget the security portion of our responsibility.
Perhaps the more proper term is DevSecOps, which is more and more making headway
in the developer community.
To support Dev(Sec)Ops, a best practice is to use various types of scans to automate often
boring, sometimes hard, sometimes also mandated requirements e.g. for compliance and
security aspects.
Since there is no DevOps without automation the tooling we adopt needs to work both in
CI and locally, and provide meaningful confidence and improvements to our delivery. Git-
Lab writes about their view on “5 benefits of automated security”, which they summarize
as:
There exists a lot of tooling in this space these days. We’re going to use some of the more
well-known, free and open-source options.
Information
Example 1: We use Trivy to check for vulnerabilities in packages (among other places).
Example 2: Then, we use Checkov to scan for misconfigurations, and also create an
infrastructure-as-code SBOM (“software bill-of-materials”).
Answering the question “What went into making your software, besides swearing and
broken deadlines?”
We need to understand what our software is composed of—this is called “software com-
position analysis” (SCA).
For certain cases (such as regulated industries) this is extremely important, down to the
requirement of knowing each and every dependency and what they themselves are built
out of… For our case, though, we are creating the SBOM to understand at “face value”
what software (and risks) we are bundling together.
Example: We create a Software Bill of Materials (or SBOM) similarly to how we ran au-
tomated scans, except this time we do it for our packages using Syft. This is, yet again,
visible together with the other tools running in the CI script.
59
Make sure you make all the open-source angels out there happy by complying with obli-
gations and license restrictions.
Open source is fantastic, you don’t need me to tell you about it! However, consuming
(and sometimes redistributing) open source is not always a very trivial matter. Especially
when we start having to comply with license obligations, like providing license files, at-
tributing people, and so on.
Information
I’ve previously written on this topic on Medium, Open source license compli-
ance, the TL;DR version. Some other good resources for open source and licens-
ing include:
• TLDRLegal
To deal with potential legal issues, we’ll set up checks to allow only permissive, good,
and well-established open-source licenses to be used in our own software.
Example: In package.json we have two scripts that run as part of the pre-commit hook
and in CI:
These verify that we only use a set of allowed open-source licenses, using license-compliance,
and can also use license-compatibility-checker to check for compatibility between our li-
cense and our used ones.
Because we are doing server-side applications (i.e. a backend or API), we are not redis-
60
tributing any code, making our obligations easier to handle and less messy. Webpack will
bundle all licenses as well, so we should be all set.
61
Each release should be uniquely versioned. We should also keep release notes, but it’s
one of those things that can be easy to miss.
…I really wish this was something more enterprise teams did, but I find it to be less
common than in the open-source or product development communities. It kind of makes
sense, but this needn’t be the case.
So, how do we make it easy “to do the right thing”? We’ll add a tool to help us!
Example: Instead of writing manual release notes, we use Standard Version to output
them for us from our commits into the CHANGELOG.md file. Practically, this works simply
by running npm run release.
Information
This should be easy and powerful enough to help set an example in this area.
62
Stability
If it’s easy to overload tooling in the first portion of a product’s lifecycle, we mustn’t loose
track of providing a stable experience, even as we work on the product.
This section brings out a battery of ways in which we can continue from the overall
quality-enhancing practices to delivering without flinching.
• API schema
• Branch by abstraction
• Beta functionality
• Canary deployment
63
More and more companies and products are using public roadmaps (see for example GitHub’s
public roadmap). Public or not, ensuring that other enterprise teams and similar stake-
holders know what you are doing should be considered a bare minimum.
We can follow a basic convention where we divide an API’s lifecycle into design, lifetime←-
,→ , sunset, and deprecation phases.
Information
Refer to the pattern Aggressive Obsolescence and articles by Nordic APIs and
Stoplight.
Beyond these, we add the notion of being removed which is the point at which a given
feature has been completely purged from the source code.
Example: We’d keep a roadmap like the below in our docs. Imagine that today is 25
November 2021 and see below what our codebase would represent at that point in time:
This takes only a few minutes to set up, but already gives others clear visibility into the
plans and current actions affecting your API.
64
API schema
Let’s make an API schema, unless you want others to literally have to conduct black box
penetration testing to understand your API.
An API schema describes our API in a standardized way. API schemas can be validated,
tested, and linted to ensure that they correspond to given standards. It’s important to
understand that in most cases the API schema is not the API itself.
• Or go with services like Stoplight, Bump, Readme, and API clients like Insomnia
that sometimes have capabilities to design APIs, too.
When you actually have a schema, make sure to make it accessible and visible
(that’s our reason for using Bump in the code part of this book).
There are a few ways to think about schemas, like “API design-first”, in which we de-
sign the API and generate the actual code from the schema. Our way is more tradi-
tional since we create the code and keep the schema mostly as a representation of the
implementation—However: A very important representation!
Information
Example: See api/schema.yml for the OpenAPI 3 schema. Since our approach is manual,
we have to implement any security and/or validations on our end in code. In our case,
65
this is both for in-going and outgoing data. Ingoing data can be seen handled at src←-
,→ /FakeUser/controllers/FakeUserController.ts in checkInput(), and outgoing data
is handled in src/FakeUser/entities/User.ts and its various validation functions like
validateName().
/**
* @description Check and validate input.
*/
function checkInput(event: APIGatewayProxyEvent): string {
const clientVersion =
event?.headers["X-Client-Version"] || event?.headers["x-client-←-
,→ version"];
const isClientVersionValid = validateClientVersion(clientVersion || "←-
,→ ");
const userId = event?.requestContext?.authorizer?.principalId;
const isUserValid = validateUserId(userId || "");
if (!isClientVersionValid || !isUserValid)
throw new Error("Invalid client version or user!");
Success
Strongly consider using security tooling like 42Crunch’s VS Code plugin for Ope-
nAPI. Note also that because this is intended as a public API, the OAS security
object is not present.
Information
For GraphQL, consider if something like Apollo Studio might be a way to cover
this area for your needs.
66
AWS API Gateway offers request (schema) validation. Schema validation is done with
JSON schemas which are similar to, but ultimately not the same as, OpenAPI schemas.
These validators allow our gateway to respond to incorrect requests without us needing
to do much of anything about them, other than provide the validator. Also, we have the
benefit of our Lambda functions not running if the in-going input is not looking the way
we expect it to.
Success
Once again, since we have a manual approach, any validation schemas need to be handled
separately from our code and the OpenAPI schema.
Information
We only use validators for POST requests, which means the FeatureToggles
function is in scope, but not the FakeUser function.
FeatureToggles:
handler: src/FeatureToggles/controllers/FeatureTogglesController.←-
,→ handler
description: Feature toggles
events:
- http:
67
method: POST
path: /featureToggles
request:
schema:
application/json: ${file(api/FeatureToggles.validator.json)←-
,→ }
68
One typical way to define expectations on an API is to use versioning. While there are
several ways to do this—for example, refer to this article from Nordic APIs—we are going
to use a header to decide which API backend to actually activate for the request.
Warning
/**
* @description Check and validate input.
*/
function checkInput(event: APIGatewayProxyEvent): string {
const clientVersion =
event?.headers["X-Client-Version"] || event?.headers["x-client-←-
,→ version"];
const isClientVersionValid = validateClientVersion(clientVersion || "←-
,→ ");
const userId = event?.requestContext?.authorizer?.principalId;
const isUserValid = validateUserId(userId || "");
if (!isClientVersionValid || !isUserValid)
throw new Error("Invalid client version or user!");
Danger
So, while it may be non-standard, in this context version 2 of the API represents the beta,
meaning that version 1 represents the current (or “stable”, “old”) variant.
With this, we have created a way to dynamically define our response simply through a
header, without resorting to separate codebases or separate deployments. No need for
anything more complicated, as long as we handle this logic in a well-engineered way.
70
Branch by abstraction
Get rid of heavy-handed branching and just branch in code instead. But how?
How do we do better than using branches? Well… not using branches! But how to deal
with changes bigger than we want to contain in a single commit?
Paul Hammant seems to be the originator, if not of the pattern, then at least of the term.
He’s also clear on this being smarter than using multiple branches.
This pattern works especially well when making significant changes to existing code. I
might be harsh here, but there might well be severe code smells already present, since
abstracting this way should be easy with well-engineered and nicely separated code.
It’s all pretty simple, actually. The full eight steps are:
Example: While our example might be too lightweight, and involved “new” and not
old code, we do have a “beta” version and a “current” version as two code paths (see
src/FakeUser/controllers/FakeUserController.ts, lines 37-44), abstracted in the con-
troller.
This is src/FakeUser/controllers/FakeUserController.ts:
/**
* Run current version for:
* - Legacy users
* - If missing version header
* - If version header is explictly set to an older version
*/
if (
!clientVersion ||
parseFloat(clientVersion) < BETA_VERSION ||
toggles.userGroup === "legacy"
)
return currentVersion();
// Run beta version for everyone else
else return await betaVersion(toggles);
If you need a hint, the encapsulation of the versions into their own “use-cases” makes it
very easy to package completely different functionality into the same deployable artifact.
72
Information
Beta functionality
We have built in a “beta functionality” concept that is propagated from feature toggles
into our services. This is a catch-all for new features that we want to test, and which may
not yet be ready for wider release. This means that our services need to have distinct
checks for this, though.
As you can see in the next section on feature toggles, we can also use user groups to
segment features, as well as classic, individual toggles that can be used on a per-user
basis.
What gives? Aren’t these beta features just like any other toggles? Yes. In this project, we
define ”beta features” as a feature-level, wide bucket across user groups, while user grouping
is a dynamically wide bucket (basically just audience segmentation). This way, we can
define granularly both user groups and on beta usage.
/**
* @description This is where we orchestrate the work needed to fulfill←-
,→ our use case "create a fake user".
*/
export async function createFakeUser(
toggles: Record<string, unknown>
): Promise<UserData | UserDataExtended> {
// Use of Cat API is same in all cases
const user = new User(toggles.enableBetaFeatures as boolean | false);
const imageResponse = await getImage("CatAPI");
user.applyUserImageFromCatApi(imageResponse); // <-- Rich entity ←-
,→ object has dedicated functionality for differing data sources
if (toggles.enableNewUserApi) {
const dataResponse = await getData("RandomUser");
user.applyUserDataFromRandomUser(dataResponse); // <-- Rich entity ←-
,→ object has dedicated functionality for differing data sources
}
// Else return regular response
else {
const dataResponse = await getData("JSONPlaceholder");
user.applyUserDataFromJsonPlaceholder(dataResponse); // <-- Rich ←-
,→ entity object has dedicated functionality for differing data ←-
,→ sources
}
return user.viewUserData();
}
75
It’s time to go from static configs to dynamic configurations and spend less time deploy-
ing.
The code part of this book uses a handcrafted, simple feature flags engine and a small
toggle configuration.
While a technically trivial solution, this enables a dynamic configuration that can be de-
tached from the source code itself. In effect, this means we are one big step closer to
separating a (technical) deployment from a (business-oriented) release.
Unknown (or missing, non-existing, null…) users get the standard user group access.
Information
Note that different feature toggle services or implementations may differ in how
they exactly apply toggles, and how they work. In our case here, we apply group-
level toggles rather than request individual flags, if nothing else than for simplic-
ity of demonstration.
Example: The full configuration you can use as your template looks like the one below.
Notice how it’s segmented on the group level:
{
"error": {
"enableBetaFeatures": false,
"userGroup": "error"
},
"legacy": {
"enableBetaFeatures": false,
76
"userGroup": "legacy"
},
"beta": {
"enableBetaFeatures": true,
"userGroup": "beta"
},
"standard": {
"enableBetaFeatures": false,
"userGroup": "standard"
},
"dev": {
"enableBetaFeatures": true,
"userGroup": "dev"
},
"devNewFeature": {
"enableBetaFeatures": true,
"enableNewUserApi": true,
"userGroup": "devNewFeature"
},
"qa": {
"enableBetaFeatures": false,
"userGroup": "qa"
}
}
You can update the flags as you want, to get the effects you need, without requiring
redeploying the code!
Information
You’ll often see tutorials and such talking about authentication, which is about how we
can verify that a person really is the person they claim to be. This tends to be mostly a
technical exercise.
Authorization, on the other hand, is knowing what this person is allowed to do.
Only trivial systems will require no authorization, so prepare to think about how you
want your model to work and how to construct your permission sets. Rather than being
a technical concern, this becomes more logical than anything else.
Warning
functions:
Authorizer:
handler: src/FeatureToggles/controllers/AuthController.handler
description: ${self:service} authorizer
FakeUser:
handler: src/FakeUser/controllers/FakeUserController.handler
description: Fake user
events:
- http:
method: GET
path: /fakeUser
authorizer:
name: Authorizer
resultTtlInSeconds: 30 # See: https://forum.serverless.com/←-
,→ t/api-gateway-custom-authorizer-caching-problems/4695
identitySource: method.request.header.Authorization
type: request
78
/**
* @description Get user's authorization level keyed for their name. ←-
,→ Fallback is "standard" features.
*/
function getUserAuthorizationLevel(user: string): string {
const authorizationLevel = userPermissions[user];
if (!authorizationLevel) return "standard";
else return authorizationLevel;
}
And src/FeatureToggles/config/userPermissions.ts:
Information
Canary deployment
The traditional way to deploy software is as one huge chunk that becomes instantly acti-
vated whenever it’s deployed to a machine. If the code was a complete failure, then you
end up having zero time to verify and correct this before the failure is apparent to users.
This notion is what makes managers ask for counter-intuitive things like code
freeze and all-hands-on-deck deployments. This is dumb and wrong and helps
no-one. Let’s forever end those days!
Matt Casperson, writing for The New Stack, deftly portrays the journey that many are now
making when it comes to finding a new “truth” when it comes to testing best practices:
[…] What I really wanted to do was leverage the existing microservice stack
deployed to a shared environment while locally running the one microservice
I was tweaking and debugging. This process would remove the need to reim-
plement live integrations for the sake of isolated local development, which
was appealing because these live integrations would be the first things to be
replaced with test doubles in any automated testing anyway. It would also
create the tight feedback loop between the code I was working on and the ex-
ternal platforms that validated the output, which was necessary for the kind
of “Oops, I used the wrong quotes, let me fix that” workflow I found myself
in.
My Googling led me to “Why We Leverage Multi-tenancy in Uber’s Microser-
vice Architecture”, which provides a fascinating insight into how Uber has
evolved its microservice testing strategies.
The post describes parallel testing, which involves creating a complete test en-
vironment isolated from the production environment. I suspect most devel-
opment teams are familiar with test environments. However, the post goes on
to highlight the limitations of a test environment, including additional hard-
ware costs, synchronization issues, unreliable testing and inaccurate capacity
testing.
The alternative is testing in production. The post identifies the requirements
to support this kind of testing:
80
There are two basic requirements that emerge from testing in production,
which also form the basis of multitenant architecture:
• Traffic Routing: Being able to route traffic based on the kind of traffic
flowing through the stack.
• Isolation: Being able to reliably isolate resources between testing and
production, thereby causing no side effects in business-critical microser-
vices.
Success
See these brilliant articles for more justification and why this is important to un-
derstand:
OK, so what can we do about it? In serverless.yml at line ~85, you’ll see type: ←-
,→ AllAtOnce.
FakeUser:
[...]
deploymentSettings:
type: AllAtOnce
alias: Live
alarms:
- FakeUserCanaryCheckAlarm
This means that we get a classic deploy === release pattern. When the deployment is
done, the new function version is immediately active with a clear cut-off between the
previous and the current (new) version.
81
There are considerations and problems with this approach. In our AWS circumstances,
running on Lambda, we won’t face downtime while the instance switches over, and even
a half-good PaaS solution won’t create massive headaches either.
Instead of being overly defensive, let’s simply embrace the uncertainty, as it’s already
there anyway.
Using a canary release is one way to get those unknown unknown effects happening with
real production traffic in a safe and controlled manner. This is where the (sometimes mis-
understood) concept testing-in-production really kicks in—trying to answer questions no
staging environment or typical test can address. Like a canary in the mines of old, our
canary will die if something is wrong, effectively stopping our roll-out.
• 90% of the traffic will pass to whatever function version that was already deployed
and active…
• …while the remaining 10% of traffic will be directed to the “canary” version of the
function.
• The alarm configuration (defined on lines 75-83) looks for a static value of 3 or
more errors on the function (I assume all versions here?) during the last 60-second
window.
• After 5 minutes, given nothing has fired the alarm, then the new version takes all
of the traffic.
This is serverless.yml:
FakeUser:
[...]
alarms:
- name: CanaryCheck
namespace: 'AWS/Lambda'
metric: Errors
threshold: 3
statistic: Sum
period: 60
evaluationPeriods: 1
comparisonOperator: GreaterThanOrEqualToThreshold
deploymentSettings:
type: Canary10Percent5Minutes
alias: Live
83
alarms:
- FakeUserCanaryCheckAlarm
You can either manually send “error traffic” with the [email protected] Authoriza-
tion header, or use the AWS CLI to manually toggle the alarm state. See AWS docs for
how to set the alarm state, similar to:
Information
Use the above with the OK state value to reset the alarm when done.
This specific solution is rudimentary, but indicative enough of how a canary solution
might begin to look. I highly recommend using a deployment strategy other than the
primitive “all-at-once” variety.
Information
See this article at Google Cloud Platform for more information on deployment
and test strategies.
84
Observability
Finally, this last section is all about putting eyeballs (and alarms and metrics…!) on the
product and making it operable as a modern solution.
• Structured logger
• Alerting
• Additional observability
85
There’s a lot written and discussed when it comes to observability vs monitoring. Let’s go
with Google’s definition taken from DORA:
Further, let’s also add that monitoring classically is said to consist of the three “pillars”
logs, metrics, and traces. In the cloud, these are all typically pretty easy to set up and we
should aim to at least have these under control.
Information
Of the three pillars, tracing is maybe the least well-understood. Do read Light-
step’s good introductory article on tracing if you are interested in that area!
Information
Taken together, these give us a very good level of baseline observability right out of the
box. Let’s look at serverless.yml:
86
provider:
[...]
tracing:
apiGateway: true
lambda: true
plugins:
- serverless-plugin-aws-alerts
custom:
alerts:
dashboards: true
Information
Refer to the CloudWatch Logs Insights query syntax if you want to set up some
nice log queries. Also note that a shortcoming with CloudWatch is that they
don’t seem ideal for cross-service logs, but they are just fine when it comes to
checking on a specific service. This is the intention behind the coming Honey-
comb addition below.
You can now see dashboards (metrics) in both the Lambda function view and CloudWatch
(plus the logs themselves) and get the full X-Ray tracing. It’s just as easy as that! Done
deal.
87
Structured logger
Structured logging should be an early thing you introduce into your stack.
Another best practice is to treat logs as a source of enriched data rather than as plain,
individual strings. To do so, we need to have a structured approach to outputting them.
Information
Good folks like Yan Cui have written and presented on this matter many times
and you can certainly also opt-in to turnkey solutions like lambda-powertools.
I’ve provided a basic one that also uses getUserMetadata() to get metadata (correlation
ID and user ID) that has been set in the environment at an early stage in the controller.
This is src/FakeUser/frameworks/Logger.ts:
timePlayed: "412",
});
Information
As opposed to some solutions, in our case, the Logger will not replace the vanilla
console.log() (etc) so you will need to import it everywhere you want to use
it.
Using it, your logs will then all follow the format (src/FakeUser/frameworks/Logger.ts):
{
message: "My message!",
level: "INFO" | "WARN" | "ERROR" <based on the call, i.e. logger.warn←-
,→ () etc.>,
timestamp: <timestamp>,
userId: <userId from metadata>,
correlationId: <correlationId from metadata>
};
89
Alerting
Monitor as you will, but don’t forget to set alarms to inform you of any incoming dumpster
fires!
Alerts (or alarms; same thing) are usually connected to metrics. When the metric is
triggered, the alert goes off. Easy peasy.
This is serverless.yml:
custom:
alerts:
dashboards: true
functions:
FakeUser:
[...]
alarms:
- name: CanaryCheck
namespace: 'AWS/Lambda'
metric: Errors
threshold: 3
statistic: Sum
period: 60
evaluationPeriods: 1
comparisonOperator: GreaterThanOrEqualToThreshold
deploymentSettings:
[...]
alarms:
- FakeUserCanaryCheckAlarm
Information
You can extend this behavior to, for example, communicate the alert to an SNS
topic which in turn can inform a pager system, Slack/Teams, or send an email to
90
a relevant person. A better way of doing this would probably use a shared service
that can also keep a stored log of all events, rather than just relaying the alert
directly to a source.
91
Service discovery seems to be a very hot topic among the Kubernetes crowd. With server-
less FaaS like we are using here (AWS Lambda), that’s not really an interesting discus-
sion. However, discoverability is not just a technical question—it’s also something that
absolutely relates to the social, organic, and management layers.
At some point, that single function will be one service, which will soon maybe become
hundreds of services and then thousands of functions. How to keep track of them?
To some extent, it’s possible to get a high-level picture inside of AWS or by being a CLI
crusader. Unfortunately, if you are not, then there is no option unless you buy, build, or
adapt some open-source solution.
Information
One such open-source solution is Spotify’s Backstage which offers a broad set of
ideas—no wonder since it’s labeled as “an open platform for building developer
portals”. It’s a bit heavy, but some pretty big players are starting to use it. For a
super-lightweight AWS-based and flexible solution, you might want to consider
catalogist written by yours truly.
For a lighter-weight system, I’d argue that the reasonable data to pull in would be a subset
of the metadata (and processes and output) we have generated and written previously.
In a corporate context, we’d want to ensure that the source of truth of the state of our
systems/applications resides with the codebase—Not in a closed tool like Sharepoint or
Confluence or similar.
Example: See manifest.json for a napkin sketch of how one could work with service
metadata if you had somewhere to send it and store it, during the CI stage. This format is
also very similar to the one used in catalogist.
This is manifest.json:
{
"spec": {
"lifecycle": "production",
92
"type": "service",
"name": "my-project",
"team": "ThatTeam",
"responsible": "Someguy Someguyson",
"system": "something",
"domain": "bigarea",
"tags": ["typescript", "backend"],
"securityClass": "Public",
"dataSensitivity": "Sensitive",
"l3ResolverGroup": "ThatTeam",
"slo": "99.95"
},
"api": [
{
"FakeUser": "./api/schema.yml"
}
],
"metadata": {
"annotations": {
"sbom": "./outputs/sbom-output.txt",
"typedoc": "./typedoc-docs/",
"arkit": "./assets/"
},
"description": "The place to be, for great artists",
"generation": 1,
"labels": {
"example.com/custom": "custom_label_value"
},
"links": [
{
"url": "https://admin.example-org.com",
"title": "Admin Dashboard",
"icon": "dashboard"
}
]
}
}
93
Additional observability
Many modern observability tools are based on the OpenTelemetry standard, making a
choice basing itself on that standard a fairly future-proof decision. Using OpenTelemetry
can be a bit of a slog, but I’ll show you a way to move into richer observability while
keeping the work quite manageable!
Information
Success
If you want to toy with Honeycomb, look no further than their playground.
If you want to try Honeycomb with this project, it’s pretty easy if we use their Lambda
extension:
7. Ready to go! You should see data coming into Honeycomb shortly if you start using
your live endpoints.
You might want to use Bunyan rather than our custom logger if the logs don’t quite
show up structured the way we output them.
Install bunyan and its typings with npm install bunyan @types/bunyan.
Open up src/FakeUser/frameworks/Logger.ts and add this to the top of the file (src/←-
,→ FakeUser/frameworks/Logger.ts):
For the log(), warn() and error() methods, change the existing console.log()-dependent
implementation lines to:
• log.info(createdLog);
• log.warn(createdLog);
• log.error(createdLog);
Lastly, in the createLog() method, go ahead and remove the level field, as bunyan adds
that itself.
Now you can use Bunyan instead of regular console.log() or our own custom one!
If this is an area you enjoy and you have a similar preference as I do to lightweight,
simple-as-in-dumb tools, then you might enjoy the Mikro family of tools:
• MikroLog
95
• MikroTrace
• MikroMetric
I’ve built these open-source three lightweight observability packages and designed them
specifically to streamline your AWS serverless experience. These tools are all tiny, zero-
config, and optimized for AWS Lambda environments, making them ideal for cloud-native
applications. Each package is built with simplicity, minimalism, and effectiveness in
mind, ensuring you get all the necessary functionality without the bloat.
MikroMetric is your go-to for seamlessly integrating with AWS CloudWatch, provid-
ing a straightforward syntax for managing metrics without the complexity of raw EMF.
MikroLog offers a clean, structured logging solution that eliminates unnecessary fields
and complexities, allowing for easy log management across multiple observability plat-
forms. MikroTrace simplifies tracing by offering OpenTelemetry-like semantics with a
focus on JSON logs, making it easier to integrate with AWS and Honeycomb. All three
packages share the benefits of being extremely lightweight (around 2 KB gzipped), having
only one dependency (aws-metadata-utils), and achieving 100% test coverage, ensuring
they are both reliable and easy to use.
Books
• Accelerate: The Science of Lean Software and DevOps: Building and Scaling High
Performing Technology Organizations, by Nicole Forsgren, Jez Humble, Gene Kim
• Refactoring: Improving the Design of Existing Code (2nd Edition), by Martin Fowler
• DDD Resources
97
• FeatureFlags.io
• Branch By Abstraction?
• Refactoring.guru
• APIsecurity.io
• Why SOLID principles are still the foundation for modern software architecture
• Khalil Stemmler: How to Test Code Coupled to APIs and Databases; also available
as video (see below)
Video
Thank you for investing your time, energy, and money into reading this book. With every
book and article I write, I strive to make it as useful as possible. Books allow us to delve
deeper—or sometimes broader—into topics than we typically can at work or in short-
form articles. Technical books, in particular, are unique creatures: they are both products
of their time and, when well-crafted, can become timeless resources within their field. I
hope this book remains relevant for (at least a few!) years to come.
I write the books I wish I had read earlier in my career and life. I’ve tried to be generous
with references to other content, such as books and articles that have helped me improve
in this subject. There are so many great authors out there and so much knowledge to
keep up with.
If you found this book helpful, I would greatly appreciate it if you could rate it on the
platform where you purchased it.
Please don’t be a stranger! Connect with me on LinkedIn or wherever else I may be when
you’re reading this.
Once again, thank you, and I hope you found value in the time we spent together.