Service Mesh Primer 200205003805 PDF
Service Mesh Primer 200205003805 PDF
Eberhard Wolff
Service Mesh
The New Infrastructure for
Microservices
Service Mesh
The New Infrastructure for Microservices
Hanna Prinz
Eberhard Wolff
ISBN 978-3-9821126-1-9
innoQ Deutschland GmbH
Krischerstraße 100 · 40789 Monheim am Rhein · Germany
Phone +49 2173 33660 · WWW.INNOQ.COM
6 Conclusion 37
1 Intro
Microservices are still the most hyped approach to software architecture. That has
led to a huge number of new technologies to support microservices. Docker1 for
instance is a great tool to package microservices in self-contained images and to run
many microservices on a single machine, but with separate file systems and network
interfaces. Kubernetes2 runs Docker containers in a cluster and enables load balancing
as well as fail over. It also adds features like service discovery and routing. Nowadays,
Kubernetes has become a very important infrastructure technology and has use cases
far beyond microservices.
If you are interested in the concepts of microservices, there is also the free Microser-
vices Primer3 and the Microservices Book4 . The free Microservices Recipes5 booklet
talks about some other technologies for implementing microservices as does the
Pratical Microservices6 book.
1
https://www.docker.com
2
https://kubernetes.io/
3
https://microservices-book.com/primer.html
4
https://microservices-book.com/primer.html
5
https://practical-microservices.com/recipes.html
6
https://practical-microservices.com/
1
2 What is a Service Mesh?
A service mesh is a dedicated infrastructure component that facilitates observing,
controlling, and securing communication between services. Unlike earlier approaches
such as Enterprise Service Buses (ESBs) or API Gateways, a service mesh embraces
the distributed nature of modern microservice applications and only focuses on net-
working rather than business concerns.
2.1 Architecture
A service mesh is composed of two layers, the data plane and the control plane. The
data plane consists of a number of service proxies, each deployed alongside every
microservice instance. This is called the sidecar pattern: Capabilities that every service
needs are extracted into an additional container (the sidecar) that is placed next to
every service instance. Up to now, typical use cases for a sidecar have been basic
monitoring and encryption of network connections. Therefore, a service proxy was
deployed as a sidecar which intersects all network traffic. The configuration of these
service proxy sidecar containers and any update to it had to be performed manually.
In a service mesh, the service proxies, that make up the data plane, are configured
automatically by the second layer of a service mesh: the control plane. Any change to
the behavior of the service mesh configured by the developer is applied to the control
plane and automatically distributed to the service proxies. As shown in figure 2.1, the
control plane also processes telemetry data that is collected by the service proxies.
This architecture adds powerful features like monitoring, circuit breaking, canary
releasing, and automatic mTLS (mutual TLS authentication) to a microservice appli-
cation without the need to change a single line of application code.
3
Figure 2.1 - Service Mesh Architecture
The diverse landscape of service mesh implementations forced tool developers and
users to bind their software to a specific service mesh. This led Microsoft, HashiCorp,
Buoyant, and Solo.io to create Service Mesh Interface4 (SMI), an API specification for
service mesh features. Users and tools binding to SMI are able to use service mesh
features independently of the implementation. Although SMI is still young, adapters
implementing a part of SMI already exist for the three major service meshes Linkerd
2, Istio, and Consul.
1
https://flagger.app
2
https://squash.solo.io
3
https://www.kiali.io
4
https://smi-spec.io
4
3 Why Service Meshes?
By adding proxies to the data plane, service meshes make it possible to solve some
basic problems for microservices infrastructures. This chapter explains the features
services meshes provide in more details.
3.1 Monitoring
A microservices system should include a monitoring infrastructure that collects in-
formation from all microservices and makes them accessible. This is required to
keep track of the metrics for the huge number of distributed microservices. You can
implement alarms and further analysis based on these metrics.
The proxies in the data plane can measure basic information about the network traffic
such as latencies or throughput. However, it is also possible to get more information
from the network traffic. The service mesh needs to understand the protocol and
interpret it. For example, HTTP has some defined status codes that enable the service
mesh to determine whether a request was successful or not. That way the service mesh
can also measure error rates and the like.
Of course, the service mesh cannot look into the microservices. So internal metrics
about thread pools, database connection pools, and the like are beyond what the
service mesh can provide. However, it is possible for the service mesh to use metrics
that a platform like Kubernetes provides and measure these, too.
• It has no impact on the code of the microservice at all. Metrics are measured only
by the proxy, so any microservice will report the same metrics – no matter in what
programming language it is written or which framework it uses.
• The metrics give a good impression about the state of the microservice. The metrics
show the performance and reliability that a user or client would see. This is enough
to ensure that service level agreements and quality of service are met.
5
So metrics provided by a service mesh might be enough to manage a microservices
system. However, for a root cause analysis additional metrics from inside the microser-
vices might be useful. In that case, the service mesh is still helpful because it provides
a standardized monitoring environment that the service mesh already uses for its own
metrics.
Tracing
Tracing solves a common problem in microservices systems. A request to a microser-
vice might result in other requests. Tracing helps to understand these dependencies,
thus facilitating root cause analysis.
The proxies are able to intercept each request. But to do tracing, it is important to
figure out which incoming request caused which outgoing requests. This can only
be done inside the microservices. Usually, each request has some meta data such as
a unique ID e.g. as part of HTTP headers, and that information is then transferred
to each outgoing request. The data for each request is then stored in a centralized
infrastructure.
A unique ID is not just valuable for tracing, but it is also useful to mark log entries that
belong to one original request across all microservices.
The code of the microservices has to transfer the IDs of each request. So the service
mesh cannot add tracing transparently to microservices. This compromises an impor-
tant benefit of service meshes.
In addition, tracing is only useful for very specific cases. Most errors can be analyzed by
looking into just one microservice and treating the other ones as black boxes. Tracing
only adds value in very complex scenarios.
The tracing data can also be used to create dependency graphs. However, such a graph
can be generated by observing the traffic between microservices. There is no need for
an ID in each request and therefore no need to change the code of the microservices.
6
3.2 Logging
Logging is another important technology to gain more insight into a system. A service
mesh collects information about the network communication. It could write that
information also to a log file. Access logs that contain entries for each HTTP access
are an important source of information to evaluate the success of web sites. So a log
that contains only information about network traffic might be useful.
However, often log files are used to analyze errors in microservices. To do that, the log
files have to include detailed information. The microservices will need to provide that
information and log it. So just like with tracing, the microservice needs to be modified.
In fact, for logging the service mesh adds little value because usually the information
about the network traffic is not that useful to understand problems in the system.
The logging support of a service mesh has the advantage that developers do not have
to care about these logs at all. Besides, the logs are uniform no matter what kind
of technology is used in the microservices and how they log. Enforcing a common
logging approach and logging format takes some effort. This is particularly true for
microservices that use different technologies. So although the logs of the service mesh
might not include information from inside the microservices, they are easy to get. Such
a log might be better than no log at all or no uniform log.
3.3 Resilience
Resilience means that individual microservices still work even if other microservices
fail. If a microservice calls another microservice and the called microservice fails, this
will have an impact. Otherwise, the microservice would not need to be called at all.
So the calling microservice will behave differently and might not be able to respond
successfully to each request. However, the microservice must still respond. It must
not block a request because then other microservices might be blocked and an error
cascade might occur. Also delays in the network communication might lead to such
problems.
7
3.3.1 Circuit Breaker
Circuit breakers are one way to add some resilience to a microservices system. A
conventional circuit breaker would cut a circuit if the circuit has been short-circuited.
That protects the circuit from overheating. Circuit breakers in software systems do
something similar: They protect a part of the system by cutting off some of the traffic.
Because service meshes add proxies to the communication between the microservices,
circuit breakers can be added without changing the code.
3.3.2 Retry
Retries repeat the failed requests. They are an obvious way to increase resilience: If
the failure is transient, the retry can make the request succeed. However, retries also
increase the number of requests. So the called microservice will have a higher load and
might become unstable. A circuit breaker might be used to solve this problem.
3.3.3 Timeout
Timeouts make sure that the calling microservice is not blocked for too long. Otherwise,
if all threads are blocked, the calling microservice might not be able to accept any more
requests. So a timeout will not make it more likely that a request succeeds, rather it
makes the request fail faster so that less resources are blocked.
3.4 Routing
Any microservices system needs some way to route requests between microservices
and to route a request from the outside to the correct microservice. Implementing
these features can be very simple. A reverse proxy might be enough to route requests
from the outside to the correct microservices.
• Canary releasing is a deployment process. First one instance of the new version of
a microservice is deployed while there are still some instances of the old version
around. Step by step, more and more traffic is routed to the new version. At the
8
same time, the new version is monitored. So if there is a problem with the new
version, it will become obvious while some instances of the old version are still
around and it is therefore easy to roll back. Advanced routing supports this process
by splitting the traffic between the two versions. This might be done randomly or
according to specific segments like devices or geo regions.
• Another way to mitigate the risk of a deployment is mirroring. In that case the new
and the old version of a microservice run in parallel. Both receive the traffic and
respond to it. It is possible to look at different behaviors in detail and to determine
whether the new version works correctly. Istio automatically discards the responses
of the new service. In that case, the risk for deploying a new version is essentially
non-existent.
It is quite easy to implement these features once the proxies of the service mesh are in
place. The only difference might be that not just traffic between microservices must
be handled, but also traffic from the outside.
3.5 Security
Obviously, it is possible to encrypt the network traffic between the proxies. However,
there is still the network traffic between the proxy and the microservice. Usually the
proxy and the microservice run on the same machine. So while the traffic might look
like network traffic, it really goes through a loop-back device and not through a real
network. Therefore, service meshes can implement encryption and confidentiality
transparently.
9
certificate. It is therefore possible to implement authentication and ensure that a call
actually originates from a specific microservice.
Of course, it is then possible to limit what a microservice can do. For example, the
proxies might not allow specific requests to other microservices. That means that the
service mesh can implement authorization for microservices.
For that reason, it is important to keep the additional latency of the proxies as small as
possible. For example, it is possible to collect data about requests in the proxy and send
it to the service mesh infrastructure later. That way sending the information does not
increase the latency even further. Also updates – for example to security policies, are
distributed asynchronously to the proxies. That means updates to such policies might
take some time until they are actually enforced in every proxy. Istio is optimized for
low latency.
Of course, the service mesh itself needs to run. This will consume memory and
CPU. However, hardware is becoming cheaper constantly and adding some hardware
resources to improve the reliability of a system might be acceptable.
1
https://jwt.io/
10
3.7 Legacy
Service meshes are very useful for microservices architectures because they solve
many challenges of distributed systems. The microservice architecture has been
around for years and its popularity is still growing. But many teams have experienced
that slicing their monolithic application takes a long time. A service mesh can add
monitoring, routing, resilience, and security features to legacy parts of the application
and facilitate integrating legacy and hybrid applications into modern architectures.
Of all service mesh implementations, Istio offers the best support for this scenario
because it is not limited to Kubernetes. It can integrate legacy systems that run on
different infrastructures using service mesh expansion2 .
For routing technologies reverse proxies or API gateways might be alternatives. They
are specialized in providing just these features. Limiting the feature set might make
2
https://istio.io/docs/setup/kubernetes/additional-setup/mesh-expansion/
11
them easier to handle than a service mesh. However, it also means that they do not
provide all the other features a service mesh has to offer.
For security, an infrastructure that provides certificates must be set up, certificates
must be distributed, and the communication must be encrypted. While there are
alternatives, they also require a modification to the infrastructure and have a limited
set of features. In addition, non-authorized microservices must not be able to get
certificates from the infrastructure. Service meshes control the proxies that need to
receive the certificates. So it is quite easy for a service mesh to ensure that certificates
are distributed securely because it controls all of the parts of the communication
infrastructure.
3.9 Conclusion
Service meshes are based on a rather simple idea: Add proxies to the communication
between microservices and add an infrastructure to configure the proxy and evaluate
the data the proxies send. However, it turns out that this idea is quite powerful. It
provides basic monitoring, security, analysis of the dependencies, and resilience. It is
not necessary to change any code to enjoy these benefits.
However, for tracing the microservices have to forward the unique IDs for each call.
So in that case some changes to the code are needed. Service meshes can provide only
basic logging for the network traffic. That might be of little value. And of course the
additional infrastructure comes with a cost: The latency increases and
the additional infrastructure also consumes memory and CPU.
However, all in all, service meshes solve challenges that are quite important for mi-
croservices systems and are therefore a good addition to a microservices system.
12
4 Example with Istio
Typically, an example that shows a technology in action is a great way to under-
stand how the technology actually works. This chapters shows a system consisting
of multiple microservices and demonstrates how a service mesh can add value to a
microservices architecture. The example runs on Kubernetes and uses Istio as a service
mesh.
4.1 Istio
Istio is the most popular service mesh and was developed by Google and IBM. Just
like Kubernetes, it is an Open Source reimplementation of a part of Google’s internal
infrastructure. Istio implements all service mesh features described in the previous
chapter such as metrics, logging, tracing, traffic routing, circuit breaking, mTLS, and
authorization. Although Istio is designed to be platform-independent, it started with
first class support for Kubernetes.
Figure 4.1 reflects how the service mesh is located between the orchestrator (top) and
the application (bottom). The four core components of Istio make up the control
plane: Galley, Pilot, Mixer, and Citadel. They communicate with the service proxies
to distribute configurations, receive recorded network traffic and telemetry data, and
manage certificates. Istio uses Envoy1 as service proxy, a widely adopted open source
proxy that is used by other service meshes, too.
In addition to the typical service mesh control and data plane, Istio also adds in-
frastructure services. They support the monitoring of microservice applications. In-
stead of developing its own tools, Istio integrates established applications such as
Prometheus, Grafana, and Jaeger and the service mesh dashboard Kiali. The image
shows that the Istio control plane interacts with the orchestrator, which today is in
most cases Kubernetes.
1
https://www.envoyproxy.io
13
Figure 4.1: Istio Architecture
clearly affects the usability. Istio also adds a number of components (marked as Istio
Core and integrated components in the figure) to an application which increases
technical complexity.
4.2 Overview
The example in this chapter contains three microservices: Users can enter new orders
via a web interface of the order microservice. The data about the order is transferred
to the invoicing microservice that will create an invoice and present it to the user.
14
The shipping microservice will use the data about the order to create a shipment. The
invoicing and the shipping microservice present a web interface to the user, too.
• Istio provides the Ingress Gateway. It forwards HTTP requests to the microservices.
In addition to the features of the Kubernetes Ingress, the Istio Gateway supports
Istio’s features mentioned previously, such as monitoring or advanced routing.
• Apache httpd provides a static HTML page that serves as the home page for the
example. The page has links to each microservice.
• Order, shipping, and invoicing are microservices. Shipping and invoicing poll data
about the orders from the order microservice using REST. Istio understands HTTP
and REST, so it can support this interface very well. The data format is based
on JSON. The feed contains a simple JSON document with a list of links to the
individual orders.
15
• All three microservices use the same Postgres database.
Sidecar
As mentioned in chapter 2, the idea behind a sidecar is to add another Docker container
to each Kubernetes pod. Actually, if you list the Kubernetes pods with kubectl get
pods, you will notice that for each pod it says 2/2:
2
https://github.com/ewolff/microservice-kubernetes/blob/master/HOW-TO-RUN.md
16
So while there is just one Docker container configured for each Kubernetes pod, two
Docker containers are in fact running. One container contains the microservice, and
the other contains the sidecar that enables the integration into the Istio infrastruc-
ture.
Istio automatically injects these containers into each pod. During the
installation described previously, kubectl label namespace default
istio-injection=enabled marked the default namespace so that Istio injects
sidecars for each pod in that namespace. Namespaces are a concept in Kubernetes to
separate Kubernetes resources. In this example, the default namespace contains all
Kubernetes resources that the user provides. The namespace istio-system contains
all Kubernetes resources that belong to Istio itself.
The sidecars contain the proxies. Istio routes all traffic between the microservices
through these proxies as described in chapter 2.
Of course, this approach supports only metrics that the proxy can measure. This
includes all the information about the request, such as its duration or the status code.
Also, information about the Kubernetes infrastructure – for example, CPU utilization
– can be measured. However, data about the internal state of the microservice is not
measured. To get that data, the microservice would need to report it to the monitoring
infrastructure.
3
https://prometheus.io/
4
https://grafana.com/
17
Prometheus
The documentation of the example5 contains information how to run Prometheus in
the example.
Prometheus stores all metrics and also provides a simple UI to analyze the metrics.
Figure 4.3 shows the byte count for requests and the different destinations: the order
microservice and also the Istio component that measures the telemetry data.
Grafana
For more advanced analysis of the data, Istio provides an installation of Grafana. The
documentation6 explains how to use Grafana with the example.
The Grafana installation in Istio provides predefined dashboards. Figure 4.4 shows an
example of the Istio service dashboard. It uses the shipping microservice. The dash-
board shows metrics such as the request volume, the success rate and the duration.
This gives a great overview about the state of the service.
Istio supplies also other dashboards. The Istio performance dashboard provides a gen-
eral overview about the state of the Kubernetes cluster with metrics such as memory
consumption or CPU utilization. The Istio mesh dashboard shows a global metric
about the number of requests the service mesh processes and their success rates.
So just by installing Istio, basic monitoring for all microservices is already in place.
5
https://github.com/ewolff/microservice-istio/blob/master/HOW-TO-RUN.md#prometh
eus
6
https://github.com/ewolff/microservice-istio/blob/master/HOW-TO-RUN.md#grafana
18
Figure 4.3: Prometheus with Istio Metrics
19
Figure 4.4: Grafana with Istio Dashboard
4.5 Tracing
As explained in section 2.1, tracing might be important to trace calls across microser-
vices and to do a root cause analysis based on that information. For tracing, Istio uses
Jaeger7 .
Figure 4.5 shows an example of a trace for a request to the shipping microservice. The
user started a poll for new data on the order microservice. Then the service contacted
the Istio Mixer to make sure the policies are enforced.
Figure 4.6 shows a different type of information Jaeger provides: the dependencies
between the microservices. The shipping and invoicing microservices use the order
microservice to receive information about the latest orders. The order microservice
7
https://www.jaegertracing.io/
8
https://github.com/ewolff/microservice-istio/blob/master/HOW-TO-RUN.md#tracing
20
Figure 4.5: Jaeger Trace
reports metrics to Mixer. And finally, the order microservice is accessed by the Istio
gateway when external requests are forwarded to it. This information about dependen-
cies might be useful to get an overview about the architecture of the system.
To understand which incoming request caused which outgoing requests, Jaeger relies
on specific HTTP header. The values in the headers of the incoming requests have to
be added to any outgoing request. This means that tracing cannot be transparent to
the microservices. They have to include some code to forward the tracing headers from
the incoming request to each outgoing request.
21
Figure 4.6: Jaeger Dependencies
10
https://www.kiali.io/
11
https://github.com/ewolff/microservice-istio/blob/master/HOW-TO-RUN.md#kiali
22
Figure 4.7 Dependencies in Kiali
4.7 Logging
As explained in section 3.2, service meshes are of little help for logging. However,
Istio can forward information about each HTTP request to a logging infrastructure.
That information can be used to analyze, for example, the number of requests to
certain URLs and status codes. A lot of statistics for web sites rely on this kind of
information.
The format in the logs can be configured. A log entry may contain any information
Mixer received from the request.
23
Figure 4.8 Logging in the Example
Figure 4.8 shows how logging is implemented in the example. Each microservice must
directly write JSON data to the Elasticsearch server. So there is no need to write any log
files which makes the system easier to handle. The need to parse the log information
has also been eliminated; Elasticsearch can directly process it.
The Istio infrastructure could log to the same Elasticsearch instance, too. However, for
the example, it was decided that this is not necessary. With the current system, it is
easy to find problems in the implementation by searching for log entries with severity
error. Also logging each HTTP request adds little value. Information about the HTTP
requests is probably already included in the logs of the microservices.
12
https://logback.qos.ch/
13
https://github.com/internetitem/logback-elasticsearch-appender
24
4.8 Resilience
Resilience means that a microservice should not fail if other microservices fail. It is
important to avoid failure cascades that could bring down the complete microservices
system.
Such scenarios are hard to simulate. Usually the network is reasonably reliable. It
would be possible to implement a stub microservice that returns errors, but that would
require some effort.
However, Istio controls the network communication through the proxies. It is there-
fore possible to add delays and errors to specific microservices. Then the other mi-
croservices can be checked to see if they are resilient against the delays and failures.
Fault Injection
The configuration is in the file fault-injection.yaml from the example. It makes
100 percent of the calls to the order microservice fail with HTTP status 500. See the
documentation14 about how to apply this configuration to your system.
Actually, the microservice will still work after applying the fault injection. If you add
a new order to the system, though, it will not be propagated to the microservices
shipping and invoicing. You can make those microservices poll the order microservice
by pressing the pull button in the web UI of the shipping and invoicing microservices.
In that case, an error will be shown. So the system is already quite resilient because
it uses polling. If the shipping microservice would call the order microservice – for
14
https://github.com/ewolff/microservice-istio/blob/master/HOW-TO-RUN.md#fault-inj
ection
25
example, to fulfill a request –, the shipping service would fail after the fault injection
if no additional logic is implemented to handle such a failure.
Delay Injection
Another possibility is to inject a delay, see the documentation15 . If you make the
shipping microservice poll the order microservice, it will take longer but it will work
fine. Otherwise the system just works normally. So again, due to polling the system is
already quite resilient.
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: order-circuit-breaker
spec:
host: order.default.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
maxConnections: 1
http:
http1MaxPendingRequests: 1
http2MaxRequests: 1
outlierDetection:
consecutiveErrors: 1
interval: 1m
15
https://github.com/ewolff/microservice-istio/blob/master/HOW-TO-RUN.md#delay-inj
ection
26
baseEjectionTime: 10m
maxEjectionPercent: 100
The previous listing shows a configuration of a circuit breaker for Istio. See the docu-
mentation16 for information about how to apply it. The configuration has the following
settings:
These limits protect the microservice from too much load. And, if an instance has
already failed, it is excluded from the work. That gives the instance a chance to
recover.
The limits in the example are very low to make it easy to trigger the circuit breaker. If
you use the load.sh script to access the order microservice’s web UI, you will need to
run a few instances of the script in parallel to receive 5xx error codes returned by the
circuit breaker.
If the circuit breaker does not accept a request because of the defined limits, the calling
microservice will receive an error. So the calling microservice is not protected from a
failing microservice.
16
https://github.com/ewolff/microservice-istio/blob/master/HOW-TO-RUN.md#circuit-b
reaker
27
Retry and Timeout Retries and timeouts as explained in section 3.3 can be defined
in the same part of the Istio configuration. That makes it easy to define a timeout per
retry and a maximum time span that all retries together might take.
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: order-retry
spec:
hosts:
- order.default.svc.cluster.local
http:
- retries:
attempts: 20
perTryTimeout: 5s
retryOn: connect-failure,5xx
timeout: 10s
route:
- destination:
host: order.default.svc.cluster.local
The previous listing shows a part of the file retry.yaml. It configures retries and
timeouts for the order microservice. Calls to the order microservice are retried up
to 20 times. For each retry, a timeout of 5 seconds is configured. However, there is
also a total timeout of 10 seconds. So if the retries don’t succeed within 10 seconds,
the call will fail. The Istio’s default timeout is 15 seconds.
retryOn defines when a request is considered failed. In this case any HTTP status code
5xx or connection failures such as timeouts are considered failed requests.
The rest of the file is not shown in the listing. It adds retries to the Istio Ingress gateway
for the order microservice.
28
See the documentation17 for information about how to make the order service fail and
then fix the problem by applying this configuration to the system.
• Istio provides a list of tasks18 . They provide hands-on guidelines how to gain more
experience with Istio and are therefore a great addition to this chapter.
• Istio provides support for security. See the security task19 for some hands-on ex-
ercises for this feature. The exercises cover encrypted communication, authentica-
tion and authorization.
17
https://github.com/ewolff/microservice-istio/blob/master/HOW-TO-RUN.md#retry
18
https://istio.io/docs/tasks/
19
https://istio.io/docs/tasks/security/
29
• Istio also supports advanced routing. For example, the traffic shifting task20 shows
how to do A/B testing with Istio. The mirroring task21 shows mirroring. With
mirroring, two versions of the microservice receive the traffic. Mirroring can be
used to make sure that the new and the old version behave in the same way.
• This example22 discusses how Istio support logs, showing how log information can
be composed from Mixer’s data. This example outputs the logs to stdout and not
to a log infrastructure.
• Also, this example23 shows how Fluentd24 collects the logs Istio provides from all
microservices. The logs are stored in Elasticsearch and evaluated with Kibana.
4.10 Conclusion
The example shows how Istio provides a complete monitoring environment and basic
monitoring information for the microservices system without any additional code. The
microservices are treated as black boxes i.e. the metrics only cover the communication
between the microservices and the infrastructure. However, this could in fact be
enough for a production system as it shows the performance as experienced by the
user.
Some code has to be added to the microservices for tracing so that they forward the
tracing headers to the outgoing HTTP requests. Still the complete infrastructure is
provided by Istio.
Istio does not provide an infrastructure for logging, but it can log information for each
HTTP request to a logging infrastructure. However, logging is often used to look inside
the microservices and understand what they actually do. This can only be implemented
in the code of the microservices. Still Istio could at least provide some very basic
information in the logs without any impact on the code or the microservices.
20
https://istio.io/docs/tasks/traffic-management/traffic-shifting/
21
https://istio.io/docs/tasks/traffic-management/mirroring/
22
https://istio.io/docs/tasks/telemetry/metrics-logs/
23
https://istio.io/docs/tasks/telemetry/fluentd/
24
https://www.fluentd.org/
30
Istio allows to simulate problems in the system by injecting delays and faults. That
is useful to test the system’s resilience. Istio’s circuit breaker and retries even help
to implement resilience. If a microservice fails, the system still needs to compensate
that failure. Dealing with a failed service must be covered by the domain logic. The
microservices in the example are self-contained i.e. all data for the logic is stored in
the microservice. So a failed microservice has very limited impact. However, this is a
feature of the architecture and not the service mesh.
31
5 Other Service Meshes
The last chapter has discussed Istio. While Istio is the most popular service mesh, the
market is quite diverse and worth evaluating. The first service mesh implementation
was Linkerd, developed in 2015. In 2017, Google and IBM joined forces to create the
Istio service mesh after they found out that they had been working on similar ideas.
The public attention Istio received through Google and IBM as main contributors
was amplified by media campaigns and conference talks. By the end of 2017, Linkerd
announced a new, more opinionated service mesh only for Kubernetes that was first
named Conduit and later Linkerd 2. By 2018, the term service mesh was ubiquitous and
more products and companies joined the party. Consul added service mesh features
and AWS announced their own service mesh implementation AWS App Mesh.
5.1 Linkerd 2
Linkerd 21 is the successor of Linkerd. Although Linkerd had been adopted in large
production systems, the software was too complex to configure and not performing
well. These limitations induced Buoyant to develop a completely new service mesh.
The result – formerly Conduit, today called Linkerd 2 – was presented in December
2017 as an open source project, written in Go and Rust. Linkerd 2 is an incubation
project of the CNCF2 (Cloud Native Computing Foundation) and currently the only
service mesh in the CNCF portfolio.
Although Kubernetes is the most common platform to be used with Istio, it is designed
to work with any environment. The complexity of Istio is to some extent caused by this
platform neutral design. While the first version of Linkerd was designed alike, Linkerd
2 is committed to Kubernetes only. Linkerd 2 implements most service mesh features
such as monitoring, routing, retries, timeouts, and mTLS. Circuit breaking, tracing,
and authorization are missing. While most service mesh implementations use the
Envoy service proxy, Linkerd includes its own service proxy implementation (linkerd-
proxy).
1
https://linkerd.io
2
https://www.cncf.io
33
Rewriting Linkerd for usability and performance while integrating with Kubernetes
seems to have worked out. The API of Linkerd 2 is much cleaner and more consistent
than Istio’s. It introduces only one Kubernetes CRD (Custom Resource Definition)
and provides a carefully considered dashboard, shown in figure 5.1. Kubernetes users
should seriously consider it since it includes most service mesh features in a produc-
tion ready stage, has a small resource and performance footprint, and provides an
excellent developer experience.
5.2 Consul
Consul3 , originally a solution for service discovery, recently added service mesh use
cases. Consul supports service discovery, authorization, mTLS, monitoring through
metrics, and routing capabilities. The latter were added by integrating with the Envoy
3
https://www.consul.io
34
proxy that Istio and other service meshes are also using. Resilience features like circuit
breaking, retry, and timeout are yet to be developed. Like Istio, Consul does not
depend on any specific orchestration platform, but is compatible with Kubernetes and
Nomad.
AWS is the most recent service mesh implementation, but since AWS has the biggest
market share for cloud computing, AWS App Mesh is expected to stay and grow.
As shown in figure 5.2, Istio has overtaken all other service mesh implementations in
terms of feature completeness and configurability. But this rich feature set has turned
Istio into a complex component that can be hard to manage in practice. In cases where
not all features and their customizability are required, Linkerd 2, Consul, and AWS App
Mesh might be better choices.
4
http://aws.amazon.com/app-mesh/
5
https://github.com/aws/aws-app-mesh-controller-for-k8s
35
Figure 5.2: Features of service mesh implementations as of August 2019
Another criterion is the platform. If the whole application runs in Kubernetes anyway,
users can benefit from the simplicity of Linkerd 2. If Consul or AWS are already
used, the corresponding service mesh implementations might cause least friction.
If multiple clusters or legacy applications outside of the cluster are involved, Istio
provides appropriate concepts and configuration as mentioned in chapter 4.
Many microservice applications are dealing with a higher latency compared to mono-
liths. Communication in a monolith is always a method call in the same process. Com-
munication between microservices goes through the network stack and has therefore
a higher latency. A service mesh also adds to the latency, which is not completely
balanced by improved load balancing strategies. In addition, the sidecars double the
number of running containers which has an impact on resource consumption. Fair
benchmarks in this space are hard to find because of the different feature sets and
configuration. Most results show that although Istio’s performance has improved over
time, Linkerd 2 is performing better under high load and uses much less resources.
36
6 Conclusion
Microservices are here to stay. And service meshes are likely to be their long-term
companions. Many monitoring, routing, resilience, and security features implemented
inside each microservice can and will be taken over by a service mesh over time. These
concepts have already proven themselves in the Google infrastructure. In some cases,
the service mesh cannot – and should not – fully implement the desired functionalities.
A service mesh has a black box view on the microservice. So internal information of the
microservice is hidden from it. For example, the service mesh can only provide metrics
about requests but not internal metrics like thread or database pool sizes. Therefore,
in the example in chapter 4 logging had to be implemented in the microservices despite
the service mesh.
Asynchronous Communication
A service mesh is not limited to traditional microservices following the request-
response paradigm. Microservices that communicate asynchronously also benefit
from monitoring and security features as well as from tools like Prometheus, Grafana,
Jaeger, or Kiali.
However, asynchronous microservices will probably not use HTTP at all so routing is
of little use. Also the resilience features are not useful. Asynchronous microservices
will work on a message queue eventually. So if the microservice is currently not
available, processing the message will take more time - until the microservice is
available again. The system should be able to deal with that, though.
Maturity
Current implementations such as Istio, Linkerd, Consul, and AWS App Mesh are al-
ready mature and stable enough to be considered for production systems. However, it
is essential to carefully test and compare the technologies with regard to the individual
use case. As the example in chapter 3 shows, a service mesh is a piece of infrastructure
and can be added and removed easily without changing application code. So it is
relatively easy to test a service mesh for a concrete application without investing too
much effort.
37
Overhead
One of the few drawbacks of service meshes is the mental overhead they add for the de-
velopers. But the challenges service meshes solve must be dealt with in a microservices
system anyway, which raises the question whether there is any simpler alternative
to a service mesh. Compared to other available solutions, the mental overhead of a
service mesh appears to be the smaller problem. The complexity of a service mesh
seems rather necessary to deal with the complexity of the distributed microservice
architecture itself.
However, not all service meshes have the same complexity. Istio provides many ad-
vanced features. That increases complexity but might only pay off in very few scenarios.
Linkerd 2 on the other hand has less features, but is also less complex. For example,
Linkerd 2 just defines one Kubernetes CRD (Custom Resource Definition) while Istio
has many. These add to the complexity of the configuration of the service mesh.
However, Istio can also be stripped down. It is possible to exclude any part of the
system from the installation.
Recent efforts such as the SMI (Service Mesh Interface) intend to reduce mental
overhead, standardize service mesh features, and will pave the way for even more tools
on top of service mesh capabilities.
The second drawback is the technical overhead: A service mesh adds to the latency and
the resource demands. However, this overhead will decrease since all current service
mesh implementations are constantly improved.
38