{"id":79781,"date":"2022-10-11T09:00:00","date_gmt":"2022-10-11T16:00:00","guid":{"rendered":"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/?post_type=lf_case_study&#038;p=79781"},"modified":"2024-02-15T22:15:05","modified_gmt":"2024-02-16T06:15:05","slug":"datadog","status":"publish","type":"lf_case_study","link":"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/case-studies\/datadog\/","title":{"rendered":"Datadog"},"content":{"rendered":"<section class=\"hero alignfull \">\n\n<figure class=\"hero__figure\">\n\t<picture><source srcset=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-content\/uploads\/2022\/10\/Bits_Mascot-1-scaled.jpg\" media=\"(min-width: 2880px)\"><source srcset=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-content\/uploads\/2022\/10\/Bits_Mascot-1-scaled.jpg\" media=\"(min-width: 1920px)\"><source srcset=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-content\/uploads\/2022\/10\/Bits_Mascot-1-scaled.jpg\" media=\"(min-width: 1440px)\"><source srcset=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-content\/uploads\/2022\/10\/Bits_Mascot-1-scaled.jpg\" media=\"(min-width: 1200px)\"><source srcset=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-content\/uploads\/2022\/10\/Bits_Mascot-1-scaled.jpg\" media=\"(min-width: 1024px)\"><source srcset=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-content\/uploads\/2022\/10\/Bits_Mascot-1-scaled.jpg\" media=\"(min-width: 768px)\"><source srcset=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-content\/uploads\/2022\/10\/Bits_Mascot-1-scaled.jpg\" media=\"(min-width: 600px)\"><source srcset=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-content\/uploads\/2022\/10\/Bits_Mascot-1-scaled.jpg\" media=\"(min-width: 414px)\"><source srcset=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-content\/uploads\/2022\/10\/Bits_Mascot-1-scaled.jpg\" media=\"(min-width: 375px)\"><source srcset=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-content\/uploads\/2022\/10\/Bits_Mascot-1-scaled.jpg\" media=\"(min-width: 0px)\">\n\t<img decoding=\"async\" src=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-content\/uploads\/2022\/10\/Bits_Mascot-1-scaled.jpg\" class=\"hero__image\" alt=\"Bits the datadog plush logo\">\n\t<\/picture><\/figure>\n\n<div class=\"container wrap hero__text-overlay title-wrapper\">\n\n<span>\n\t<\/span>\n\n<h1 class=\"is-style-case-study-title\">Datadog<\/h1>\n\n<\/div>\n<\/section>\n\n\t\n\n\n\n<section\n\tclass=\"wp-block-lf-case-study-highlights \">\n\n\t\t<p\n\t\tclass=\"case-study-title is-style-spaced-uppercase\">By the numbers<\/p>\n\t\n\t<div class=\"case-study-highlights columns-three\">\n\n\t\t<!-- col 1 -->\n\t\t<div class=\"case-study-highlights__col\">\n\n\t\t\t\t\t\t<p\n\t\t\t\tclass=\"case-study-highlights__heading is-style-spaced-uppercase has-large-font-size has-text-color has-purple-color\">\n\t\t\t\t10 trillion+<\/p>\n\t\t\t\t\t\t\t<p\n\t\t\t\tclass=\"case-study-highlights__text has-large-font-size\">Data points every day<\/p>\n\t\t\t\n\t\t<\/div>\n\n\t\t<!-- col 2 -->\n\t\t<div class=\"case-study-highlights__col\">\n\n\t\t\t\t\t\t<p\n\t\t\t\tclass=\"case-study-highlights__heading is-style-spaced-uppercase has-large-font-size has-text-color has-purple-color\">\n\t\t\t\t18,500+<\/p>\n\t\t\t\t\t\t\t<p\n\t\t\t\tclass=\"case-study-highlights__text has-large-font-size\">Customers using SaaS product<\/p>\n\t\t\t\n\t\t<\/div>\n\n\t\t<!-- col 3 -->\n\t\t<div class=\"case-study-highlights__col\">\n\n\t\t\t\t\t\t<p\n\t\t\t\tclass=\"case-study-highlights__heading is-style-spaced-uppercase has-large-font-size has-text-color has-purple-color\">\n\t\t\t\t10,000+<\/p>\n\t\t\t\t\t\t\t<p\n\t\t\t\tclass=\"case-study-highlights__text has-large-font-size\">Nodes &amp; Pods across multiple cloud providers<\/p>\n\t\t\t\n\t\t<\/div>\n\t<\/div>\n\n<\/section>\n\t\n\n\n<h3 class=\"wp-block-heading\">Problem: Limited scalability with iptables, IPVS, and kube-proxy<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Datadog runs 10s of Kubernetes clusters, across multiple cloud providers. Some of them have up to 4,000 nodes, which makes for 10,000s of hosts in their infrastructure. Datadog\u2019s SaaS product serves over 18,500 customers with millions of hosts reporting in, resulting in 10s of trillions of data points per day. Such a large and growing infrastructure comes with challenges.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For example, enabling cross-cluster communication in some environments isn&#8217;t simple, and ensuring that it is secure and encrypted complicates the process even further. There was also a need for routable pod IPs, which improves performance and enables direct cross-cluster communication often needed to reduce the management burden of data stores, such as Kafka or Cassandra. With many clusters and IPs routed between them, extra care must be taken to properly manage the IP address spaces and cross-cluster service discovery.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Datadog\u2019s initial solution was to use cloud specific solutions, like <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/github.com\/lyft\/cni-ipvlan-vpc-k8s\">Lyft\u2019s CNI plugin<\/a> for AWS. However, that meant finding additional solutions for other cloud providers. In addition, maintaining very different approaches and technologies across multiple clouds\u2014many without operators to automate tasks\u2014and managing provider and solution differences was time intensive. These solutions were not able to address the need for streamlined network policies nor did they meet the requirement for traffic encryption.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Another challenge was service load balancing. The usual approach is to use iptables, but as the environment grows, the rule count also grows, making this difficult to scale. This can increase update time\u2014sometimes more than 10 seconds\u2014as you need to methodically review all of the rules that you are interested in (&#8220;matching time&#8221;). In the words of Laurent Bernaille, Staff Engineer at Datadog:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-style-smaller-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">\u201cWhen you scale, and you have a large number of services and endpoints, iptables becomes challenging. <em>We can use iptables for load balancing, but it was not designed for it.\u201d <\/em><\/p>\n<cite> Laurent Bernaille, Staff Engineer, Datadog<\/cite><\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">As an alternative, they tried <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/en.wikipedia.org\/wiki\/IP_Virtual_Server\">IPVS<\/a>, which was more powerful but brought another set of challenges, especially given IPVS was not designed for client-side load balancing. Connection tracking had to be done twice, once for IPVS and once for netfilter, and it lacked feature parity with iptables.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The pain of using these solutions grew even more intense because they were difficult to debug, and it could take a lot of time between finding issues and getting the solution into production. As a result, the team asked itself: \u201cWhat if we could dynamically program these features?\u201d<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Solution: Cilium and eBPF powering the Kubernetes networking data plane<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Datadog went to <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/kubecon.io\/\">KubeCon + CloudNativeCon<\/a> and spoke with other teams in the hallway track to find out what solutions they were considering to solve their scaling problems, and many of them were considering Cilium. When they spoke with their development team, all the features that Datadog found interesting were already in the works or almost done. They started using some of the features in early beta, including putting v1.6 into production, and gave feedback to the team.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-style-smaller-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">\u201cOf course we saw some small issues and edge cases, but our interaction with upstream has always been very good and we have engineers who hadn\u2019t contributed before that were quickly able to become contributors.\u201d <\/p>\n<cite> Laurent Bernaille, Staff Engineer, Datadog<\/cite><\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">As part of the new architecture, Cilium was used to replace kube-proxy, addressing the growing pains with iptables and IPVS. Leveraging eBPF\u2019s efficiency also allowed Datadog to enforce network policies, which increased their security and was required for a certification and in regulated environments. Cilium has now become the default CNI for Kubernetes at Datadog and runs on almost every node of their SaaS offering. This allows them to abstract cloud providers and have a consistent data plane network.&nbsp;<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-style-smaller-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">\u201cAs the universal default Kubernetes CNI, Cilium can be used on multiple clouds, allowing network policies to be enforced across clouds where they are needed.\u201d <\/p>\n<cite> Laurent Bernaille, Staff Engineer at Datadog<\/cite><\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">Beyond the SaaS offering, Datadog also has an integration with Cilium where their customers can send Cilium\u2019s metrics and logs into Datadog. The team is also always looking for ways to improve both their infrastructure and Cilium. They are currently working on a few improvements around supporting multiple CIDR ranges in Cilium and are testing <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.youtube.com\/watch?v=QTSS6ktK8hY\">Bandwidth Manager<\/a> to guarantee better throughput.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-style-smaller-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">&#8220;eBPF and Cilium helped us to push the boundaries both within operations and also with product development. To do things safer, faster and more easily than what we could have with traditional techniques.&#8221; <\/p>\n<cite> Laurent Bernaille, Staff Engineer, Datadog<\/cite><\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">Datadog has started using Envoy for more use cases and is very interested in the sidecarless approach developed in the context of <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/isovalent.com\/blog\/post\/cilium-service-mesh\/\">Cilium Service Mesh<\/a>. They are also looking outside the cluster and considering using eBPF to create a smarter and more efficient network edge in the future. Using <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/cilium.io\/blog\/2022\/04\/12\/cilium-standalone-L4LB-XDP\/\">Cilium L4LB XDP<\/a> to create their own load balancers rather than relying upon a cloud provider would allow them to provide a consistent experience to end users.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">\u201cThe overlay features in a single product, compatibility with multiple cloud providers, and ability to just run it. These three things are what made Cilium an obvious choice for us.\u201d <\/p>\n<cite> Laurent Bernaille, Staff Engineer, Datadog<\/cite><\/blockquote>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Logs Told Us It Was DNS, It Felt Like DNS, It Had To Be DNS, I... Laurent Bernaille &amp; Elijah Andrews\" width=\"500\" height=\"281\" src=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.youtube.com\/embed\/NunyPkN0n3c?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Hear more about Datadog\u2019s journey in their videos from <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.youtube.com\/watch?v=6nlv_VCsjpQ\">eBPF Summit<\/a> and <a href=\"https:\/\/2.zoppoz.workers.dev:443\/https\/www.youtube.com\/watch?v=NunyPkN0n3c\">KubeCon + CloudNativeCon Europe 2022<\/a>.<\/p>\n","protected":false},"featured_media":79785,"template":"","meta":{"_acf_changed":false,"content-type":"","lf_case_study_long_title":"How Datadog uses Cilium & eBPF to power their data plane","lf_case_study_key_stat":"10+ Trillion","lf_case_study_key_stat_label":"Data points every day","lf_case_study_company_logo":79786,"lf_case_study_homepage_company_logo":102086,"lf_case_study_homepage_image":102079,"footnotes":""},"lf-project":[1175,126],"lf-country":[432],"lf-product-type":[],"lf-cloud-type":[709],"lf-challenge":[279,1302,281,1215,779,244,239],"lf-industry":[270],"class_list":["post-79781","lf_case_study","type-lf_case_study","status-publish","has-post-thumbnail","hentry","lf-project-cilium","lf-project-kubernetes","lf-country-north-america","lf-cloud-type-multi","lf-challenge-automation","lf-challenge-compliance","lf-challenge-latency","lf-challenge-observability","lf-challenge-performance","lf-challenge-scaling","lf-challenge-security","lf-industry-software"],"acf":[],"_links":{"self":[{"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-json\/wp\/v2\/lf_case_study\/79781","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-json\/wp\/v2\/lf_case_study"}],"about":[{"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-json\/wp\/v2\/types\/lf_case_study"}],"version-history":[{"count":5,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-json\/wp\/v2\/lf_case_study\/79781\/revisions"}],"predecessor-version":[{"id":81568,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-json\/wp\/v2\/lf_case_study\/79781\/revisions\/81568"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-json\/wp\/v2\/media\/79785"}],"wp:attachment":[{"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-json\/wp\/v2\/media?parent=79781"}],"wp:term":[{"taxonomy":"lf-project","embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-json\/wp\/v2\/lf-project?post=79781"},{"taxonomy":"lf-country","embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-json\/wp\/v2\/lf-country?post=79781"},{"taxonomy":"lf-product-type","embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-json\/wp\/v2\/lf-product-type?post=79781"},{"taxonomy":"lf-cloud-type","embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-json\/wp\/v2\/lf-cloud-type?post=79781"},{"taxonomy":"lf-challenge","embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-json\/wp\/v2\/lf-challenge?post=79781"},{"taxonomy":"lf-industry","embeddable":true,"href":"https:\/\/2.zoppoz.workers.dev:443\/https\/www.cncf.io\/wp-json\/wp\/v2\/lf-industry?post=79781"}],"curies":[{"name":"wp","href":"https:\/\/2.zoppoz.workers.dev:443\/https\/api.w.org\/{rel}","templated":true}]}}