Daniel Jacobson
• @daniel_jacobson
• http://www.linkedin.com/in/danieljacobson
• http://www.slideshare.net/danieljacobson
Sangeeta Narayanan
• @sangeetan
• http://www.linkedin.com/in/sangeetanarayanan/
I have added more detail in the
notes field for each slide to provide
additional context
Strategy
Lessons
Implementation
Lessons
Know Your Audience
Target Audience Dictates
Everything Else
The target audience should be
the single biggest influence on
your API design
Small Set of Known Developers
SSKDs
Large Set of Unknown Developers
LSUDs
Both
SSKDs and LSUDs
No matter what…
Figure this out first!
Target Audience Influence
• Team Identity
• Staffing Decisions
• System Architecture
• SLAs
• Development Velocity
• Security Needs
Top 10 Lessons Learned from the Netflix API - OSCON 2014
Netflix API : Key Responsibilities
2008
• Broker data between internal services and
public developers
• Grow community of public developers
• Optimize design for reusability
Evangelists
Partner
Engagement
and Support
API
Engineers
Technical
Writer
QA
Specialists
Top 10 Lessons Learned from the Netflix API - OSCON 2014
Private API Public API
< 0.3% of total
API traffic *
* 11 years worth of public API requests = one day of private API requests
Netflix API : Key Responsibilities
Today
• Broker data between services and devices
• System resiliency
• Scaling the system
• High velocity development
• Insights
The consumers of the API are now
Netflix subscribers
We are now responsible for ensuring
subscribers can stream
Application
Engineers
Platform
Engineers
Technical
Writer
Tools and
Automation
Engineers
Team is now 6x its
size from 2010
Separation of Concerns
Primary Responsibilities of APIs
• Data Gathering
– Retrieving the requested data from one or many local
or remote data sources
• Data Formatting
– Preparing a structured payload to the requesting
agent
• Data Delivery
– Delivering the structured payload to the requesting
agent
There are two players in APIs
API Provider API Consumer
API Provider
PROVIDES
API Consumer
CONSUMES
Traditional API Interactions
API Provider
PROVIDES
EVERYTHING
API Consumer
CONSUMES
Everything means, API Provider does:
• Data Gathering
• Data Formatting
• Data Delivery
• (among other things)
Traditional API Interactions
Why do most API providers provide
everything?
• Many APIs have a large set of unknown and
external developers
• Generic API design tends to be easier for
teams closer to the source
• Centralized API functions makes them easier
to support
Why do most API providers provide
everything?
• Many APIs have a large set of unknown and
external developers
• Generic API design tends to be easier for
teams closer to the source
• Centralized API functions makes them easier
to support
Data Gathering Data Formatting Data Delivery
API Consumer
Don’t care how data is
gathered, as long as it
is gathered
Each consumer cares a
lot about the format
for that specific use
Each consumer cares a
lot about how payload
is delivered
API Provider
Care a lot about how
the data is gathered
Only cares about the
format to the extent it
is easy to support
Only cares that the
delivery method is
easy to support
Separation of Concerns
To be a better provider, the API should address the
separation of concerns of the three core functions
One Size Doesn’t Fit All
Embrace the Differences
Data Gathering Data Formatting Data Delivery
API Consumer
Don’t care how data is
gathered, as long as it
is gathered
Each consumer cares a
lot about the format
for that specific use
Each consumer cares a
lot about how payload
is delivered
API Provider
Care a lot about how
the data is gathered
Only cares about the
format to the extent it
is easy to support
Only cares that the
delivery method is
easy to support
Separation of Concerns
Top 10 Lessons Learned from the Netflix API - OSCON 2014
Top 10 Lessons Learned from the Netflix API - OSCON 2014
Screen Real Estate
Controllers
Technical Capabilities
Resource-Based API
vs.
Experience-Based API
Resource-Based Requests
• /users/<id>/ratings/title
• /users/<id>/queues
• /users/<id>/queues/instant
• /users/<id>/recommendations
• /catalog/titles/movie
• /catalog/titles/series
• /catalog/people
REST API
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
Network Border Network Border
Experience-Based Requests
• /ps3/homescreen
JAVA API
Network Border Network Border
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
Client Adapter Code
Be Pragmatic, Not Dogmatic
Common API Debates
• XML / JSON
• REST / SOAP
• OAuth / Other
• Versioning
• Hypermedia
Who Cares!?!?
Just Solve Problems for your
Audience
Embrace Change
Impermanence and Versionless APIs
v1.0
v1.5
v2.0
Versioning for APIs
1.0
1.5
2.0
3.0?
4.0?
5.0?
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Eliminate Versioning?
1.0
1.5
2.0
New Architecture
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
JAVA API
Network Border Network Border
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
Act Fast, React Fast
Favor Velocity Over Completeness
Delivery Using Buckets
Top 10 Lessons Learned from the Netflix API - OSCON 2014
Top 10 Lessons Learned from the Netflix API - OSCON 2014
Top 10 Lessons Learned from the Netflix API - OSCON 2014
Top 10 Lessons Learned from the Netflix API - OSCON 2014
Testing
Production Traffic
Old Code (Baseline) New Code (Canary)
~1% Traffic
Top 10 Lessons Learned from the Netflix API - OSCON 2014
Deployments
Old Code New Code
Production Traffic
Enable Others to
Act Fast, React Fast
JAVA API
Network Border Network Border
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
Dynamically deployed
endpoints
Top 10 Lessons Learned from the Netflix API - OSCON 2014
Statically deployed
libraries
Dynamically deployed
endpoints
Dependency Canaries
Personalization
Service
API
• Build
• Test
• Deploy Service
• Release Lib
Pers.
Lib
• Integrate Lib
• Build
• Test
• Deploy Service
UI Script
Iterations in Hours or Days
Access
Data
Personalization
Service
API
• Build
• Test
• Deploy Service
• Release Lib
• Publish to API
Pers.
Lib
UI Script
Iterations in Minutes?
Access
Data
• Integrate Lib
• Build
• Test
• Deploy Service
Internal Developers Need
Engagement Too
Top 10 Lessons Learned from the Netflix API - OSCON 2014
Documentation
Tools
REPL:
Trainings
Failure is Inevitable
~5,000,000,000
Requests per day
~35
Dependencies
~600
Libraries
Things will break!
Top 10 Lessons Learned from the Netflix API - OSCON 2014
Top 10 Lessons Learned from the Netflix API - OSCON 2014
Top 10 Lessons Learned from the Netflix API - OSCON 2014
Scale at All Costs
-
10
20
30
40
50
60
June, 2010 June, 2011 June, 2012
RequestsinBillions
API Requests Per Month
Incoming Traffic
Predictive Auto Scaling
Predicted vs. Actual RPS
Top 10 Lessons Learned from the Netflix API - OSCON 2014
Reactive + Predictive Autoscaling
No. of instances
1. Know Your Audience
2. Separation of Concerns
3. One Size Doesn’t Fit All
4. Be Pragmatic, Not
Dogmatic
5. Embrace Change
1. Act Fast, React Fast
2. Enable Others to Act
Fast, React Fast
3. Internal Developers
Need Engagement Too
4. Failure is Inevitable
5. Scale at All Costs
http://github.com/Netflix
Daniel Jacobson
• @daniel_jacobson
• http://www.linkedin.com/in/danieljacobson
• http://www.slideshare.net/danieljacobson
Sangeeta Narayanan
• @sangeetan
• http://www.linkedin.com/in/sangeetanarayanan/

More Related Content

PDF
Open Policy Agent
PPTX
Chef fundamentals
PDF
Design in Venture Capital | Cloud Kitchen
PDF
AWS Summit Seoul 2023 | 당신만 모르고 있는 AWS 컨트롤 타워 트렌드
PDF
Basic Kong API Gateway
PPTX
Modeling microservices using DDD
PDF
Extensible Data Modeling
PDF
Troubleshooting Anypoint Platform
Open Policy Agent
Chef fundamentals
Design in Venture Capital | Cloud Kitchen
AWS Summit Seoul 2023 | 당신만 모르고 있는 AWS 컨트롤 타워 트렌드
Basic Kong API Gateway
Modeling microservices using DDD
Extensible Data Modeling
Troubleshooting Anypoint Platform

Viewers also liked (20)

PPTX
Netflix API - Separation of Concerns
PPTX
Netflix Edge Engineering Open House Presentations - June 9, 2016
PPTX
Culture
PPTX
Redesigning the Netflix API - OSCON
PPTX
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
PPTX
Netflix API
PPTX
Set Your Content Free! : Case Studies from Netflix and NPR
PPTX
API Revolutions : Netflix's API Redesign
PPT
Culture (Original 2009 version)
PPT
Sulle tracce di Tullo Morgagni
PDF
How Do Developers React to API Deprecation? The Case of a Smalltalk Ecosystem
PDF
API Business Models
PPTX
Scaling the Netflix API - From Atlassian Dev Den
PPTX
Versioning schemes and branching models for Continuous Delivery - Continuous ...
PPTX
History and Future of the Netflix API - Mashery Evolution of Distribution
PPTX
Netflix API - Presentation to PayPal
PDF
Automotive Grade APIs – designing for longevity
PDF
Why should C-Level care about APIs? It's the new economy, stupid.
PPTX
Zuul @ Netflix SpringOne Platform
PPTX
NuGet package CI and CD
Netflix API - Separation of Concerns
Netflix Edge Engineering Open House Presentations - June 9, 2016
Culture
Redesigning the Netflix API - OSCON
Maintaining the Netflix Front Door - Presentation at Intuit Meetup
Netflix API
Set Your Content Free! : Case Studies from Netflix and NPR
API Revolutions : Netflix's API Redesign
Culture (Original 2009 version)
Sulle tracce di Tullo Morgagni
How Do Developers React to API Deprecation? The Case of a Smalltalk Ecosystem
API Business Models
Scaling the Netflix API - From Atlassian Dev Den
Versioning schemes and branching models for Continuous Delivery - Continuous ...
History and Future of the Netflix API - Mashery Evolution of Distribution
Netflix API - Presentation to PayPal
Automotive Grade APIs – designing for longevity
Why should C-Level care about APIs? It's the new economy, stupid.
Zuul @ Netflix SpringOne Platform
NuGet package CI and CD
Ad

Similar to Top 10 Lessons Learned from the Netflix API - OSCON 2014 (20)

PPTX
Oscon2014 Netflix API - Top 10 Lessons Learned
PPTX
Maintaining the Front Door to Netflix
PDF
apidays LIVE London 2021 - Moving from a Product as API to API as a Product b...
PDF
Designing Usable APIs featuring Forrester Research, Inc.
PDF
Dependency Down, Flexibility Up – The Benefits of API-First Development
PDF
API Introduction - API Management Workshop Munich from Ronnie Mitra
PPTX
API Strategy Evolution at Netflix
PDF
Guide To API Development.pdf
PPSX
APIs as a Product Strategy
PDF
The API Opportunity: Crossing the Digital Divide
PPT
How to design effective APIs
PDF
Your API is your Product - Arun Ravindran, Unisys
PDF
Guide To API Development – Cost, Importance, Types, Tools, Terminology, and B...
PPTX
Api design part 1
PDF
API Best Practices
PPTX
Microservices&amp;ap imanagement
PDF
Building A Great API - Evan Cooke, Cloudstock, December 2010
PDF
Design & Deploy a data-driven Web API in 2 hours
PDF
How to Develop APIs - Importance, Types, Tools, Terminology, and Best Practic...
PPT
Oscon2014 Netflix API - Top 10 Lessons Learned
Maintaining the Front Door to Netflix
apidays LIVE London 2021 - Moving from a Product as API to API as a Product b...
Designing Usable APIs featuring Forrester Research, Inc.
Dependency Down, Flexibility Up – The Benefits of API-First Development
API Introduction - API Management Workshop Munich from Ronnie Mitra
API Strategy Evolution at Netflix
Guide To API Development.pdf
APIs as a Product Strategy
The API Opportunity: Crossing the Digital Divide
How to design effective APIs
Your API is your Product - Arun Ravindran, Unisys
Guide To API Development – Cost, Importance, Types, Tools, Terminology, and B...
Api design part 1
API Best Practices
Microservices&amp;ap imanagement
Building A Great API - Evan Cooke, Cloudstock, December 2010
Design & Deploy a data-driven Web API in 2 hours
How to Develop APIs - Importance, Types, Tools, Terminology, and Best Practic...
Ad

More from Daniel Jacobson (16)

PPTX
Maintaining the Front Door to Netflix : The Netflix API
PPTX
Why API? - Business of APIs Conference
PPTX
Scaling the Netflix API - OSCON
PPTX
Scaling the Netflix API
PPTX
Netflix API: Keynote at Disney Tech Conference
PPTX
Techniques for Scaling the Netflix API - QCon SF
PPTX
APIs for Internal Audiences - Netflix - App Dev Conference
PPTX
Netflix API : BAPI 2011 Presentation : SF
PPTX
Presentation to ESPN about the Netflix API
PPTX
The future-of-netflix-api
PPT
NPR Presentation at Wolfram Data Summit 2010
PPT
NPR: Digital Distribution Strategy: OSCON2010
PPT
NPR's Digital Distribution and Mobile Strategy
PPT
NPR API Usage and Metrics
PPT
OpenID Adoption UX Summit
PPT
NPR : Examples of COPE
Maintaining the Front Door to Netflix : The Netflix API
Why API? - Business of APIs Conference
Scaling the Netflix API - OSCON
Scaling the Netflix API
Netflix API: Keynote at Disney Tech Conference
Techniques for Scaling the Netflix API - QCon SF
APIs for Internal Audiences - Netflix - App Dev Conference
Netflix API : BAPI 2011 Presentation : SF
Presentation to ESPN about the Netflix API
The future-of-netflix-api
NPR Presentation at Wolfram Data Summit 2010
NPR: Digital Distribution Strategy: OSCON2010
NPR's Digital Distribution and Mobile Strategy
NPR API Usage and Metrics
OpenID Adoption UX Summit
NPR : Examples of COPE

Recently uploaded (20)

PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PDF
4 layer Arch & Reference Arch of IoT.pdf
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PDF
Advancing precision in air quality forecasting through machine learning integ...
PPTX
Training Program for knowledge in solar cell and solar industry
PPTX
MuleSoft-Compete-Deck for midddleware integrations
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
PDF
Comparative analysis of machine learning models for fake news detection in so...
PPTX
Internet of Everything -Basic concepts details
PDF
SaaS reusability assessment using machine learning techniques
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
Enhancing plagiarism detection using data pre-processing and machine learning...
Custom Battery Pack Design Considerations for Performance and Safety
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
Convolutional neural network based encoder-decoder for efficient real-time ob...
MENA-ECEONOMIC-CONTEXT-VC MENA-ECEONOMIC
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
Improvisation in detection of pomegranate leaf disease using transfer learni...
4 layer Arch & Reference Arch of IoT.pdf
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
Rapid Prototyping: A lecture on prototyping techniques for interface design
Advancing precision in air quality forecasting through machine learning integ...
Training Program for knowledge in solar cell and solar industry
MuleSoft-Compete-Deck for midddleware integrations
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
Comparative analysis of machine learning models for fake news detection in so...
Internet of Everything -Basic concepts details
SaaS reusability assessment using machine learning techniques
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf

Top 10 Lessons Learned from the Netflix API - OSCON 2014

Editor's Notes

  • #4: The lessons that we discuss in these slides fall into two buckets: strategy and implementation.
  • #7: In some cases, the audience will be a small set of known developers (SSKDs). These developers are generally engineers within your company or one with whom you are partnering.
  • #8: In other cases, the audience may be a large set of unknown developers. This audience is typically associated with public APIs.
  • #9: And in some cases, the API will target both audience types.
  • #11: This is a short list of the things that the target audience will influence.
  • #12: For Netflix, we started out with a public API, with the audience being a large set of unknown developers. There were no internal use cases at launch.
  • #14: Based on the target audience of unknown developers, we staffed accordingly. The team was relatively small, with skills around development, evangelism, partnering, testing and documentation.
  • #15: As streaming became more critical to the company, we started having devices use the API. Our first mistake was that we were probably too late to pivot our architecture based on our change in target audience. At the time, we had many devices call into our REST API, the same one that we used for the unknown developers.
  • #16: But eventually, the data demonstrated that the architectural change was needed. This chart shows that the private API completely drarfs the public API in terms of requests. The private API does about five billion requests per day while the public API does between one and two million. This disparity clearly demonstrates the need for us to target the API to the small set of known developers – Netflix’s UI engineers – who build the vast majority of the experiences on Netflix devices.
  • #19: Given the shift in responsibilities, we positioned the team accordingly, hiring for skills mostly around engineering.
  • #20: And the team size grew by about 6x in the last few years. If the target audience was still the public API, it is likely that the team size would have grown, but less significantly (perhaps 2x) in that time frame.
  • #31: API consumers care a lot about data formatting and delivery, but each consumer, in such a diverse ecosystem, cares about them differently. For some devices, they may want an XML payload delivered as a complete document, while others may need JSON, protobuffer or some other format, potentially delivered as streamed bits. Because of these diverse needs, we need to separate out the concerns to better enable the consumers to get what they need.
  • #32: Most companies focus on a small handful of device implementations, most notably Android and iOS devices.
  • #33: At Netflix, we have more than 1,000 different device types that we support. Across those devices, there is a high degree of variability. As a result, we have seen inefficiencies and problems emerge across our implementations. Those issues also translate into issues with the API interaction.
  • #34: For example, screen size could significantly affect what the API should deliver to the UI. TVs with bigger screens that can potentially fit more titles and more metadata per title than a mobile phone. Do we need to send all of the extra bits for fields or items that are not needed, requiring the device itself to drop items on the floor? Or can we optimize the deliver of those bits on a per-device basis? Different devices have different controllers as well. Some, like the iPad, allow for fast swipe interactions so the content needs to be there for the entire row. Other devices, like smart TVs or game some game consoles have LRUD controllers, so it at least gives the opportunity to fetch the data as the row gets navigated. And the technical capabilities of the devices will influence the interactions as well. Some have more computing power or memory which will influence how much data you can process on the device vs. how much needs to be gathered in real-time.
  • #35: We evolved our discussion towards what ultimately became a discussion between resource-based APIs and experience-based APIs.
  • #36: The original one-size-fits-all API was very resource oriented with granular requests for specific data, delivering specific documents in specific formats.
  • #37: The interaction model looked basically like this, with (in this example) the PS3 making many calls across the network to the OSFA API. The API ultimately called back to dependent services to get the corresponding data needed to satisfy the requests.
  • #38: We have decided to pursue an experience-based approach instead. Rather than making many API requests to assemble the PS3 home screen, the PS3 will potentially make a single request to a custom, optimized endpoint.
  • #39: In an experience-based interaction, the PS3 can potentially make a single request across the network border to a scripting layer (currently Groovy), in this example to provide the data for the PS3 home screen. The call goes to a very specific, custom endpoint for the PS3 or for a shared UI. The Groovy script then interprets what is needed for the PS3 home screen and triggers a series of calls to the Java API running in the same JVM as the Groovy scripts. The Java API is essentially a series of methods that individually know how to gather the corresponding data from the dependent services. The Java API then returns the data to the Groovy script who then formats and delivers the very specific data back to the PS3.
  • #45: Our original REST API had granular endpoints and generic interaction models. This leads to different versions when significant changes are made. The REST API had three primary version before our move to the experience-based API.
  • #46: If we persisted in the REST API, we very likely could have continued to add versions while needing to support the old ones. The need to support prior versions stems from older device implementations that may not be able to updated or retired, thus forcing us to maintain these endpoints for a long time (perhaps as long as 10 years).
  • #47: Our target with the experience-based API was to build an architecture that allowed us to be versionless. Through SSKDs, separation of concerns, abstraction layers, and interaction optimizations, we are able move to a deprecation model.
  • #48: The primary goal is to limit versioning in the device-to-server interaction. Ideally, we can deprecate effectively in the server interactions as well, but that is sometimes more difficult. Back to our architecture view, the data can now flow from the services into the Java APIs. We expose granular methods (think data elements rather than resources) to the scripting tier. If a method needs to change, we can add a new method and then work closely with the SSKDs to migrate the calling scripts, enabling us to deprecate the old method. If we are not able to move the scripts, we can insulate the devices from the change either in the Java layer or in the scripting tier.
  • #50: Several years ago, we were deploying changes roughly every two weeks. We would accumulate changes over that time and then drop them into production all at once. Think of it as gathering water in a bucket.
  • #51: What we found was that our releases were unpredictable, sometimes resulting in outages, broken functionality, or incomplete work. Accordingly, we decided to slow down, changing our release cycles to three weeks. We figured that would give us more time to test our work. In other words, we got a larger bucket.
  • #52: Over time, however, we learned that the longer release cycle didn’t improve predictability or quality. Instead, it just slowed us down. In response, we moved aggressively towards continuous delivery. Instead of delivering water in buckets, we had a steady stream of water from a hose. This enabled us to have smaller changes, more isolated and testable, pushed to production instead of having bigger releases with more complexity.
  • #53: This is how code flows through the system. We have multiple canary releases per day. Internal envs are deployed ~8 times/day in 3 AWS regions. Prod deployments happen 2-3 times/week and can be triggered on demand.
  • #54: This dashboard lets us track the status of our master branch at any time. Builds that fail at any step in the pipeline are stopped from going further.
  • #55: A quick word on Testing. We follow the ‘Operate what you Build’ model where developers are responsible for shepherding their changes all the way through to production. We provide them with the tools necessary to help them gain confidence in the quality of their code. One such tool is the automated Canary Analyzer.
  • #56: Canary Analysis is the process wherein a small percent of traffic is routed to the new code and its performance is compared against the old code based on 1000s of metrics.
  • #57: A detailed report gives further insight into potential problem areas. In this case, our canary gives a score of 87%, which means it is likely not ready for release.
  • #58: In tandem with canaries, we use Red/Black deployments as well.
  • #59: The Red/Black process allows us to run production code in one cluster while we spin up the new code in a second one. As the new code proves itself, we can route all traffic to it and eventually shut down the old. It also allows us to have a fast, automated rollback in the even that the new code is seeing problems.
  • #61: Our architecture enables us to move faster because of the scripting tier. But this also put us in position to help our consuming teams and dependency teams to move faster as well.
  • #62: Let’s peek under the hood of the API Server. Client teams deploy endpoints dynamically based on their own schedule. Their cycles are completely asynchronous of server deployments. Newly deployed endpoints are live and ready to take traffic within minutes.
  • #63: Endpoint Activity Dashboard shows recent deployment activity. Rollbacks can be performed in a matter of minutes as well.
  • #64: Our dependent services provide to us client libraries that get compiled into our JVM upon deployment. These libraries typically expose static interfaces, which means changes to the interfaces require coding and deployments without our contain. Similar to the dynamic endpoints, we also have opportunity to improve the nimbleness and velocity around these libraries.
  • #65: One such improvement is dependency canaries, where we are evaluating our new code against the dependencies. This is a dashboard the provides insights into these canaries.
  • #66: Making the interaction with the consumers of the API dynamic has led to increased agility on the UI side. We are also exploring ways to increase the speed of iteration on the dependencies side. The current interaction model uses static domain models and client libraries to handle the data flow through the API. This results in long iteration cycles for even the simplest of use cases. We are actively pursuing an approach where our dependencies will be able to expose new data by using dynamic pass-through model using a Dictionary of key values.
  • #67: The idea is that this model will avoid the static update cycle on the API end, thereby resulting in shorter iteration cycles. This will require investment in things like safety checks and discoverability of the API. We are instrumenting the API layer to inspect traffic at runtime and provide insights into API usage.
  • #69: One of the early mistakes that we made in this new architecture was not treating internal developers like we did public developers. We don’t need the same degree of evangelism, but we do need to maintain strong communications with the client teams while providing robust tools and systems to help them be better developers in our system. An example of us being late to this is represented by our endpoint dashboard. One of our teams went from having about 30 scripts to about 500 in a matter of weeks. Each of these scripts are dynamically compiled into the JVM, occupying permgen space. As the script count shot up, we hit limits in our permgen which resulted in an outage. And an outage in our layer means people cannot stream Netflix. Of course, there is nothing like an outage to kickstart new behaviors. As a result, we immediately set up alerts and then focused more heavily on building tools to support the developers.
  • #70: Included in that effort is comprehensive documentation.
  • #71: We built an array of tools as well, including this REPL.
  • #72: And prepared frequent trainings and videos.
  • #74: Nobody has a 100% SLA, so things will fail
  • #75: In fact, a few years ago, we have many failures on a routine basis.
  • #76: Many of those failures were a result of failures in a dependent service that we did a poor job of protecting against. Because we are the last step before delivering content to the customers, we have a unique opportunity to help protect customers from such failures.
  • #77: Hystrix allows us to be resilient to failure by implementing the bulk-heading and circuit breaker patterns. Hystrix is open source and available at our github repository.
  • #78: Failure Simulation and Game Day exercises are a key part of the overall story. The Simian army is a fleet of monkeys who are simulate failures and alert us to non-conformities in an automated manner. Chaos Monkey periodically terminates AWS instances in production to see how the system responds to the instance disappearing. Latency Monkey introduces latencies and errors into a service to see how it responds and lets us assess the customer quality of experience. Conformity monkey alerts us to variations in versions of application across regions. The monkeys are also available in our open source github repository
  • #80: Because of our pivot to the private API and the explosion of devices consuming it, our traffic grew tremendously in a few years (and continues to grow at very fast rates). Scaling our systems to support this growth is absolutely critical to the success of the company. Techniques, such as throttling are not an option because that only serves to limit the interactions from our streaming subscribers. Instead, we need to be able to handle any load that our devices throw at us. This manifests in many ways, but the following is a detail on one of them – instance scaling.
  • #81: Let’s go back to the traffic chart. The pattern is predictable with higher peaks on the weekends
  • #82: To offset these limitations, we created Scryer (not yet open sourced, but in production at Netflix). Scryer evaluates needs based on historical data (week over week, month over month metrics), adjusts instance minimums based on algorithms, and relies on Amazon Auto Scaling for unpredicted events
  • #83: This graph shows that Scryer’s predictions are in line with actual RPS. In production, Scryer allows us to get instances into production prior to the need (which is different than Amazon’s reactive autoscaling engine which triggers the ramp up based on immediate need, only needing to wait until server start-up is complete). Because the instances are there in advance, Scryer smooths out load averages and response times, which in turn improves the customer experience.
  • #84: This is an example of what Scryer looks like during an outage. When actual traffic dropped because of an outage, the reactive autoscaling engine would have downsized the farm. In this case, Scryer kept the farm sized correctly so that we were able to deal with the traffic spike after the recovery.
  • #85: As a side benefit (not the initial intent), Scryer also allows us to be more precise with our instance counts, reducing inefficiencies.