Apache Knox - Load Balancing
Overview 1
Load Balancing HTTP 1
Apache Knox and Load Balancing 2
Considerations 2
Example Implementation 3
Overview 3
Architecture 3
Technology Used 4
References 5
Overview
Load balancing HTTP services can be quite involved. This document covers some specifics about HTTP load
balancing and Apache Knox. This is meant to be a guide about what to consider and doesn't necessarily cover
all cases.
Load Balancing HTTP
There are quite a few things to be aware of when load balancing HTTP. Each of these can drastically affect the
backend service depending how the backend HTTP service and load balancer are configured. List of related
topics:
● HTTP vs TCP load balancing
● TLS termination for HTTP load balancing
● DNS
○ Domains/Subdomains
○ TTL for DNS changes
● URL rewriting
● Health checks
● Caching
Apache Knox and Load Balancing
There are a lot of different load balancers and ways to configure them. Don’t think we have a
recommended/supported setup - Definitely not one that ships as part of HDP. There is a rough guide to using
Apache HTTPD with Apache Knox.
Depends on what features you are looking to load balance as well. APIs vs UIs vs KnoxSSO. HTTP can be
load balanced in a variety of different ways. Matters if context path is changed, how caching is handled,
cookies matter, domains matter, etc.
Considerations
● HTTP vs TCP load balancing
○ Affects SPNEGO authentication due to hostname
● Apache Knox doesn’t load balance backends
○ will failover if necessary but all traffic will go to single backend instance until it dies
○ Putting a load balancer behind Knox can be tricky due to SPNEGO
● Load balancing Kerberos services without Knox or behind Knox is TECHNICALLY possible but a PITA
○ [Link]
● HiveServer2 is stateful
○ need to make sure that all Knox instances point to the same HS2 if load balancing
○ ensure clients don’t get redirected to random HS2 instance
○ need to be careful when rolling restart Knox/HS2
● Non stateful services (WebHDFS, HBase) are much easier to LB
○ No need to worry about sticky sessions or that clients go to the same backend
● HTTP clients need to be able to handle cookies/redirects if doing load balancing
○ be aware of 30* errors and 40* errors that could be caused by redirects/cookie handling
○ curl has "-L", "--location-trusted", "-c", and "-b" flags to help
● DNS/Url rewriting - changing the context path
○ WILL cause issues with UIs due to not handling the rewriting correctly
○ Can also cause cookie issues if they are bound to domain/path
● DNS changes between LB/Knox instances
○ can cause issues with KnoxSSO due to cookies depending on how the DNS is setup
○ Cookies are typically bound to domains
● Load balancing KnoxSSO
○ need to make sure you have the same signing key across nodes to prevent weird “signed out”
issues due to not being able to verify the JWT
Example Implementation
Overview
At my last job, we load balanced 4 Knox instances and handled ~10000 requests per min between
HBase/WebHDFS/Hive services on ~3 clusters. Most were HBase calls concentrated on the production cluster.
We expanded to multiple data centers and used DNS/load balancing to handle failover if necessary. This
configuration worked well and allowed for 0 downtime upgrades, failover, and load balancing for maintenance.
We only focused on API calls - no UIs or SSO (we started with HDP 2.5.x which didn’t support Knox proxying
UIs/SSO). I know we would have had issues with UIs/SSO if we were to go that route based on how we had
configured DNS and the multiple LBs.
All client access to the cluster was over HTTP. No SSH access to the clusters. In order to access the HDP
clusters, all traffic was funneled through HTTP and Knox - Spark with Livy, HBase REST, WebHDFS
read/write, Hive queries, Sqoop via Oozie to ingest RDBMS tables.
Architecture
Technology Used
● Load balancing
○ F5 hardware HTTP LB with sticky sessions
■ provided hardware failover and was part of “corporate LB”
■ Avoided TCP load balancing since HTTP was "default" setup
■ Provided HTTP TLS termination that matched corporate standards
■ F5 pointed to 4 software load balancers - Traefik
■ Health checks to remove instances if they were down
○ Traefik software HTTP LB with sticky sessions
■ HDP team controlled Traefik
● Didn't need to wait for F5 team on LB changes
■ Traefik and Knox were collocated on the same servers
■ Traefik load balanced across 4 Knox instances to balance load
■ Health checks to see if Knox nodes were up/responding
■ TLS termination at Traefik
● Higher than corporate standards since only F5 was client
● Wildcard TLS certificate to easily LB different domains
■ URL rewriting (only to have “nice” urls)
● Example rewriting:
○ [Link]
○ [Link]
● Apache Knox
○ Each instance provided access to multiple clusters
■ Allowed for failure of all but 1 Knox instance
○ LDAP bind account/passwords were per host and not per cluster
■ Enabled changing LDAP password with zero downtime
● DNS
○ Delegated subdomain for F5, Traefik, and other uses
■ *.[Link]
■ Low TTL on delegated subdomain
■ Used dnsmasq reading from /etc/hosts on multiple hosts
○ Setup details
■ [Link] pointed to F5 in main data center
● If major data center failure, [Link] repointed to F5 in secondary
data center
● In reality, easy for specific applications to failover to [Link]
○ Not all applications needed to failover
■ [Link] pointed to F5 in main data center
● dc2 points to backup data center
● Could easily test "failover" by pointing to dc2
■ [Link] pointed to Traefik 1 in main data center
● Repeat for N number of Traefik nodes
● Allowed for pinpointing a single Traefik instance for testing
○ Able to use cookies to pinpoint specific Traefik through F5 cookies or specific Traefik backend
with Traefik cookies
References
● [Link]
er+++mod_proxy+++mod_proxy_balancer
● [Link]
● LB and HiveServer2
○ Cookies - [Link]
○ SPNEGO - [Link]
● [Link]
● [Link]