externalTrafficPolicy Local is a configuration option for Kubernetes Services that configures nodes to only forward external traffic to local Service endpoints, reducing latency and preserving the client source IP. The ins and outs of networking in GKE is my favourite video for understanding this configuration option.
This post describes how a workload exposed using a Service of type LoadBalancer with externalTrafficPolicy Local can be configured to avoid requests being dropped when performing a rolling update.
Although the concepts should apply to other Kubernetes environments, for simplicity we will consider the specific example of Nginx Ingress Controller running behind a Service of type LoadBalancer with GKE subsetting enabled; GKE subsetting reconciles LoadBalancer Services using ingress-gce instead of cloud-provider-gcp.
PreStop Lifecycle Hook
When using externalTrafficPolicy Local, kube-proxy serves a health endpoint which only passes when there is at least one running and ready Pod selected by the Service on the local node. This health endpoint is probed by the load balancer to ensure that packets are only forwarded to nodes that are running Pods to receive them.
However, when the last Pod selected by the Service on the local node is terminated, there is a race condition between when the Pod is stopped and when the load balancer stops forwarding packets.
To ensure that the Pod continues to serve until the load balancer stops sending packets, a PreStop lifecycle hook can be configured to delay the kubelet sending a TERM signal to the Nginx Ingress Controller container process until the load balancer health check has failed. In our case this is 6 seconds and we add a small amount of additional time to allow kube-proxy to observe the terminating Pod and start failing the health endpoint:
containers:
- name: controller
lifecycle:
preStop:
sleep:
seconds: 10
Note that typically we expect the load balancer to remove nodes from load balancing more quickly than the health check because the load balancer controller watches EndpointSlices.
Note also that Nginx Ingress Controller ships with a
wait-shutdown
binary that is meant to be used instead of a sleep lifecycle hook together with the
--shutdown-grace-period
flag, however using this binary causes the readiness probe to start failing straight away which can
cause traffic to be blackholed; see GitHub issue
#13689 for details.
Readiness Probe
Once the load balancer health check has failed new connections will no longer be sent to the node, however packets corresponding to existing connections will continue to be routed. In our case load balancer connection draining is configured to be 30 seconds, so we want to continue serving until these connections have been closed.
One option is to simply extend our PreStop lifecycle hook for 30 seconds so that the load balancer resets the connections before the Nginx process is shut down, however this does not give Nginx a chance to drain HTTP requests gracefully (e.g. by sending GOAWAY frames on HTTP/2 connections).
Instead, by ensuring that Nginx Ingress Controller Pods remain ready after receiving a TERM signal from the kubelet, we can take advantage of KEP-1669: Proxy Terminating Endpoints to ensure that existing connections continue to be forwarded until they are closed by Nginx.
As described in #13689, the Nginx
Ingress Controller readiness probe currently starts failing as soon as the TERM signal is received
so we need to make sure that terminationGracePeriodSeconds is set such that the Pod is killed
before the Pod can been marked as not ready to avoid packets being blackholed. In other words we
need:
lifecycleHookPreStopSleepSeconds + (readinessProbePeriodSeconds * (readinessProbeFailureThreshold - 1)) >= terminationGracePeriodSeconds
The following configuration satisfies this requirement:
containers:
- name: controller
lifecycle:
preStop:
sleep:
seconds: 10
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
terminationGracePeriodSeconds: 30
Readiness Gate
So far we have discussed how to ensure that connections are closed gracefully when terminating a Pod, but we also need to ensure that new Pods are successfully registered with the load balancer on start up before marking the Pod as ready. Without this, capacity could be reduced or the load balancer could even temporarily have no nodes to forward packets to!
Pod readiness gates can help here by checking that the Pod is registered with the load balancer before marking the Pod as ready. When using GKE container-native load balancing, we can use the cloud.google.com/load-balancer-neg-ready readiness gate out of the box. This works because the load balancer health check is made directly to the Pod and so is decoupled from the kube-proxy health check. However, when using a GKE Service of type LoadBalancer, there is a cyclic dependency on the readiness of the node; the readiness gate would not pass until the load balancer has marked the local node as ready which would not happen until the readiness gate passes.
Since the GKE Service of type LoadBalancer implementation considers both ready and not ready Pods we could implement a readiness gate that waits for the local node to be added into the load balancer but just does not wait for it to be healthy, taking advantage of the fact that GCP internal passthrough Network Load Balancers distribute new connections to all backend VMs if they are all unhealthy. This does not provide the same guarantees as the GKE container-native load balancing readiness gate, but it does protect against an Nginx Ingress Controller rolling update progressing even if the load balancer controller is down (e.g. during an upgrade of the GKE control plane).
As a simpler work around, the readiness probe can be delayed using initialDelaySeconds for an
amount of time that gives a high chance that the local node has been registered before marking the
Pod as ready (e.g. 20 seconds, slightly longer than the Lease
duration):
containers:
- name: controller
lifecycle:
preStop:
sleep:
seconds: 10
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
initialDelaySeconds: 20
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
terminationGracePeriodSeconds: 30
Note that GKE recommends using
minReadySeconds
which works great when performing a rolling update but is not considered in other circumstances that
cause Pods to be terminated and recreated (e.g. evictions caused by nodes being drained or due to
vertical Pod
autoscaling).
Conclusion
As described above, there are a number of configuration options that can improve the chance of zero downtime deployments when running behind a Service of type LoadBalancer with externalTrafficPolicy Local, but they do not guarantee this even when all cluster components are working correctly.
However, the remaining race conditions become increasingly rare as we increase the number of replicas and reduce voluntary Pod disruptions by using a PodDisruptionBudget or surge updates.
In the future there may be additional options to mitigate these issues, for example Pod termination gates.