Microservice Governance - Resilience Patterns - Part 1

Photo by David Clode / Unsplash

Hey guys, nice to see you again. This is the second blog for discussing how to govern the complex Microservice Architecture. The first one, discussing Routing Patterns, is here.
Given that Microservices architecture is highly distributed, you have to be extremely careful in preventing a problem in a single service or service instance from cascading up and out to the consumers of the service. How we build our applications to respond to failures is the most critical part of software development in Microservices Architecture, which is the major problem that Resilience Patterns aim to resolve. So I will have two articles focus on talking about Resilience Patterns, including the following individual patterns, by breaking them down to the problem to solve, common design, and implementation by leveraging Service Mesh.

  • Part-1:
  1. Client-side Load Balancing
  2. Retries
  3. Timeouts
  1. Circuit Breakers
  2. Fallbacks
  3. Bulkheads

Resilience Pattern-1: Client-side Load Balancing

Question To Solve

How to cache the location of your service instances on the service client so that calls to multiple instances of a service are load balanced to all the health instances?

Common Design

We have introduced the client-side load balancing when talking about Service Discovery Pattern. It involves having the client look up all instances of a service from a Service Discovery service and then caching the physical locations of those service instances. Whenever a service consumer needs to call that service instance, the client-side load balancer will return a location from the service locations pool that it’s maintaining. Because the client-side load balancer sits on the client-side, it can detect if a server-side service instance is throwing errors or performing poorly. Then if it sees an issue, it can remove the unhealthy instance from the pool of available service locations and prevent future service calls from hitting that service instance.

Actually, for monolithic applications, the common design of load balancing is on the server-side by architecting DNA and a Load Balancer. But this solution is not suitable for Microservice Architecture because of the following two issues,

  • The Load Balancer is a Single Point of Failure, which means the services behind it can not provide service anymore when the load balancer itself is malfunctioning even if the services are performing pretty well.
  • The client-side services can not balance load based on the pressure of each instance of service-side services, which means client-side services keep firing requests to the other side and can only take action after the bad responses come back or time out. This will make the whole system waste plenty of network resources and have a delayed feedback loop, which is not aligned with the Fail-fast system design principle.

Implementation in AWS App Mesh

We don’t need to declare any resources to implement this pattern because the Envoy proxies automatically load balance traffic from all clients in the mesh and keep updating load balancing endpoints based on health checks and service registration.
Regarding the load balancing algorithm, though Envoy supports multiple load balancers for different algorithms, App Mesh can only leverage the Round Robin algorithm currently, at 2021-03-09. And this is set by default, so we don’t need to configure any related settings for Envoy.

Resilience Pattern-2: Retries

Question To Solve

How to manage to connect to the server-side service again if the initial call fails?
Retries Pattern is to resolve this problem.

Common Design

We need to specify a configuration for a service to declare the maximum number of times a client attempts to connect to it if the initial call fails. We call it a Retry setting. Retries can enhance service availability and application performance by ensuring that calls don’t fail permanently because of transient problems such as a temporarily overloaded service or network. The retry behavior for HTTP requests is to retry n times before returning the error.

Implementation in AWS App Mesh

Retrying for a server-side service is set as VirtualRoute Retry Policy in the server-side service’ Virtual Router instead of on the client-side. A Retry Policy enables clients to protect themselves from intermittent network failures or intermittent server-side failures. It is optional but recommended. Even if you want the virtual service to reach a virtual node directly, without a virtual router, App Mesh may automatically create a default Virtual Route with the default Envoy route retry policy for each virtual node provider. So our convention for Virtual Router is to always define a Virtual Router for Virtual Node and set the Retry Policy explicitly, though there is only one Virtual Node provider for a Virtual Service.
As an example, imagine that we have a heavy calculation workload API, a couple of megabytes of data would be replied from it, in service sw-foo-service. Apparently, this API would take a long time to respond and is subject to failure by some network fluctuation when the data is transmitting. So we need to set up a different Retry Policy to have more times retrying for it. And all the other APIs would have a default policy to just retry 1 time for dealing with the initial request failures.
Please refer to the HttpRetryPolicy API Reference to get more details on each configuration.

apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualRouter
metadata:
  labels:
    app: sw-foo-service
  name: sw-foo-service-router
  namespace: sw-foo-service
spec:
  listeners:
    - portMapping:
        port: 8392
        protocol: http
  routes:
    - name: foo-heavy-feature-route
      httpRoute:
        match:
          prefix: /v1/heavy/data-calculate
        action:
          weightedTargets:
            - virtualNodeRef:
                name: sw-foo-service-heavy
              weight: 1
        retryPolicy:
          maxRetries: 3
          perRetryTimeout:
            unit: s
            value: 60
          httpRetryEvents:
            - server-error # HTTP status codes 500, 501, 502, 503, 504, 505, 506, 507, 508, 510, and 511
            - gateway-error # HTTP status codes 502, 503, and 504
          tcpRetryEvents:
            - connection-error
    - name: default
      httpRoute:
        match:
          prefix: /
        action:
          weightedTargets:
            - virtualNodeRef:
                name: sw-foo-service
              weight: 1
        retryPolicy:
          maxRetries: 2
          perRetryTimeout:
            unit: ms
            value: 1000
          httpRetryEvents:
            - server-error
            - gateway-error

Resilience Pattern-3: Timeouts

Question To Solve

Though we can retry when a call returns failures, what about a call that can not even be returned in time? How do we prevent the client services from hanging around waiting for replies indefinitely and that calls succeed or fail within a predictable timeframe? Timeouts Pattern is the one to resolve this kind of issue.

Common Design

A Remote Procedure Call (RPC) always has 3 kinds of response states, which are

  • Success
  • Failure
  • Timeout

We can handle Failure by Retry settings, but we need to handle Timeout by specifying a configuration for a service to declare the duration of time a client should wait for replies from that service. We call it a Timeout setting. And beyond that, this setting should be specific as service by service to make them appropriate to your application. On one hand, a too-long timeout would result in excessive latency from waiting for replies from failing services. On the other hand, a too-short timeout could result in calls failing unnecessarily while waiting for an operation involving multiple services to return.

Implementation in AWS App Mesh

To implement this pattern, we need to set HTTP or HTTP/2 Timeout in both Virtual Router and Virtual Node as below,

Virtual Router

  • Request timeout
    An object that represents a per-request timeout. If you specified a Retry policy, then the duration that you specify here should always be greater than or equal to the retry duration multiplied by the Max retries that you defined in the Retry policy so that your retry policy can be complete. The default value is 15 seconds.
    Note: If you specify a timeout greater than the default, make sure that the timeout specified for the listener for all virtual node participants is also greater than the default. However, if you decrease the timeout to a value that is lower than the default, it's optional to update the timeouts at virtual nodes.
  • TCP Connection Idle duration
    The amount of time that the TCP proxy will allow a connection to exist with no upstream or downstream activity, specifies the length of time that a connection is idle before the connection is eligible for deletion. If no traffic flow is detected within the idle session timeout, the proxy can delete the connection. The default setting is 300 seconds.

Virtual Node

  • Request timeout
    The request timeout for a listener. The default is 15 seconds.
    Note: If you specify a timeout greater than the default, make sure to set up a virtual router and a route with a timeout greater than the default. However, if you decrease the timeout to a value that is lower than the default, it's optional to update the timeouts at Route.
  • TCP Connection Idle duration
    The duration specifies the idle duration for a listener. The default is 300 seconds.

As an example, let's imagine the same scenario as we mentioned in Retries Pattern. The heavy calculation workload API needs to have a longer timeout setting instead of the same one as the other APIs, so we need to create a separate Virtual Route for it to specify different timeout settings.

  • Virtual Router

    apiVersion: appmesh.k8s.aws/v1beta2
    kind: VirtualRouter
    metadata:
      name: sw-foo-service-router
      namespace: sw-foo-service
    spec:
      listeners:
        - portMapping:
            port: 8080
            protocol: http
      routes:
        - name: foo-heavy-feature-route
          httpRoute:
            match:
              prefix: /v1/heavy/data-calculate
            action:
              weightedTargets:
                - virtualNodeRef:
                    name: sw-foo-service-heavy
                  weight: 1
            retryPolicy:
              maxRetries: 3
              perRetryTimeout:
                unit: s
                value: 60
              httpRetryEvents:
                - server-error  # HTTP status codes 500, 501, 502, 503, 504, 505, 506, 507, 508, 510, and 511
                - gateway-error # HTTP status codes 502, 503, and 504
              tcpRetryEvents:
                - connection-error
            timeout:
              perRequest:
                unit: s
                value: 120
              idle:
                unit: s
                value: 600
       - name: default
          httpRoute:
            match:
              prefix: / # default match with no priority
            action:
              weightedTargets:
                - virtualNodeRef:
                    name: sw-foo-service
                  weight: 1
            retryPolicy:
              maxRetries: 2
              perRetryTimeout:
                unit: s
                value: 5
              httpRetryEvents:
                - server-error 
                - gateway-error
              tcpRetryEvents:
                - connection-error
            timeout:
              perRequest:
                unit: s
                value: 10
              idle:
                unit: s
                value: 600
    
  • Virtual Node for service default deployment

    apiVersion: appmesh.k8s.aws/v1beta2
    kind: VirtualNode
    metadata:
      name: sw-foo-service
      namespace: sw-foo-service
    spec:
      podSelector:
        matchLabels:
          app: sw-foo-service
      listeners:
        - portMapping:
            port: 8080
            protocol: http
          healthCheck:
            ...
          timeout:
            perRequest:
              unit: s
              value: 5
            idle:
              unit: s
              value: 600
      serviceDiscovery:
        awsCloudMap:
          namespaceName: foo.prod.softwheel.aws.local
          serviceName: sw-foo-service
    
  • Virtual Node for the heavy deployment

    apiVersion: appmesh.k8s.aws/v1beta2
    kind: VirtualNode
    metadata:
      name: sw-foo-service-heavy
      namespace: sw-foo-service
    spec:
      podSelector:
        matchLabels:
          app: sw-foo-service
          feature: heavy-calculation
      listeners:
        - portMapping:
            port: 8080
            protocol: http
          healthCheck:
            ...
          timeout:
            perRequest:
              unit: s
              value: 60
            idle:
              unit: s
              value: 600
      serviceDiscovery:
        awsCloudMap:
          namespaceName: foo.prod.softwheel.aws.local
          serviceName: sw-foo-service
          attributes:
          - key: feature
            value: heavy-calculation
    

Wrap Up

We discussed how to leverage Resilience Patterns to build a Fault-tolerant Microservice Architecture. And then introduced the following 3 patterns by breaking them down to what are they, how to solve them, and how to implement the solution in AWS App Mesh:

  1. Client-side Load Balancing Pattern
  2. Retries Pattern
  3. Timeouts Pattern

Stay tuned. We will talk about the left 3 patterns, Circuit Breakers, Fallbacks, and Bulkheads in Part - 2 of Resilience Patterns.

Haili Zhang

Haili Zhang