Chaos engineering: how to test latency resilience

Distributed software systems are all around us. If you as an engineer care about the availability of the system you build, a distributed architecture is your choice. It helps the system to continue running if some replicas of service are down. But besides that advantage, the distributed nature of a system brings some issues to consider as well. A higher probability of network, application, and infrastructure failures (more servers, more problems) can cause retry storms, outages due to traffic overflow, cascading failures, and so on. It is especially true for microservice driven systems. These problems lead to service unavailability which is not desirable, obviously.
Here chaos engineering comes in place. “Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” Popularized by Netflix, chaos engineering and testing is a handy principle to find sometimes unexpected issues that prevent your system to be resilient. Read more about chaos engineering principles.
So use it
Moving to practice, there are a couple of ways to test your system against rare but disruptive real-world events: standalone tools or injections to a codebase.
The most popular standalone tool is probably the original one — Chaos Monkey by Netflix. Such tools work mostly with infrastructure: stop virtual machines or servers, remove datacenters from the network, introduce latencies with network adapters, and so on.
However, sometimes it is more convenient to inject “disruptive events” on a code level. As most problems occur on inter-service communication level and the most popular way to communicate is HTTP protocol, it seems a straightforward and easy solution to add chaos using your HTTP-client.
At my current job position as .NET-developer our HTTP-client of choice is Vostok.ClusterClient. So I developed a library called ClusterClient.Chaos to inject latency with specified probability to test how the system reacts to unwanted and random extra latencies on HTTP calls between services.
Example
Here is a synthetic example of chaos testing the system using ClusterClient.Chaos and latency injections.
Let’s consider our client app (e.g. backend web-app) has to make 3 requests to different endpoints (e.g. backend microservices) to gather the required information.

Our usability specialist made a statement, that it is ok for the user if the method for gathering information works no longer than 1 second. And usually, it does but in periods of high load on services extra latency added and users see an error page as a timeout error happens. But actually, the user would prefer to get the page even without full information so his workflow is not interrupted. Let’s try to find this problem upward and prevent a flurry of angry emails from users.
Here is the original code of our service that builds some information that will be shown to a user.
Pretty straightforward: call endpoints one by one and combine results in a list.
Let’s test this method against our one-second timeout requirement.
This test passes all the time. But it is synthetic and does not consider real-world random network and other failures. Let’s inject some extra latency with a 5% probability. The configuration of ClusterClient using ClusterClient.Chaos lib goes as follows.
Now our test almost always fails (not really always because of randomness). So we found an issue in our codebase which is hard to find with the usual testing techniques. The next step would be to fix this. The first attempt could be to make service calls parallel, so the total time should be 3 times less than the original.
But if we run the same test as above against this method — it will fail as well for obvious reasons. The total method execution time is almost equal to the longest service call time. So if any of three calls’ execution time exceeds 1 second the test is failed.
But we are on the right way. As our requirement says: the incomplete information on a page is a better option than the timeout error. So we can configure timeouts on individual requests thus we won’t wait for the service calls that run longer than we expect. Graceful degradation in action.
Run the test again (several times) and we see that it helps, the user always (almost) gets the page. The page sometimes has incomplete information but it is often better for the service availability and such trade-off provides a better user experience. It is not always the case though.
Conclusion
This example shows how chaos testing can help you to find issues in the software that prevents a system to be truly resilient and then test your changes in problematic conditions again.
Full code can be found on GitHub as part of ClusterClient.Chaos repository. Some helpful links: