Resilience APIs to Transient Faults using Polly
Introduction
In our previous article (.NET Nakama, 2022 July), we saw that Transient faults are inevitable temporary errors (especially in microservice and cloud-based applications). An API is Resilient
when it can recover from transient failures and continue functioning (in a way that avoids downtime or data loss). Therefore, we have learned the main strategies to handle transient faults. Transient fault handling may seem complicated, but libraries like Polly can simplify it.
In this article, we will use the Polly library to apply and combine the Retries, Circuit-Breaker, Network Timeout, and Fallbacks strategies. In our use case, we need to execute an action in our Web API (GetData), which retrieves and merges data from two third-party providers (via HTTP) to return them to our consumers (clients), as shown in Figure 1.
The Polly library and Resilience Policies
Polly is a NET resilience and transient-fault-handling library. Simply put, it provides a quick way to define and apply strategies (policies) to handle transient faults. Polly targets .NET Standard 1.1 and .NET Standard 2.0+, which means that it can be used (everywhere!):
- .NET Framework 4.6.1 and 4.7.2
- .NET Core 2.0+, .NET Core 3.0, and later.
The Polly NuGet library can be installed via the NuGet UI or the NuGet package manager console:
The strategies that we learned in .NET Nakama (2022, July) and we will use in the current article have an equivalent Polly policy, as we can see in the following table.
Table 1. - The Polly Policies-Strategies to Handle Transient Faults
Strategy | Polly Policy | Description |
---|---|---|
Retries | Retry Policy (with and without waiting) | Let’s retry after a short delay. Then, maybe the fault will be self-correct. |
Circuit-Breaker | Circuit Breaker Policy | When a system is seriously struggling, failing fast is better to give that system a break. |
Network Timeout | Timeout Policy | Don’t wait forever! Beyond a specific waiting time, a successful result is unlikely and worthless. |
Fallbacks | Fallback Policy | Things will still fail! Plan what you will do when that happens. |
Combination of Multiple Strategies | Policy Wrap | Different faults require different strategies. By combining multiple policies we increase the resiliency. |
Using Polly in 3 Steps
Step 1: Specify the Faults That the Policies Will Handle
We need to apply some policies (strategies) to handle transient faults. So, our first step is to define how to recognize these faults, which can be performed either from:
- Exceptions thrown (such as
HttpRequestException
,Exception
, etc.), or - Returned Results (specifying the fault, e.g., in a related property such as Status, ErrorCode, etc.). In this case, we assume that the exceptions are handled with a try-catch, and a corresponding result will be returned.
Handle Thrown Exceptions
In the following example code, we will handle the HttpRequestException
and OperationCanceledException
exceptions.
In the following example code, we will handle all Exceptions. In addition, we state that our execution code and our fallback policies will return a nullable MyCodesResponse
class. In this way, we can define policies for execution code that are not void (i.e., returns something).
Handle Returned Results
In the following example code, we will get a MyResponseDTOClass
object and handle the cases in which the MyStatusCode
property is either InternalServerError
or BadGateway
.
Step 2: Specify How the Policy Should Handle the Faults
In this step, we define the policies (scenarios) with their thresholds and how we will combine them. The following code samples show how we can define policies based on the policyBuilder
of Step 1. However, there are cases, such as the TimeoutPolicy
, in which we should use the Polly
static methods. To learn more about the different contractors of each policy, see the Polly documentation.
Step 3: Execute Code through the Policy
It’s time to apply the policy in the code that communicates with the third-party provider. We used the circuit breaker policy in the following examples to execute the ThirdPartyProviderCommunication()
function. In addition, we can see how we can get a response.
Handle Transient Faults with Polly Policies
Policy Objects VS HttpClient Factory
To use Polly, we have two options. We can use Policy Objects or the HttpClient factory (ASPNET Core 2.1 onwards). As we can understand from the naming:
- Policy Objects: Can be used everywhere we want to apply the Polly policies.
- HttpClient Factory: Add Polly policies directly on the HttpClient Factory to be applied to every outgoing call.
The HttpClient factory (IHttpClientFactory
) in ASPNET Core can be registered and used to pre-configure and create HttpClient
instances in an application. The IHttpClientFactory
offers additional benefits than using the HttpClient
directly. If you are interested, the Larkin K., et al. (2022, June 29) describes how we can use the IHttpClientFactory
(Basic usage, Named clients, etc.).
We aim to apply the Polly policies in the communication code with our third-party providers. In our case, the communication is performed via HTTP. So, both options are applicable. However, this will not always be the case. For example, different communication protocols or existing HTTP client libraries may be used that does not support Polly by default. For such cases, we can select the Policy Objects option.
In this article, we will start from the basics and use the Policy Objects to apply Polly policies, which we can use everywhere.
The Tutorial Project
For the sake of our use case, we implemented two dummy providers (ProviderExampleApi1 & ProviderExampleApi2), which return random weather forecasts. In addition, we can define their execution delay and error response in each API request to simulate the transient faults.
The WebApiPolly
project represents our API which provides endpoints to simulate the different transient-fault scenarios. These endpoints communicate with our two providers to retrieve and combine the available results. For that purpose, two separate services have been implemented (assuming that each provider has a different API contract). In our example, we have applied Polly policies only to the integration of Provider2 (Provider2Integration.cs
).
In the following sections we will see in detail:
- How we defined each policy (as async).
- How we combined them, and
- Simulation transient-fault scenarios to investigate our system’s behavior.
Basic Structure
In the Provider2Integration.cs
file, we can see how we implemented the HTTP communication and applied the Polly policies. Let’s see some important details here:
- We have registered the
IProvider2Integration
withTransient
lifetime in the Dependency Injection (DI). You can decide how to register your services depending on your project requirements. For a better understanding of Dependency Injection and Lifetime, read the .NET Nakama (2020, November) article. - The
HttpClient
is static. TheHttpClient
is intended to be instantiated once per application rather than per use (.NET 6.0 Documentation). - Our policy object is static (
AsyncPolicyWrap
). This object contains the information that is needed from the policies. For example, we might need to store the consecutive errors. As we can understand, we cannot instantiate this data per request.
Handle All Exceptions
We will handle all Exceptions and define that our execution code and fallback policies will return a nullable Provider2GetResponse
class.
Fallbacks Policy
We intend to use several policies to reduce and handle transient faults. However, there will be actions that will still fail. Using a fallback policy, we plan what we will do in those cases. In the following example, we will log (in console) these cases and return a null value. We could return a default or substitute value depending on each use case.
Retry Policy with Exponential Backoff
In this policy, we will retry the failed executions for maxRetries
(e.g. 2) and wait between the retries for a duration calculated based on the number of retry attempts. So, if we set the max retries with a value of two, the maximum executions would be three (initial execution + two retries). In this example, we are using a simple function (waitTime = 2 ^ retryAttempt
) to calculate the waitTime
.
- 2 ^ 1 = 2 seconds
- 2 ^ 2 = 4 seconds
- 2 ^ 3 = 8 seconds
- etc.
Circuit-Breaker Policy
In this circuit-breaker policy, we break the circuit after breakCurcuitAfterErrors
consecutive exceptions and keep the circuit broken for keepCurcuitBreakForMinutes
minutes. In addition, we are defining what to do when the circuit state changes to open (onBreak
) and when the circuit state changes to closed (onReset
). In our case, we are keeping an informational console log.
It is essential to notice that we have used an additional fallback policy for the circuit-breaker to handle the BrokenCircuitException
, keep a related log, and return an alternative response. We needed this because we would like to stop the repeat policy when the circuit is opened (blocked).
breakCurcuitAfterErrors
, we must consider that the circuit-breaker also counts the failed repeat executions.
Network Timeout Policy
In our example, we are using an HttpClient
in which we can set the Timeout
. However, this would not always be the case. We may communicate using a client that does not support Timeout. In such cases, we can use the Polly timeout. In our example, we will timeout after timeoutInSeconds
and write a related log. The TimeoutStrategy
has the following two options:
- Optimistic: The called code honors the
CancellationToken
and cancels when needed. - Pessimistic: The called code may not honor the
CancellationToken
.
Policy Wrap
The Polly policies can be combined in any order using a PolicyWrap
. However, we should consider the ordering points that are described in the Polly documentation. In the following example, we combined all the studied policy strategies based on the typical policy ordering.
Transient-Fault Scenarios Simulation
The tutorial project is configured as “Multiple Startup Projects” to start the two example providers and our main Web API project together. So, you just need to click the Start
button as shown in Figure 2.
The following table shows the endpoints that simulate the different transient-fault scenarios and their names in the provided Postman collection. We can find the complete code of the tutorial on GitHub and in the Postman collection to test it quickly.
Postman Request Name | API GET Endpoints |
---|---|
Happy Path Scenario: No errors | https://localhost:7083/weatherforecasts |
Continuous-Failures (Provider 2 is down) | https://localhost:7083/weatherforecasts/continuous-failures |
Timeout-Errors (Provider 2 delay to respond) | https://localhost:7083/weatherforecasts/timeout-errors |
Transient-Faults (Random errors or/and delays on Provider 2) | https://localhost:7083/weatherforecasts/transient-faults |
Continuous Exceptions and Timeouts Scenarios
To test the retry and fallback policies, we can send the Continuous-Failures
and the Timeout-Errors
requests and investigate the produced console logs. For example, in the following figures, we can see:
- All executions fail either by general exception or timeout error (Figures 3 & 4).
- The initial execution and the two retries (Figures 3 & 4).
- The fallback policy returns a null value when the communication is not possible (Figures 3 & 4.
- The circuit-breaker policy opened (blocked) the circuit on the 6th consecutive failed execution (Figure 5).
- The circuit remained open for one minute (as configured) and did not accept messages to give that system a break (Figure 5).
- After one minute, one execution was attempted, and because a failure occurred, it opened the circuit again for another minute (Figure 5).
Transient-Fault Scenarios
The Transient-Faults
endpoint produce random errors or/and delays. In the following figure, we can see an execution example, in which the first two executions failed (due to error and timeout). However, the third attempt was successful. Thus, in this request the client received the results.
Summary
Transient fault handling may seem complicated, but libraries like Polly can simplify it. This article teaches the three basic steps to use the Polly library. In addition, we applied and combined the Retries, Circuit-Breaker, Network Timeout, and Fallbacks policies to improve the resiliency of our Web API.
Using the provided source code and Postman collection, we simulated continuous and random failures (exceptions or/and timeouts). Finally, we investigated our system’s behavior by applying the Polly policies. As we saw, combining these policies provides a powerful tool that reduces and handles transient faults to provide resilient APIs.
References
- .NET 6.0 Documentation (Accessed on 2022 July). HttpClient Class. https://docs.microsoft.com/en-us/dotnet/api/system.net.http.httpclient?view=net-6.0
- .NET Nakama (2020, November 4). ASP.NET Core Web API Fundamentals. https://www.dotnetnakama.com/blog/asp-dotnetcore-webapi-fundamentals/#basics-of-dependency-injection-in-aspnet-core
- .NET Nakama (2022, July 4). Strategies to Handle Transient Faults in Web APIs. https://www.dotnetnakama.com/blog/strategies-to-handle-transient-faults-in-web-apis/
- Larkin K., et al. (2022, June 29). Make HTTP requests using IHttpClientFactory in ASP.NET Core. https://docs.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-6.0
If you liked this article (or not), do not hesitate to leave comments, questions, suggestions, complaints, or just say Hi in the section below. Don't be a stranger 😉!
Dont't forget to follow my feed and be a .NET Nakama. Have a nice day 😁.