Transient timeouts and the retry rabbit hole (.Net 4.5)

Tulara Webster
Campaign Monitor Engineering
6 min readJun 25, 2018

--

A month ago I would have told you retrying on timeouts in c# was as simple as chucking Polly at the thing. oh, to be young and naïve and using ambiguous language!

Turns out, there are some easy mistakes to make, and here is a post summarising my learnings. All code samples shown are runnable in LinqPad.

tldr; If you want to retry an HTTP call, HttpClient’s Delegating Handlers don’t let you distinguish between external task cancellations and actual request timeouts. And if you don’t use them, you have to be careful to make sure your request and its content hasn’t been disposed when you retry. Skip to the last attempt to see how we solved this.

Task:

Campaign Monitor has a high-performance in-house file storage service we use for things like storing campaign content.
The mission in this instance was to add retries, and a circuit breaker, to the client library that makes (a lot of) http calls to this service.

Attempt 1: Polly + DelegatingHandler = easy retries! right?!

Ingredient 1: Polly.

I like Polly. It’s a resilience and transient-fault-handling library that allows for configuring such wonders as retries and circuit breakers in a fluent manner. It handles async and is thread safe. It’s so wonderful that it’s been integrated into .Net Core 2.1. However, alas, some of our code is still operating in .Net 4.5 so here we’ve had to put the pieces together ourselves.

You can use it for anything that might throw an exception at you (in the real world, you’ll most likely doing this when that failure is transient):

Ingredient 2: .Net’s Delegating Handler.

This guy has been around since .Net 4.5 and sits in between the HttpClient and its built-in HttpClientHandler a là:

You can read more about it here and here but the essence of it is: do something fun like log or retry on any request the client issues. In code:

Put the two ingredients together, et voila! Neat little request client with retries.

Only… There’s a problem.

Exhibit A. Timeouts.
HttpClient bubbles timeouts up as task cancellations.

An impractical timeout for illustrative purposes
resulting in a TaskCanceledException

Exhibit B. External Task Cancellations

If someone closes a connection (e.g. gives up on a browser tab), the client will cancel all the outstanding requests, for say things like images. These are passed down as task cancellations.

The following shows instances of “System.Threading.Tasks.TaskCanceledException: A task was canceled”.
A peak of 2000 requests that have been cancelled in the space of a minute? Yep, that’ll break your circuit.

Sure enough, instances of “Polly.CircuitBreaker.BrokenCircuitException: The circuit is now open and is not allowing calls”:

A retry handler like that in Attempt 1 was responsible for sending these requests. It's now attempting to retry requests that no-one will ever see the result of.

What a sad and lonely thought. 😓

Problem:
The cancellation token can’t tell us where the cancellation came from.

If we are outside the httpClient we could just catch the exception and ask the cancellationToken if it was cancelled. True = someone outside my method cancelled the request. False = timeout (because the httpClient must have cancelled it!).
The thing is, you can’t distinguish exhibit A from exhibit B once you’re inside a DelegatingHandler. The first thing the HttpClient does when it gets your request is link your cancellation token together with its own. It is the combined token that is passed down through the handlers.

HttpClient.SendAsync

The cancellation token given to your Delegating handler will show IsCancellationRequested == true for both an external task cancellation request and an http client timeout:

Resulting in:
False
True

That’s a bummer… what does it mean?

1st Lesson: If we want to retry a TaskCanceledException caused as a result of a transient timeout, but not as a result of a cancellation request by the caller, we can’t use a DelegatingHandler.

Attempt 2: Differentiate timeouts and task cancellations

Alright, fine. We’ll retry the request from outside the httpClient call. Who needs you anyway DelegatingHandler.

ha, and I still got to use a Polly retry policy 😉 Let’s just execute that…

oh… hmm. You win this round HttpClient. But, I can totally just recreate this request!

Unit test the new retry handler. Manually test the integration with a few Fiddler-motivated retries. Ready to go!

2nd Lesson: To recreate HttpRequestMessage, you have to recreate the HttpContent object (and test retries on your goddamn writes!)

Otherwise you might end up bringing down a production API for half an hour… 😬

Snippet of a log from the resulting mayhem

From here, I solicited advice from prolific C sharpers (my colleagues) and we came up with the following options:

  • clone http content using a memory stream
  • create an undisposable http request object
  • stop using httpClient’s Timeout property and time out the request using our own cancellation token

So in the pursuit of simplicity I opted for option 3.

It’s worth noting that this has been rectified in dotNet Core; httpClient will no longer dispose of the request.

Attempt 3: HttpClient.Timeout you’re no friend of mine.

Moving back to the delegating handler model described in approach one. This time we have another handler which will manage a cancellation token coordinating the request timeout. And let’s actually throw a relevant exception while we’re at it.

Note the linkedToken; this is what httpClient was doing under the hood. Now we have out own, unpolluted token which we can check for cancellation. If either the inherited one or our timeout token are cancelled, that is passed through to httpClient via the linked token.

Throw that together with the appropriate RetryHandler (catching timeouts only):

So now, if we cancel the request via the token from outside the client request we won’t see a retry.
One more gotcha — the default Timeout on HttpClient is 100 seconds. If you make your timeout configurable (like we have outside these examples) some crazy person might set this to a timespan greater than the default and we would lose our retries!

Success!

Retries are happening (this is the last day of data):

at a healthy rate:

And the ones we’ve logged are retrying for the right reasons:

🙌🙌🙌🙌

--

--