Systems can be susceptible to transient failures when communicating with downstream dependencies, impacting service availability, especially on the cloud, where the communication between services goes over unreliable channels.
Using a proper retry strategy is crucial to recover quickly from minor glitches, but it has to be done carefully and follows a set of best practices.
I firmly believe that no or conservative retry configuration is as bad as aggressive retries; they impact the system's availability differently.
𝗕𝗲𝘀𝘁 𝗣𝗿𝗮𝗰𝘁𝗶𝗰𝗲𝘀
Every component in the call chain must have a time-to-live (Timeout) and retry configuration that works in harmony. Async workflows and batch processes should tolerate higher latency, and here a retry strategy is crucial, but it has to be limited (e.g., 3 retries).
Systems should have a mechanism to throttle bursts in traffic to avoid retry storms, which may impact the system's recovery during major outages.
Using exponential back-off with jitter strategies when configuring retries, the waiting time should exponentially increase (e.g., First retry 100ms, second after 200ms, and third after 500ms).
Back-off only isn't very helpful as retries of multiple failed requests simultaneously cause contention and overload again, so adding randomness (Jitter) to the waiting time (back-off) spreads the retries of failures around in time.
Retries may impact latency, and you have to measure its impact on the end-to-end flow; control-plane vs. data-plane react differently to such effect; hence it's up to you to decide the best retry strategy for your usecase.
Overly aggressive retry strategies (Too short intervals or too frequent retries) can adversely affect the downstream service recovery during major outages.