Saturday, March 26, 2022

Retry Strategy - Transient Fault Handling

 

Systems can be susceptible to transient failures when communicating with downstream dependencies, impacting service availability, especially on the cloud, where the communication between services goes over unreliable channels.
Using a proper retry strategy is crucial to recover quickly from minor glitches, but it has to be done carefully and follows a set of best practices.
I firmly believe that no or conservative retry configuration is as bad as aggressive retries; they impact the system's availability differently.
𝗕𝗲𝘀𝘁 𝗣𝗿𝗮𝗰𝘁𝗶𝗰𝗲𝘀
Every component in the call chain must have a time-to-live (Timeout) and retry configuration that works in harmony. Async workflows and batch processes should tolerate higher latency, and here a retry strategy is crucial, but it has to be limited (e.g., 3 retries).
Systems should have a mechanism to throttle bursts in traffic to avoid retry storms, which may impact the system's recovery during major outages.
Using exponential back-off with jitter strategies when configuring retries, the waiting time should exponentially increase (e.g., First retry 100ms, second after 200ms, and third after 500ms).
Back-off only isn't very helpful as retries of multiple failed requests simultaneously cause contention and overload again, so adding randomness (Jitter) to the waiting time (back-off) spreads the retries of failures around in time.
Retries may impact latency, and you have to measure its impact on the end-to-end flow; control-plane vs. data-plane react differently to such effect; hence it's up to you to decide the best retry strategy for your usecase.
Overly aggressive retry strategies (Too short intervals or too frequent retries) can adversely affect the downstream service recovery during major outages.

Tuesday, February 15, 2022

How do I move to the next level?

If you are wondering how you could move to the next level, Junior to Senior level, or Senior to a Tech Lead level. Here is an idea!

Take some of the ideas I post here from time to time. I will try to post something every week, compile a small document that briefly describes the initiative and its benefits, and present it to your team.

You don't have to implement it entirely, but you could become the go-to person for this topic in your team from now on ;)

I'd also go above and beyond and create a backlog story with refined sub-tasks that are ready for pick-up in future sprint plannings!

If you are a Senior-level, I think it's more convenient to use such ideas as onboarding projects for new hires; it will be a great onboarding experience ;)

Sunday, February 13, 2022

Standard Operating Procedure (SOP)

 

𝐒𝐭𝐚𝐧𝐝𝐚𝐫𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐏𝐫𝐨𝐜𝐞𝐝𝐮𝐫𝐞 (𝐒𝐎𝐏)
 
In some companies, every team is solely responsible for the design and development of their systems and what comes after that, such as deployments, monitoring, and troubleshooting of errors.
An SOP document is crucial to provide detailed step-by-step instructions to complete specific processes.
For example, when a failure occurs (Alarm A is triggered), a team member should find the steps to troubleshoot and mitigate this error in the SOP doc!
 
𝐆𝐨𝐨𝐝 𝐒𝐎𝐏
  •  Go to logs for stage X (Link)
  •  Check the dashboard for stage X (Link)
  •  Update the ticket with what you found in the logs, and a link to the graphs
  •  Ensure the Dead Letter Queue (DLQ) is empty; if not, apply the SOP section for retrying DLQ messages (Link)
  •  Check if there is a customer impact (usually customer X, Y, and Z are impacted by such error) - (Link to clients dashboard)
  •  Check if the error is because of a downstream dependency (Link to dependency H and R dashboards)
  •  Update the ticket with your findings above, and engage on-calls for client/dependency teams if necessary.
  •  ....
𝐖𝐡𝐞𝐧 𝐝𝐨 𝐭𝐞𝐚𝐦𝐬 𝐝𝐞𝐟𝐢𝐧𝐞 𝐒𝐎𝐏𝐬?
 
Unfortunately, most teams write the SOP sections for specific errors only after they occur, but in our team, we started to change our strategy and always define all possible points of failures during design and development, and add tasks to update the SOP doc with a section for each failure type!
For existing systems, I highly recommend reading about Pre-Mortem process.

Technology Evolves, Facts can change too?

 

In the past few weeks, I was involved in several discussions about topics that I had always thought they are over-studied, and pretty much everyone is aligned around them!
 
I had the impression, why are we discussing this now? - there are tons of articles and books citing case studies around them, and they all reached a similar conclusion 🧐
 
Then, I decided to take a step back and listen carefully to the other person’s insights and reasoning that made him reach such conclusion and thought!
The thoughts and arguments were good, they weren’t strong enough to fully convince me, but if you think about it from a different perspective, you may find it crucial to continue challenging those thoughts, opinions, and facts from time to time.
 
Technology has advanced so much, and businesses have crazily evolved recently. So maybe those facts that made sense ten years ago aren’t relevant anymore.
 
 End of text 🤦‍♂️