My Experience

Saturday, March 26, 2022

Retry Strategy - Transient Fault Handling

Systems can be susceptible to transient failures when communicating with downstream dependencies, impacting service availability, especially on the cloud, where the communication between services goes over unreliable channels.

⁣

Using a proper retry strategy is crucial to recover quickly from minor glitches, but it has to be done carefully and follows a set of best practices.

⁣

I firmly believe that no or conservative retry configuration is as bad as aggressive retries; they impact the system's availability differently.

⁣

𝗕𝗲𝘀𝘁 𝗣𝗿𝗮𝗰𝘁𝗶𝗰𝗲𝘀

⁣

Every component in the call chain must have a time-to-live (Timeout) and retry configuration that works in harmony. Async workflows and batch processes should tolerate higher latency, and here a retry strategy is crucial, but it has to be limited (e.g., 3 retries).

⁣

Systems should have a mechanism to throttle bursts in traffic to avoid retry storms, which may impact the system's recovery during major outages.

⁣

Using exponential back-off with jitter strategies when configuring retries, the waiting time should exponentially increase (e.g., First retry 100ms, second after 200ms, and third after 500ms).

⁣

Back-off only isn't very helpful as retries of multiple failed requests simultaneously cause contention and overload again, so adding randomness (Jitter) to the waiting time (back-off) spreads the retries of failures around in time.

⁣

Retries may impact latency, and you have to measure its impact on the end-to-end flow; control-plane vs. data-plane react differently to such effect; hence it's up to you to decide the best retry strategy for your usecase.

⁣

Overly aggressive retry strategies (Too short intervals or too frequent retries) can adversely affect the downstream service recovery during major outages.

Tuesday, February 15, 2022

How do I move to the next level?

If you are wondering how you could move to the next level, Junior to Senior level, or Senior to a Tech Lead level. Here is an idea!

Take some of the ideas I post here from time to time. I will try to post something every week, compile a small document that briefly describes the initiative and its benefits, and present it to your team.

You don't have to implement it entirely, but you could become the go-to person for this topic in your team from now on ;)

I'd also go above and beyond and create a backlog story with refined sub-tasks that are ready for pick-up in future sprint plannings!

If you are a Senior-level, I think it's more convenient to use such ideas as onboarding projects for new hires; it will be a great onboarding experience ;)

Sunday, February 13, 2022

Standard Operating Procedure (SOP)

𝐒𝐭𝐚𝐧𝐝𝐚𝐫𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐏𝐫𝐨𝐜𝐞𝐝𝐮𝐫𝐞 (𝐒𝐎𝐏)

In some companies, every team is solely responsible for the design and development of their systems and what comes after that, such as deployments, monitoring, and troubleshooting of errors.

An SOP document is crucial to provide detailed step-by-step instructions to complete specific processes.

For example, when a failure occurs (Alarm A is triggered), a team member should find the steps to troubleshoot and mitigate this error in the SOP doc!

𝐆𝐨𝐨𝐝 𝐒𝐎𝐏

Go to logs for stage X (Link)
Check the dashboard for stage X (Link)
Update the ticket with what you found in the logs, and a link to the graphs
Ensure the Dead Letter Queue (DLQ) is empty; if not, apply the SOP section for retrying DLQ messages (Link)
Check if there is a customer impact (usually customer X, Y, and Z are impacted by such error) - (Link to clients dashboard)
Check if the error is because of a downstream dependency (Link to dependency H and R dashboards)
Update the ticket with your findings above, and engage on-calls for client/dependency teams if necessary.
....

𝐖𝐡𝐞𝐧 𝐝𝐨 𝐭𝐞𝐚𝐦𝐬 𝐝𝐞𝐟𝐢𝐧𝐞 𝐒𝐎𝐏𝐬?

Unfortunately, most teams write the SOP sections for specific errors only after they occur, but in our team, we started to change our strategy and always define all possible points of failures during design and development, and add tasks to update the SOP doc with a section for each failure type!

For existing systems, I highly recommend reading about Pre-Mortem process.

Technology Evolves, Facts can change too?

In the past few weeks, I was involved in several discussions about topics that I had always thought they are over-studied, and pretty much everyone is aligned around them!

I had the impression, why are we discussing this now? - there are tons of articles and books citing case studies around them, and they all reached a similar conclusion

Then, I decided to take a step back and listen carefully to the other person’s insights and reasoning that made him reach such conclusion and thought!

The thoughts and arguments were good, they weren’t strong enough to fully convince me, but if you think about it from a different perspective, you may find it crucial to continue challenging those thoughts, opinions, and facts from time to time.

Technology has advanced so much, and businesses have crazily evolved recently. So maybe those facts that made sense ten years ago aren’t relevant anymore.

End of text

Sunday, June 21, 2020

Modulo Operation Tricks

When you solve a problem that might result in values beyond integer limits (e.g., Wouldn't fit into a 64bit integer), you're often asked to print the value modulo a significant prime value (e.g. 10^9 + 1).

To be able to use the modulo operation correctly, you need to understand its properties.

Trick #1 Multiplication/Summation of values

For example, if you are multiplying/adding a set of numbers (a, b, c, d) then its useful to know that:

(a*b*c*d)%N = (a%N)*(b%N)(c%N)*(d%N) = ((a*b*c)%N)*d%N

The above is quite useful when you want to always keep calculations to fit into the given limits.

Trick #2 Multiplicative Inverse

For division, it's a little bit tricky though, you can't merely assume that:

(a/b)%N = (a%N/b%N)

Although that would be awesome, it's unfortunately not correct - So to solve this, you have to understand the modular multiplicative inverse of a number A.

So to calculate (a/b)%N, you can convert it to a%N*power(b, N-2) - I will not explain how or why, please refer to the wiki page above for more details.

Trick #3 Negative Numbers

Another problem one would face during programming contests is when you calculate modulo for a negative number, which doesn't really work as intended.

So -1%10 would result in -1 rather than 9 - This really depends on the programming language you use but this trick is useful for Java and C++ but not for Python which produces the right value 😊.

To overcome this, you can calculate it by adding N to the number so, -1%10 = 9%10 = 9 👍.

That's it! I will keep this post up to date with any other tricks I find in the future.

Sunday, May 24, 2020

Speak so that others want to listen

How many times were you in a meeting or discussing with a group of people, and was your turn to say something, and someone either interrupted you or waited until you finished and continued talking or just switched the topic?

I'm pretty sure this has happened many times throughout your career and will continue to happen even after you read this article and practice its techniques, but I can promise it will be less often than before.

I've been observing others and myself in meetings and always found out that most of the discussions followed the same scenario I've described above, so I started to look for ways to improve the way I speak and spotting the mistakes I and others often do.

I've found many techniques and suggestions that are helpful for someone to speak effectively and make people more engaged to what he/she says, I've chosen only 3 of them to talk about in this article.

Note: The following techniques are not only useful when speaking, but for example, rule #3 is a handy technique to write persuasive and easy to follow/read content (e.g., Email).

1. The power of the pause

This technique is one of the essential methods one could learn and practice to improve the effectiveness of his contributions in discussions and meetings.

Adding a pause before you talk allowing the other person to finish, rather than jumping with the first thing comes to your mind to avoid the risk of interrupting the other person if he wants to continue to add something.

Adding pauses when you jump from one point to the next, grabs your audience full attention to what you are saying.

This technique will make people feel more comfortable talking to you and, of course, actively listening to you and what you say. It will also make you and the others more engaged in the discussion.

2. Speak with clarity and confidence

How many times have you been asked (or asked someone) to repeat what you/he/she's just said (multiple times)?

One mistake we all do (both native and non-native speakers) is speaking too fast and tend to swallow the endings of some words, which might change the meaning or cause confusion. If the audience lost focus for few seconds to think about what you said or feels difficulty tracking it, then he will either ask you to repeat what you just said or stop focusing because he is still thinking about some words that you mispronounced.

So next time, try to speak at a slower pace and make yourself clear when speaking. You will see the difference in how confident you become and how engaged the audience is - practicing this technique will improve your cadence, and later it will naturally happen when you speak.

One tip I liked is to speak slowly as if you are giving your phone number to someone and want to make sure he gets it right and able to write it down.

3. The rule of three

Structuring your content, for example, when writing an Email or in a presentation in three parts, makes it very easy for your audience to follow and remember and makes you (the author) as a speaker or writer appear more knowledgeable, credible, and convincing.

As you can see, I followed this approach in this article, and after reading through it, I genuinely believe that you will be able to memorize all three techniques quickly next time you speak or write about something.

Do you know more techniques? Please comment on the article and share them with other readers and me. And as a little practice, write your review using the above methods about the topic and share it with friends.

Saturday, April 6, 2019

The Beatles Problem A - Codeforces #549 DIV1

Few days ago I was trying to solve the beatles problem which was a little bit tricky and I had to analyze it to find enough observations in order to come up with a good solution .. In this article I will go through my observations step by step and hope its easy for you to digest.

The editorial of the problem is very concise and not easy for everyone to understand so I will be as detailed as possible in this article to make sure you get the idea by going through my observations one by one and of course combine them to code a solution.

Observation #1:

Lets assume the jump we take each time is L, how many jumps do we need to get to S again (S is the start position)? Lets find out by example:

Note: For this observation we can completely ignore a and b

Assume we have n = 2, k = 3 and we start at S = 2:

If L = 1 then the path will be {2, 3, 4, 5, 6, 1, 2} => 6 jumps!

If L = 2 then the path will be {2, 4, 6, 2} => 3 jumps!

If L = 3 then the path will be {2, 5, 2} => 2 jumps!

From the above example we figured out that #jumps = #cities/gcd(#cities, L)

Where gcd is Greatest Common Divisor.

Observation #2:

We need to figure out L as its much easier to use to brute force a solution (I will show you why in the next observation).

To be honest I didn't get to that observation until I took a small hint from a solution which I couldn't understand (The solution was checking 4 possible values to find L) :O but after some manual work I figured it out.

The 2 X markers on the line are the first and second cities .. from the problem statement we know that city1 has the nearest restaurant at distance A and city2 has nearest restaurant at distance B -> Nice!

We have 2 possibilities for nearest restaurant of the first city .. either its before or after (marked as A1, A2) .. Same for nearest restaurant of the second city (marked as B1, B2).

We know also from the problem statement that the distance between any two neighboring restaurants = K, That means the distance between A* and B* is i*k, where i is any value between 1 ... n (We will know why its < n in the next observation).

- Remember we need to figure out the distance between the 2 Xs which is ???

- Yes, correct - its L

That leaves us with the following 4 possibilities:

Possibility #1: Restaurants are located at (A1, B1):

-> L = i*k + (B1 - A1)

Possibility #2: Restaurants are located at (A1, B2):

-> L = i*k + (-B2 - A1)

Possibility #3: Restaurants are located at (A2, B1):

-> L = i*k + (A2 + B1)

Possibility #4: Restaurants are located at (A2, B2):

-> L = i*k + (A2 - B2)

Observation #3:

Now we need to iterate to find the value of L, but what are the boundaries to find L?

We know that L = i*k + c, where c is one of the possible values {(A+B), (A-B), (-A-B), (B-A)}.

We also know we have only N*K cities .. so i could be any value between (1 ... N), right?

Now with the above 3 observations we can easily code a solution .. here is my C++ solution for this problem.

#include<iostream>
#include<map>
#include<set>
#include<algorithm>
#include<cstring>
using namespace std;

main() {
        long long n, k, a, b;
        cin >> n >> k >> a >> b;

        long long possible[] = {a+b, a-b, -a-b, b-a};

        long long mn = n*k, mx = 1;

        for (int i = 1; i <= n; i++) {
                for (int j = 0; j < 4; j++) {
                        long long x = possible[j]%k;
                        if (x < 0) x += k;

                        mn = min(mn, n*k/__gcd(n*k, i*k+x));
                        mx = max(mx, n*k/__gcd(n*k, i*k+x));
                }
        }

        cout << mn << " " << mx << endl;
}

If you have questions/feedback or spot a mistake please let me know in the comments.