Friday, September 18, 2015

Monitoring tools and libraries

This quarter I have been assigned a task to design and implement a solution to monitor the accuracy of our results and the health of our system (e.g., Request durations, Error rate, Caching ratio, .... etc).

In the beginning I felt it might be a silly task, but then it turned out being an exciting task that I've learnt alot from as there are many open source technologies I have used during this project.

In this topic I will mention some of these systems and libs that might be useful for many of you to track the health of your system and give you a good indication of how good your system is. ;)

Metrics Library


Simply if you want to monitor your system you should expose data to measure and correctly monitor your system. This library is used to expose some metrics and store them via JMX, however it supports other ways to expose and report your metrics (e.g., Console Reporter, JMX, HTTP, CSV, ....).

These metrics could be one of the following type:

Counters


You can use this metric to report something that increase and decrease over time (e.g., Bookings, Errors, Done requests, ... etc)

 

Histogram


I have found this metric type very useful to measure statistical changes of a sequence or series of data like (Request duration) .. for example you can update this metric with the duration of all done requests and measure at any point of time the mean, standard deviation, 75th percentile and so on to know how good your system at any point of time.

 

Timer


I didn't use this type of metrics but its mainly can be used for example to measure the duration of a request and get the rate of requests per second.

 

Health Checks


This one is also very useful if you have multiple subsystems that you need to check its health and see if anything happens, like health of database connection.

I used this lib and exposed all metrics via the JMX reporter and the results were amazing especially when you use other nice monitoring tools like the ones I will describe below.

Grafana


You can use Grafana to visualize your metrics and measure your system health over time.

I used Grafana to create multiple dashboards to measure the Error Rate, Cache Ratio, No Results Rate, and to measure the accuracy of our results.

I would recommend it for anybody wants to visualize and measure his system health and accuracy.

I will put some useful points about Grafana:
  • Datasource - You can define multiple data sources and each one has its own query editor which supported by Grafana (e.g., InfluxDB, Graphite, ...)
  • User - It supports user authentication and authorization via LDAP, Database, Google Authentication.
  • Dashboard - You can group graphs in one dashboard and create multiple dashboards to track your system.
  • Row - In one dashboard you can have multiple rows to organise how the graphs should look like.
  • Panel - Panel has multiple types but for me the most important was the graph which you can define your graph with the not very powerful query editor :D. 

You can also define the period you need to see on each graph and the refresh rate of the whole dashboard.

Sample Images from Grafana website:

 

No comments:

Post a Comment