Web Service Analysis and monitoring with Grafana & InfluxDB

Web Service Analysis and monitoring with Grafana & InfluxDB

Metrics: why and how

Author: Lorenzo Berni

Sometimes during the development of web services, there is a tendency to give low consideration to both performance monitoring and service usage. Maintaining a system including metrics and adequate visualization will allow us to find any bottlenecks and bugs before they even occur.

My use case

I was confronted with the topic of metrics when I was asked to monitor the performance of a REST API, to check which endpoints were most stressed and to take account of the HTTP response codes. To these parameters, strictly related to the API, I then added further metrics related to the use of the underlying databases and the related performance.

The analysis of metrics allows, at a glance, the immediate identification of certain anomalous API behaviors, behaviors that would otherwise go completely unnoticed for a long period of time.

Although reasonable, the requirements of a metric system are stringent:

Why Grafana

There are many reasons why I turned to Grafana:

Interesting metrics for a REST API

Filling your API with metrics is not only a useless exercise but can actually be counter-productive. For this reason, it is always advisable to perform a thorough analysis of what needs to be monitored. Receiving tons of useless alerts is barely more useful than not receiving them at all.

Here are some parameters that might be interesting to monitor:

While it is not recommended to add large quantities of uninteresting parameters, it is instead of fundamental importance to attach to these metrics as much information as possible: knowing that many errors are occurring in requests is a good starting point, but knowing that these errors are caused by Android users located in the United States via the app updated to its last version released a few hours earlier is much more useful.

Details on the metric stack

statsd

At the first level of the stack, immediately in contact with the REST API is statsd, with which it is communicated via a statsd-client, a simple Python module that implements the communication protocol via UDP. Statsd has been chosen for this thankless task due to its ability to store the metrics received and to generate output as well as the aggregates themselves.

InfluxDB

InfluxDB

Due to the persistence of the accumulated data InfluxDB was chosen, a robust database, written in go, on which it is possible to make queries using a SQL-like language. InfluxDB was created with the aim of storing data series time, and is highly performing both in the save phase and in the query phase. It also maintains filesystem data in an extremely compact format. Communication between statsd and InfluxDB takes place via the graphite protocol.

Grafana

Visualizzazione con Grafana

The visualization, instead, as written above is delegated to Grafana.

Grafana supports countless data sources, including InfluxDB and provides graphical tools within the web app itself to simply configure queries and to change the appearance of the results display.

Conclusions

I was very satisfied with Grafana as well as with the entire stack dedicated to metrics. And yet I found the graphical dashboard configuration interface on Grafana very cumbersome. In addition, another flaw that I found was the impossibility of being able to configure a series of dashboards ready to be used on a clean installation of Grafana itself, so that to build a reproducible and working environment on docker I was forced to write scripts that import dashboards on first boot using the Grafana HTTP API.

Remove statsd from the stack

Although it is a mature and reliable tool, at the end of the project I came to the conclusion that it was not worthwhile including an entire additional service solely and exclusively to obtain temporary aggregates: InfluxDB is sufficiently powerful and performing to be able to carry out aggregations on-the-fly. In the final setup of the project, eliminating statsd would have meant removing an additional dependency, with the related configuration and maintenance issues.

Give the TICK stack offered by InfluxData a chance

stack TICK

Another point that – considering infinite time and budget – I would certainly have addressed is a possible migration to the entire TICK stack made available by InfluxData (the creators of InfluxDB)

This stack consists of: