Monitoring through Metrics

Siddheshwar Kumar
3 min readMay 22, 2024

This post talks about one of the traditional ways to monitoring a service or application. The post explains what is metrics, how it works and the database used to store it.

What is Metrics

Metrics are numbers that capture some important piece of information about a service or environment. They can be used to capture low-level aspects like CPU, memory, disk space; high-level aspects like number of errors, latency numbers, requests served, backlog of a queue; and even business insights like daily active users (DAU), orders placed in a day etc. And these values are collected at regular interval (say every 10 sec) to see the trend.

Metrics is basically a tuple of (time, measured_value) which if plotted in a given time window can give some interesting insights and trends. The tuple gets stored as time-series data set and that’s why Metrics are aka Time Series Data. If we capture tuple of errors in a service, we can get error rate in last 1 hour (or day or month and even all of these as well). Similarly, we can get latency graph and how it co-relates to corresponding change in CPU usage.

Metrics can have optional tag or label value for grouping and searching those numbers, this is where we can have details like this particular value is for which node, service, etc.

cpu.load host=host01,region=eu-west 1613707265 50

cpu.load host=host02,region=eu-west 1613707265 60

  • ..

In above example, cpu.load is metric name. host, region are labels/tags. The last two values are the timestamp and the actual CPU load value at that time.

How data gets stored

We can’t just store one reading at a given time and use it meaningfully to get good picture. Instead, we need to keep taking these values at regular interval (say every 30 seconds). And, when we have these details captured for some time it gives insights like what’s the behavior in last 5 mins, last 24 hours, last 30 days etc.

A Time Series Database (TSDB) is a database designed to store and retrieve such data for each point in time (or in other words, TSDB is a specialized DB that efficiently stores and retrieves time-stamped data).

TSDB vs Other DBs

It’s possible to handle time series data using regular relational / non-relational data stores, but TSDBs are optimized for functions like ingestion rate, query performance, data compression, analytics and specific data management features. For example, computing moving average in a rolling time window is not trivial in relational DB. And to support tagging/labeling of data, we need to index for each label. And, relational DBs are not designed for constant heavy write load. NoSQL DBs like Cassandra can handle this kind of load but it would require expertise to optimize for time series data.

How TSDB work

The source hosts (the one which generate metrics data) produce timestamped data points. This data is then sent to collecting agent where it is processed and aggregated. This is where the masses of data might be down-sampled to more manageable averages or units, like finally store values for at interval of 1 or 2 minute interval. The collecting agent writes this processed data to the TSDB.

Limitation of Metrics/ TSDB

Metrics, for example, will allow you to get an aggregate understanding of what’s happening to all instances of a given service, and even allow you to narrow your query to specific groups of services, but fail to account for infinite cardinality.

InfluxDB builds indexes on labels to facilitate the fast lookup of time-series by labels. The key here is, the labels should have low cardinality (i.e. small set of possible values; person name is high cardinality data). Availability Zone value is low cardinality as most of the cases we use 3 AGs.

Solutions

There are multiple well known products for metrics. InfluxDB and Prometheus are two of the most popular TSDBs. Both of them rely on an in-memory cache on-disk storage.

References:

--

--

Siddheshwar Kumar

Principal Software Engineer; enjoy building highly scalable systems. Before moving to this platform I used to blog on http://geekrai.blogspot.com/