Prometheus monitoring and VictoriaMetrics

In November 2019 we started the project to migrate existing monitoring infrastructure based on Prometheus v1 to the new monitoring stack.

Prometheus v1 instance served us for a year, was reliable, but consumed almost 2TB of storage (uncompressed, 4 bytes per sample average) and even with 32GB RAM instance had several OOMs when tried to get more than >3months of metrics.

Diagram of our old architecture:

Old monitoring architecture with Prometheus 1

New monitoring cluster has to be:

  • high available
  • more performant
  • more efficient in space as we store up to 2 years of TS data.

Prometheus v2 itself is more efficient than v1, but we decided to check what storage options were available. Came up with the following list:

TSDB Bytes per datapoint from open sources
Prometheus 2 2
Thanos 1.07
Cortex ?
M3DB 1.45
VictoriaMetrics 0.4
TimescaleDB 29

Fast-forward - we decided to give VictoriaMetrics a try.
Here results for the disk usage test from the real world. Test instance ran simultaneously VictoriaMetrics and Prometheus 2 for some time and collected 5.46B datapoints.

TSDB Bytes per datapoint
Prometheus 2 without compression 1.15
Prometheus 2 with ZFS compression 0.713 (-38%)
VictoriaMetrics without compression 0.336 (-70%)
VictoriaMetrics with ZFS compression 0.293 (-74%)

With our current load (17K samples per second) VictoriaMetrics shows consistently faster query time than Prometheus2 (10ms on average vs 500ms, same hardware), has stable and low RAM usage (goodbye OOMs) and uses just 5% of CPU (t3.large)

New architecture diagram:

New monitoring architecture with Prometheus 2 and VictoriaMetrics

The cluster consists of 3 servers, two of them have Prometheuses and VM (VictoriaMetrics) DB.
Each Prometheus writes to the local VM DB. AlertManager instances configured in cluster mode and de-duplicate the alerts before sending them to Opsgenie.
Third server has Caddy web server, Grafana and promxy.

Built on & with:

  • CentOS 7
  • ZFS
  • ansible
  • terraform

Some details from VM dashboard for the latest 30 days in production:

VictoriaMetrics totals VictoriaMetrics query duration VictoriaMetrics active timeseries VictoriaMetrics ingestion rate VictoriaMetrics datapoints VictoriaMetrics disk usage VictoriaMetrics memory usage VictoriaMetrics CPU usage

   

   

   

Extra links:

Evaluating Performance and Correctness: VictoriaMetrics response by Aliaksandr Valialkin