In November 2019 we started the project to migrate existing monitoring infrastructure based on Prometheus v1 to the new monitoring stack.
Prometheus v1 instance served us for a year, was reliable, but consumed almost 2TB of storage (uncompressed, 4 bytes per sample average) and even with 32GB RAM instance had several OOMs when tried to get more than >3months of metrics.
Diagram of our old architecture:
New monitoring cluster has to be:
- high available
- more performant
- more efficient in space as we store up to 2 years of TS data.
Prometheus v2 itself is more efficient than v1, but we decided to check what storage options were available. Came up with the following list:
|TSDB||Bytes per datapoint from open sources|
Fast-forward - we decided to give VictoriaMetrics a try.
Here results for the disk usage test from the real world. Test instance ran simultaneously VictoriaMetrics and Prometheus 2 for some time and collected 5.46B datapoints.
|TSDB||Bytes per datapoint|
|Prometheus 2 without compression||1.15|
|Prometheus 2 with ZFS compression||0.713 (-38%)|
|VictoriaMetrics without compression||0.336 (-70%)|
|VictoriaMetrics with ZFS compression||0.293 (-74%)|
With our current load (17K samples per second) VictoriaMetrics shows consistently faster query time than Prometheus2 (10ms on average vs 500ms, same hardware), has stable and low RAM usage (goodbye OOMs) and uses just 5% of CPU (t3.large)
New architecture diagram:
The cluster consists of 3 servers, two of them have Prometheuses and VM (VictoriaMetrics) DB.
Each Prometheus writes to the local VM DB. AlertManager instances configured in cluster mode and de-duplicate the alerts before sending them to Opsgenie.
Third server has Caddy web server, Grafana and promxy.
Built on & with:
- CentOS 7
Some details from VM dashboard for the latest 30 days in production:
Evaluating Performance and Correctness: VictoriaMetrics response by Aliaksandr Valialkin