-
Notifications
You must be signed in to change notification settings - Fork 189
Description
Background: I've been running consul-alerts for years in a cluster of 3 consul servers.
The last restart I've done was ~4 months ago.
Today external website monitoring alerted me that one of my 3 machines was down (the reason being that the underlying VPN excluded it), including 1 of the 3 consul servers. I was surprised that external monitoring told me that, and consul-alerts did not, as that is its main job.
I checked the configured consul-alerts/config/notifiers/log/path on all 3 machines, and the last outputs as recorded in those log files were from 7 days ago, 6 days ago, and 3 days ago respectively.
System logs on all 3 machines:
Details
[root@node-1:~]# systemctl status consul-alerts | cat
● consul-alerts.service
Loaded: loaded (/nix/store/xj9xpqylbnifwzxaisllqnq4yrnzw0xg-unit-consul-alerts.service/consul-alerts.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2019-11-03 02:23:58 UTC; 4 months 26 days ago
Main PID: 1703 (consul-alerts)
Tasks: 83 (limit: 4915)
CGroup: /system.slice/consul-alerts.service
├─ 1703 /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts start --alert-addr=localhost:9000 --consul-addr=localhost:8500 --watch-checks --watch-events
├─ 1769 consul watch -http-addr localhost:8500 -datacenter dc1 -type checks /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch checks --alert-addr localhost:9000 --log-level warn
├─ 1770 consul watch -http-addr localhost:8500 -datacenter dc1 -type event /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch event --alert-addr localhost:9000 --log-level warn
└─17052 /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch checks --alert-addr localhost:9000 --log-level warn
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
[root@node-2:~]# systemctl status consul-alerts | cat
● consul-alerts.service
Loaded: loaded (/nix/store/xj9xpqylbnifwzxaisllqnq4yrnzw0xg-unit-consul-alerts.service/consul-alerts.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2020-01-17 17:50:03 UTC; 2 months 12 days ago
Main PID: 21793 (consul-alerts)
Tasks: 76 (limit: 4915)
CGroup: /system.slice/consul-alerts.service
├─21793 /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts start --alert-addr=localhost:9000 --consul-addr=localhost:8500 --watch-checks --watch-events
├─21806 consul watch -http-addr localhost:8500 -datacenter dc1 -type event /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch event --alert-addr localhost:9000 --log-level warn
└─21807 consul watch -http-addr localhost:8500 -datacenter dc1 -type checks /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch checks --alert-addr localhost:9000 --log-level warn
Mar 30 16:14:05 node-2 consul-alerts[21793]: 2020/03/30 16:14:05 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:17:12 node-2 consul-alerts[21793]: 2020/03/30 16:17:12 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:20:20 node-2 consul-alerts[21793]: 2020/03/30 16:20:20 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:23:27 node-2 consul-alerts[21793]: 2020/03/30 16:23:27 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:26:34 node-2 consul-alerts[21793]: 2020/03/30 16:26:34 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:29:41 node-2 consul-alerts[21793]: 2020/03/30 16:29:41 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:32:48 node-2 consul-alerts[21793]: 2020/03/30 16:32:48 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:35:56 node-2 consul-alerts[21793]: 2020/03/30 16:35:56 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:39:03 node-2 consul-alerts[21793]: 2020/03/30 16:39:03 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:42:10 node-2 consul-alerts[21793]: 2020/03/30 16:42:10 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
[root@node-3:~]# systemctl status consul-alerts | cat
● consul-alerts.service
Loaded: loaded (/nix/store/xj9xpqylbnifwzxaisllqnq4yrnzw0xg-unit-consul-alerts.service/consul-alerts.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2019-11-03 02:24:38 UTC; 4 months 26 days ago
Main PID: 1592 (consul-alerts)
Tasks: 76 (limit: 4915)
CGroup: /system.slice/consul-alerts.service
├─1592 /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts start --alert-addr=localhost:9000 --consul-addr=localhost:8500 --watch-checks --watch-events
├─1624 consul watch -http-addr localhost:8500 -datacenter dc1 -type event /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch event --alert-addr localhost:9000 --log-level warn
└─1625 consul watch -http-addr localhost:8500 -datacenter dc1 -type checks /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch checks --alert-addr localhost:9000 --log-level warn
Mar 19 13:27:24 node-3 consul-alerts[1592]: 2020/03/19 13:27:24 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (rpc error making call: EOF), retry in 5s
Mar 19 13:27:36 node-3 consul-alerts[1592]: 2020/03/19 13:27:36 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 20s
Mar 30 15:07:48 node-3 consul-alerts[1592]: 2020/03/30 15:07:48 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (rpc error making call: EOF), retry in 5s
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Note the Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable. is not important here -- that is what systemd prints for log rotation, but the newer (perhaps relevant) logs are still present here).
In the output we can see that node-1 logged no errors, node-2 was in a retry loop (makes sense given that it is the one that was kicked out of the VPN, so it cannot talk to the Consul server quorum), and node 3 apparently had a short error today, but retried only once because its retry did not grow beyond 5s.
If I log into the consul dashboard (localhost:8500 on either server), I can see that lots of services are marked red.
Yet, consul-alerts today sent me no notifications at all -- neither to Slack (as I have it configured), nor into the configured log file. It usually does notify my.
So the question is: Why did it fail as a watchdog today?
I would have expected it to notify me of the errors shown in the Consul UI shortly after they ocurred.
A systemctl restart consul-alerts on node-1 immediately resulted in a Slack notification, and correct entries in the log file (but only in the log file of node-3!).
So it seems that something got stuck, and only manual restarting revived it (which is of course not good for an automated high-availability watchdog).
I cannot currently tell whether the fault is in consul-alerts, or in consul itself (consul-alerts starts consul watch, so in case that one hung and didn't output anything, it would be consul's fault).
Maybe anybody knows anything?