How to Install Monit to reboot go-zenon

What is Monit?

Monit is a utility for managing and monitoring processes, programs, files, directories and filesystems on a Unix system.

Node and Pillar operators can use Monit to monitor memory usage and use it to restart go-zenon when the host memory usage exceeds a certain percent. This is especially useful if you are running an orchestrator on your Pillar. The orchestrator requires a valid go-zenon node and when the node goes offline the orchestrator will fail.

If you restart your node or pillar with a cron job, consider implementing monit to reduce the restarts. This will make the node more stable and will help with the operation of the orchestrator.

Step 1 - Install Monit

sudo apt install monit

Step 2 - Create a monit file to monitor host memory usage

sudo nano /etc/monit/conf-available/host

Paste the following content into the host file

check system $HOST
    # if loadavg (5min) > 3 then alert
    # if loadavg (15min) > 1 then alert
    if memory usage > 70% for 1 cycles then exec "/usr/bin/systemctl restart go-zenon"

Step 3 - Activate the host monitoring configuration

ln -s /etc/monit/conf-available/host /etc/monit/conf-enabled

Step 4 - Reload Monit to Activate memory monitoring

monit reload

Step 5 - Check to make sure Monit is running properly

sudo systemctl status monit

Did you try pprof to debug the memory leaks?

1 Like

Not yet. I’m not super familiar with pprof. But I do think solving this issue is critical to orchestrator stability. I’m talking to lots of pillars and I’m pretty sure stability is impacted b/ everyone is on a cron to reboot go-zenon. Some reboot 4x a day.

I can try to mess around with it this week.


Thanks for the instructions for Monit, 4 times a day lol that was me…

1 Like

go-zenon is remarkably unreliable. I have a healthcheck running on 2 nodes. it runs a syncInfo check every 30 seconds and gives me the results in telegram. The service is either unresponsive (takes longer than 5 seconds to respond) or is out of sync, about every 30 minutes. It’s more unresponsive than out of sync.

I’m starting to think this memory issues is critical to solve. The orchestrators rely on it. When memory usage approached 70% on a 16g node, the results get worse.

After I finish up trying to stabilize the public node, I’ll try pprof.

1 Like

Instructions to follow

Cronjobs work well for me, the orchestrator isn’t down since I updated it correctly.

1 Like