What is Monit?
Monit is a utility for managing and monitoring processes, programs, files, directories and filesystems on a Unix system.
Node and Pillar operators can use Monit to monitor memory usage and use it to restart
go-zenon when the host memory usage exceeds a certain percent. This is especially useful if you are running an orchestrator on your Pillar. The orchestrator requires a valid
go-zenon node and when the node goes offline the orchestrator will fail.
If you restart your node or pillar with a cron job, consider implementing monit to reduce the restarts. This will make the node more stable and will help with the operation of the orchestrator.
Step 1 - Install Monit
sudo apt install monit
Step 2 - Create a monit file to monitor host memory usage
sudo nano /etc/monit/conf-available/host
Paste the following content into the
check system $HOST
# if loadavg (5min) > 3 then alert
# if loadavg (15min) > 1 then alert
if memory usage > 70% for 1 cycles then exec "/usr/bin/systemctl restart go-zenon"
Step 3 - Activate the
host monitoring configuration
ln -s /etc/monit/conf-available/host /etc/monit/conf-enabled
Step 4 - Reload Monit to Activate memory monitoring
Step 5 - Check to make sure Monit is running properly
sudo systemctl status monit
Did you try
pprof to debug the memory leaks?
Not yet. I’m not super familiar with
pprof. But I do think solving this issue is critical to orchestrator stability. I’m talking to lots of pillars and I’m pretty sure stability is impacted b/ everyone is on a cron to reboot go-zenon. Some reboot 4x a day.
I can try to mess around with it this week.
Thanks for the instructions for Monit, 4 times a day lol that was me…
go-zenon is remarkably unreliable. I have a healthcheck running on 2 nodes. it runs a syncInfo check every 30 seconds and gives me the results in telegram. The service is either unresponsive (takes longer than 5 seconds to respond) or is out of sync, about every 30 minutes. It’s more unresponsive than out of sync.
I’m starting to think this memory issues is critical to solve. The orchestrators rely on it. When memory usage approached 70% on a 16g node, the results get worse.
After I finish up trying to stabilize the public node, I’ll try
Cronjobs work well for me, the orchestrator isn’t down since I updated it correctly.