@aliencoder @sol not sure where to begin to analyze this. I’ll look for some visual analizers.
The heap is growing over time. I will post some exports. This one is at 45% memory usage.
heap-45%.txt (4.0 MB)
@aliencoder @sol not sure where to begin to analyze this. I’ll look for some visual analizers.
The heap is growing over time. I will post some exports. This one is at 45% memory usage.
heap-45%.txt (4.0 MB)
@sol @aliencoder @sumamu @georgezgeorgez
Guys, I’m starting to think the growing memory use issue is a real problem we need to fix now. The orchestrator relies on znnd. I’m in comms with all (most) of the orchestrators and they all seem to have go-zenon reboot every 24 hours. That causes the Orchestrator to crash (reboot) every 24 hours.
Today @vilkris had a 1006 error in the Orchestrator and it froze rather than rebooting. I’m going to report an issue on the repo.
As we launch the side chain I’m sure the infra will rely on local znnd. We simply cannot have a network where go-zenon must be restarted every 24 hours to function. I was able to get pprof data (posted above) but it’s clear to me that I cannot trouble shoot this issue. @sol had some ideas that I think we should discuss.
Disappointed at Mr. Kaine waving away the issue when we all reported it and when I insisted (loudly) saying it WAS a memory issue / memory leak. It would be interesting if ChatGPT could spot such issue in a long multi files code.
Where do you guys think it comes from?
Sol has done some diligence / research and is more qualified to respond.
I don’t want to jinx it, but I may have resolved the issue.
Currently soaking the fix on Linux and Windows to see if memory allocation creeps up over time.
Early numbers indicating sub-200 MB memory required for a full node after it has synced, which is very different from the 5GB my testnet node was consuming.
While syncing with the fix, it fluctuates between 500-1500MB.
I might have discovered a second “leak”; not sure how to characterize it yet, but I may work on another optimization as well.
I’ll confirm my findings with the Golang devs then write a post with troubleshooting steps in case anyone needs to do some performance profiling in the future.
Like I said… Kaine’s cousin! Can’t wait to see the results. Great work.
@sol can you please share what have you found? I also started looking into the problem, maybe I could help you.
I’m getting some mixed results today so I haven’t committed any code to git.
Like I said, there may be more than one optimization required.
In terms of what I found:
The handleMsg function spawns hundreds of thousands of orphaned goroutines.
Part of my patch is to add a channel to the goroutine that signals when the handleMsg() is finished parsing a peer’s message.
This is what my WIP looks like:
c := make(chan int8)
go func() {
select {
case <-pm.quitSync:
p.Disconnect(ErrNoStatusMsg)
case <-c:
close(c)
}
}()
.
.
.
default:
c <- 0
return errResp(ErrInvalidMsgCode, "%v", msg.Code)
}
c <- 0
return nil
I’m aware close
may not be necessary, uint8
could be a struct{}
. I’m seeing conflicting results when I try to simplify the code above.
Anyways, any insight would be appreciated
I was trying to find this in the original ethereum releases code (1.6, 1.7, 1.8) and this go func was not found. Maybe we could just delete it, I will take a closer look at the code though. This urls could help you.
I think I have fixed it, I looked at the Ethereum code base and these methods were not present in the releases so I removed them. Pprof showed me these would be spawned continuously and never returned. I have tested it by resyncing also and looks good. Try it yourselves. @sol @0x3639
awesome. I will try today. Do I need to sync from genesis?
You should try it. And also do some RPC and maybe multiple connections.
OK - I’m syncing now. I’m trying with an 8g server. I’ll let you know when I’m synced up and start testing.
Mine’s at 2.9M momentums and syncing.
If this works it would be awesome to run this on a Raspberry Pi with no reboots.
Thanks for continuing to work on this.
Do you have any KPIs to quantify the improvement?
Anecdotally, it seemed to work at first, but I was still able to encounter situations when znnd would bloat to multiple GBs of memory.
Here is my sync memory utilization. I’m about 75% done. Seems to be pretty stable around 25% except the spikes.
has anyone been able to sync 100%. I’m stuck around 75% and the node seems to be crashing. syncs a few momentums, then znnd restarts.