Pprof results from go-zenon

0x3639 · June 16, 2023, 6:34pm

@aliencoder @sol not sure where to begin to analyze this. I’ll look for some visual analizers.

The heap is growing over time. I will post some exports. This one is at 45% memory usage.

heap-45%.txt (4.0 MB)

0x3639 · June 16, 2023, 7:32pm

Heap at 60% memory usage

heap-60%.txt (4.7 MB)

0x3639 · July 2, 2023, 11:57am

@sol @aliencoder @sumamu @georgezgeorgez

Guys, I’m starting to think the growing memory use issue is a real problem we need to fix now. The orchestrator relies on znnd. I’m in comms with all (most) of the orchestrators and they all seem to have go-zenon reboot every 24 hours. That causes the Orchestrator to crash (reboot) every 24 hours.

Today @vilkris had a 1006 error in the Orchestrator and it froze rather than rebooting. I’m going to report an issue on the repo.

As we launch the side chain I’m sure the infra will rely on local znnd. We simply cannot have a network where go-zenon must be restarted every 24 hours to function. I was able to get pprof data (posted above) but it’s clear to me that I cannot trouble shoot this issue. @sol had some ideas that I think we should discuss.

Chadass · July 2, 2023, 1:08pm

Disappointed at Mr. Kaine waving away the issue when we all reported it and when I insisted (loudly) saying it WAS a memory issue / memory leak. It would be interesting if ChatGPT could spot such issue in a long multi files code.

Where do you guys think it comes from?

0x3639 · July 2, 2023, 4:02pm

Sol has done some diligence / research and is more qualified to respond.

sol · July 4, 2023, 4:05am

I don’t want to jinx it, but I may have resolved the issue.
Currently soaking the fix on Linux and Windows to see if memory allocation creeps up over time.

Early numbers indicating sub-200 MB memory required for a full node after it has synced, which is very different from the 5GB my testnet node was consuming.
While syncing with the fix, it fluctuates between 500-1500MB.

I might have discovered a second “leak”; not sure how to characterize it yet, but I may work on another optimization as well.

I’ll confirm my findings with the Golang devs then write a post with troubleshooting steps in case anyone needs to do some performance profiling in the future.

0x3639 · July 4, 2023, 9:42am

Like I said… Kaine’s cousin! Can’t wait to see the results. Great work.

MoonBaze · July 4, 2023, 2:21pm

@sol can you please share what have you found? I also started looking into the problem, maybe I could help you.

sol · July 4, 2023, 9:51pm

I’m getting some mixed results today so I haven’t committed any code to git.
Like I said, there may be more than one optimization required.

In terms of what I found:

The handleMsg function spawns hundreds of thousands of orphaned goroutines.

Part of my patch is to add a channel to the goroutine that signals when the handleMsg() is finished parsing a peer’s message.

This is what my WIP looks like:

c := make(chan int8)
go func() {
	select {
	case <-pm.quitSync:
		p.Disconnect(ErrNoStatusMsg)
	case <-c:
		close(c)
	}
}()
.
.
.
default:
	c <- 0
	return errResp(ErrInvalidMsgCode, "%v", msg.Code)
}
c <- 0
return nil

I’m aware close may not be necessary, uint8 could be a struct{}. I’m seeing conflicting results when I try to simplify the code above.

Anyways, any insight would be appreciated

MoonBaze · July 5, 2023, 11:36am

I was trying to find this in the original ethereum releases code (1.6, 1.7, 1.8) and this go func was not found. Maybe we could just delete it, I will take a closer look at the code though. This urls could help you.

github.com

ethereum/go-ethereum/blob/release/1.6/eth/handler.go#L305


      
          	for {
          		if err := pm.handleMsg(p); err != nil {
          			p.Log().Debug("Ethereum message handling failed", "err", err)
          			return err
          		}
          	}
          }
          
          // handleMsg is invoked whenever an inbound message is received from a remote
          // peer. The remote connection is torn down upon returning any error.
          func (pm *ProtocolManager) handleMsg(p *peer) error {
          	// Read the next message from the remote peer, and ensure it's fully consumed
          	msg, err := p.rw.ReadMsg()
          	if err != nil {
          		return err
          	}
          	if msg.Size > ProtocolMaxMsgSize {
          		return errResp(ErrMsgTooLarge, "%v > %v", msg.Size, ProtocolMaxMsgSize)
          	}
          	defer msg.Discard()

github.com

ethereum/go-ethereum/blob/release/1.7/eth/handler.go#L313


      
          	for {
          		if err := pm.handleMsg(p); err != nil {
          			p.Log().Debug("Ethereum message handling failed", "err", err)
          			return err
          		}
          	}
          }
          
          // handleMsg is invoked whenever an inbound message is received from a remote
          // peer. The remote connection is torn down upon returning any error.
          func (pm *ProtocolManager) handleMsg(p *peer) error {
          	// Read the next message from the remote peer, and ensure it's fully consumed
          	msg, err := p.rw.ReadMsg()
          	if err != nil {
          		return err
          	}
          	if msg.Size > ProtocolMaxMsgSize {
          		return errResp(ErrMsgTooLarge, "%v > %v", msg.Size, ProtocolMaxMsgSize)
          	}
          	defer msg.Discard()

MoonBaze · July 14, 2023, 2:30pm

I think I have fixed it, I looked at the Ethereum code base and these methods were not present in the releases so I removed them. Pprof showed me these would be spawned continuously and never returned. I have tested it by resyncing also and looks good. Try it yourselves. @sol @0x3639

0x3639 · July 14, 2023, 2:33pm

awesome. I will try today. Do I need to sync from genesis?

MoonBaze · July 14, 2023, 2:47pm

You should try it. And also do some RPC and maybe multiple connections.

0x3639 · July 14, 2023, 3:06pm

OK - I’m syncing now. I’m trying with an 8g server. I’ll let you know when I’m synced up and start testing.

sumamu · July 14, 2023, 4:01pm

Mine’s at 2.9M momentums and syncing.

0x3639 · July 14, 2023, 4:06pm

If this works it would be awesome to run this on a Raspberry Pi with no reboots.

sol · July 14, 2023, 11:34pm

Thanks for continuing to work on this.

Do you have any KPIs to quantify the improvement?

Anecdotally, it seemed to work at first, but I was still able to encounter situations when znnd would bloat to multiple GBs of memory.

0x3639 · July 14, 2023, 11:54pm

Here is my sync memory utilization. I’m about 75% done. Seems to be pretty stable around 25% except the spikes.

0x3639 · July 15, 2023, 1:26am

has anyone been able to sync 100%. I’m stuck around 75% and the node seems to be crashing. syncs a few momentums, then znnd restarts.

sumamu · July 15, 2023, 6:50am

Fully synced, no issues here. @MoonBaze has cracked it!

Topic		Replies	Views
PTLC Code Review Development	0	638	January 19, 2024
Code Consensus Only Devs	16	697	July 19, 2023
Code Review Process Development	17	926	October 6, 2023
AZ Bi-Monthly Dev Updates Development	33	1078	January 10, 2024
NoM Developer Coordination Meeting #4 - 17 Nov @ 12 PM UTC CONFIRMED Development	1	423	November 2, 2023

Pprof results from go-zenon

Related Topics