Let's build a Golang collector for monitoring Ethereum Full Nodes

Hey everyone,

Netdata will participate in EthCC with a workshop on how to monitor Ethereum Nodes in under 5 minutes.

In order to do that, we will be building a new collector in Golang. So I figured that it would be fun to build it in public, together!

In this topic I will be documenting my progress, so you can follow the journey. If you have any questions, just leave a comment on the topic!

For starters, we need a template collector as the basis for our collector.

We will use VerneMQ, as @ilyam8 kindly pointed out to me.

Moreover, we have a golang guide to help us along the way:

Ethereum is currently transitioning to Ethereum 2.0, from PoW to PoS.

Given this migration, that will take place in July, we will be working on providing visualisations for Ethereum 2.0 nodes.

The system in Ethereum 2.0 is composed of 3 distinct parts:

  • The beacon Chain
  • The validator
  • an Ethereum 1.0 node

You can read more here:

For our ETH1 implementation, we will use geth-mev.

It’s the fork of geth but with added endpoints that have to do with MEV-related activity.

The following are the metrics that are returned by Geth.

To expose these metrics, we need to run Geth with the following flags: --metrics --metrics.addr 0.0.0.0

Let’s see them

Continuing this line of thought, here are the metrics that are output by Prysm’s beacon chain and validator:

Prysm’s docs:

Now that we have our metrics, let’s talk about creating a prometheus collector for Netdata.

netdata currently has a Prometheus endpoint collector, which means that it can easily scrape any Prometheus endpoint and create a chart per metric.

The collector is currently limited, being able to do very elementary computation on the metrics.

Thus, we opted to create our own collector, which is only a bit harder, but offers much more flexibility. It’s still fairly easy, because all prometheus endpoint-related collectors have the same architecture and functionality.

It’s really just a matter of changing the following types of variables:

  1. Metric names that are being scraped
  2. Chart definitions.

Let’s seen an example.

As the basis for our work, we will use the VerneMQ collector:

Let’s look at the files, in order:

vernemq.go

Changes: Change endpoint, change object name and package name. For example:

package geth

import (
	"errors"
	"time"

	"github.com/netdata/go.d.plugin/pkg/prometheus"
	"github.com/netdata/go.d.plugin/pkg/web"

	"github.com/netdata/go.d.plugin/agent/module"
)

func init() {
	creator := module.Creator{
		Create: func() module.Module { return New() },
	}

	module.Register("geth", creator)
}

func New() *Geth {
	config := Config{
		HTTP: web.HTTP{
			Request: web.Request{
				URL: "http://127.0.0.1:6060/metrics/debug/prometheus",
			},
			Client: web.Client{
				Timeout: web.Duration{Duration: time.Second},
			},
		},
	}

	return &Geth{
		Config: config,
		charts: charts.Copy(),
		cache:  make(cache),
	}
}

type (
	Config struct {
		web.HTTP `yaml:",inline"`
	}

	Geth struct {
		module.Base
		Config `yaml:",inline"`

		prom   prometheus.Prometheus
		charts *Charts
		cache  cache
	}

	cache map[string]bool
)

metrics.go

This is the file that is responsible for the metrics definitions. In essence we simply translate the metrics that we care about, from the prometheus format to a go variable.

Changes: Choose the metrics that you will use and change both the prometheus metric name and the corresponding variable.

const (
	// AUTH
	metricAUTHReceived = "mqtt_auth_received" // v5 has 'reason_code' label
	metricAUTHSent     = "mqtt_auth_sent"     // v5 has 'reason_code' label

collect.go

This is the source file that is responsible for collecting the metrics from the endpoints. A best practice is to divide the functions by chart. We use the metric variables, as we defined them in charts.go above.

unc (v *Geth) collectGeth(pms prometheus.Metrics) map[string]float64 {
	mx := make(map[string]float64)

	collectSockets(mx, pms)
	collectQueues(mx, pms)
	collectSubscriptions(mx, pms)
	v.collectErlangVM(mx, pms)
	collectBandwidth(mx, pms)
	collectRetain(mx, pms)
	collectCluster(mx, pms)
	collectUptime(mx, pms)

	v.collectAUTH(mx, pms)
	v.collectCONNECT(mx, pms)
	v.collectDISCONNECT(mx, pms)
	v.collectSUBSCRIBE(mx, pms)
	v.collectUNSUBSCRIBE(mx, pms)
	v.collectPUBLISH(mx, pms)
	v.collectPING(mx, pms)
	v.collectMQTTInvalidMsgSize(mx, pms)
	return mx
}

func (v *Geth) collectCONNECT(mx map[string]float64, pms prometheus.Metrics) {
	pms = pms.FindByNames(
		metricCONNECTReceived,
		metricCONNACKSent,
	)
	v.collectMQTT(mx, pms)
}

Changes: Change the name of every function to correspond to some “logical division” of the metrics.

We may want to make some computations, e.g

	collectNonMQTT(mx, pms)
	mx["open_sockets"] = mx[metricSocketOpen] - mx[metricSocketClose]

Finally,

charts.go

Now that we have defined our metrics and how we are going to collect them, it’s time to organize them into charts.

chartOpenSockets = Chart{
		ID:    "sockets",
		Title: "Open Sockets",
		Units: "sockets",
		Fam:   "sockets",
		Ctx:   "vernemq.sockets",
		Dims: Dims{
			{ID: "open_sockets", Name: "open"},
		},
	}
	chartSocketEvents = Chart{
		ID:    "socket_events",
		Title: "Socket Open and Close Events",
		Units: "events/s",
		Fam:   "sockets",
		Ctx:   "vernemq.socket_operations",
		Type:  module.Stacked,
		Dims: Dims{
			{ID: metricSocketOpen, Name: "open", Algo: module.Incremental},
			{ID: metricSocketClose, Name: "close", Algo: module.Incremental},
		},
	}

Changes: Change every aspect of these charts. Moreover, make sure to change the charts object so that it uses the charts that you will define.

var charts = Charts{
	chartOpenSockets.Copy(),
	chartSocketEvents.Copy(),
	chartClientKeepaliveExpired.Copy(),
	chartSocketErrors.Copy(),
	chartSocketCloseTimeout.Copy()
...

Here is some clarifications:

  • Fam = `family
  • Ctx = context
  • Dims = dimensions(The ID of every dimension is the name of the metric variable that we have defined above)

???

Yup, that was it. As you can see, the challenge here is not to create a new collector, but rather to understand how the metrics should be surfaced to the user.

We need you

Yup, that’s right.

We need your expertise to organize the metrics that I have posted in the previous post into meaningful charts. Then, after we define the charts, we will be able to define meaningful alerts and have a complete turn-key experience.

Just fire and forget. Exactly as you would expect from a modern monitoring solution.

On top of that, you get to leverage the rest of the features, such as eBPF monitoring and per-second metric granularity. Not to mention the whole range of sweet sweet things we offer with Netdata Cloud.

The goal is to offer a solution that is as transparent as possible to the Ethereum Node operator. They will focus on keeping the network secure, while we focus on what we know best, monitoring systems and keeping them up and running.

1 Like

As we discussed on Discord, I am engaging the Eth Staker and Rocket Pool community on the requested input. I’ll help answer questions where I can, ping you where I can’t. I’ll also help facilitate input collection to try and cut down on noise for you and help ensure it reflects several viewpoints. Look for updates over the next couple days.

That’s awesome, thanks!

BTW, I have the pastebins posted above, they are the full prometheus endpoints for all relevant services (geth/validator/beacon).

God, I love our community :fist:

After doing some additional digging, there is an interesting addition to the above metrics I mentioned above.

Since we are building our collector in Golang, we could leverage the fact that Geth is built in golang and include the package inside the collector.

In other words, we can access many more metrics, as the GETH library will communicate over RPC with the full-node to extract information that currently is not accessible via the /metrics endpoint.

Here is an example of an exporter:

Although the code is fairly simple, we will most certainly not make it on time for the conference.

We put a pin on this and we will revisit on a later day.

1 Like

After speaking with some community members, I decided to request additional input via reddit.com/r/ethstaker. This should help ensure broader community input on metrics, visualizations and alerts. https://www.reddit.com/r/ethstaker/comments/o1gtac/netdata_monitoring_customized_for_ethstakers/

1 Like

Hey everyone,

A quick update on this:

I have opened a draft PR on a collector for go-ethereum. This collector surfaces geth specific metrics from inside the Netdata dashboard.

This enables a go-ethereum user to have turn-keys for their system and the geth node itself.

This is only a PoC, since there is much more data that we can actually gather and organize into charts.

The really good thing is that we can easily extend the collector to collect any data point that is exposed by the prometheus endpoint.

A note on data manipulation

As you know, Netdata’s native data structure is not raw metrics per-se, but rather charts and dimensions. That means that the majority of data manipulation must be done inside the netdata collector.

There are 2 ways to “alter” the raw metric points that you are gathering from the endpoint:

  • In the chart definition, we can define a multiplier or divider. The most obvious use is to invert a dimension so that we can intuitively show IN/OUT of some input-output on the same chart.
  • In the chart definition we can use one of the stock algorithms that Netdata supports:
    • Absolute (show value as-is, default)
    • Incremental (Show delta between last value and current, in per-second. Thus, value shown = delta/data-collection-frequency. This is great for values that only increase (e.g total written bytes).
    • The above, but show the percentage of each dimension in relation to the sum of all the dimension.
  • Hard-code the computation in the code. This is the more robust since we can do anything that we like. For example, we can define a structure, store the last 100 values and find the average of that.

A note on data source

After close examination, the prometheus endpoint is indeed missing a lot of higher level data that are important and interesting, including the # of transactions, the gas price, and others.

It may very well be that the ideal collector will also gather data using the RPC endpoint of the ethereum Node. For example, the Prometheus exporter follows that approach:

As always, I am looking forward for your feedback and thoughts!

Hi @OdysLam, you may be interested in checking out the latest update from Rocket Pool. It now comes with included support for Grafana and dependencies as well as a pre-built dashboard for each of the ETH2 clients. Setting up the Grafana Dashboard | Rocket Pool

I have it running on one of my testnet servers, so let me know if you’d like to see it in action.

1 Like

That’s awesome!

We could easily port this to Netdata, now that we know what data we actually need from the prometheus endpoints of the clients.

As I outline in this guide

It’s super easy to create a new integration between netdata and a data source that exposes data over the prometheus endpoint. We could replicate the Geth collector and have it up and running in no time.

Perhaps a community user would like to pick this up and help out web3 users to monitor their systems.