How to migrate a parent to a new server and keep the previously collected data?

Problem/Question

Our current netdata master is having RAM constraints.
We want to move to a bigger server.
How can we do this while keeping all the previously collected data?

Relevant docs you followed/actions you took to solve the issue

Nothing in docs.

Environment/Browser/Agent’s version etc

v1.36.1
db_engine

What I expected to happen

I expect to have one of the following way:

  1. Ansible playbook that does it for me
  2. documentation with clear steps so that no chances of losing the data
  3. replication that does it seamlessly

I am willing to contribute for the cause.

This is unfortunately not something we have properly documented anywhere at the moment, but in general you can usually migrate an install by shutting down Netdata on the source system, copying /var/lib/netdata, /var/cache/netdata, and /etc/netdata from the source system to the target system, and then explicitly removing the contents of /var/lib/netdata on the source system.

If you want to just preserve metrics and config, you can skip the handling of /var/lib/netdata, though if you use Netdata Cloud that will mean you need to claim the new system even the old one was claimed previously.

You should certainly try replication first, but you will need a more recent version of Netdata, which is much more reliable and performant for replication. Use the latest stable. If you have any issues and the data isn’t completely replicated, it will be a bug that we’ll need to address.

Created a new feature request: [Feat]: Add support for offline migration of a Netdata install · Issue #14764 · netdata/netdata · GitHub

If the new server and the old server have the same CPU architecture, copying the files is definitely the fastest.

If they are the same, copy the following directories:

  • /etc/netdata
  • /var/cache/netdata

Pay some attention to permissions. The directories and files you copy must maintain their permissions and ownership. This may be a bit tricky, because users and groups may have a different UIDs and GUIDs on the target system.

The above will move the data, but the new netdata will have its own machine guid, so it will be a new netdata for cloud, replication and streaming, etc. If you need to completely replace the old netdata with the new one, you will also have to copy /var/lib/netdata. But be careful. The old and the new netdata installation should never run at the same time, or strange things will happen. So, if you plan to delete the old netdata installation, copy /var/lib/netdata too, but make sure the old netdata will never start after you do so.

If the CPU architectures are not the same, then replication is the only way to move forward. There are a few limitations you should know about:

  1. Both the old and the new netdata should be able to run concurrently.
  2. Replication only replicates tier 0, the high resolution data. The other tiers are re-generated from tier 0. But this means you can replicate for as long as tier 0 on the old netdata has data.
  3. Replication replicates only metrics (charts and nodes) that are currently being collected. If you have archived metrics (charts and nodes that are no longer being collected), they will not be replicated. This also means that if the old netdata is a parent for other nodes, these other nodes need to be connected to it (to the old netdata) to trigger replication of their data.
  4. The default replication period is 1 day (86400 seconds). You will need to change that in netdata.conf of the new netdata installation to cover as much as you want. It is ok if it is bigger than your tier 0 retention. So, you can set it to a month (2592000 seconds) or even more.
  5. Replication will need bandwidth between your old and new netdata servers. If they are in separate cloud providers, this may have some indirect cost, due to the amount of traffic these servers will exchange.

Replication is pretty fast, but it will take time and resources, as both the old and the new netdata need to unpack and repack every single metric collected.

If you have really a lot of data, you can also increase the replication threads on the old netdata. This will make it go faster. Generally, every thread is usually able to send about 1-2 million points per second. The more threads you add the faster it will go. But on the receiver side, you will benefit from such increased speed only when there are multiple nodes in your data (the receiver has always 1 thread per node). Do not over-do it with the number of threads. Generally you should not gain more speed by adding more than 10 replication threads on the old netdata.

If you increase the number of threads, you have to know that netdata replication can saturate the link between the old and the new netdata. So, depending on your setup, it may not be good to increase the number of threads too much.

Replication calculates the % of completion on both the sending and the receiving sides. On the sending side, it is located in Netdata Monitoring, workers replication sender (chart netdata.workers_replication_value_completion). On the receiving side it is located in Netdata Monitoring, workers streaming receive (chart netdata.workers_streamrcv_value_replication_completion).