Some words about "design" and "user experience" for monitoring dns queries

thomasmerz · October 13, 2022, 4:25pm

I’m monitoring DNS response times as a basis for deciding which DNS resolver to choose for performance reasons.

With some updates there came a severe change in the graphs regarding the scale which makes me unhappy. Before all graphs were the same scale (milliseconds, because seconds for DNS queries really suck!), but now I see some graphs with milliseconds, others with seconds.
Another thing I dislike is that I grouped DNS servers of the same company, but the charts are on a per-server-base which makes it messier than before:
I really liked it before as you can see on an older screenshot on my GitHub site.

- name: ext_nextdns
  update_every: 10
  domains:
    - bla.random.net
  servers:
    - 45.90.28.39
    - 45.90.30.39

And is “query status” new for “how often did the DNS server respond” (in percent)? Well, this might be indeed a good indicator not to choose Quad-9 (in my case)

image2498×1312 333 KB

andrewm4894 · October 21, 2022, 3:09pm

hmmm, let me see if i can help (and thanks for the feedback!).

can you share a screenshot of this? have not seen this before so am curious what might have changed for this to happen, screenshot might help me try figure out who to ask about it. Agree showing seconds and milliseconds on same chart would be silly.
I think if you use the charts in netdata cloud then you could use group by job name from the config to aggregate per job. In example below we just have one job i believe but i think if you had a job per company then you could agg by _collect_job and would end up with a line per job that would agg over all the servers.

here is a link to the public netdata-demo space that has some DNS stuff set up. I think in your case you might end up with a dimension called “ext_nextdns” for example.

Unsure.

ilyam8 · October 24, 2022, 6:55pm

The new grouping (a chart per server) was done taking into account Netdata Cloud aggregated charts. We will switch to chart aggregation only in the future.
The status chart reflects the query’s current status. The dimensions are a set of individual statuses( success, network_error, and dns_error). The value is boolean (0 or 1). Only one of them is 1 at a given time (the current state). So the current value is not a percentage. But aggregation over time can be read as a percentage (e.g. 30 mins average: success 0.8, network_error 0.1, dns_error 0.1 => for the last 30 mins 80% success 10% network_error and 10% dns_error).

some graphs with milliseconds, others with seconds

All of them are in seconds actually (decimal). It is the “Scale Units” feature (dashboard) that changes them based on precision.

you could agg by _collect_job

Can be aggregated (Netdata Cloud) by server. This will give a chart with a dimension per server (as it was on the old chart).

ilyam8 · October 24, 2022, 7:03pm

I agree with all points in OP. The problem is the current local dashboard. It will be replaced with something that works and is much better.

thomasmerz · October 25, 2022, 10:18am

@andrewm4894 I’m using and speaking about local netdata “dashboard” or “system overview” where no “group_by” exists in my netdata version: v1.36.0-264-nightly (“You already have the latest netdata!”)
Seconds and milliseconds are NOT mixed in the SAME chart for “DNS Query Time”. But one chart has seconds, other one has milliseconds for Y-axis as you can see on my screenshots I’ve already provided.

thomasmerz · October 25, 2022, 10:32am

Regarding “query status” (3.) I was on the right trail - Thanks for confirming

Regarding “grouping” (2.)
This is part of my /etc/netdata/go.d/dns_query.conf:

jobs:
- name: int_dns
  update_every: 30
  domains:
    - bla.random.net
  servers:
    - 192.168.0.13
    - 172.17.0.1
- name: ext_cloudflare
  update_every: 10
  domains:
    - bla.random.net
  servers:
    - 1.0.0.1
    - 1.1.1.1
- name: ext_google
  update_every: 10
  domains:
    - bla.random.net
  servers:
    - 8.8.4.4
    - 8.8.8.8
- name: ext_nextdns
  update_every: 10
  domains:
    - bla.random.net
  servers:
    - 45.90.28.39
    - 45.90.30.39

So how should it look like to group all “servers” for every single “name”?

Regarding “scale” (1.) @andrewm4894 I indeed forgot to add screenshots for seconds:

(Quad9 really, really sucks over here!

)

And for milliseconds:

(NextDNS is really “cool” and the fastest DNS I’ve ever seen over here!

)

thomasmerz · November 20, 2022, 10:39pm

@ilyam8 , do you find this aggregation with “some more” dns servers to check usefull for real? I don’t see any useful detail anymore

The local presentation is much better than the presentation in the cloud:

But even this is still less nice than “before” (all “servers” in one single chart for “name”) with this config in /etc/netdata/go.d/dns_query.conf:

- name: freifunk_mue
  update_every: 10
  domains:
    - nextwurz.mooo.com
  servers:
    - 5.1.66.255
    - 185.150.99.255

HereHere on my GitHub Page you can see “the old and beautiful” overview

Topic		Replies	Views
dns_query module shows response as a 0 rather than 1 Help agent , collectors	3	319	December 22, 2022
How to monitor DNS query response time Help	6	479	August 4, 2023
🚀 Netdata v1.41.0 is OUT! General announcement	6	435	August 14, 2023
Discrepancies on netdata charts Help cloud , dashboard	3	561	September 14, 2022
dnsquery-monitoring General feedback	5	550	May 8, 2023

Some words about "design" and "user experience" for monitoring dns queries

Related topics