dnsquery-monitoring

I’ve been using the dnsquery-monitoring module and just wanted to feedback some practical experience that might be worth including in the documentation.

If you choose domain-names with low TTL values then you will likely see a lot of cache-misses which then bump-up the latency scores in the charts. This is simply because, unless those names are being queried very frequently, once the TTL expires the DNS server/resolver will need to do a lookup upstream.

In my case I am having netdata query against an on-site DNS server; and I was also using the vanilla list of domains:

   domains:
     - google.com
     - github.com
     - reddit.com
     - netdata.cloud

But those domains are largely very low TTL records:

google.com = 5mins
github.com = 60secs
reddit.com = 5mins
netdata.cloud = 60secs

If you have a very busy DNS server/resolver then the chances of having a cache-miss at the moment netdata makes its queries is relatively low; conversely, if you don’t have a lot of DNS requests for those domains then the chances of netdata hitting a cache-miss scenario are statistically higher.

So the advice would be to be selective about which domain names you use:
Either,

  1. Use only domain names that you know you get a lot of DNS requests for; that way you know that a cache-hit is likely to represent the actual user-experience. or,
  2. Use domain names with long TTL values so that your charts better reflect the actual resolver performance rather than whether we happen to hit a cache-miss at the time of query. or,
  3. Leave it as-is and accept that the fluctuations you will see in the charts could be down to cache-misses or any other DNS problem (you won’t know).

Please feel free to let me know if I’ve gone wrong anywhere :slight_smile:

On a related topic, the blog here makes mention of other parameters such as the base protocol. What is the syntax for that? And is there an option to choose whether a query is dnssec or not?

2 Likes

Personally I do NOT want to measure under best conditions, where everything is perfect: I want to and I do measure a real user’s experience. So for example I use an oldschool type of monitoring some DNS servers/resolvers that I’m “interested” in from my home via WiFi:
https://thomasmerz.github.io/dnspingtest_rrd_ka/dashboard_month.html

But I’m also measuring via my cloudservers that have a direct interconnect at DE-CIX with a 10 Gigabit-Interface and a multiple-hundred Gigabit-Connection at DE-CIX. Just change “dnsping_rrd_ka” to “dnsping_rrd_nbg” or “dnsping_rrd_hel” in URL/address for my cloudservers.

Conclusion:

For me and my family it’s worthless to measure how fast Google DNS (for example) might be able to answer a query that come from a direct interconnect like DE-CIX. It’s important for real users how fast a DNS servers/resolvers responds in real life.

@thomasmerz
Good feedback but I think you may have misinterpreted my post since I was not talking about testing against specific nameservers but about how which domains you query against those servers can affect your metrics.

Edit:
Just to add that your point (which I agree with) is exactly why I came across this stuff. I was measuring DNS performance from a node on my local network and the targets were a local DNS server and several DNS servers at different distance (so ISP then some major ones). But what I found strange was how the DNS latency for queries to my local DNS server seemed about the same as for ones, say, at 1.1.1.1. Even more strange because I knew that I was plenty of cached replies from the local DNS server and the performance from the client-perspective was excellent.

It took a little while but I soon realised that the reason why the latency was about the same was because the default domain-list contains two domains with really short TTL’s and so I was frequently hitting a cache-miss on my local-DNS server; which resulted in a forwarded query upstream to, you guessed it, 1.1.1.1. These domains were not necessarily ones that get much traffic on my network. Simply by swapping out the domain-names for other ones with higher TTL’s changed the metrics (not changing the DNS servers themselves).

The latency for the queries against the local DNS server decreased but, crucially, before I was getting spikes in latency.
So the new scenario best represents user-experience because domains that need to be accessed are often cached.
Now, it is also the case that if a client does a lookup on those domains with low TTLs then you’re likely to see a DNS query latency spike. Whether you want to monitor that is up to you; there is definite value in having a worst-case baseline to keep metrics on. My post was designed to provide information so people could make that distinction.

1 Like

A little bit of history information: when implementing the dns_query collector, I had in mind only the availability check. Back then I worked at an ISP, we had a lot of DNS servers, but from time to time they would hang and stop responding. And I needed something to check if they respond to queries.

1 Like

So from your “availability check” arose a really great tool for also measuring performance / response times :heart:

In my setup I can also observe that my local Pi-hole responds in average some ms slower than the pretty much fastest upstream DNS (26 vs 24ms in average, but most upstream DNS are responding slower than 40-50ms) because my Pi-hole has often also to query an upstream DNS to respond. This is due to NOT querying a domain that is 100% cached like “google.com” but some uncached domains (which reflects a more real life users experience)