dnsquery-monitoring

Luis_Johnstone · April 28, 2023, 9:59pm

I’ve been using the dnsquery-monitoring module and just wanted to feedback some practical experience that might be worth including in the documentation.

If you choose domain-names with low TTL values then you will likely see a lot of cache-misses which then bump-up the latency scores in the charts. This is simply because, unless those names are being queried very frequently, once the TTL expires the DNS server/resolver will need to do a lookup upstream.

In my case I am having netdata query against an on-site DNS server; and I was also using the vanilla list of domains:

   domains:
     - google.com
     - github.com
     - reddit.com
     - netdata.cloud

But those domains are largely very low TTL records:

google.com = 5mins
github.com = 60secs
reddit.com = 5mins
netdata.cloud = 60secs

If you have a very busy DNS server/resolver then the chances of having a cache-miss at the moment netdata makes its queries is relatively low; conversely, if you don’t have a lot of DNS requests for those domains then the chances of netdata hitting a cache-miss scenario are statistically higher.

So the advice would be to be selective about which domain names you use:
Either,

Use only domain names that you know you get a lot of DNS requests for; that way you know that a cache-hit is likely to represent the actual user-experience. or,
Use domain names with long TTL values so that your charts better reflect the actual resolver performance rather than whether we happen to hit a cache-miss at the time of query. or,
Leave it as-is and accept that the fluctuations you will see in the charts could be down to cache-misses or any other DNS problem (you won’t know).

Please feel free to let me know if I’ve gone wrong anywhere

On a related topic, the blog here makes mention of other parameters such as the base protocol. What is the syntax for that? And is there an option to choose whether a query is dnssec or not?

thomasmerz · May 5, 2023, 4:36pm

Personally I do NOT want to measure under best conditions, where everything is perfect: I want to and I do measure a real user’s experience. So for example I use an oldschool type of monitoring some DNS servers/resolvers that I’m “interested” in from my home via WiFi:
https://thomasmerz.github.io/dnspingtest_rrd_ka/dashboard_month.html

But I’m also measuring via my cloudservers that have a direct interconnect at DE-CIX with a 10 Gigabit-Interface and a multiple-hundred Gigabit-Connection at DE-CIX. Just change “dnsping_rrd_ka” to “dnsping_rrd_nbg” or “dnsping_rrd_hel” in URL/address for my cloudservers.

Conclusion:

For me and my family it’s worthless to measure how fast Google DNS (for example) might be able to answer a query that come from a direct interconnect like DE-CIX. It’s important for real users how fast a DNS servers/resolvers responds in real life.

Luis_Johnstone · May 5, 2023, 8:12pm

@thomasmerz
Good feedback but I think you may have misinterpreted my post since I was not talking about testing against specific nameservers but about how which domains you query against those servers can affect your metrics.

Edit:
Just to add that your point (which I agree with) is exactly why I came across this stuff. I was measuring DNS performance from a node on my local network and the targets were a local DNS server and several DNS servers at different distance (so ISP then some major ones). But what I found strange was how the DNS latency for queries to my local DNS server seemed about the same as for ones, say, at 1.1.1.1. Even more strange because I knew that I was plenty of cached replies from the local DNS server and the performance from the client-perspective was excellent.

It took a little while but I soon realised that the reason why the latency was about the same was because the default domain-list contains two domains with really short TTL’s and so I was frequently hitting a cache-miss on my local-DNS server; which resulted in a forwarded query upstream to, you guessed it, 1.1.1.1. These domains were not necessarily ones that get much traffic on my network. Simply by swapping out the domain-names for other ones with higher TTL’s changed the metrics (not changing the DNS servers themselves).

The latency for the queries against the local DNS server decreased but, crucially, before I was getting spikes in latency.
So the new scenario best represents user-experience because domains that need to be accessed are often cached.
Now, it is also the case that if a client does a lookup on those domains with low TTLs then you’re likely to see a DNS query latency spike. Whether you want to monitor that is up to you; there is definite value in having a worst-case baseline to keep metrics on. My post was designed to provide information so people could make that distinction.

ilyam8 · May 5, 2023, 9:37pm

A little bit of history information: when implementing the dns_query collector, I had in mind only the availability check. Back then I worked at an ISP, we had a lot of DNS servers, but from time to time they would hang and stop responding. And I needed something to check if they respond to queries.

thomasmerz · May 8, 2023, 7:33am

So from your “availability check” arose a really great tool for also measuring performance / response times

thomasmerz · May 8, 2023, 7:43am

In my setup I can also observe that my local Pi-hole responds in average some ms slower than the pretty much fastest upstream DNS (26 vs 24ms in average, but most upstream DNS are responding slower than 40-50ms) because my Pi-hole has often also to query an upstream DNS to respond. This is due to NOT querying a domain that is 100% cached like “google.com” but some uncached domains (which reflects a more real life users experience)

Topic		Replies	Views
dns_query_time_query_time Alerts	0	489	February 21, 2022
Some words about "design" and "user experience" for monitoring dns queries General	6	454	November 20, 2022
How to monitor DNS query response time Help	6	474	August 4, 2023
dns_query_time_query_time vs. recommendation of Google's Public DNS is not the solution for all (DNS) problems General announcement	4	763	February 25, 2022
DNS query alert Help agent	1	878	February 22, 2022

dnsquery-monitoring

Conclusion:

Related topics