I just switched to Netdata, and I’m absolutely loving it. I’m having an odd issue though, and I’m not quite sure how to go about fixing it.
I have ten nodes attached to Netdata Cloud. Everything works pretty well, but from time to time I’m getting web_log_1m_unmatched errors on the five nodes that make up our CDN. This happens seemingly at random, and has thus far happened on all five of those nodes, although rarely at the same time. The description on the error is percentage of unparsed log lines over the last minute. Usually, I’d assume that this is because my logs are in a custom format, or are throwing errors that can’t be read. But the five servers it’s happening on are all functionally identical to one another (and have exactly the same logging format), so I can’t see why it would happen on one while the rest are working correctly.
I’m not sure if this related, but I’ve also noticed that—again, seemingly at random—certain of those same five nodes don’t show up when I navigate to System Overview → web log → responses. For example: All five of those nodes running the same version of NGINX (v1.17.3), and all five have the same logging system, but, despite all being connected correctly to Netdata, I get all five of them in
web_log_nginx.responses_by_status_code_class
web_log_nginx.status_code_class_2xx_responses
web_log_nginx.status_code_class_3xx_responses
web_log_nginx.status_code_class_4xx_responses,
But I only get three of them in web_log_nginx.status_code_class_5xx_responses.
Moreover, if I navigate to the System Overview pages of the two nodes that are missing from web_log_nginx.status_code_class_5xx_responses, there are no graphs at all for that class.
Final piece of information that may or may not be relevant: When I restart the Netdata agent with systemctl restart netdata, it hangs for about a minute before restarting. I don’t know if that’s expected behavior or not.
I think I’ve worked out the second problem. It’s because the web_log_nginx.status_code_class_5xx_responses doesn’t populate until there’s an actual 503 in the log. Which makes sense.
Now, my issue is that the combined charts only show some of the servers. Specifically, these only show three of the five servers:
web_log.response_statuses
web_log.response_codes
web_log.detailed_response_codes
None of those variables show up if I look at the two individual servers that aren’t in the combined charts, either. Any ideas?
I worked out what was happening her. I had a mixture of nodes running the web_log service via Python and a mixture of nodes running the web_log service via Go.
But, I’m still having the issue with the unmatched logs. All servers have the same format, but from time to time one of them just starts throwing huge numbers of these errors:
hmm - that slow restart is something i have sometimes come across (just started happening recently enough) sometimes but could never figure out. I will look into the agent telemetry to see if we can narrow down or identify such slow restarts in some way.
oh i was going to suggest this - sometimes funny edge cases like this can happen if running some via python and some via the go version of the collector.
Do all the “unmatched line: csv parse” log messages have assign 'Go-http-client/1.1'?
Let’s try to find a line that contains “Go-http-client/1.1” in the Nginx logs. I see that the collector tries to assign it as processing time ($request_time field).