Regular 'unmatched' errors, and some nodes excluded from web_log_nginx.status graphs

enfant_terrible · October 11, 2021, 1:03pm

Hi there,

I just switched to Netdata, and I’m absolutely loving it. I’m having an odd issue though, and I’m not quite sure how to go about fixing it.

I have ten nodes attached to Netdata Cloud. Everything works pretty well, but from time to time I’m getting web_log_1m_unmatched errors on the five nodes that make up our CDN. This happens seemingly at random, and has thus far happened on all five of those nodes, although rarely at the same time. The description on the error is percentage of unparsed log lines over the last minute. Usually, I’d assume that this is because my logs are in a custom format, or are throwing errors that can’t be read. But the five servers it’s happening on are all functionally identical to one another (and have exactly the same logging format), so I can’t see why it would happen on one while the rest are working correctly.

I’m not sure if this related, but I’ve also noticed that—again, seemingly at random—certain of those same five nodes don’t show up when I navigate to System Overview → web log → responses. For example: All five of those nodes running the same version of NGINX (v1.17.3), and all five have the same logging system, but, despite all being connected correctly to Netdata, I get all five of them in

web_log_nginx.responses_by_status_code_class
web_log_nginx.status_code_class_2xx_responses
web_log_nginx.status_code_class_3xx_responses
web_log_nginx.status_code_class_4xx_responses,

But I only get three of them in web_log_nginx.status_code_class_5xx_responses.

Moreover, if I navigate to the System Overview pages of the two nodes that are missing from web_log_nginx.status_code_class_5xx_responses, there are no graphs at all for that class.

Final piece of information that may or may not be relevant: When I restart the Netdata agent with systemctl restart netdata, it hangs for about a minute before restarting. I don’t know if that’s expected behavior or not.

Thanks!

enfant_terrible · October 12, 2021, 1:26pm

I think I’ve worked out the second problem. It’s because the web_log_nginx.status_code_class_5xx_responses doesn’t populate until there’s an actual 503 in the log. Which makes sense.

Now, my issue is that the combined charts only show some of the servers. Specifically, these only show three of the five servers:

web_log.response_statuses
web_log.response_codes
web_log.detailed_response_codes

None of those variables show up if I look at the two individual servers that aren’t in the combined charts, either. Any ideas?

enfant_terrible · October 12, 2021, 3:33pm

I wonder if this is related. I seem to have multiple versions of the same charts here:

The plural ones (http methods, http versions, ip protocols) only show some of the nodes. The non-plural ones show all of the nodes.

enfant_terrible · October 12, 2021, 5:20pm

I worked out what was happening her. I had a mixture of nodes running the web_log service via Python and a mixture of nodes running the web_log service via Go.

enfant_terrible · October 12, 2021, 9:08pm

But, I’m still having the issue with the unmatched logs. All servers have the same format, but from time to time one of them just starts throwing huge numbers of these errors:

2021-10-12 16:59:16: go.d INFO: web_log[nginx] unmatched line: csv parse: assign 'request_time': assign 'Go-http-client/1.1': bad req processing time (parser: csv: $remote_addr - - [$time_local] "$request" $status $body_bytes_sent $request_length $request_time)

I can’t see anything in the log it’s trying to parse that is different from the other servers, or that should cause it any issues.

Typically, I can stop it by restarting Netdata. But not always.

andrewm4894 · October 13, 2021, 2:54pm

hmm - that slow restart is something i have sometimes come across (just started happening recently enough) sometimes but could never figure out. I will look into the agent telemetry to see if we can narrow down or identify such slow restarts in some way.

andrewm4894 · October 13, 2021, 2:55pm

oh i was going to suggest this - sometimes funny edge cases like this can happen if running some via python and some via the go version of the collector.

ilyam8 · October 14, 2021, 6:20pm

@enfant_terrible

Do all the “unmatched line: csv parse” log messages have assign 'Go-http-client/1.1'?

Let’s try to find a line that contains “Go-http-client/1.1” in the Nginx logs. I see that the collector tries to assign it as processing time ($request_time field).

Topic		Replies	Views
1m_unmatched Alerts	0	506	November 3, 2021
Web_log_nginx.excluded_requests web log 1m unmatched Help agent	8	2255	November 24, 2020
Netdata Cloud "Unknown Error" Help cloud	5	542	July 15, 2021
web_log_1m_unmatched Alerts	0	6415	February 7, 2022
error for all graphs on cloud Help agent , cloud	3	319	March 24, 2023

Regular 'unmatched' errors, and some nodes excluded from web_log_nginx.status graphs

Related topics