[go: nahoru, domu]

Page MenuHomePhabricator

Add enhanced logging to Citoid
Open, Needs TriagePublic

Description

This task involves the work of adding enhancing logging to Citoid so that we (the maintainers of Citoid) can:

  1. Understand exactly why Citoid is failing
  2. Offer people who are impacted by these failures (e.g. in the process of attempting to generate a citation) informative and actionable feedback

The absence of the level of nuanced logging this task is "asking" for is impacting us in the following ways:

  1. We're unable to offer people specific error messages within Citoid (T364594)
  2. We're not able to determine which publishers we ought to prioritize contacting because we do not know the frequency with which specific domains are failing

Event Timeline

Change #1036250 had a related patch set uploaded (by Mvolz; author: Mvolz):

[mediawiki/services/citoid@master] Log HTTP errors on warn level

https://gerrit.wikimedia.org/r/1036250

Splitting from the main thread, to inquire about this specifically:

I 'll have a deeper look tomorrow.

urldownloaders don't have the visibility needed to look at URLs since most sites are accessed over HTTPS. So the only thing they do see is URL domains, not paths. We need Citoid to log out what requests it sees.

I'm implementing logging in citoid now, but - don't we mostly need to know the domain, not the path, anyway? As it lets us sort by hosts that are blocking us? I don't think we actually need the full path...

Just as a data point, I've had a look into urldowloaders logs today for nytimes and apparently the last time we got an error has been in May 4th (2 errors). Before that, we had a spew of errors on April 19th (~20) a pattern that continues until 4th of March, at which point we no longer have data in logstash.

Total number of errors: 743
Domains: nytimes.com, www.nytimes.com
Definition of error: HTTP 403
Time range of errors: 2024-03-04 to 2024-05-04

I noticed science direct is still giving us 200s across the board:

https://logstash.wikimedia.org/goto/d1c901fbadc2b726c357bcf357f9e37c

But obviously they are still blocking our requests... is it just that connect is fine, it's only the get requests that are blocked? Why aren't the get requests getting logged, then? (I think I still don't understand exactly what url downloader does!)

Splitting from the main thread, to inquire about this specifically:

I 'll have a deeper look tomorrow.

urldownloaders don't have the visibility needed to look at URLs since most sites are accessed over HTTPS. So the only thing they do see is URL domains, not paths. We need Citoid to log out what requests it sees.

I'm implementing logging in citoid now, but - don't we mostly need to know the domain, not the path, anyway? As it lets us sort by hosts that are blocking us? I don't think we actually need the full path...

The best solution would be to split them to url.domain and url.path. That way we can aggregate according to our needs in logstash.

Just as a data point, I've had a look into urldowloaders logs today for nytimes and apparently the last time we got an error has been in May 4th (2 errors). Before that, we had a spew of errors on April 19th (~20) a pattern that continues until 4th of March, at which point we no longer have data in logstash.

Total number of errors: 743
Domains: nytimes.com, www.nytimes.com
Definition of error: HTTP 403
Time range of errors: 2024-03-04 to 2024-05-04

I noticed science direct is still giving us 200s across the board:

https://logstash.wikimedia.org/goto/d1c901fbadc2b726c357bcf357f9e37c

But obviously they are still blocking our requests... is it just that connect is fine, it's only the get requests that are blocked? Why aren't the get requests getting logged, then? (I think I still don't understand exactly what url downloader does!)

urldownloader is just squid proxy set up in forward proxy configuration without caching. It makes the connection from client applications like Citoid and Zotero show up as they are initiated using it's IP and not the private internal IP that is assigned to Citoid and Zotero instances. That way, these applications are able to access the internet.

There are 2 different ways of it functioning

  • HTTP forward proxying uses the usual HTTP methods you probably know, GET, POST, PUT, DELETE etc. The proxy sees the entirety of the request/response pair. Of that data, It sends to the source/destination whatever the configuration tells it to send (e.g. configuration might say to remove Cookies, specific headers etc)
  • The CONNECT HTTP method starts a TCP tunnel from the client asking it to the destination. As a method, it is used for all HTTPS destinations by all HTTP(S) proxy compliant libraries (like what citoid and zotero use). In this case, due to the encryption, the proxy doesn't see anything of the data in the request/response pair. It's blind forwarding of data, that starts right after a client like Citoid issues something like
CONNECT www.sciencedirect.com:443 HTTP/1.1

Per https://www.rfc-editor.org/rfc/rfc9110#CONNECT

Any 2xx (Successful) response indicates that the sender (and all inbound proxies) will switch to tunnel mode immediately after the response header section; data received after that header section is from the server identified by the request target. Any response other than a successful response indicates that the tunnel has not yet been formed.

So that 200, is meaningless. All it tells citoid is that the tunnel was successfully formed. Any other HTTP response (e.g. a 403) indicates the tunnel wasn't formed and may or may not carry information as to the reason why.

So, my comment above regarding no longer receiving errors from nytimes.com should be viewed under the light of squid was able to form a tunnel to nytimes.com and allow Citoid to fetch the information it wanted. What happens after that, it's Citoid's job to interpret it.

In other words, when we were seeing those 403s, nytimes.com was outright rejecting urldownloader initiated connections. The fact that we no longer see them for nytimes.com or sciencedirect.com just means that our IPs aren't blocked. Nothing more than that.

Which is why I am saying that Citoid/Zotero should be logging more about what is happening. urldownloader provides no more visibility here because it doesn't have any.

I deployed a change today for this. It is not everything we hoped, alas :).

We don't actually seem to log info, only warns. However, it may look like we were because of the presence of NOTICE levels in logstash. Which brings me to a second issues, which is: these are fact NOTICES caused by parsing failure of what are actually WARNS. Some of them are just total garbage: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=TmlhDJAB_bxzc-Pm7TGI. Some of these NOTICES are a single line of a stack trace. Guessing from this one I am thinking... the inclusion of the html of the response in the message of the error might be the cause of this. It's maybe actually not sanitising it and it's escaping it some how? Yikes :P And then we're getting like 20 NOTICEs for parts of the response, each separately. I.e. https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=TdVkDJABDD1VxFBSzVVI

This is a preexisting bug, but only became very noticeable now when we started logging a lot more failures.

All the HTTPErrors includes the body as the "msg"when logged. And if the body contains json I think that might be how it's escaping out.

Here's a broken science direct one: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=ttluDJAB0bw1V4rvcXY8

On the positive side:

Here is an example of one warning actually working (from our probe): https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=n19qDJAB5VvKlwLuaeRS

Also we apparently need to index the pattern to actually filter by uri: https://logstash.wikimedia.org/app/management/opensearch-dashboards/indexPatterns/patterns/logstash-*#/?_a=h@9293420 does anyone know where the template lives where we can add those fields?

Also, do you know how to enable INFO in logstash for citoid?

I deployed a change today for this. It is not everything we hoped, alas :).

We don't actually seem to log info, only warns. However, it may look like we were because of the presence of NOTICE levels in logstash. Which brings me to a second issues, which is: these are fact NOTICES caused by parsing failure of what are actually WARNS. Some of them are just total garbage: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=TmlhDJAB_bxzc-Pm7TGI. Some of these NOTICES are a single line of a stack trace. Guessing from this one I am thinking... the inclusion of the html of the response in the message of the error might be the cause of this. It's maybe actually not sanitising it and it's escaping it some how? Yikes :P And then we're getting like 20 NOTICEs for parts of the response, each separately. I.e. https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=TdVkDJABDD1VxFBSzVVI

Yikes indeed. I had a look at a couple of those and I can say that it's definitely a WARN level logline. The entire event however was 21630 characters long. The garbage part that you noticed, was a inline base64 encoded image apparently.

This is a preexisting bug, but only became very noticeable now when we started logging a lot more failures.

All the HTTPErrors includes the body as the "msg"when logged. And if the body contains json I think that might be how it's escaping out.

My current theory is probably that there is some truncation happening (I doubt logstash/elasticsearch ever expected 21KB log events) and a default NOTICE level. I based that on the size and the fact the in these entries there's a lack of other fields.

But you are absolutely right that this is a bug and shouldn't be happening. We need to stop logging bodies of responses (or truncate them very heavily, keeping the first 100 or so chars)

Here's a broken science direct one: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=ttluDJAB0bw1V4rvcXY8

On the positive side:

Here is an example of one warning actually working (from our probe): https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=n19qDJAB5VvKlwLuaeRS

Also we apparently need to index the pattern to actually filter by uri: https://logstash.wikimedia.org/app/management/opensearch-dashboards/indexPatterns/patterns/logstash-*#/?_a=h@9293420 does anyone know where the template lives where we can add those fields?

Also, do you know how to enable INFO in logstash for citoid?

It's citoid that needs to emit messages with INFO levels. That can be done at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/citoid/templates/_config.yaml#16

All it needs is an extra line of

level:  {{ .Values.main_app.log_level | default "info" }}

and then we override that in helmfile.d stanzas for when we dont' want info.

But before proceeding with that, we probably want to stop logging the body of the response.