Add enhanced logging to Citoid
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	ppelberg
	May 14 2024, 6:15 PM

Description

This task involves the work of adding enhancing logging to Citoid so that we (the maintainers of Citoid) can:

Understand exactly why Citoid is failing
Offer people who are impacted by these failures (e.g. in the process of attempting to generate a citation) informative and actionable feedback

The absence of the level of nuanced logging this task is "asking" for is impacting us in the following ways:

We're unable to offer people specific error messages within Citoid (T364594)
We're not able to determine which publishers we ought to prioritize contacting because we do not know the frequency with which specific domains are failing

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T362379 Several major news websites (NYT, NPR, Reuters...) block citoid
Open		None	T364900 Add enhanced logging to Citoid
Resolved		Mvolz	T364901 Log data about which domains are failing most frequently
Resolved	BUG REPORT	Mvolz	T367776 TypeErrors in logstash
Resolved		Mvolz	T367870 Index OutgoingReqResult fields
Open		None	T364903 [SPIKE] Determine how specific we can be about logging why Citoid is failing
Open		None	T365583 Return 415 Media Type not Supported errors for pdfs and other types of unsupported formats in the citoid back end.

Event Timeline

ppelberg created this task.May 14 2024, 6:15 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 14 2024, 6:15 PM

Maintenance_bot added a project: VisualEditor.May 14 2024, 6:29 PM

Change #1036250 had a related patch set uploaded (by Mvolz; author: Mvolz):

[mediawiki/services/citoid@master] Log HTTP errors on warn level

https://gerrit.wikimedia.org/r/1036250

gerritbot added a project: Patch-For-Review.May 27 2024, 2:48 PM

Mvolz moved this task from Backlog to Service on the Citoid board.May 28 2024, 1:28 PM

Mvolz added a project: Platform Engineering.

Splitting from the main thread, to inquire about this specifically:

In T362379#9807771, @akosiaris wrote:

In T362379#9805889, @akosiaris wrote:

I 'll have a deeper look tomorrow.

urldownloaders don't have the visibility needed to look at URLs since most sites are accessed over HTTPS. So the only thing they do see is URL domains, not paths. We need Citoid to log out what requests it sees.

In T362379#9805685, @Mvolz wrote:

I'm implementing logging in citoid now, but - don't we mostly need to know the domain, not the path, anyway? As it lets us sort by hosts that are blocking us? I don't think we actually need the full path...

In T362379#9820495, @akosiaris wrote:

Just as a data point, I've had a look into urldowloaders logs today for nytimes and apparently the last time we got an error has been in May 4th (2 errors). Before that, we had a spew of errors on April 19th (~20) a pattern that continues until 4th of March, at which point we no longer have data in logstash.

Total number of errors: 743
Domains: nytimes.com, www.nytimes.com
Definition of error: HTTP 403
Time range of errors: 2024-03-04 to 2024-05-04

I noticed science direct is still giving us 200s across the board:

https://logstash.wikimedia.org/goto/d1c901fbadc2b726c357bcf357f9e37c

But obviously they are still blocking our requests... is it just that connect is fine, it's only the get requests that are blocked? Why aren't the get requests getting logged, then? (I think I still don't understand exactly what url downloader does!)

ppelberg added a parent task: T362379: Several major news websites (NYT, NPR, Reuters...) block citoid .May 28 2024, 4:58 PM

In T364900#9838429, @Mvolz wrote:

Splitting from the main thread, to inquire about this specifically:

In T362379#9807771, @akosiaris wrote:

In T362379#9805889, @akosiaris wrote:

I 'll have a deeper look tomorrow.

urldownloaders don't have the visibility needed to look at URLs since most sites are accessed over HTTPS. So the only thing they do see is URL domains, not paths. We need Citoid to log out what requests it sees.

In T362379#9805685, @Mvolz wrote:

I'm implementing logging in citoid now, but - don't we mostly need to know the domain, not the path, anyway? As it lets us sort by hosts that are blocking us? I don't think we actually need the full path...

The best solution would be to split them to url.domain and url.path. That way we can aggregate according to our needs in logstash.

In T362379#9820495, @akosiaris wrote:

Just as a data point, I've had a look into urldowloaders logs today for nytimes and apparently the last time we got an error has been in May 4th (2 errors). Before that, we had a spew of errors on April 19th (~20) a pattern that continues until 4th of March, at which point we no longer have data in logstash.

Total number of errors: 743
Domains: nytimes.com, www.nytimes.com
Definition of error: HTTP 403
Time range of errors: 2024-03-04 to 2024-05-04

I noticed science direct is still giving us 200s across the board:

https://logstash.wikimedia.org/goto/d1c901fbadc2b726c357bcf357f9e37c

But obviously they are still blocking our requests... is it just that connect is fine, it's only the get requests that are blocked? Why aren't the get requests getting logged, then? (I think I still don't understand exactly what url downloader does!)

urldownloader is just squid proxy set up in forward proxy configuration without caching. It makes the connection from client applications like Citoid and Zotero show up as they are initiated using it's IP and not the private internal IP that is assigned to Citoid and Zotero instances. That way, these applications are able to access the internet.

There are 2 different ways of it functioning

HTTP forward proxying uses the usual HTTP methods you probably know, GET, POST, PUT, DELETE etc. The proxy sees the entirety of the request/response pair. Of that data, It sends to the source/destination whatever the configuration tells it to send (e.g. configuration might say to remove Cookies, specific headers etc)
The CONNECT HTTP method starts a TCP tunnel from the client asking it to the destination. As a method, it is used for all HTTPS destinations by all HTTP(S) proxy compliant libraries (like what citoid and zotero use). In this case, due to the encryption, the proxy doesn't see anything of the data in the request/response pair. It's blind forwarding of data, that starts right after a client like Citoid issues something like

CONNECT www.sciencedirect.com:443 HTTP/1.1

Per https://www.rfc-editor.org/rfc/rfc9110#CONNECT

Any 2xx (Successful) response indicates that the sender (and all inbound proxies) will switch to tunnel mode immediately after the response header section; data received after that header section is from the server identified by the request target. Any response other than a successful response indicates that the tunnel has not yet been formed.

So that 200, is meaningless. All it tells citoid is that the tunnel was successfully formed. Any other HTTP response (e.g. a 403) indicates the tunnel wasn't formed and may or may not carry information as to the reason why.

So, my comment above regarding no longer receiving errors from nytimes.com should be viewed under the light of squid was able to form a tunnel to nytimes.com and allow Citoid to fetch the information it wanted. What happens after that, it's Citoid's job to interpret it.

In other words, when we were seeing those 403s, nytimes.com was outright rejecting urldownloader initiated connections. The fact that we no longer see them for nytimes.com or sciencedirect.com just means that our IPs aren't blocked. Nothing more than that.

Which is why I am saying that Citoid/Zotero should be logging more about what is happening. urldownloader provides no more visibility here because it doesn't have any.

I deployed a change today for this. It is not everything we hoped, alas :).

We don't actually seem to log info, only warns. However, it may look like we were because of the presence of NOTICE levels in logstash. Which brings me to a second issues, which is: these are fact NOTICES caused by parsing failure of what are actually WARNS. Some of them are just total garbage: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=TmlhDJAB_bxzc-Pm7TGI. Some of these NOTICES are a single line of a stack trace. Guessing from this one I am thinking... the inclusion of the html of the response in the message of the error might be the cause of this. It's maybe actually not sanitising it and it's escaping it some how? Yikes :P And then we're getting like 20 NOTICEs for parts of the response, each separately. I.e. https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=TdVkDJABDD1VxFBSzVVI

This is a preexisting bug, but only became very noticeable now when we started logging a lot more failures.

All the HTTPErrors includes the body as the "msg"when logged. And if the body contains json I think that might be how it's escaping out.

Here's a broken science direct one: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=ttluDJAB0bw1V4rvcXY8

On the positive side:

Here is an example of one warning actually working (from our probe): https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=n19qDJAB5VvKlwLuaeRS

Also we apparently need to index the pattern to actually filter by uri: https://logstash.wikimedia.org/app/management/opensearch-dashboards/indexPatterns/patterns/logstash-*#/?_a=h@9293420 does anyone know where the template lives where we can add those fields?

Also, do you know how to enable INFO in logstash for citoid?

In T364900#9884119, @Mvolz wrote:

I deployed a change today for this. It is not everything we hoped, alas :).

We don't actually seem to log info, only warns. However, it may look like we were because of the presence of NOTICE levels in logstash. Which brings me to a second issues, which is: these are fact NOTICES caused by parsing failure of what are actually WARNS. Some of them are just total garbage: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=TmlhDJAB_bxzc-Pm7TGI. Some of these NOTICES are a single line of a stack trace. Guessing from this one I am thinking... the inclusion of the html of the response in the message of the error might be the cause of this. It's maybe actually not sanitising it and it's escaping it some how? Yikes :P And then we're getting like 20 NOTICEs for parts of the response, each separately. I.e. https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=TdVkDJABDD1VxFBSzVVI

Yikes indeed. I had a look at a couple of those and I can say that it's definitely a WARN level logline. The entire event however was 21630 characters long. The garbage part that you noticed, was a inline base64 encoded image apparently.

This is a preexisting bug, but only became very noticeable now when we started logging a lot more failures.

All the HTTPErrors includes the body as the "msg"when logged. And if the body contains json I think that might be how it's escaping out.

My current theory is probably that there is some truncation happening (I doubt logstash/elasticsearch ever expected 21KB log events) and a default NOTICE level. I based that on the size and the fact the in these entries there's a lack of other fields.

But you are absolutely right that this is a bug and shouldn't be happening. We need to stop logging bodies of responses (or truncate them very heavily, keeping the first 100 or so chars)

Here's a broken science direct one: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=ttluDJAB0bw1V4rvcXY8

On the positive side:

Here is an example of one warning actually working (from our probe): https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.06.12?id=n19qDJAB5VvKlwLuaeRS

Also we apparently need to index the pattern to actually filter by uri: https://logstash.wikimedia.org/app/management/opensearch-dashboards/indexPatterns/patterns/logstash-*#/?_a=h@9293420 does anyone know where the template lives where we can add those fields?

Also, do you know how to enable INFO in logstash for citoid?

It's citoid that needs to emit messages with INFO levels. That can be done at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/citoid/templates/_config.yaml#16

All it needs is an extra line of

level:  {{ .Values.main_app.log_level | default "info" }}

and then we override that in helmfile.d stanzas for when we dont' want info.

But before proceeding with that, we probably want to stop logging the body of the response.

Mvolz closed subtask T364901: Log data about which domains are failing most frequently as Resolved.Jun 24 2024, 9:48 AM

Add enhanced logging to CitoidOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Add enhanced logging to Citoid
Open, Needs TriagePublic
Actions

Related Objects
Search...