[go: nahoru, domu]

Page MenuHomePhabricator

Migrate wikibase-termbox to node20
Closed, ResolvedPublic

Description

To ensure our service security and compatibility with other infrastructure, we should update to Node v20 (the latest node lts).

A/C:

  • The wikibase-termbox service is updated to Node v20 on production infrastructure

Previously: T355685: Migrate Termbox SSR from Node 16 to 18

Event Timeline

(FTR – if only so I remember it myself ^^ – I was motivated to work on this because a new Envoy image is being rolled out to all Kubernetes services for T368366, and so deploying the node20 update would have also deployed that update and killed two birds with one stone.)

Change #1052746 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[wikibase/termbox@master] Migrate from node18 to node20

https://gerrit.wikimedia.org/r/1052746

Change #1052746 merged by jenkins-bot:

[wikibase/termbox@master] Migrate from node18 to node20

https://gerrit.wikimedia.org/r/1052746

Change #1052928 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/deployment-charts@master] termbox: update to 2024-07-09-084416-production

https://gerrit.wikimedia.org/r/1052928

Change #1052928 merged by jenkins-bot:

[operations/deployment-charts@master] termbox: update to 2024-07-09-084416-production

https://gerrit.wikimedia.org/r/1052928

Hm, as far as I can tell the updated version is working on Wikidata, but not on Test Wikidata – there’s no SSR termbox there. But I don’t see any errors in logstash that would explain this.

Annoyingly, I didn’t check that the termbox SSR was working before deploying, so it’s possible that it was broken for a while but nobody noticed. I guess we can revert the test deployment to the earlier container image version and see how that behaves there.

Aha, “unable to get local issuer certificate”:

lucaswerkmeister-wmde@deploy1002 /srv/deployment-charts/helmfile.d/services/termbox $ kube_env termbox staging
lucaswerkmeister-wmde@deploy1002 /srv/deployment-charts/helmfile.d/services/termbox $ kubectl get pods
NAME                              READY   STATUS    RESTARTS   AGE
termbox-staging-c74b86488-hwm58   3/3     Running   0          11m
termbox-test-d78f78c6d-k5xnb      2/2     Running   0          11m
lucaswerkmeister-wmde@deploy1002 /srv/deployment-charts/helmfile.d/services/termbox $ kubectl logs termbox-test-d78f78c6d-k5xnb termbox-test --tail=1
{"name":"wikibase-termbox","hostname":"termbox-test-d78f78c6d-k5xnb","pid":21,"level":"ERROR","message":"unable to get local issuer certificate","request":{"headers":{"Accept":"application/json, text/plain, */*","User-Agent":"wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.21.1","Host":"test.wikidata.org:4446"},"url":"api.php","params":{"action":"query","meta":"allmessages","ammessages":"wikibase-edit|wikibase-publish|wikibase-cancel|wikibase-aliases-separator|wikibase-entitytermsforlanguagelistview-more|wikibase-entitytermsforlanguagelistview-less|wikibase-entitytermsview-entitytermsforlanguagelistview-toggler|wikibase-label-empty|wikibase-description-empty|wikibase-label-edit-placeholder|wikibase-description-edit-placeholder|wikibase-alias-edit-placeholder|wikibase-anonymouseditwarning-heading|wikibase-anonymouseditwarning-message|wikibase-anonymouseditnotificationtempuser-message|wikibase-anonymouseditwarning-dismiss-button|wikibase-anonymouseditwarning-dismiss-persist|pt-login|pt-createaccount|wikibase-shortcopyrightwarning-accept-persist|wikibase-shortcopyrightwarning-heading|wikibase-entity-save-error-heading|wikibase-entity-save-error-message","amlang":"en","format":"json"}},"url":"/termbox?entity=Q235006&revision=676047&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ235006&preferredLanguages=en%7Cmul","reqId":"34fc55cc-dbfc-4c12-bc43-b64ebe2e009f","levelPath":"error/service","msg":"unable to get local issuer certificate","time":"2024-07-10T12:28:17.803Z","v":0}

I have no idea if this could be related to the new envoy version (T368366) or not, though.

But the production termbox doesn’t seem to have this error, so I don’t think it’s a red herring.

lucaswerkmeister-wmde@deploy1002 /srv/deployment-charts/helmfile.d/services/termbox $ kube_env termbox eqiad 
lucaswerkmeister-wmde@deploy1002 /srv/deployment-charts/helmfile.d/services/termbox $ kubectl get pods
NAME                                  READY   STATUS    RESTARTS   AGE
termbox-production-7f7466fcb4-gxcb4   3/3     Running   0          12m
termbox-production-7f7466fcb4-hk5lj   3/3     Running   0          12m
termbox-production-7f7466fcb4-mft66   3/3     Running   0          13m
termbox-production-7f7466fcb4-pnm8x   3/3     Running   0          13m
lucaswerkmeister-wmde@deploy1002 /srv/deployment-charts/helmfile.d/services/termbox $ kubectl logs termbox-production-7f7466fcb4-gxcb4 termbox-production --tail=10
{"name":"termbox","hostname":"termbox-production-7f7466fcb4-gxcb4","pid":1,"level":"INFO","levelPath":"info/service-runner","msg":"master(1) initializing 1 workers","time":"2024-07-10T12:22:48.275Z","v":0}
{"name":"wikibase-termbox","hostname":"termbox-production-7f7466fcb4-gxcb4","pid":22,"level":"INFO","levelPath":"info/service","msg":"Set config to: {\"WIKIBASE_REPO\":\"http://www.wikidata.org:6500/w\",\"WIKIBASE_REPO_HOSTNAME_ALIAS\":\"localhost\",\"SSR_PORT\":3030,\"MEDIAWIKI_REQUEST_TIMEOUT\":3000,\"MESSAGES_CACHE_MAX_AGE\":60000,\"LANGUAGES_CACHE_MAX_AGE\":300000,\"HEALTHCHECK_QUERY\":\"language=de&entity=Q1&revision=103&editLink=/edit/Q1347&preferredLanguages=de|en\"}","time":"2024-07-10T12:22:49.006Z","v":0}
{"name":"termbox","hostname":"termbox-production-7f7466fcb4-gxcb4","pid":1,"level":"WARN","levelPath":"warn/service-runner","msg":"startup finished","time":"2024-07-10T12:22:49.111Z","v":0}
{"name":"wikibase-termbox","hostname":"termbox-production-7f7466fcb4-gxcb4","pid":22,"level":"INFO","levelPath":"info/service","msg":"server is now running...","time":"2024-07-10T12:22:49.111Z","v":0}
{"name":"termbox","hostname":"termbox-production-7f7466fcb4-gxcb4","pid":22,"level":"WARN","levelPath":"warn/metrics","msg":"Calling 'Metrics.timing' directly is deprecated.","time":"2024-07-10T12:22:50.444Z","v":0}

as far as I can tell the updated version is working on Wikidata, but not on Test Wikidata

See also T355955: [SW] [GENERAL] Simplify Termbox SSR test release for the general issue of these two deployments being fairly different. At least in this case we got lucky and the broken deployment is the less important one ^^

Aha, “unable to get local issuer certificate”:

lucaswerkmeister-wmde@deploy1002 /srv/deployment-charts/helmfile.d/services/termbox $ kube_env termbox staging
lucaswerkmeister-wmde@deploy1002 /srv/deployment-charts/helmfile.d/services/termbox $ kubectl get pods
NAME                              READY   STATUS    RESTARTS   AGE
termbox-staging-c74b86488-hwm58   3/3     Running   0          11m
termbox-test-d78f78c6d-k5xnb      2/2     Running   0          11m
lucaswerkmeister-wmde@deploy1002 /srv/deployment-charts/helmfile.d/services/termbox $ kubectl logs termbox-test-d78f78c6d-k5xnb termbox-test --tail=1
{"name":"wikibase-termbox","hostname":"termbox-test-d78f78c6d-k5xnb","pid":21,"level":"ERROR","message":"unable to get local issuer certificate","request":{"headers":{"Accept":"application/json, text/plain, */*","User-Agent":"wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.21.1","Host":"test.wikidata.org:4446"},"url":"api.php","params":{"action":"query","meta":"allmessages","ammessages":"wikibase-edit|wikibase-publish|wikibase-cancel|wikibase-aliases-separator|wikibase-entitytermsforlanguagelistview-more|wikibase-entitytermsforlanguagelistview-less|wikibase-entitytermsview-entitytermsforlanguagelistview-toggler|wikibase-label-empty|wikibase-description-empty|wikibase-label-edit-placeholder|wikibase-description-edit-placeholder|wikibase-alias-edit-placeholder|wikibase-anonymouseditwarning-heading|wikibase-anonymouseditwarning-message|wikibase-anonymouseditnotificationtempuser-message|wikibase-anonymouseditwarning-dismiss-button|wikibase-anonymouseditwarning-dismiss-persist|pt-login|pt-createaccount|wikibase-shortcopyrightwarning-accept-persist|wikibase-shortcopyrightwarning-heading|wikibase-entity-save-error-heading|wikibase-entity-save-error-message","amlang":"en","format":"json"}},"url":"/termbox?entity=Q235006&revision=676047&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ235006&preferredLanguages=en%7Cmul","reqId":"34fc55cc-dbfc-4c12-bc43-b64ebe2e009f","levelPath":"error/service","msg":"unable to get local issuer certificate","time":"2024-07-10T12:28:17.803Z","v":0}

I have no idea if this could be related to the new envoy version (T368366) or not, though.

I don't think this is related to the new envoy version. The certificate chain did stay the same.
I recall we had issues with nodejs in the past due to it not honoring the systems CA store, though. IIRC the env variable NODE_EXTRA_CA_CERTS had to be set properly - maybe that changed with node20 or there is a better way to do it (use the systems CA store) now?

Hm, I can see NODE_EXTRA_CA_CERTS in both deployments at least (termbox-production and termbox-test)… and it’s still documented in Node 20 so I doubt it went away – but maybe it works differently now.

Let’s just, first of all, test whether reverting the test deployment to node18 fixes the issue. (That would already rule out envoy, because that wouldn’t be rolled back at the same time.)

Change #1053324 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/deployment-charts@master] termbox: revert test deployment to 2024-03-14-121904-production

https://gerrit.wikimedia.org/r/1053324

Change #1053324 merged by jenkins-bot:

[operations/deployment-charts@master] termbox: revert test deployment to 2024-03-14-121904-production

https://gerrit.wikimedia.org/r/1053324

Alright, the image revert fixed the termbox on Test Wikidata. So it’s indeed an issue somewhere in the node20 version of the image, and unrelated to envoy. (But node20 is still deployed in eqiad/codfw for non-Test Wikidata and working fine there.)

Hm, this seems a bit suspicious… if I connect to the endpoint that the Test Wikidata Termbox is configured to connect to:

$ openssl s_client -connect mw-api-int-ro.discovery.wmnet:4446

It spits out a certificate which, according to openssl x509 -noout -text (copy+paste the CERTIFICATE block there), is a 2048 bit RSA certificate – which feels a bit short, according to my limited cryptography knowledge. I wonder if Node 20 requires 4kbit RSA keys?


Other things that didn’t go anywhere but I’ll dump them here in case they‘re useful:

I’m not allowed to attach to the container to take a look (unless there’s a different command to do it?):

lucaswerkmeister-wmde@deploy1002 ~ $ kube_env termbox staging
lucaswerkmeister-wmde@deploy1002 ~ $ kubectl exec -it termbox-test-d78f78c6d-k5xnb termbox-test -- bash
Defaulted container "termbox-test" out of: termbox-test, test-metrics-exporter
Error from server (Forbidden): pods "termbox-test-d78f78c6d-k5xnb" is forbidden: User "termbox" cannot create resource "pods/exec" in API group "" in the namespace "termbox"

Locally, the image works:

$ docker run --rm -p "3030:3030" -e WIKIBASE_REPO=https://www.wikidata.org/w -e WIKIBASE_REPO_HOSTNAME_ALIAS=www.wikidata.org -e SSR_PORT=3030 -e LOGSTASH_HOST=localhost -e STATSD_HOST=localhost docker-registry.wikimedia.org/wikimedia/wikibase-termbox:2024-07-09-084416-production
# in another terminal
$ curl -s 'http://localhost:3030/termbox?entity=Q42&revision=1841500264&language=en&editLink=%2Fw%2Findex.php%2FSpecial%3ASetLabelDescriptionAliases%2FQ42&preferredLanguages=en%7Cde'
# returns HTML beginning with <section class="wikibase-entitytermsview"

I’m assuming this is due to the different WIKIBASE_REPO, which only uses HTTPS in the test deployment:

values-test.yaml
# The port is the port of the mw-api-int deployment
WIKIBASE_REPO: https://test.wikidata.org:4446/w
WIKIBASE_REPO_HOSTNAME_ALIAS: mw-api-int-ro.discovery.wmnet
values.yaml
WIKIBASE_REPO: http://www.wikidata.org:6500/w
WIKIBASE_REPO_HOSTNAME_ALIAS: localhost

I can also get the configured NODE_EXTRA_CA_CERTS from kubectl describe configmap config-test, though I’m not sure if that’s relevant (it shouldn’t have changed, after all).

I wonder if Node 20 requires 4kbit RSA keys?

Doesn’t look like it – I can await fetch('https://rsa2048.badssl.com/') from a Node 22 REPL just fine (ditto from docker run -it --entrypoint=node docker-registry.wikimedia.org/wikimedia/wikibase-termbox:2024-07-09-084416-production). (Also, https://rsa2048.badssl.com/ is bright green, so maybe I was too suspicious about that key length. I think 1024-bit is the one that’s considered really questionable.)

Okay, I can reproduce something somewhat close to the error message, I think.

In one terminal:

ssh -L '*:4446:mw-api-int-ro.discovery.wmnet:4446' deployment.eqiad.wmnet

In another terminal:

$ curl -v --connect-to test.wikidata.org:443:localhost:4446 'https://test.wikidata.org/w/api.php?action=query&format=json'
* Connecting to hostname: localhost
* Connecting to port: 4446
* Host localhost:4446 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:4446...
* Connected to localhost (::1) port 4446
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: none
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS alert, unknown CA (560):
* SSL certificate problem: unable to get local issuer certificate
* Closing connection
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

That’s to be expected – I didn’t tell my curl anything about the custom cert here. But it seems encouraging that it produces the same message, “unable to get local issuer certificate” – to me it wasn’t previously obvious that this could mean “server sent unknown certificate”. Let me see if I get any further with this, ideally even in Node.js.

Okay, the behavior seems pretty bizarre even without configuring the NODE_EXTRA_CA_CERTS=. With the new (node20) image, I get the “unable to get local issuer certificate” error:

$ # (the ssh -L from earlier is still in effect, forwarding port 4446 to mw-api-int-ro.discovery.wmnet:4446)
$ docker run -it --entrypoint=node --network=host docker-registry.wikimedia.org/wikimedia/wikibase-termbox:2024-07-09-084416-production
> await fetch('https://localhost:4446/w/api.php?action=query&format=json')
Uncaught TypeError: fetch failed
    at Object.fetch (node:internal/deps/undici/undici:11576:11)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async REPL2:1:33 {
  cause: Error: unable to get local issuer certificate
      at TLSSocket.onConnectSecure (node:_tls_wrap:1627:34)
      at TLSSocket.emit (node:events:514:28)
      at TLSSocket.emit (node:domain:552:15)
      at TLSSocket._finishInit (node:_tls_wrap:1038:8)
      at ssl.onhandshakedone (node:_tls_wrap:824:12)
      at TLSWrap.callbackTrampoline (node:internal/async_hooks:130:17) {
    code: 'UNABLE_TO_GET_ISSUER_CERT_LOCALLY'
  }
}

Whereas with the old (node18) image, even without the NODE_EXTRA_CA_CERTS=, the only error is that the name doesn’t match?! (Because I don’t know what Node’s equivalent to --connect-to is, so I just put localhost in the URL.)

$ docker run -it --entrypoint=node --network=host docker-registry.wikimedia.org/wikimedia/wikibase-termbox:2024-03-14-121904-production                             
Welcome to Node.js v18.19.0.                                                                                                                                                                  
Type ".help" for more information.                                                                                                                                                            
> await fetch('https://localhost:4446/w/api.php?action=query&format=json')                                                                                                                    
Uncaught TypeError: fetch failed                                                                                                                                                              
    at Object.fetch (node:internal/deps/undici/undici:16289:11)                                                                                                                               
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)                                                                                                             
    at async REPL1:1:33 {                                                                                                                                                                     
  cause: Error [ERR_TLS_CERT_ALTNAME_INVALID]: Hostname/IP does not match certificate's altnames: Host: localhost. is not in the cert's altnames: DNS:mw-api-int.discovery.wmnet, etc., DNS:*.wikidata.org, etc.

I don’t know why Node 18 seemingly accepts the internal CA certificate… I would’ve thought that, without the NODE_EXTRA_CA_CERTS= (or some other production config), this should always have been invalid? I certainly wouldn’t expect the internal cert to validate on my laptop…

Okay, I found out how to emulate curl --connect-to.

## terminal A – SSH port forwarding
$ ssh -L '*:4446:mw-api-int-ro.discovery.wmnet:4446' deployment.eqiad.wmnet

## terminal B – Node 18
$ docker run -it --entrypoint=node --network=host --add-host=mw-api-int-ro.discovery.wmnet:127.0.0.1 docker-registry.wikimedia.org/wikimedia/wikibase-termbox:2024-03-14-121904-production
> await fetch( 'https://mw-api-int-ro.discovery.wmnet:4446/w/api.php?action=query&format=json', { headers: { host: 'test.wikidata.org' } } ).then( r => r.json() )
{ batchcomplete: '' }

## terminal C – Node 20
$ docker run -it --entrypoint=node --network=host --add-host=mw-api-int-ro.discovery.wmnet:127.0.0.1 docker-registry.wikimedia.org/wikimedia/wikibase-termbox:2024-07-09-084416-production
> await fetch( 'https://mw-api-int-ro.discovery.wmnet:4446/w/api.php?action=query&format=json', { headers: { host: 'test.wikidata.org' } } ).then( r => r.json() )
Uncaught TypeError: fetch failed
    at Object.fetch (node:internal/deps/undici/undici:11576:11)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async REPL1:1:33 {
  cause: Error: unable to get local issuer certificate
      at TLSSocket.onConnectSecure (node:_tls_wrap:1627:34)
      at TLSSocket.emit (node:events:514:28)
      at TLSSocket.emit (node:domain:552:15)
      at TLSSocket._finishInit (node:_tls_wrap:1038:8)
      at ssl.onhandshakedone (node:_tls_wrap:824:12)
      at TLSWrap.callbackTrampoline (node:internal/async_hooks:130:17) {
    code: 'UNABLE_TO_GET_ISSUER_CERT_LOCALLY'
  }
}

I still haven’t configured the NODE_EXTRA_CA_CERTS=. Somehow, Node 18 supported whatever certificate mw-api-int-ro.discovery.wmnet:4446 serves out of the box. Node 20 doesn’t.

(And as seen in T368523#9969866, curl rejects the cert from mw-api-int-ro.discovery.wmnet:4446, so my feeling would be that Node 20 is right to reject it and Node 18 had a bug. But that doesn’t help us much if we want to get away from Node 18 :D)

My TLS-fu isn’t the strongest, but to me it looks like the certificate configured for the service (kube_env termbox staging; kubectl get configmap config-test -o json | jq -r '.data["puppetca.crt.pem"]') and the ones sent by mw-api-int-ro (openssl s_client -connect mw-api-int-ro.discovery.wmnet:4446 -showcerts < /dev/null) are unrelated…? The former is for “CN = Puppet CA: palladium.eqiad.wmnet”, whereas the latter are both for “O = "Wikimedia Foundation, Inc"” (one “OU = SRE Foundations, CN = discovery”, the other “OU = Cloud Services, CN = Wikimedia_Internal_Root_CA”). I don’t see anything linking them.

Is “palladium” still a thing? It was supposedly decommissioned in T147320 – but that would’ve been years before Termbox even started, I think, so maybe the hostname just lingered on in the Puppet CA for longer than that. (But even then, I don’t know if it’s correct that termbox would be configured to add this certificate to the root CAs when talking to mw-api-int-ro.)

My working theory right now is that the NODE_EXTRA_CA_CERTS=/etc/termbox/puppetca.crt.pem in Termbox’ configmap has been outdated for a while now, but that Node 18 (and earlier versions?) incorrectly accepted the connection anyway, so that the problem wasn’t noticed before.

(And at an even higher level, the fact that only the Test Wikidata version of the Termbox uses HTTPS at all, whereas the production version uses [IIUC] envoy / service mesh over HTTP since 2020, is another facet of “the test and production releases are weirdly different and this keeps biting us” aka T355955.)

Is “palladium” still a thing?

Searching for "Puppet CA: palladium.eqiad.wmnet" on Phabricator (rather than just “palladium” as I did earlier) indicates that the CA certificate is definitely still a thing; it can be found in wmf-certificates (that’s the same one as in the termbox configmap) and there are various other relatively recent tasks related to it. Remains to be seen whether the certificate is linked to the ones served by mw-api-int-ro in a way that I missed, or whether they’re still unconnected for some unknown reason.

I might be missing something obvious here (sorry for not mentioning earlier), but why does termbox test not use the service mesh to connect to mw-api-int-ro.discovery.wmnet like the other termbox releases do? That would leave the TLS part completely to envoy, removing the requirement to provide NODE_EXTRA_CA_CERTS etc.

I might be missing something obvious here (sorry for not mentioning earlier), but why does termbox test not use the service mesh to connect to mw-api-int-ro.discovery.wmnet like the other termbox releases do? That would leave the TLS part completely to envoy, removing the requirement to provide NODE_EXTRA_CA_CERTS etc.

I think that’s a very fair question that I’m not qualified to answer 😅 as far as I know, there were some concerns over whether this would work (T355685#9490969), but I don’t really understand the details.

That said, as far as I’m concerned breaking the service for Test Wikidata is not a big deal (especially if we can easily revert any change), so we could just experiment with it a bit if we wanted… something like this, maybe?

diff --git a/helmfile.d/services/termbox/values-test.yaml b/helmfile.d/services/termbox/values-test.yaml
index 28afb3cf87..24279d8028 100644
--- a/helmfile.d/services/termbox/values-test.yaml
+++ b/helmfile.d/services/termbox/values-test.yaml
@@ -6,9 +6,8 @@ config:
     LANGUAGES_CACHE_MAX_AGE: 300000
     MEDIAWIKI_REQUEST_TIMEOUT: 3000
     MESSAGES_CACHE_MAX_AGE: 60000
-    # The port is the port of the mw-api-int deployment
-    WIKIBASE_REPO: https://test.wikidata.org:4446/w
-    WIKIBASE_REPO_HOSTNAME_ALIAS: mw-api-int-ro.discovery.wmnet
+    WIKIBASE_REPO: http://test.wikidata.org:6500/w
+    WIKIBASE_REPO_HOSTNAME_ALIAS: localhost
 main_app:
   liveness_probe:
     tcpSocket:
@@ -25,15 +24,3 @@ service:
 tolerations: {}
 app:
   port: 3031
-mesh:
-  enabled: false
-# We need an egress to reach mw-api-int-ro:4446
-networkpolicy:
-  egress:
-    dst_nets:
-      # mw-api-int-ro.eqiad
-      - cidr: 10.2.2.81/32
-        port: 4446
-      # mw-api-int-ro.codfw
-      - cidr: 10.2.1.81/32
-        port: 4446

Something like that, yes. Although you will need to use a mesh.public_port different from the one used for the "not test" release, as the nodePort would clash otherwise. See https://wikitech.wikimedia.org/wiki/Kubernetes/Service_ports

I was hoping to only change how the Termbox connects to Wikidata, not the other direction (Termbox being called by Wikidata) – your comment sounds like it would affect both? (Maybe it’s not possible to separate them?) Would that mean changing the wmgWikibaseSSRTermboxServerUrl as well (and maybe a new hieradata/common/service.yaml and/or hieradata/common/profile/services_proxy/envoy.yaml entry in Puppet)?

Change #1055164 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] ermbox: Enable mesh for termbox-test

https://gerrit.wikimedia.org/r/1055164

I was hoping to only change how the Termbox connects to Wikidata, not the other direction (Termbox being called by Wikidata) – your comment sounds like it would affect both? (Maybe it’s not possible to separate them?) Would that mean changing the wmgWikibaseSSRTermboxServerUrl as well (and maybe a new hieradata/common/service.yaml and/or hieradata/common/profile/services_proxy/envoy.yaml entry in Puppet)?

By default the plain text port/service goes away when the service mesh/tls termination is enabled. But that's not set in stone. The attached CR should add a second service (I've reserved the port(s) in the wikitech list) for TLS on port 4018 while leaving the plain text service/port intact. This should allow you to still use the plain text port to connect to termbox-test - but I would recommend to switch to TLS obviously and revert the change to charts/termbox/templates/service.yaml after to disable the plain text port.

Change #1055164 merged by jenkins-bot:

[operations/deployment-charts@master] termbox: Enable mesh for termbox-test

https://gerrit.wikimedia.org/r/1055164

The above change seems to have worked fine \o/ I’ll see if we can make the Test Wikidata → Termbox connection go through TLS later.

Change #1055234 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/deployment-charts@master] Revert "termbox: revert test deployment to 2024-03-14-121904-production"

https://gerrit.wikimedia.org/r/1055234

Change #1055234 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "termbox: revert test deployment to 2024-03-14-121904-production"

https://gerrit.wikimedia.org/r/1055234

And with that, the Node 20 termbox seems to work on Test Wikidata \o/

Lucas_Werkmeister_WMDE claimed this task.

I think we can consider this task done, then – the Termbox now uses Node 20 everywhere, and we’ve even made some progress on T355955 along the way. Further improvements to the Termbox service can happen there.

Change #1063750 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Update termbox

https://gerrit.wikimedia.org/r/1063750

Change #1063750 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Update termbox

https://gerrit.wikimedia.org/r/1063750