Use Envoy instead of LVS to route internal federation traffic for WDQS
Closed, InvalidPublic
Actions

Assigned To

Authored By

	Gehel
	Jul 1 2024, 6:50 PM

Description

As discussed in T361950#9934152, we don't want to go through LVS for internal traffic. Solution proposed by @Vgutierrez is to use local envoy to do the routing / load balancing.

AC:

federation traffic in WDQS does not go through LVS

Related Objects
Search...

Status	Assigned	Task
Open	None	T335067 Epic: Wikidata Query Service stabilization
Open	None	T337013 [Epic] Splitting the graph in WDQS
Open	None	T364363 [Epic] Productionize federated wdqs graph-split endpoints
Invalid	Gehel	T368972 Use Envoy instead of LVS to route internal federation traffic for WDQS

Event Timeline

Gehel created this task.Jul 1 2024, 6:50 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 1 2024, 6:50 PM

Gehel mentioned this in T361950: Ensure that WDQS query throttling does not interfere with federation.Jul 1 2024, 7:05 PM

Discussed this at pairing with @RKemper today, we were curious how to implement in envoy.

I'm far from being an envoy expert, but it looks like we could use envoy as a load balancer by defining a static envoy config similar to cluster.yaml.erb and listerner.yaml.erb . So envoy definitely supports this, but it's not a feature that's in use anywhere at WMF as far as I can tell. Tagging serviceops to comment on whether or not configuring this way is recommended.

ServiceOps, most of the context you need is in the parent ticket. Thanks for looking!

bking added subscribers: dcausse, EBernhardson.Jul 2 2024, 9:10 PM

In the below post I'm trying to aggregate info from the previous phab ticket and in-meeting discussions to give service-ops a jumping-off point to start from

(Some of these details might be wrong, but I think the main points are correct)

Request Routing

Previous way: user submits a query at query.wikidata.org-> ATS -> LVS -> nginx -> blazegraph

New way: (if a federated query) user submits a query at query-(main|scholarly).wikidata.org -> ATS -> LVS -> nginx -> blazegraph -> http client inside blazegraph reaches out to endpoint specified in federation request

(Non-federated queries are equivalent to the previous non-graph-split way of doing things so are irrelevant for our purposes here)

Throttling

Previous way: throttling done via token-bucket algorithm on 2-tuple of (IP addr, user agent)

New way: Still same token-bucket algorithm, however we need to identify internal federation queries and disable throttling in that use case (throttling will still occur at the external level; ie the first wdqs-(main|scholarly) endpoint contacted by the user). Current proposal is to use envoy for internal (i.e. emanating from blazegraph http client) federation queries, but that raises some questions, most pressingly: is there a mechanism whereby we can query etcd state so that hosts depooled from pybal will similarly be excluded from the envoy pool?

Questions

Is it feasible to have envoy query etcd state in order to decide pool members? (i.e. we would want to maintain the invariant that if a host is depooled in pybal it won't be present in envoy's pool). My present understanding is that sophroid (https://gitlab.wikimedia.org/repos/sre/sophroid) is intended to support this but isn't production-ready
If it's not feasible, are there any recommendations about how to best achieve our goals?

JMeybohm added a project: serviceops.Jul 4 2024, 2:50 PM

Restricted Application added a project: wmde-wikidata-tech. · View Herald TranscriptJul 4 2024, 2:50 PM

Gehel removed a project: Discovery-Search.Jul 8 2024, 1:11 PM

Gehel moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.

Note that the throttling question is orthogonal to the LVS vs Envoy question. As long as we are able to identify which requests are coming from the outside (internet) vs internal, we can put in place throttling exceptions.

It is not clear to me what solutions we have in place to route / load balance internal traffic. If LVS isn't the solution to load balance internal traffic, we need another solution. I don't think that Data-Platform-SRE is the right team to put in place a new load balancing solution.

Whatever solution we use for internal load balancing should provide at least:

easy way to pool / depool individual nodes, manually or automatically based on health check
redundancy: a failure of on of the load balancing node should not result in a complete loss of the service behind it
integrated operation for internal and external traffic routing: in this specific case, the Blazegraph pools are going to serve both internal and external traffic, we need to ensure that depooling a node will depool it for both internal and external traffic. That could be achieve either by having a shared control plane or by having a single LB routing both internal and external traffic.

Gehel triaged this task as High priority.Jul 9 2024, 8:02 AM

Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.

Gehel edited projects, added Data-Platform-SRE (2024.07.08 - 2024.07.28); removed Data-Platform-SRE.Jul 9 2024, 8:06 AM

Note that we should not have traffic loops, at least not in the sense that I understand loops. There are multiple blazegraph pools with different datasets. Each pool will be able to federate with another pool, but should never federate with itself. As an example:

Queries to wdqs-main, federated with wdqs-scholarly:
query.wikidata.org-> ATS -> LVS -> nginx -> blazegraph (main) -> LVS -> nginx -> blazegraph (scholarly)

And a graph to hopefully make all this more clear:

wdqs-graph-split-traffic.png (695×768 px, 54 KB)

After discussion with @Vgutierrez and @Joe, given that we have separate LVS pools for the different blazegraph pools, we don't create traffic loops and we should be good using LVS.

I'm closing this as invalid, feel free to re-open if I misunderstood something.

Use Envoy instead of LVS to route internal federation traffic for WDQSClosed, InvalidPublicActions