[go: nahoru, domu]

Page MenuHomePhabricator

[Search Update Pipeline] Source streams for private wikis
Closed, ResolvedPublic13 Estimated Story Points

Description

Currently the search update pipeline relies on kafka streams emitted -- amongst other sources -- from EventBus. That is not the case for private wikis, such as office.wikimedia.org.

As discussed on 23-09-11 a clone of the page_change stream that is dedicated to private wikis might be an option.

Minimum set of streams that are needed:

  • mediawiki.page_change.v1
  • mediawiki.cirrussearch.page_rerender.v1

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Ideally, we should think about private and public versions of every stream. Private streams still have all events, Public streams have only public events. Public streams can be compacted and created from private streams.

We can do this now for mediawiki_page_change, but doing so will cause events to be emitted to all other streams (e.g. mediawiki.recentchange, mediawiki.revision-create) too, and those older streams don't have configs for producing them to different stream names. EventBus also does not have per-stream enabling configs. So, some dev work will need to happen, in order to not expose private wiki's events in other streams.

Here's how this could be done now for mediawiki_page_change. Do not just do this though! We'll need to make sure other streams aren't produced first.

  • Declare the private stream in EventStreamConfig. Perhaps mediawiki.page_change.private.v1 ? See end of comment
  • Set wgEnableEventBus to TYPE_JOB|TYPE_EVENT in InitialiseSettings.php in mediawiki-config for the wikis in question. (Although, why aren't we sending purge events for these wikis? If we wanted to, we could just use TYPE_ALL) NOTE: This is what would cause ALL event types to be produced from the private wikis.
  • In mediawiki-config InitialiseSettings.php, Set wgEventBusStreamNamesMap for the wikis in question to change the mediawiki_page_change stream name that EventBus PageChangeHooks produces to (docs here)
'wgEventBusStreamNamesMap' => [
    'private' => [
        'mediawiki_page_change' => 'mediawiki.page_change.private.v1'
    ],
    // Other wikis if necessary? wikitech? etc.
]
'private' stream name convention:

Whatever we do here will set a convention for us doing this in the future for other streams and datasets. I can imagine a 'public by default' future, which should be our dream, in which any dataset that is not named with this convention is possible to be made public.

Alternatively, we could do this the other way around: explicitly name public datasets 'public' in some way. We are a little late considering this for mediawiki.page_change.v1, as it already exists and is exposed publicly.

This is relevant for streams as well as more static datasets. @mforns and others have had this discussion about the event vs event_sanitized Hive databases. We feel it would be better to invert their names, mainly because event_sanitized has the long term history of events T225751: Consider renaming event and event_sanitized Hive databases (I just updated that task with my understanding of that old discussion). It would be nice if whatever we decide, the naming convention would be consistent everywhere.

Ahoelzl set the point value for this task to 13.Apr 4 2024, 12:33 AM

@Ahoelzl - what implementation is the 13 point estimate based on?

@gmodena @lbowmaker @tchin @Ahoelzl Let's discuss as a team what we want to do here. As stated above, a few code and config tweaks in EventBus will enable this, but the decision we make might be hard to change later.

I think we will want to implement some shorter term EventBus based solution here. But, there is a larger related conversation of how we manage public/private data in streams: {T241178}. We should consider this as we make a plan.

Discussed this in a meeting with Gabriele today. We discussed an ideal solution long term solution, and also a practical short term solution.

The long term solution has a lot of dependencies and would require us to implement platform capabilities we don't currently have. We are documenting this long term solution here so that we remember what we would like to do in the future.

I am drafting and editing these solutions here in this comment, and will move to task description once discussed, and then will document at Event_Platform/Decision_Log.

Ideal Long Term Solution

Ideally, we would be able to support this request as a Event & Data Platform capability:

We'd have

  • A standardized way (T354557: Dataset Config Store?) to configure a stream / dataset's privacy settings, hopefully in line with the new (as of 2024-06) Data Collection Guidelines. Ideally this would be the same way we configure sanitization of Table datasets in the Data Lake.
  • By default, streams are private.
  • A standard stream sanitization job would automate applying the configured privacy sanitization rules and producing a public version of a stream.
  • Public streams in Kafka would be compacted and keyed appropriately to allow for proactive deletion of messages. E.g. if a revision is suppressed, the public stream sanitization job would emit a new event either deleting the message by key (AKA tombstoning), or by producing a new message with the same key with the record redacted. (This would solve longstanding {T241178}.)

Practical Short Term Solution

NOTE: this section is WIP.
  • Modify EventBus hook handlers to support configuration of the stream name of every stream they emit. This configuration already exists for the mediawiki_page_change stream, but not for any other stream. We just need to repeat this in every hook handler that emits a stream. Docs here and code here.
    • We might want to create a nicer PHP interface class to get EventBus related configs, rather than using MainConfig MW Service class directly like done in PageChangeHooks.php.
  • Set a wgEventStreamsDefaultSettings override for private wikis that sets producers.mediawiki_eventbus.enable = false.

Something like

    'wgEventStreamsDefaultSettings' => [
	'default' => [ ... ],
        'private' => [
            'producers' => [
                'mediawiki_eventbus' => [
                    'enabled' => false
                ]
            ],
        ],
    ],

EventBus checks this setting to determine if events for a given stream should not be produced. By setting this as the default for private wikis, we can safely modify the value of wgEnableEventBus as described below.

NOTE: we need to check that mediawiki-config will properly deep merge this setting. E.g. if someone sets producers.mediawiki_eventbus.event_service_name, but not enabled, will the defaults be merged properly with the stream specific settings? (I think they will?)
  • Document a convention for naming private streams, and also how to declare and configure them, at the Stream_Configuration, similar to how Stream Versioning convention has been documented. Private streams will be named <base_stream_name>.private.<version>.
  • Declare separate private specific streams in wgEventStreams. These will need to specifically have producers.mediawiki_eventbus.enabled = true in order to override the default we set for private wikis.
  • Configure EventBus HookHandlers to use the private streams for private wikis:

In mediawiki-config InitialiseSettings.php, Set wgEventBusStreamNamesMap for the private wikis in to change the stream names that EventBus hooks produces to. E.g. (docs here)

'wgEventBusStreamNamesMap' => [
    'private' => [
        'mediawiki_page_change' => 'mediawiki.page_change.private.v1',
        // other streams as needed
    ],,
    // Other wikis if necessary? wikitech? etc.
]

As I wrote the 'Practical Short Term Solution' I came up again against the awkwardness of the wgEnableEventBus config. I wonder if we shouldn't consider just removing and refactoring that, and to make use of the EventStreamConfig producer specific setting (producers.mediawiki_eventbus.enabled = false) added in T259712: Allow disabling/enabling configured streams via wgEventStreams config.

This would be more flexible than the wgEnableEventBus's 'TYPE_*' bitfield. But also, it could be risky, as this is currently what is being used to enable or disable JobQueue, purge, and 'regular event' emission on various wikis.

Change #1050060 had a related patch set uploaded (by Ottomata; author: Ottomata):

[mediawiki/extensions/EventBus@master] [WIP] - Remove support for wgEnableEventBus and 'event type'

https://gerrit.wikimedia.org/r/1050060

I wonder if we shouldn't consider just removing and refactoring that, and to make use of the EventStreamConfig producer specific setting

I asked @Jdforrester-WMF in IRC what he thought, and he made this very valid point

It's probably fine, but a bit more complicated to reason about rather than a simple short list in prod config?

Here is an example of what ESC would look like if we remove wgEnableEventBus (1053750) if we did this.

I agree that it is much more difficult to reason about. Even the per wiki config rendering is a little confusing. In order to determine e.g. if 'mediawiki.job.*' streams are enabled on a e.g. private wiki, you either have to: mentally combine the default ESC wgEventStreamsDefaultSettings plus the +private wgEventStreamsDefaultSettings overrides, plus the default wgEventStreams plus the +private wgEventStreams settings, or: curl the action=streamconfigs API from the wiki in question.

I think its too much. Let's not do it. Let's keep $wgEnableEventBus.


I will still consider getting rid of EventBusFactory, in favor of an EventBus singleton that uses wgEnableEventBus + StreamcConfigs to do the right thing. Not sure yet, but I think that will simplify a few things.

I've been looking over the related code and pondering what all could potentially go wrong with the practical short term solution.

I'm a little worried about stream versioning. The proposal is to use EventBusStreamNamesMap which currently uses names like mediawiki_page_change when mapping mediawiki.page_change.v1 to the new value. What happens when it changes to v2? In this setup we would have to remember to update both the code and the configuration to match each other, and make sure they stay aligned with the train deployment. Proposal is to instead separate the name from the version, and have a small resolver that takes the name and version, maps the name, then appends the version.

Chatted with @EBernhardson in IRC.

Conclusion is that the keys in EventBusStreamNamesMap should match the default values of the stream name. E.g. to override the private wiki's stream name:

'wgEventBusStreamNamesMap' => [
    'private' => [
        'mediawiki.page_change.v1' => 'mediawiki.page_change.private.v1'
    ],
]

Creating a .v2 stream is the same as creating a new stream completely. Putting the '.v1' in the EventBusStreamNamesMap config will avoid confusion and difficulty if/when we need to do that.

Change #1050060 abandoned by Ottomata:

[mediawiki/extensions/EventBus@master] [WIP] - Remove support for wgEnableEventBus and 'event type'

Reason:

Won't do: https://phabricator.wikimedia.org/T346046#9974998

https://gerrit.wikimedia.org/r/1050060

Couple of WIP patches up for discussion.

The second depends on the first.

@gmodena I like the way this is headed, but I'm not sure if following through (and fully deprecating EventBusFactory) is worth it. Let's get together and discuss.

Change #1055239 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/EventBus@master] Centralize stream name mapping

https://gerrit.wikimedia.org/r/1055239

Change #1055242 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Utilize StreamNameMapper from EventBus

https://gerrit.wikimedia.org/r/1055242

Change #1055239 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] Centralize stream name mapping

https://gerrit.wikimedia.org/r/1055239

Change #1055275 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] Produce a limited set of event streams on private wikis

https://gerrit.wikimedia.org/r/1055275

Change #1055242 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Utilize StreamNameMapper from EventBus

https://gerrit.wikimedia.org/r/1055242

Couple of WIP patches up for discussion.

The second depends on the first.

@gmodena I like the way this is headed, but I'm not sure if following through (and fully deprecating EventBusFactory) is worth it. Let's get together and discuss.

I had a first pass over your CRs (and we chatted a bit about it), the approach makes sense IMHO.

1053785 lgtm, modulo fleshing out the todos/having tests pass (I left some comments).

I only have a couple of concerns about 1054934, but tbh I need to think more deeply about it. Some early thoughts (non blockers):

  • I do agree that EventBusFactory is a bit messy in its current state, and not necessarily needed.
  • This refactoring is moving complexity from EventBusFactory to EventBus and EventBus::send(). Is this really worth it?
  • Routing events by destination_event_service changes behavior. Now a single EventBus instance could talk to multiple EventGates. +1 for erroring if that's the case like you do now. I don't have a good understanding right now of when that code path would be triggered. Could you remind me why you introduced this feature (destination_event_service) ? Is it to noop streams that are disabled?
  • I would make EventBusFactory deprecation more explicit and move the discussion to a dedicated phab. To me, the two patches seem orthogonal.

This refactoring is moving complexity from EventBusFactory to EventBus and EventBus::send(). Is this really worth it?

I'm not sure. I think it removes some weirdness that is hard to reason about, especially the 'null' EventBus instance, but I'd bet that deprecating EventBusFactory will be more work than it looks like.

I don't have a good understanding right now of when that code path would be triggered.

It won't by any existing code. But with my change, it could be if someone calls send() with $events destined to different event services.

Could you remind me why you introduced this feature (destination_event_service) ?

destination_event_service maps a stream to an eventgate instance. E.g. mediawiki.page_change.v1 goes to eventgate-main (and thus to a Kafka main cluster), and mediawiki.api-request goes to eventgate-analytics (and thus to Kafka jumbo cluster). This is currently done by keeping EventBus instances that are keyed by event service name. The EventBus instance to use is selected by the stream's destination_event_service setting.

Is it to noop streams that are disabled?

Yes. The weirdness I was hoping to overcome is that currently, if the stream is disabled, destination_event_service will be ignored, and the 'null EventBus' configured to allow TYPE_NONE will be used.

I would make EventBusFactory deprecation more explicit and move the discussion to a dedicated phab. To me, the two patches seem orthogonal.

Okay, I agree. I was exploring this here because this change will make this weirdness more apparent, and thought perhaps a quick refactor could help. In hindsight the refactor isn't going to be so 'quick'.

Thanks for the help! I'll make a new phab and attach those WIP patches to it, and then let it ferment for years ;)

@EBernhardson thanks for picking up the practical implementation ;)

Change #1056965 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] Produce a limited set of event streams on private wikis (pt 2)

https://gerrit.wikimedia.org/r/1056965

Change #1055275 merged by jenkins-bot:

[operations/mediawiki-config@master] Produce a limited set of event streams on private wikis (pt 1)

https://gerrit.wikimedia.org/r/1055275

Mentioned in SAL (#wikimedia-operations) [2024-07-29T20:34:48Z] <cjming@deploy1003> Started scap sync-world: Backport for [[gerrit:1055275|Produce a limited set of event streams on private wikis (pt 1) (T346046)]]

Mentioned in SAL (#wikimedia-operations) [2024-07-29T20:36:37Z] <cjming@deploy1003> ebernhardson, cjming: Backport for [[gerrit:1055275|Produce a limited set of event streams on private wikis (pt 1) (T346046)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-07-29T20:42:19Z] <cjming@deploy1003> Finished scap: Backport for [[gerrit:1055275|Produce a limited set of event streams on private wikis (pt 1) (T346046)]] (duration: 07m 30s)

Change #1056965 merged by jenkins-bot:

[operations/mediawiki-config@master] Produce a limited set of event streams on private wikis (pt 2)

https://gerrit.wikimedia.org/r/1056965

Mentioned in SAL (#wikimedia-operations) [2024-07-29T20:55:34Z] <cjming@deploy1003> Started scap sync-world: Backport for [[gerrit:1056965|Produce a limited set of event streams on private wikis (pt 2) (T346046)]]

Mentioned in SAL (#wikimedia-operations) [2024-07-29T21:00:11Z] <cjming@deploy1003> ebernhardson, cjming: Backport for [[gerrit:1056965|Produce a limited set of event streams on private wikis (pt 2) (T346046)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-07-29T21:06:14Z] <cjming@deploy1003> Finished scap: Backport for [[gerrit:1056965|Produce a limited set of event streams on private wikis (pt 2) (T346046)]] (duration: 10m 40s)

FTR, this broke all jobs at private wikis (T371433: JobQueueError: Could not enqueue jobs). I re-enabled mediawiki_eventbus via https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1058582 to unbreak stuff, and jobs appear to be back.

Unfortunately reverting it like that has been producing events from private wikis to the public streams at stream.wikimedia.org

Change #1058603 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] EventStreamConfig - fix for private wiki streams

https://gerrit.wikimedia.org/r/1058603

Change #1058603 merged by jenkins-bot:

[operations/mediawiki-config@master] EventStreamConfig - fix for private wiki streams

https://gerrit.wikimedia.org/r/1058603

Mentioned in SAL (#wikimedia-operations) [2024-07-31T13:42:16Z] <logmsgbot> lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1058603|EventStreamConfig - fix for private wiki streams (T346046 T371433)]]

Mentioned in SAL (#wikimedia-operations) [2024-07-31T13:44:23Z] <logmsgbot> lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, otto: Backport for [[gerrit:1058603|EventStreamConfig - fix for private wiki streams (T346046 T371433)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-07-31T13:53:48Z] <logmsgbot> lucaswerkmeister-wmde@deploy1003 Finished scap: Backport for [[gerrit:1058603|EventStreamConfig - fix for private wiki streams (T346046 T371433)]] (duration: 11m 31s)