Getting added_lines data
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Leaderboard
	Mar 3 2023, 4:46 PM

Description

Is there a way to do this without resorting to abuse filters? The reason this is needed is that I plan to feed this data into a neural network, with the goal to catch LTAs while minimising false positives. It cannot simply use diff since that is likely to be noisier and has a higher chance of running into false positives (compared to added_lines, which provides what I'm looking for in one go).

Things like storing full wikitext would not be feasible, since I plan to test this bot globally to see how well it can catch LTAs in a variety of cases (and especially watch out for false positives).

I tried to get limited adminship at Meta just for that, but concerns were raised at https://meta.wikimedia.org/wiki/Meta:Requests_for_limited_adminship/Leaderboard_(2) (which also provides context on the alternatives I considered), and hence here I am.

Event Timeline

Leaderboard created this task.Mar 3 2023, 4:46 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 3 2023, 4:46 PM

Leaderboard updated the task description. (Show Details)Mar 3 2023, 4:48 PM

Leaderboard updated the task description. (Show Details)

Leaderboard added subscribers: Urbanecm, WhatamIdoing, Billinghurst.

Leaderboard added a subscriber: Xaosflux.Mar 3 2023, 4:50 PM

@Leaderboard: Hi, could you please add some context? Which code base is this about? Thanks!

To be clear, you want to get some sort of "added_lines" feed (not "added_edits") feed from which projects? The meta-wiki discussion suggested all WMF small+medium wikis, plus the opt-in set (enwikisource, frwiki, incubatorwiki, mediawikiwiki, metawiki, ptwiki, ruwikinews, specieswiki, test2wiki, testwiki, warwiki, and wikidatawiki).

Are you only trying to gather edits that are successful, or even attempted edits stopped by other pre-publish processes?

How are you looking to actually transfer this data, what type of retention are you expecting?

In T331150#8664702, @Aklapper wrote:

@Leaderboard: Hi, could you please add some context? Which code base is this about? Thanks!

Hi Andre, the linked Meta-Wiki discussion provides some context. In short, Leaderboard wishes to get access to certain data about incoming edits for some/all wikis.

The exact codebase where this should/could live is yet to-be-determined. The data can be provided via a new stream available under https://stream.wikimedia.org, a regularly updated dataset under https://analytics.wikimedia.org/published/datasets/, or it can be a new/updated MediaWiki API. Probably, the most relevant tags could be Data-Engineering and AbuseFilter.

Billinghurst unsubscribed.Mar 4 2023, 12:40 AM

To be clear, you want to get some sort of "added_lines" feed (not "added_edits") feed from which projects.

As many projects as possible. This needs to be as real-time as possible (so something like EventStreams would be fine, but not a hourly dump).

Are you only trying to gather edits that are successful, or even attempted edits stopped by other pre-publish processes?

Ideally only successful edits - the existing AbuseFilter filters are good enough to work though attempted edits.

How are you looking to actually transfer this data, what type of retention are you expecting?

I have a Cloud VPS and Toolforge instance (both named statanalyser) that I plan to use for this purpose. There is no fixed retention period of the data, as it depends on how well the tests run.

Xaosflux renamed this task from Getting added_edits data to Getting added_lines data.Mar 4 2023, 9:49 AM

Leaderboard edited subscribers, added: • Whatamidoing-WMF; removed: WhatamIdoing.Mar 4 2023, 10:45 AM

TheresNoTime subscribed.Mar 16 2023, 6:15 PM

Aklapper added a project: Data-Engineering.Apr 3 2023, 2:22 PM

lbowmaker moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Apr 7 2023, 2:12 PM

@lbowmaker it's been almost a year - is there an update on this?

@MW-Interfaces-Team - do we have an API to get added_lines?

@Leaderboard - Our team doesn’t plan on implementing this stream anytime soon but you could make use of this existing event stream to get edited pages and then if we have an API for added_lines you could call that for the events you care about.

https://stream.wikimedia.org/v2/ui/#/?streams=mediawiki.page_change.v1

Johannnes89 subscribed.Mar 10 2024, 1:45 PM

We actually have a somewhat nice REST API for fetching diffs that should fit the bill: https://en.wikipedia.org/w/rest.php/v1/revision/1215949997/compare/1215612564

An example of a diff where lines are added is https://en.wikipedia.org/w/index.php?title=User:Spicy/spihelper_log&curid=74418657&diff=1216025285&oldid=1216024788

With that REST API, you can query https://en.wikipedia.org/w/rest.php/v1/revision/1216025285/compare/1216024788 which has a "diff" object of:

[
    {
      "type": 0,
      "lineNumber": 4427,
      "text": "* [[Wikipedia:Sockpuppet investigations/Andy murrey]] (full case) 15:57, 28 March 2024 (UTC)",
      "offset": {
        "from": 321248,
        "to": 321248
      }
    },
    {
      "type": 0,
      "lineNumber": 4428,
      "text": "** moved/merged case to Anantam tripathi",
      "offset": {
        "from": 321341,
        "to": 321341
      }
    },
    {
      "type": 2,
      "text": "* [[Wikipedia:Sockpuppet investigations/Anantam tripathi]] (section 26 March 2024) 16:01, 28 March 2024 (UTC)",
      "offset": {
        "from": 321382,
        "to": null
      }
    },
    {
      "type": 2,
      "text": "** commented",
      "offset": {
        "from": 321492,
        "to": null
      }
    }
  ]

The "text" entries with "offset" having equal "from" and "to" fields, that are not null, indicate added lines.

Getting added_lines dataClosed, DeclinedPublicActions

Description

Event Timeline

Getting added_lines data
Closed, DeclinedPublic
Actions