[go: nahoru, domu]

Page MenuHomePhabricator

Getting added_lines data
Closed, DeclinedPublic

Description

Is there a way to do this without resorting to abuse filters? The reason this is needed is that I plan to feed this data into a neural network, with the goal to catch LTAs while minimising false positives. It cannot simply use diff since that is likely to be noisier and has a higher chance of running into false positives (compared to added_lines, which provides what I'm looking for in one go).

Things like storing full wikitext would not be feasible, since I plan to test this bot globally to see how well it can catch LTAs in a variety of cases (and especially watch out for false positives).

I tried to get limited adminship at Meta just for that, but concerns were raised at https://meta.wikimedia.org/wiki/Meta:Requests_for_limited_adminship/Leaderboard_(2) (which also provides context on the alternatives I considered), and hence here I am.

Event Timeline

@Leaderboard: Hi, could you please add some context? Which code base is this about? Thanks!

To be clear, you want to get some sort of "added_lines" feed (not "added_edits") feed from which projects? The meta-wiki discussion suggested all WMF small+medium wikis, plus the opt-in set (enwikisource, frwiki, incubatorwiki, mediawikiwiki, metawiki, ptwiki, ruwikinews, specieswiki, test2wiki, testwiki, warwiki, and wikidatawiki).

Are you only trying to gather edits that are successful, or even attempted edits stopped by other pre-publish processes?

How are you looking to actually transfer this data, what type of retention are you expecting?

@Leaderboard: Hi, could you please add some context? Which code base is this about? Thanks!

Hi Andre, the linked Meta-Wiki discussion provides some context. In short, Leaderboard wishes to get access to certain data about incoming edits for some/all wikis.

The exact codebase where this should/could live is yet to-be-determined. The data can be provided via a new stream available under https://stream.wikimedia.org, a regularly updated dataset under https://analytics.wikimedia.org/published/datasets/, or it can be a new/updated MediaWiki API. Probably, the most relevant tags could be Data-Engineering and AbuseFilter.

To be clear, you want to get some sort of "added_lines" feed (not "added_edits") feed from which projects.

As many projects as possible. This needs to be as real-time as possible (so something like EventStreams would be fine, but not a hourly dump).

Are you only trying to gather edits that are successful, or even attempted edits stopped by other pre-publish processes?

Ideally only successful edits - the existing AbuseFilter filters are good enough to work though attempted edits.

How are you looking to actually transfer this data, what type of retention are you expecting?

I have a Cloud VPS and Toolforge instance (both named statanalyser) that I plan to use for this purpose. There is no fixed retention period of the data, as it depends on how well the tests run.

Xaosflux renamed this task from Getting added_edits data to Getting added_lines data.Mar 4 2023, 9:49 AM

@lbowmaker it's been almost a year - is there an update on this?

@MW-Interfaces-Team - do we have an API to get added_lines?

@Leaderboard - Our team doesn’t plan on implementing this stream anytime soon but you could make use of this existing event stream to get edited pages and then if we have an API for added_lines you could call that for the events you care about.

https://stream.wikimedia.org/v2/ui/#/?streams=mediawiki.page_change.v1

We actually have a somewhat nice REST API for fetching diffs that should fit the bill: https://en.wikipedia.org/w/rest.php/v1/revision/1215949997/compare/1215612564

aaron subscribed.

An example of a diff where lines are added is https://en.wikipedia.org/w/index.php?title=User:Spicy/spihelper_log&curid=74418657&diff=1216025285&oldid=1216024788

With that REST API, you can query https://en.wikipedia.org/w/rest.php/v1/revision/1216025285/compare/1216024788 which has a "diff" object of:

[
    {
      "type": 0,
      "lineNumber": 4427,
      "text": "* [[Wikipedia:Sockpuppet investigations/Andy murrey]] (full case) 15:57, 28 March 2024 (UTC)",
      "offset": {
        "from": 321248,
        "to": 321248
      }
    },
    {
      "type": 0,
      "lineNumber": 4428,
      "text": "** moved/merged case to Anantam tripathi",
      "offset": {
        "from": 321341,
        "to": 321341
      }
    },
    {
      "type": 2,
      "text": "* [[Wikipedia:Sockpuppet investigations/Anantam tripathi]] (section 26 March 2024) 16:01, 28 March 2024 (UTC)",
      "offset": {
        "from": 321382,
        "to": null
      }
    },
    {
      "type": 2,
      "text": "** commented",
      "offset": {
        "from": 321492,
        "to": null
      }
    }
  ]

The "text" entries with "offset" having equal "from" and "to" fields, that are not null, indicate added lines.