[go: nahoru, domu]

Page MenuHomePhabricator

Investigate the impact of the WDQS graph split on constraints checks
Closed, ResolvedPublic

Description

Problem:
With the current work going on around testing the graph split, we need to understand the impact of the graph split on constraints checks on Wikidata.
Some of the constraint checks are using SPARQL queries to determine if a statement violates a constraint. It'd be useful to understand at least the following:

  • Which constraint types are affected because they currently rely on SPARQL queries?
  • How will each of them be affected by the split? Will they produce more false positives? More false negatives? Not run at all?
  • Are there ideas for mitigation and the effort associated with them?

Acceptance criteria:

  • We understand how constraint checks are potentially going to be affected by the graph split

Event Timeline

Looking at the constraints I believe that 4 may use sparql:

  • FormatChecker.php
  • TypeChecker.php
  • UniqueValueChecker.php
  • ValueTypeChecker.php

FormatChecker switched to using shellbox so I think can be ignored.

TypeChecker & ValueTypeChecker are using Sparql to inspect the class hierarchy which may or may not be affected by the split.
UniqueValueChecker is on the other hand most certainly affected by the split.

Looking at the constraints I believe that 4 may use sparql:

  • FormatChecker.php
  • TypeChecker.php
  • UniqueValueChecker.php
  • ValueTypeChecker.php

I think that’s all of them, yeah.

FormatChecker switched to using shellbox so I think can be ignored.

Agreed. Even if it does use SPARQL, it doesn’t use any of the data inside it, so we could still run it against any SPARQL server.

TypeChecker & ValueTypeChecker are using Sparql to inspect the class hierarchy which may or may not be affected by the split.

Yes. Notably, the initial lookup of the class to check (the subject’s “instance of” and/or “subclass of” statements) always happens in PHP, not in SPARQL. My assumption would be that the class hierarchy is always fully included in the main graph, and only individual instances are potentially in the scholarly graph; in that case, we could run all the “is subclass of” queries against the main graph. Is that correct?

UniqueValueChecker is on the other hand most certainly affected by the split.

True… I guess we’ll always have to query all the graphs for other items with the same value?

TypeChecker & ValueTypeChecker are using Sparql to inspect the class hierarchy which may or may not be affected by the split.

Yes. Notably, the initial lookup of the class to check (the subject’s “instance of” and/or “subclass of” statements) always happens in PHP, not in SPARQL. My assumption would be that the class hierarchy is always fully included in the main graph, and only individual instances are potentially in the scholarly graph; in that case, we could run all the “is subclass of” queries against the main graph. Is that correct?

Yes this is my understanding as well, the undesirable effects I could see are:

  • one tagging an entity with a P31 that points to a scholarly article
  • introducing a scholarly article in the chain of subclass of thus making the sparql property path noneffective

I'm not knowledgeable enough but I suspect these problems should be quite rare and perhaps already identified via other means?

UniqueValueChecker is on the other hand most certainly affected by the split.

True… I guess we’ll always have to query all the graphs for other items with the same value?

yes, this one might need some work to include federation. Without, the impact might be false negatives by not identifying the duplicates from the scholarly graph.

Yes this is my understanding as well, the undesirable effects I could see are:

  • one tagging an entity with a P31 that points to a scholarly article

I guess this would actually be okay? It should just result in “true positives” for the “wrong reason” (IIUC) – there would be a constraint violation because the scholarly article is not a subclass of the expected class, and that constraint violation is right, regardless of which graph was consulted.

  • introducing a scholarly article in the chain of subclass of thus making the sparql property path noneffective

I'm not knowledgeable enough but I suspect these problems should be quite rare and perhaps already identified via other means?

Yeah, I think so. Or at least, I’m pretty sure we can go with this as a first version, and revisit if we unexpectedly get feedback that both graphs are crucial for correct subclass checking.

UniqueValueChecker is on the other hand most certainly affected by the split.

True… I guess we’ll always have to query all the graphs for other items with the same value?

yes, this one might need some work to include federation. Without, the impact might be false negatives by not identifying the duplicates from the scholarly graph.

Yeah.


I think we’ve basically completed this investigation already? ^^

Which constraint types are affected because they currently rely on SPARQL queries?

FormatChecker is unaffected. TypeChecker and ValueTypeChecker should continue to query the main graph (but we might have to configure a different host name for that?). UniqueValueChecker needs to be changed to query a list of endpoints.

How will each of them be affected by the split? Will they produce more false positives? More false negatives? Not run at all?

If we do nothing, UniqueValueChecker will produce some false negatives.

Are there ideas for mitigation and the effort associated with them?

Yes, we should make UniqueValueChecker query a list of endpoints (and configure that list in production). Shouldn’t be too difficult.

We now have T369079 for the remaining work. Can this be closed?

Lucas_Werkmeister_WMDE claimed this task.

We now have T369079 for the remaining work. Can this be closed?

I guess so, yeah.