Investigate the impact of the WDQS graph split on constraints checks
Closed, ResolvedPublic
Actions

Description

Problem:
With the current work going on around testing the graph split, we need to understand the impact of the graph split on constraints checks on Wikidata.
Some of the constraint checks are using SPARQL queries to determine if a statement violates a constraint. It'd be useful to understand at least the following:

Which constraint types are affected because they currently rely on SPARQL queries?
How will each of them be affected by the split? Will they produce more false positives? More false negatives? Not run at all?
Are there ideas for mitigation and the effort associated with them?

Acceptance criteria:

We understand how constraint checks are potentially going to be affected by the graph split

Related Objects
Search...

Status	Assigned	Task
Open	None	T335067 Epic: Wikidata Query Service stabilization
Open	None	T337013 [Epic] Splitting the graph in WDQS
Resolved	Lucas_Werkmeister_WMDE	T355298 Investigate the impact of the WDQS graph split on constraints checks

Event Timeline

Lydia_Pintscher created this task.Jan 18 2024, 10:00 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 18 2024, 10:00 AM

Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.Jan 22 2024, 1:35 PM

Gehel edited projects, added Discovery-Search (Current work); removed Wikidata-Query-Service.

Gehel moved this task from Incoming to Blocked/Waiting on the Discovery-Search (Current work) board.

karapayneWMDE moved this task from Incoming to Product Backlog on the Wikidata Dev Team board.Feb 13 2024, 3:09 PM

Daniel_Mietchen subscribed.Apr 6 2024, 1:08 AM

dr0ptp4kt subscribed.Apr 8 2024, 3:19 PM

Looking at the constraints I believe that 4 may use sparql:

FormatChecker.php
TypeChecker.php
UniqueValueChecker.php
ValueTypeChecker.php

FormatChecker switched to using shellbox so I think can be ignored.

TypeChecker & ValueTypeChecker are using Sparql to inspect the class hierarchy which may or may not be affected by the split.
UniqueValueChecker is on the other hand most certainly affected by the split.

Arian_Bozorg updated the task description. (Show Details)Jun 19 2024, 8:40 AM

In T355298#9768137, @dcausse wrote:

Looking at the constraints I believe that 4 may use sparql:

FormatChecker.php

TypeChecker.php

UniqueValueChecker.php

ValueTypeChecker.php

I think that’s all of them, yeah.

FormatChecker switched to using shellbox so I think can be ignored.

Agreed. Even if it does use SPARQL, it doesn’t use any of the data inside it, so we could still run it against any SPARQL server.

TypeChecker & ValueTypeChecker are using Sparql to inspect the class hierarchy which may or may not be affected by the split.

Yes. Notably, the initial lookup of the class to check (the subject’s “instance of” and/or “subclass of” statements) always happens in PHP, not in SPARQL. My assumption would be that the class hierarchy is always fully included in the main graph, and only individual instances are potentially in the scholarly graph; in that case, we could run all the “is subclass of” queries against the main graph. Is that correct?

UniqueValueChecker is on the other hand most certainly affected by the split.

True… I guess we’ll always have to query all the graphs for other items with the same value?

Arian_Bozorg moved this task from Product Backlog to Unified DOT Backlog on the Wikidata Dev Team board.Jun 19 2024, 9:33 AM

In T355298#9906297, @Lucas_Werkmeister_WMDE wrote:

TypeChecker & ValueTypeChecker are using Sparql to inspect the class hierarchy which may or may not be affected by the split.

Yes. Notably, the initial lookup of the class to check (the subject’s “instance of” and/or “subclass of” statements) always happens in PHP, not in SPARQL. My assumption would be that the class hierarchy is always fully included in the main graph, and only individual instances are potentially in the scholarly graph; in that case, we could run all the “is subclass of” queries against the main graph. Is that correct?

Yes this is my understanding as well, the undesirable effects I could see are:

one tagging an entity with a P31 that points to a scholarly article
introducing a scholarly article in the chain of subclass of thus making the sparql property path noneffective

I'm not knowledgeable enough but I suspect these problems should be quite rare and perhaps already identified via other means?

UniqueValueChecker is on the other hand most certainly affected by the split.

True… I guess we’ll always have to query all the graphs for other items with the same value?

yes, this one might need some work to include federation. Without, the impact might be false negatives by not identifying the duplicates from the scholarly graph.

In T355298#9908389, @dcausse wrote:

Yes this is my understanding as well, the undesirable effects I could see are:

one tagging an entity with a P31 that points to a scholarly article

I guess this would actually be okay? It should just result in “true positives” for the “wrong reason” (IIUC) – there would be a constraint violation because the scholarly article is not a subclass of the expected class, and that constraint violation is right, regardless of which graph was consulted.

introducing a scholarly article in the chain of subclass of thus making the sparql property path noneffective

I'm not knowledgeable enough but I suspect these problems should be quite rare and perhaps already identified via other means?

Yeah, I think so. Or at least, I’m pretty sure we can go with this as a first version, and revisit if we unexpectedly get feedback that both graphs are crucial for correct subclass checking.

UniqueValueChecker is on the other hand most certainly affected by the split.

True… I guess we’ll always have to query all the graphs for other items with the same value?

yes, this one might need some work to include federation. Without, the impact might be false negatives by not identifying the duplicates from the scholarly graph.

Yeah.

I think we’ve basically completed this investigation already? ^^

Which constraint types are affected because they currently rely on SPARQL queries?

FormatChecker is unaffected. TypeChecker and ValueTypeChecker should continue to query the main graph (but we might have to configure a different host name for that?). UniqueValueChecker needs to be changed to query a list of endpoints.

How will each of them be affected by the split? Will they produce more false positives? More false negatives? Not run at all?

If we do nothing, UniqueValueChecker will produce some false negatives.

Are there ideas for mitigation and the effort associated with them?

Yes, we should make UniqueValueChecker query a list of endpoints (and configure that list in production). Shouldn’t be too difficult.

Lucas_Werkmeister_WMDE moved this task from In Task Breakdown to Ready for Peer Review on the Wikidata Dev Team (Wikidata.org Slice) board.Thu, Jun 20, 1:17 PM

We now have T369079 for the remaining work. Can this be closed?

dcausse moved this task from Blocked/Waiting to Needs Reporting on the Discovery-Search (Current work) board.Mon, Jul 15, 3:12 PM

In T355298#9948029, @Lydia_Pintscher wrote:

We now have T369079 for the remaining work. Can this be closed?

I guess so, yeah.

Investigate the impact of the WDQS graph split on constraints checksClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Investigate the impact of the WDQS graph split on constraints checks
Closed, ResolvedPublic
Actions

Related Objects
Search...