dcausse (David Causse)
User

Projects

CirrusSearch
Component
Trusted-Contributors
Group

Calendar

User Details

User Since: Jun 9 2015, 9:03 AM (473 w, 6 d)
Availability: Available
IRC Nick: dcausse
LDAP User: DCausse
MediaWiki User: DCausse (WMF) [ Global Accounts ]

Recent Activity
View All

Fri, Jul 5

dcausse added a comment to T369149: Search has outdated label for P12861 (“Shape Expression for class” rather than “EntitySchema for class”).

@Lucas_Werkmeister_WMDE thanks for the fix! I manually re-indexed this item with our new (WIP) tooling, it would have been fixed automatically by the cleanup process but it would have taken up to 2weeks to discover in the worst case.

Fri, Jul 5, 11:55 AM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), wmde-wikidata-tech, Wikimedia-production-error, Patch-For-Review, Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), EntitySchema, Wikidata

dcausse edited P65853 Search Modules with the search API.

Fri, Jul 5, 8:13 AM

dcausse edited P65853 Search Modules with the search API.

Fri, Jul 5, 8:11 AM

dcausse created P65853 Search Modules with the search API.

Fri, Jul 5, 8:07 AM

Thu, Jul 4

dcausse added a comment to T368543: Error: Call to a member function getPageAsLinkTarget() on null.

In T368543#9926759, @Lucas_Werkmeister_WMDE wrote:

We’ve already checked that $valParts[1] isn’t set, but then we still cast it to an int, and try to load a revision record from that ID? Is the if condition just flipped from what it should be?

Thu, Jul 4, 8:26 AM · Metrics Platform Backlog, CirrusSearch, Data Products, MediaWiki-extensions-EventLogging, Wikimedia-production-error, Data-Engineering

dcausse added a comment to T237773: Move Wikitech onto the production MW cluster.

Should T192361 be re-opened and added as a subtask here?

Thu, Jul 4, 6:49 AM · cloud-services-team, wikitech.wikimedia.org

Wed, Jul 3

dcausse moved T369149: Search has outdated label for P12861 (“Shape Expression for class” rather than “EntitySchema for class”) from Incoming to Blocked/Waiting on the Discovery-Search (Current work) board.

Wed, Jul 3, 1:39 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), wmde-wikidata-tech, Wikimedia-production-error, Patch-For-Review, Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), EntitySchema, Wikidata

dcausse added a comment to T369149: Search has outdated label for P12861 (“Shape Expression for class” rather than “EntitySchema for class”).

Seems like \EntitySchema\Wikibase\DataValues\EntitySchemaValue::getType() is returning EntityIdValue::getType() and thus some code are considering it as EntityIdValue ('VT:wikibase-entityid`), here WikibaseCirrusSearch is calling https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikibase/+/8b3312396b4b8b91790d7b33c4703fb31bd290d8/repo/WikibaseRepo.datatypes.php#421 with an EntitySchemaValue.

Wed, Jul 3, 1:35 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), wmde-wikidata-tech, Wikimedia-production-error, Patch-For-Review, Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), EntitySchema, Wikidata

dcausse added a comment to T369149: Search has outdated label for P12861 (“Shape Expression for class” rather than “EntitySchema for class”).

The process is unable to render this document: https://www.wikidata.org/w/api.php?action=query&cbbuilders=content|links&format=json&format=json&formatversion=2&pageids=120965176&prop=cirrusbuilddoc fails with Caught exception of type TypeError:

Wed, Jul 3, 1:10 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), wmde-wikidata-tech, Wikimedia-production-error, Patch-For-Review, Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), EntitySchema, Wikidata

dcausse edited projects for T369149: Search has outdated label for P12861 (“Shape Expression for class” rather than “EntitySchema for class”), added: Discovery-Search (Current work); removed Discovery-Search.

Wed, Jul 3, 1:04 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), wmde-wikidata-tech, Wikimedia-production-error, Patch-For-Review, Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), EntitySchema, Wikidata

dcausse added a comment to T362702: APT errors when installing custom packages in MediaWiki-Docker.

Seems like that recently https://packages.sury.org/php/dists/buster/ is now returning a 403

Wed, Jul 3, 8:51 AM · dev-images, Release-Engineering-Team, MediaWiki-Docker

dcausse added a comment to T369080: statsd-exporter in k8s is not configured to use its mapping configuration.

In T369080#9947536, @colewhite wrote:

@dcausse, I see some metrics now at mediawiki_cirrus_search_request_time_bucket. Anything amiss?

Wed, Jul 3, 6:56 AM · SRE, Observability-Metrics

Tue, Jul 2

dcausse updated the task description for T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting.

Tue, Jul 2, 2:16 PM · SRE Observability (FY2024/2025-Q1), Discovery-Search (Current work), Data-Platform-SRE, MW-1.42-notes (1.42.0-wmf.20; 2024-02-27), User-fgiunchedi, Observability-Metrics

dcausse added a comment to T368996: Entering "Palestine" on en.wp, search suggestions do not offer "State of Palestine".

Tue, Jul 2, 8:02 AM · Discovery-Search, CirrusSearch

Mon, Jul 1

dcausse moved T331127: phantom redirects lingering in incategory searches after page moves from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.

Moving a page from one namespace to another should now properly cleanup the search index, existing phantom redirects might still be around for a couple weeks while the automated cleanup process takes care of them. Please let me know if you see new instances of this problem in the future, sorry for the inconvenience.

Mon, Jul 1, 3:48 PM · MW-1.40-notes (1.40.0-wmf.25; 2023-02-27), Discovery-Search (Current work), CirrusSearch

dcausse added a comment to T362978: Update all helm modules and charts to be compatible with the restricted PSS.

Hi I'm having issues with a flink job running in staging and failing to deploy with an error:
>>> Status | Error | DEPLOYED | {"type":"org.apache.flink.kubernetes.operator.exception.DeploymentFailedException","message":"pods \"flink-app-consumer-search-784bc9fd87-9n862\" is forbidden: violates PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (container \"flink-main-container\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"flink-main-container\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or container \"flink-main-container\" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container \"flink-main-container\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")","additionalMetadata":{"reason":"FailedCreate"},"throwableList":[]}

Mon, Jul 1, 1:21 PM · Patch-For-Review, serviceops, Prod-Kubernetes

dcausse added a comment to T368894: Cirrus search does not prioritise master pages on their subpages.

The talk page is ranked very low indeed, it does seem quite recent (created on may 2024) and have 0 incoming_links and thus is far behind https://he.wikipedia.org/wiki/%D7%A9%D7%99%D7%97%D7%AA_%D7%9E%D7%A9%D7%AA%D7%9E%D7%A9 which has more than 3k incoming links. CirrusSearch does not prioritize master pages over their subpages indeed, if we want to do this this would have to be carefully evaluated because one thing we can't do is rank lower a subpage comparatively solely to its master page, all subpages would be down-ranked.

Mon, Jul 1, 12:39 PM · Discovery-Search, CirrusSearch

dcausse moved T286814: '.event.pageViewId' should be string, '.event.subTest' should be string, '.event.searchSessionId' should be string from Bugs to needs triage on the Discovery-Search board.

Mon, Jul 1, 10:15 AM · Discovery-Search (Current work), Wikimedia-production-error, Data-Engineering

dcausse moved T331127: phantom redirects lingering in incategory searches after page moves from Needs review to To Be Deployed on the Discovery-Search (Current work) board.

Mon, Jul 1, 7:15 AM · MW-1.40-notes (1.40.0-wmf.25; 2023-02-27), Discovery-Search (Current work), CirrusSearch

Fri, Jun 28

dcausse moved T363521: Completion suggester can promote a bad build from In Progress to Blocked/Waiting on the Discovery-Search (Current work) board.

I added some logging info to get a sense of the numbers, moving to waiting while we gather a bit more info.

Fri, Jun 28, 3:39 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Discovery-Search (Current work), serviceops-radar, CirrusSearch

dcausse claimed T363521: Completion suggester can promote a bad build.

Fri, Jun 28, 2:43 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Discovery-Search (Current work), serviceops-radar, CirrusSearch

dcausse claimed T366589: PHP Deprecated: Implicit conversion from float 75000.00000000001 to int loses precision.

Fri, Jun 28, 2:07 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Discovery-Search (Current work), affects-translatewiki.net, PHP 8.1 support, CirrusSearch

dcausse added a project to T361950: Ensure that WDQS query throttling does not interfere with federation: serviceops.

Fri, Jun 28, 12:45 PM · wmde-wikidata-tech, Patch-For-Review, Discovery-Search (Current work), Wikidata

dcausse added a comment to T361950: Ensure that WDQS query throttling does not interfere with federation.

tagging serviceops for help on envoy to see if it can be used as a load balancer to balance the internal requests made from one blazegraph cluster to another without using lvs.

Fri, Jun 28, 12:44 PM · wmde-wikidata-tech, Patch-For-Review, Discovery-Search (Current work), Wikidata

dcausse added a comment to T361950: Ensure that WDQS query throttling does not interfere with federation.

@Vgutierrez thanks for the help!

Fri, Jun 28, 12:25 PM · wmde-wikidata-tech, Patch-For-Review, Discovery-Search (Current work), Wikidata

Wed, Jun 26

dcausse added a comment to T368010: Search not working for entity schemas.

In the meantime a ugly workaround is to search both EntitySchema and EntitySchema talk namespaces but filter on the content model using the keyword contentmodel:EntitySchema: https://www.wikidata.org/w/index.php?search=contentmodel%3AEntitySchema+intitle%3A%2FE%2F&title=Special:Search&profile=advanced&fulltext=1&ns640=1&ns641=1 .

Wed, Jun 26, 7:07 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Wikidata Dev Team (Wikidata.org Slice), Patch-For-Review, Discovery-Search (Current work), CirrusSearch, Wikidata, EntitySchema

dcausse added a comment to T368010: Search not working for entity schemas.

In T368010#9926662, @Lucas_Werkmeister_WMDE wrote:

Hm, though the search links in the task description still don’t yield the expected results :/

yes this is sadly kind of expected (I should have told you about this on the config patch, sorry). The cleanup process had already started moving pages around while the entity schema namespace was considered non-content and thus these ones are no longer findable now it was brought back again in the content namespace. I need to reindex these pages to make search working again but sadly our tooling is not working as expected and I need to deploy https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/143 first to be able to fix the index. If this is causing major disruption I can messup with the index by hand but I'd rather not do that if not strictly required, sorry for the inconvenience!

Wed, Jun 26, 6:46 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Wikidata Dev Team (Wikidata.org Slice), Patch-For-Review, Discovery-Search (Current work), CirrusSearch, Wikidata, EntitySchema

dcausse added a comment to T362977: WDQS updater missed some updates.

Another instance of this issue was reported on wiki:

@dcausse (WMF): fwiw, I have 6 items updated on the 19 & 20 June - https://w.wiki/ASz6 - for which WDQS has not been updated ... on the production WDQS, not test. Only one of them was edited within the June 19 between 03:00 and 15:30 UTC window, afaics. It's not a prolem for me, more of a FYI. --Tagishsimon (talk) 16:01, 21 June 2024 (UTC)

Wed, Jun 26, 1:16 PM · Data-Engineering, Data-Platform, Wikidata, Wikidata-Query-Service

dcausse updated subscribers of T368010: Search not working for entity schemas.

Surprisingly E378 which is one of the schemas that is not indexed appears to be indexed in the "content" index of wikidata, but AFAICT 640 is not a content namespace.
But it might have been considered as a content namespace few weeks ago.
I wonder if T363153 and esp. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EntitySchema/+/1040113/ might not be the reason of this change. When a namespace with existing documents has its search characteristics changed (wgContentNamespace and/or wgNamespacesToBeSearchedDefault) the indexed docs are not moved automatically from one index to another and will rely on the saneitizer to slowly fix the inconsistencies, this is what might have happened here and explain why the schemas suddenly disappeared and got re-indexed slowly overtime.

Wed, Jun 26, 10:35 AM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Wikidata Dev Team (Wikidata.org Slice), Patch-For-Review, Discovery-Search (Current work), CirrusSearch, Wikidata, EntitySchema

dcausse moved T368010: Search not working for entity schemas from In Progress to Needs review on the Discovery-Search (Current work) board.

Wed, Jun 26, 10:33 AM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Wikidata Dev Team (Wikidata.org Slice), Patch-For-Review, Discovery-Search (Current work), CirrusSearch, Wikidata, EntitySchema

dcausse claimed T368010: Search not working for entity schemas.

Wed, Jun 26, 10:03 AM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Wikidata Dev Team (Wikidata.org Slice), Patch-For-Review, Discovery-Search (Current work), CirrusSearch, Wikidata, EntitySchema

dcausse added a comment to T368010: Search not working for entity schemas.

The above reindex did not work as I expected, the attached patch should remedy this by allowing non indexed page to be re-indexed properly when manually re-indexing a whole namespace.
The root cause as to why these schemas were not indexed in the first place is yet to be investigated.

Wed, Jun 26, 9:55 AM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Wikidata Dev Team (Wikidata.org Slice), Patch-For-Review, Discovery-Search (Current work), CirrusSearch, Wikidata, EntitySchema

Tue, Jun 25

dcausse added a comment to T368010: Search not working for entity schemas.

There are currently 354 pages indexed in the entity schema, the all pages api does seem to suggest that there are 397 schemas.

Tue, Jun 25, 2:25 PM · MW-1.43-notes (1.43.0-wmf.12; 2024-07-02), Wikidata Dev Team (Wikidata.org Slice), Patch-For-Review, Discovery-Search (Current work), CirrusSearch, Wikidata, EntitySchema

dcausse moved T331127: phantom redirects lingering in incategory searches after page moves from In Progress to Needs review on the Discovery-Search (Current work) board.

Tue, Jun 25, 2:12 PM · MW-1.40-notes (1.40.0-wmf.25; 2023-02-27), Discovery-Search (Current work), CirrusSearch

Mon, Jun 24

dcausse moved T366346: Mute helmfile apply notifications from cirrus-streaming-updater deploys from In Progress to Needs Reporting on the Discovery-Search (Current work) board.

Should be done in https://gitlab.wikimedia.org/repos/search-platform/cirrus-reindex-orchestrator/-/commit/9175d48ab9ff47f7e53d150156f9d71366563849

Mon, Jun 24, 3:19 PM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Discovery-Search (Current work), CirrusSearch

dcausse claimed T331127: phantom redirects lingering in incategory searches after page moves.

Mon, Jun 24, 7:34 AM · MW-1.40-notes (1.40.0-wmf.25; 2023-02-27), Discovery-Search (Current work), CirrusSearch

dcausse moved T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Mon, Jun 24, 7:33 AM · SRE Observability (FY2024/2025-Q1), Discovery-Search (Current work), Data-Platform-SRE, MW-1.42-notes (1.42.0-wmf.20; 2024-02-27), User-fgiunchedi, Observability-Metrics

dcausse updated the task description for T241128: EPIC: Reduce the time needed to do the initial WDQS import.

Mon, Jun 24, 7:32 AM · Epic, Wikidata-Query-Service, Wikidata

Fri, Jun 21

dcausse updated the task description for T241128: EPIC: Reduce the time needed to do the initial WDQS import.

Fri, Jun 21, 9:40 AM · Epic, Wikidata-Query-Service, Wikidata

Thu, Jun 20

dcausse added a comment to T361950: Ensure that WDQS query throttling does not interfere with federation.

After discussing this Erik with we have a rough plan:

add a new lvs enpoint dedicated to internal federation and targeting a new port opened by nginx
add a new port in the nginx config for which we add the X-Disable-Throttling + x-bigdata-read-only to the request forwarded to blazegraph
use the blazegraph service alias feature to map https://query-main.wikidata.org/sparql -> https://wdqs-main.discovery.wmnet:$NEW_PORT/sparql
adapt ProxiedHttpConnectionFactory to allow the bypass of *.wmnet hostnames

Thu, Jun 20, 7:47 AM · wmde-wikidata-tech, Patch-For-Review, Discovery-Search (Current work), Wikidata

dcausse added a comment to T355298: Investigate the impact of the WDQS graph split on constraints checks.

In T355298#9906297, @Lucas_Werkmeister_WMDE wrote:

In T355298#9768137, @dcausse wrote:

Looking at the constraints I believe that 4 may use sparql:

FormatChecker.php

TypeChecker.php

UniqueValueChecker.php

ValueTypeChecker.php

I think that’s all of them, yeah.

FormatChecker switched to using shellbox so I think can be ignored.

Agreed. Even if it does use SPARQL, it doesn’t use any of the data inside it, so we could still run it against any SPARQL server.

TypeChecker & ValueTypeChecker are using Sparql to inspect the class hierarchy which may or may not be affected by the split.

Yes. Notably, the initial lookup of the class to check (the subject’s “instance of” and/or “subclass of” statements) always happens in PHP, not in SPARQL. My assumption would be that the class hierarchy is always fully included in the main graph, and only individual instances are potentially in the scholarly graph; in that case, we could run all the “is subclass of” queries against the main graph. Is that correct?

Yes this is my understanding as well, the undesirable effect I could if some mistakes are made:

one tagging an entity with a P31 that points to a scholarly article
introducing a scholarly article in the chain of subclass of thus making the sparql property path noneffective

I'm not knowledgeable enough but I suspect these problems should be quite rare and perhaps already identified via other means?

Thu, Jun 20, 7:08 AM · Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), Wikibase-Quality-Constraints, Wikidata

Tue, Jun 18

dcausse updated the task description for T241128: EPIC: Reduce the time needed to do the initial WDQS import.

Tue, Jun 18, 9:24 AM · Epic, Wikidata-Query-Service, Wikidata

dcausse updated the task description for T241128: EPIC: Reduce the time needed to do the initial WDQS import.

Tue, Jun 18, 9:24 AM · Epic, Wikidata-Query-Service, Wikidata

dcausse updated the task description for T367510: Request permission to create 4 kafka topics in kafka-main (WDQS graph split).

Tue, Jun 18, 8:48 AM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Discovery-Search (Current work), wmde-wikidata-tech, serviceops, Wikidata

dcausse added a comment to T367510: Request permission to create 4 kafka topics in kafka-main (WDQS graph split).

In T367510#9902307, @akosiaris wrote:

I do have one question though. Why 4 weeks retention? Is there some business reason or could it be dropped to a smaller duration?

we need 4 weeks to be able to backfill after an import, from the time the wikidata dump process starts, the time required to shuffle the data around (compression, hdfs-rsync to hdfs) and til the end of the import into blazegraph, see the initial lag column in T241128 for past import times, perhaps 3weeks would be manageable but we went to 4 weeks to have extra room.

Tue, Jun 18, 8:44 AM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Discovery-Search (Current work), wmde-wikidata-tech, serviceops, Wikidata

Fri, Jun 14

dcausse added a comment to T214378: Check simple format constraints (no grouping) in PHP instead of SPARQL.

Unsure if feasible but perhaps manually flagging list of safe regex & very popular regex could help reduce the number of requests to shellbox?

Fri, Jun 14, 2:11 PM · [DEPRECATED] wdwb-tech, Security-Team, Wikidata-Campsite, Wikibase-Quality-Constraints, Wikidata

dcausse renamed T367510: Request permission to create 4 kafka topics in kafka-main (WDQS graph split) from Request permission to create 4 kafka topics in kafka-main to Request permission to create 4 kafka topics in kafka-main (WDQS graph split).

Fri, Jun 14, 1:20 PM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Discovery-Search (Current work), wmde-wikidata-tech, serviceops, Wikidata

dcausse created T367510: Request permission to create 4 kafka topics in kafka-main (WDQS graph split).

Fri, Jun 14, 1:18 PM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Discovery-Search (Current work), wmde-wikidata-tech, serviceops, Wikidata

dcausse added a comment to T361950: Ensure that WDQS query throttling does not interfere with federation.

I did some testing and sadly when a wdqs node makes a query to https://query.wikidata.org it hits varnish again:
from wdqs1020 to https://query.wikidata.org (echo 'SELECT ?test_dcausse { ?test_dcausse ?p ?o . } LIMIT 1' | curl -f -s --data-urlencode query@- https://query.wikidata.org/sparql?format=json)

"x-request-id": "b34bb930-ef85-4b23-956e-7dcb11f0f7ec",
"content-length": "99",
"x-forwarded-proto": "http",
"x-client-port": "40256",
"x-bigdata-max-query-millis": "60000",
"x-wmf-nocookies": "1",
"x-client-ip": "2620:0:861:10a:10:64:131:24",
"x-varnish": "800949377",
"x-forwarded-for": "2620:0:861:10a:10:64:131:24\\, 10.64.0.79\\, 2620:0:861:10a:10:64:131:24",
"x-requestctl": "",
"x-cdis": "pass",
"accept": "*/*",
"x-real-ip": "2620:0:861:10a:10:64:131:24",
"via-nginx": "1",
"x-bigdata-read-only": "yes",
"host": "query.wikidata.org",
"content-type": "application/x-www-form-urlencoded",
"connection": "close",
"x-envoy-expected-rq-timeout-ms": "65000",
"x-connection-properties": "H2=1; SSR=0; SSL=TLSv1.3; C=TLS_AES_256_GCM_SHA384; EC=UNKNOWN;",
"user-agent": "curl/7.74.0"

Fri, Jun 14, 12:55 PM · wmde-wikidata-tech, Patch-For-Review, Discovery-Search (Current work), Wikidata

Thu, Jun 13

dcausse added a comment to P64016 Testing wdqs.data-reload with HDFS.

@RKemper I think we should now do a full import to measure the time it takes in order to have a rough estimation to answer T367409
To have a full run we need to re-enable the updater on wdqs2023 (which I think will be done with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042965)
The command to run should be (using the latest dumps):

cookbook sre.wdqs.data-reload \
 --task-id T349069 \
 --reason "Test wdqs reload based on HDFS" \
 --reload-data wikidata_full \
 --from-hdfs hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603 \
 --stat-host stat1009.eqiad.wmnet \
 wdqs2023.codfw.wmnet

Thu, Jun 13, 1:32 PM

Tue, Jun 11

dcausse updated the task description for T365692: PHP Notice: Undefined index: lexeme_language / lexical_category.

Tue, Jun 11, 8:36 AM · MW-1.43-notes (1.43.0-wmf.8; 2024-06-04), Discovery-Search (Current work), wmde-wikidata-tech, Wikidata, Wikidata Lexicographical data, Wikimedia-production-error

dcausse moved T365692: PHP Notice: Undefined index: lexeme_language / lexical_category from In Progress to Needs Reporting on the Discovery-Search (Current work) board.

Triggered a reindex of all the lexemes using https://gitlab.wikimedia.org/repos/search-platform/cirrus-rerender, might take about 3 hours to complete.

Tue, Jun 11, 8:36 AM · MW-1.43-notes (1.43.0-wmf.8; 2024-06-04), Discovery-Search (Current work), wmde-wikidata-tech, Wikidata, Wikidata Lexicographical data, Wikimedia-production-error

Mon, Jun 10

dcausse added a comment to T366904: Improve mysql search for files.

In T366904#9877421, @TheDJ wrote:

@dcausse @Gehel As far as I can see, updateTitle is not implemented by CirrusSearch right, and thus a noop per the parent SearchEngine class ? If so, then i can safely modify this.

Mon, Jun 10, 9:14 PM · Patch-For-Review, User-TheDJ, MediaWiki-Search, Discovery-Search

dcausse awarded T358373: [Dumps 2] Reconcillation mechanism to detect and fetch missing/mismatched revisions a Love token.

Mon, Jun 10, 6:22 PM · Patch-For-Review, Dumps 2.0 (Kanban Board)

Jun 6 2024

dcausse added a comment to P64016 Testing wdqs.data-reload with HDFS.

@RKemper for testing I created a smaller folder at hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ it has only two chunks so I hope it might help iterate a bit faster on this, the command should become:

cookbook sre.wdqs.data-reload \
 --task-id T349069 \
 --reason "Test wdqs reload based on HDFS" \
 --reload-data wikidata_full \
 --from-hdfs hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ \
 --stat-host stat1009.eqiad.wmnet \
 wdqs2023.codfw.wmnet

Jun 6 2024, 12:20 PM

Jun 4 2024

dcausse edited P64016 Testing wdqs.data-reload with HDFS.

Jun 4 2024, 3:31 PM

dcausse created P64016 Testing wdqs.data-reload with HDFS.

Jun 4 2024, 3:24 PM

Jun 3 2024

dcausse placed T331127: phantom redirects lingering in incategory searches after page moves up for grabs.

Jun 3 2024, 4:18 PM · MW-1.40-notes (1.40.0-wmf.25; 2023-02-27), Discovery-Search (Current work), CirrusSearch

dcausse added a comment to T362518: Deprecate buster-backports.

In T362518#9854325, @Clement_Goubert wrote:

@dcausse docker-registry.wikimedia.org/wikimedia/wikidata-query-flink-rdf-streaming-updater seems to be deprecated in favor of docker-registry.wikimedia.org/repos/search-platform/flink-rdf-streaming-updater, can you confirm?

Yes (all the images under docker-registry.wikimedia.org/wikimedia/wikidata-query-flink-rdf-streaming-updater should no longer be used and can be safely removed if needed)

Jun 3 2024, 3:56 PM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops

dcausse moved T331127: phantom redirects lingering in incategory searches after page moves from Needs Reporting to Incoming on the Discovery-Search (Current work) board.

Sorry to see this happening again, it is probable that we missed some edge cases when deploying T317045.

Jun 3 2024, 8:08 AM · MW-1.40-notes (1.40.0-wmf.25; 2023-02-27), Discovery-Search (Current work), CirrusSearch

May 31 2024

dcausse added a comment to T366253: Create a generic stream to populate CirrusSearch weighted_tags.

In T366253#9848716, @pfischer wrote:

From a SUP perspective this would replace all sources of weighted tags (config option: stream name):

article-topic-stream: mediawiki.page_outlink_topic_prediction_change.v1

draft-topic-stream: mediawiki.revision_score_drafttopic

recommendation-create-stream: mediawiki.revision-recommendation-create

May 31 2024, 7:38 AM · Discovery-Search, CirrusSearch

May 30 2024

dcausse added a comment to T364856: Outreach to producers of "other dumps" to raise awareness about Dumps 2.0 and options for deprecation or migration.

Hi, we might have a use-case related to "other dumps" that might benefit from the Dumps 2.0 infrastructure, I filed T366248 with some details about it.

May 30 2024, 9:32 AM · Data Products, Dumps 2.0, Dumps-Generation, Epic

dcausse created T366253: Create a generic stream to populate CirrusSearch weighted_tags.

May 30 2024, 9:28 AM · Discovery-Search, CirrusSearch

dcausse created T366248: Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script.

May 30 2024, 9:07 AM · Data Products, CirrusSearch, Dumps 2.0, Discovery-Search

May 29 2024

dcausse added a comment to T365692: PHP Notice: Undefined index: lexeme_language / lexical_category.

The system should now index lexemes properly.
We still have to reindex all the lexemes to fix the ones created/edited before the fix was applied.

May 29 2024, 10:20 AM · MW-1.43-notes (1.43.0-wmf.8; 2024-06-04), Discovery-Search (Current work), wmde-wikidata-tech, Wikidata, Wikidata Lexicographical data, Wikimedia-production-error

dcausse updated the task description for T365692: PHP Notice: Undefined index: lexeme_language / lexical_category.

May 29 2024, 10:18 AM · MW-1.43-notes (1.43.0-wmf.8; 2024-06-04), Discovery-Search (Current work), wmde-wikidata-tech, Wikidata, Wikidata Lexicographical data, Wikimedia-production-error

dcausse added a comment to T366043: Some dumps are not available since mid may 2024.

In T366043#9840839, @dcausse wrote:

@BTullis thanks! Categories are reloaded via a cronjob on all WDQS machine, the job is about to run in 30 mins

May 29 2024, 7:36 AM · Data-Platform-SRE (2024.05.27 - 2024.06.16), Discovery-Search, Data-Engineering, Dumps-Generation

dcausse added a comment to T366043: Some dumps are not available since mid may 2024.

@BTullis thanks! Categories are reloaded via a cronjob on all WDQS machine, the job is about to run in 30 mins

May 29 2024, 7:15 AM · Data-Platform-SRE (2024.05.27 - 2024.06.16), Discovery-Search, Data-Engineering, Dumps-Generation

May 28 2024

dcausse added a comment to P63465 extra fields in cirrus indices.

Output with:

cirrus = (spark.table("discovery.cirrus_index").where('cirrus_replica="codfw" AND snapshot="20240428"'))

May 28 2024, 5:15 PM

dcausse created P63465 extra fields in cirrus indices.

May 28 2024, 5:12 PM

dcausse committed rEWLC5e903c77c46b: Workaround missing lemma fields.

Workaround missing lemma fields

May 28 2024, 4:48 PM

dcausse added a comment to T365692: PHP Notice: Undefined index: lexeme_language / lexical_category.

The search fields specific to Lexemes are currently ignored causing this NOTICE but also preventing lexemes from being searchable (esp. the new ones).
The schemas should be adapted to support these fields and the lexemes will have to be re-indexed.

May 28 2024, 9:53 AM · MW-1.43-notes (1.43.0-wmf.8; 2024-06-04), Discovery-Search (Current work), wmde-wikidata-tech, Wikidata, Wikidata Lexicographical data, Wikimedia-production-error

dcausse merged task T365684: Particular lexeme (L1326823) not indexed so search with the Wikidata API returns nothing into T365692: PHP Notice: Undefined index: lexeme_language / lexical_category.

May 28 2024, 9:51 AM · Discovery-Search (Current work), Wikidata

dcausse merged T365684: Particular lexeme (L1326823) not indexed so search with the Wikidata API returns nothing into T365692: PHP Notice: Undefined index: lexeme_language / lexical_category.

May 28 2024, 9:50 AM · MW-1.43-notes (1.43.0-wmf.8; 2024-06-04), Discovery-Search (Current work), wmde-wikidata-tech, Wikidata, Wikidata Lexicographical data, Wikimedia-production-error

dcausse claimed T365692: PHP Notice: Undefined index: lexeme_language / lexical_category.

May 28 2024, 8:47 AM · MW-1.43-notes (1.43.0-wmf.8; 2024-06-04), Discovery-Search (Current work), wmde-wikidata-tech, Wikidata, Wikidata Lexicographical data, Wikimedia-production-error

dcausse added a comment to T361483: Selectively disable changeprop functionality that is no longer used.

@achou except expert search users explicitly searching for topics (which I suspect are rare) the growth team is the only team using this data in a user facing product, it is hard to tell what would be the impact for them but I suspect that if only a few (<100) are lost these might hardly impact anything. If you suspect that more might be lost perhaps having duplicates is better if this is an option for you.

May 28 2024, 8:03 AM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTBase Deprecation Roadmap)

dcausse created T366043: Some dumps are not available since mid may 2024.

May 28 2024, 7:44 AM · Data-Platform-SRE (2024.05.27 - 2024.06.16), Discovery-Search, Data-Engineering, Dumps-Generation

May 23 2024

dcausse moved T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers from In Progress to Needs review on the Discovery-Search (Current work) board.

May 23 2024, 2:52 PM · Discovery-Search (Current work), Data-Platform-SRE, Wikidata-Query-Service, Wikidata

dcausse moved T365190: Cannot provide empty array to wikis as $wgCirrusSearchWriteClusters from Needs review to Needs Reporting on the Discovery-Search (Current work) board.

May 23 2024, 2:51 PM · MW-1.43-notes (1.43.0-wmf.6; 2024-05-21), Discovery-Search (Current work), CirrusSearch

dcausse moved T364837: Q125918173 missing from elastic@codfw from Needs review to Needs Reporting on the Discovery-Search (Current work) board.

May 23 2024, 2:51 PM · Discovery-Search (Current work), CirrusSearch

dcausse moved T362060: Generalize ScholarlyArticleSplitter from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.

May 23 2024, 2:51 PM · Discovery-Search (Current work), Wikidata

May 16 2024

dcausse moved T364837: Q125918173 missing from elastic@codfw from In Progress to Needs review on the Discovery-Search (Current work) board.

May 16 2024, 6:40 PM · Discovery-Search (Current work), CirrusSearch

dcausse updated the task description for T364837: Q125918173 missing from elastic@codfw.

May 16 2024, 6:40 PM · Discovery-Search (Current work), CirrusSearch

May 15 2024

dcausse triaged T364837: Q125918173 missing from elastic@codfw as High priority.

May 15 2024, 7:42 AM · Discovery-Search (Current work), CirrusSearch

May 14 2024

dcausse updated the task description for T364837: Q125918173 missing from elastic@codfw.

May 14 2024, 1:21 PM · Discovery-Search (Current work), CirrusSearch

dcausse updated the task description for T364837: Q125918173 missing from elastic@codfw.

May 14 2024, 12:40 PM · Discovery-Search (Current work), CirrusSearch

dcausse moved T364837: Q125918173 missing from elastic@codfw from Incoming to In Progress on the Discovery-Search (Current work) board.

May 14 2024, 10:48 AM · Discovery-Search (Current work), CirrusSearch

dcausse edited projects for T364837: Q125918173 missing from elastic@codfw, added: Discovery-Search (Current work); removed Discovery-Search.

May 14 2024, 10:47 AM · Discovery-Search (Current work), CirrusSearch

dcausse created T364837: Q125918173 missing from elastic@codfw.

May 14 2024, 10:44 AM · Discovery-Search (Current work), CirrusSearch

dcausse created P62377 Q125918173 missing from elastic@codfw.

May 14 2024, 10:07 AM

May 13 2024

dcausse awarded T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors) a Love token.

May 13 2024, 8:07 AM · Wikidata, Wikidata-Query-Service

May 7 2024

dcausse moved T362508: WDQS updater misbehaving in codfw from Needs review to Needs Reporting on the Discovery-Search (Current work) board.

May 7 2024, 6:27 AM · Discovery-Search (Current work), Wikidata

dcausse updated the task description for T362508: WDQS updater misbehaving in codfw.

May 7 2024, 6:27 AM · Discovery-Search (Current work), Wikidata

May 6 2024

dcausse updated the task description for T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting.

May 6 2024, 3:29 PM · SRE Observability (FY2024/2025-Q1), Discovery-Search (Current work), Data-Platform-SRE, MW-1.42-notes (1.42.0-wmf.20; 2024-02-27), User-fgiunchedi, Observability-Metrics

dcausse moved T360993: WDQS lag propagation to wikidata not working as intended from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.

May 6 2024, 3:22 PM · Data-Platform-SRE (2024.05.06 - 2024.05.26), MW-1.42-notes (1.42.0-wmf.25; 2024-04-02), Wikidata, Discovery-Search (Current work)

dcausse added a comment to T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers.

Possible options I see so far:

Runs hdfs-rsync directly from the blazegraph hosts
- requires installing its dependencies
- open a holes between blazegraph and the hadoop cluster
Schedule hdfs-rsync on a stat machine copying the ttl dumps from hdfs to /srv/analytics-search/wikibase_processed_dumps/wikidata/$SNAPSHOT
- cons: consumes some space on a stat machine
Run hdfs-rsync on-demand to copy the ttl dump from hdfs to /srv/analytics-search/wikibase_processed_dumps/temp and cleanup this folder once done
- cons: slows down a bit the process

May 6 2024, 1:11 PM · Discovery-Search (Current work), Data-Platform-SRE, Wikidata-Query-Service, Wikidata

dcausse added a comment to T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers.

Another approach could be to use the /mnt/hdfs mountpoint? I have been told that it might not be stable enough but perhaps it's OK for doing a copy?

May 6 2024, 9:11 AM · Discovery-Search (Current work), Data-Platform-SRE, Wikidata-Query-Service, Wikidata

May 3 2024

dcausse added a comment to T355298: Investigate the impact of the WDQS graph split on constraints checks.

Looking at the constraints I believe that 4 may use sparql:

FormatChecker.php
TypeChecker.php
UniqueValueChecker.php
ValueTypeChecker.php

May 3 2024, 3:23 PM · Wikidata Dev Team (Wikidata.org Slice), Discovery-Search (Current work), Wikibase-Quality-Constraints, Wikidata

dcausse created T364077: Adapt the wdqs data-transfer cookbook to operate with federated subgraphs.

May 3 2024, 8:23 AM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Discovery-Search (Current work), Wikidata

May 2 2024

dcausse added a comment to T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers.

@BTullis @bking I plan to use a cookbook to transfer some data out of hdfs to blazegraph machines, a naive approach I thought about was to use a temp folder somewhere in /srv of a stat100x machine that would be populated using hdfs dfs or hdfs-rsync and then re-use the transferpy python module.
The current dumps are about 200G, do you think that this option is viable? Can we use a folder in /srv as a temp folder for such transfers? This data is only useful for the transfer and should be deleted by the cookbook when it ends.

May 2 2024, 6:03 PM · Discovery-Search (Current work), Data-Platform-SRE, Wikidata-Query-Service, Wikidata

dcausse (David Causse)
User

Projects

Calendar

Today

Tomorrow

Wednesday

User Details

Recent Activity
View All

Fri, Jul 5

Thu, Jul 4

Wed, Jul 3

Tue, Jul 2

Mon, Jul 1

Fri, Jun 28

Wed, Jun 26

Tue, Jun 25

Mon, Jun 24

Fri, Jun 21

Thu, Jun 20

Tue, Jun 18

Fri, Jun 14

Thu, Jun 13

Tue, Jun 11

Mon, Jun 10

Jun 6 2024

Jun 4 2024

Jun 3 2024

May 31 2024

May 30 2024

May 29 2024

May 28 2024

May 23 2024

May 16 2024

May 15 2024

May 14 2024

May 13 2024

May 7 2024

May 6 2024

May 3 2024

May 2 2024

dcausse (David Causse)User

Projects

Calendar

Today

Tomorrow

Wednesday

User Details

Recent ActivityView All

Fri, Jul 5

Thu, Jul 4

Wed, Jul 3

Tue, Jul 2

Mon, Jul 1

Fri, Jun 28

Wed, Jun 26

Tue, Jun 25

Mon, Jun 24

Fri, Jun 21

Thu, Jun 20

Tue, Jun 18

Fri, Jun 14

Thu, Jun 13

Tue, Jun 11

Mon, Jun 10

Jun 6 2024

Jun 4 2024

Jun 3 2024

May 31 2024

May 30 2024

May 29 2024

May 28 2024

May 23 2024

May 16 2024

May 15 2024

May 14 2024

May 13 2024

May 7 2024

May 6 2024

May 3 2024

May 2 2024

dcausse (David Causse)
User

Recent Activity
View All