User Details
- User Since
- Jun 9 2015, 9:03 AM (473 w, 6 d)
- Availability
- Available
- IRC Nick
- dcausse
- LDAP User
- DCausse
- MediaWiki User
- DCausse (WMF) [ Global Accounts ]
Fri, Jul 5
@Lucas_Werkmeister_WMDE thanks for the fix! I manually re-indexed this item with our new (WIP) tooling, it would have been fixed automatically by the cleanup process but it would have taken up to 2weeks to discover in the worst case.
Thu, Jul 4
Should T192361 be re-opened and added as a subtask here?
Wed, Jul 3
Seems like \EntitySchema\Wikibase\DataValues\EntitySchemaValue::getType() is returning EntityIdValue::getType() and thus some code are considering it as EntityIdValue ('VT:wikibase-entityid`), here WikibaseCirrusSearch is calling https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikibase/+/8b3312396b4b8b91790d7b33c4703fb31bd290d8/repo/WikibaseRepo.datatypes.php#421 with an EntitySchemaValue.
The process is unable to render this document: https://www.wikidata.org/w/api.php?action=query&cbbuilders=content|links&format=json&format=json&formatversion=2&pageids=120965176&prop=cirrusbuilddoc fails with Caught exception of type TypeError:
Seems like that recently https://packages.sury.org/php/dists/buster/ is now returning a 403
Tue, Jul 2
Mon, Jul 1
Moving a page from one namespace to another should now properly cleanup the search index, existing phantom redirects might still be around for a couple weeks while the automated cleanup process takes care of them. Please let me know if you see new instances of this problem in the future, sorry for the inconvenience.
Hi I'm having issues with a flink job running in staging and failing to deploy with an error:
>>> Status | Error | DEPLOYED | {"type":"org.apache.flink.kubernetes.operator.exception.DeploymentFailedException","message":"pods \"flink-app-consumer-search-784bc9fd87-9n862\" is forbidden: violates PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (container \"flink-main-container\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"flink-main-container\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or container \"flink-main-container\" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container \"flink-main-container\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")","additionalMetadata":{"reason":"FailedCreate"},"throwableList":[]}
The talk page is ranked very low indeed, it does seem quite recent (created on may 2024) and have 0 incoming_links and thus is far behind https://he.wikipedia.org/wiki/%D7%A9%D7%99%D7%97%D7%AA_%D7%9E%D7%A9%D7%AA%D7%9E%D7%A9 which has more than 3k incoming links. CirrusSearch does not prioritize master pages over their subpages indeed, if we want to do this this would have to be carefully evaluated because one thing we can't do is rank lower a subpage comparatively solely to its master page, all subpages would be down-ranked.
Fri, Jun 28
I added some logging info to get a sense of the numbers, moving to waiting while we gather a bit more info.
tagging serviceops for help on envoy to see if it can be used as a load balancer to balance the internal requests made from one blazegraph cluster to another without using lvs.
@Vgutierrez thanks for the help!
Wed, Jun 26
In the meantime a ugly workaround is to search both EntitySchema and EntitySchema talk namespaces but filter on the content model using the keyword contentmodel:EntitySchema: https://www.wikidata.org/w/index.php?search=contentmodel%3AEntitySchema+intitle%3A%2FE%2F&title=Special:Search&profile=advanced&fulltext=1&ns640=1&ns641=1 .
yes this is sadly kind of expected (I should have told you about this on the config patch, sorry). The cleanup process had already started moving pages around while the entity schema namespace was considered non-content and thus these ones are no longer findable now it was brought back again in the content namespace. I need to reindex these pages to make search working again but sadly our tooling is not working as expected and I need to deploy https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/143 first to be able to fix the index. If this is causing major disruption I can messup with the index by hand but I'd rather not do that if not strictly required, sorry for the inconvenience!
Another instance of this issue was reported on wiki:
@dcausse (WMF): fwiw, I have 6 items updated on the 19 & 20 June - https://w.wiki/ASz6 - for which WDQS has not been updated ... on the production WDQS, not test. Only one of them was edited within the June 19 between 03:00 and 15:30 UTC window, afaics. It's not a prolem for me, more of a FYI. --Tagishsimon (talk) 16:01, 21 June 2024 (UTC)
Surprisingly E378 which is one of the schemas that is not indexed appears to be indexed in the "content" index of wikidata, but AFAICT 640 is not a content namespace.
But it might have been considered as a content namespace few weeks ago.
I wonder if T363153 and esp. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EntitySchema/+/1040113/ might not be the reason of this change. When a namespace with existing documents has its search characteristics changed (wgContentNamespace and/or wgNamespacesToBeSearchedDefault) the indexed docs are not moved automatically from one index to another and will rely on the saneitizer to slowly fix the inconsistencies, this is what might have happened here and explain why the schemas suddenly disappeared and got re-indexed slowly overtime.
The above reindex did not work as I expected, the attached patch should remedy this by allowing non indexed page to be re-indexed properly when manually re-indexing a whole namespace.
The root cause as to why these schemas were not indexed in the first place is yet to be investigated.
Tue, Jun 25
There are currently 354 pages indexed in the entity schema, the all pages api does seem to suggest that there are 397 schemas.
Mon, Jun 24
Fri, Jun 21
Thu, Jun 20
After discussing this Erik with we have a rough plan:
- add a new lvs enpoint dedicated to internal federation and targeting a new port opened by nginx
- add a new port in the nginx config for which we add the X-Disable-Throttling + x-bigdata-read-only to the request forwarded to blazegraph
- use the blazegraph service alias feature to map https://query-main.wikidata.org/sparql -> https://wdqs-main.discovery.wmnet:$NEW_PORT/sparql
- adapt ProxiedHttpConnectionFactory to allow the bypass of *.wmnet hostnames
Yes this is my understanding as well, the undesirable effect I could if some mistakes are made:
- one tagging an entity with a P31 that points to a scholarly article
- introducing a scholarly article in the chain of subclass of thus making the sparql property path noneffective
I'm not knowledgeable enough but I suspect these problems should be quite rare and perhaps already identified via other means?
Tue, Jun 18
we need 4 weeks to be able to backfill after an import, from the time the wikidata dump process starts, the time required to shuffle the data around (compression, hdfs-rsync to hdfs) and til the end of the import into blazegraph, see the initial lag column in T241128 for past import times, perhaps 3weeks would be manageable but we went to 4 weeks to have extra room.
Fri, Jun 14
Unsure if feasible but perhaps manually flagging list of safe regex & very popular regex could help reduce the number of requests to shellbox?
I did some testing and sadly when a wdqs node makes a query to https://query.wikidata.org it hits varnish again:
from wdqs1020 to https://query.wikidata.org (echo 'SELECT ?test_dcausse { ?test_dcausse ?p ?o . } LIMIT 1' | curl -f -s --data-urlencode query@- https://query.wikidata.org/sparql?format=json)
"x-request-id": "b34bb930-ef85-4b23-956e-7dcb11f0f7ec", "content-length": "99", "x-forwarded-proto": "http", "x-client-port": "40256", "x-bigdata-max-query-millis": "60000", "x-wmf-nocookies": "1", "x-client-ip": "2620:0:861:10a:10:64:131:24", "x-varnish": "800949377", "x-forwarded-for": "2620:0:861:10a:10:64:131:24\\, 10.64.0.79\\, 2620:0:861:10a:10:64:131:24", "x-requestctl": "", "x-cdis": "pass", "accept": "*/*", "x-real-ip": "2620:0:861:10a:10:64:131:24", "via-nginx": "1", "x-bigdata-read-only": "yes", "host": "query.wikidata.org", "content-type": "application/x-www-form-urlencoded", "connection": "close", "x-envoy-expected-rq-timeout-ms": "65000", "x-connection-properties": "H2=1; SSR=0; SSL=TLSv1.3; C=TLS_AES_256_GCM_SHA384; EC=UNKNOWN;", "user-agent": "curl/7.74.0"
Thu, Jun 13
@RKemper I think we should now do a full import to measure the time it takes in order to have a rough estimation to answer T367409
To have a full run we need to re-enable the updater on wdqs2023 (which I think will be done with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042965)
The command to run should be (using the latest dumps):
cookbook sre.wdqs.data-reload \ --task-id T349069 \ --reason "Test wdqs reload based on HDFS" \ --reload-data wikidata_full \ --from-hdfs hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603 \ --stat-host stat1009.eqiad.wmnet \ wdqs2023.codfw.wmnet
Tue, Jun 11
Triggered a reindex of all the lexemes using https://gitlab.wikimedia.org/repos/search-platform/cirrus-rerender, might take about 3 hours to complete.
Mon, Jun 10
Jun 6 2024
@RKemper for testing I created a smaller folder at hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ it has only two chunks so I hope it might help iterate a bit faster on this, the command should become:
cookbook sre.wdqs.data-reload \ --task-id T349069 \ --reason "Test wdqs reload based on HDFS" \ --reload-data wikidata_full \ --from-hdfs hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ \ --stat-host stat1009.eqiad.wmnet \ wdqs2023.codfw.wmnet
Jun 4 2024
Jun 3 2024
Yes (all the images under docker-registry.wikimedia.org/wikimedia/wikidata-query-flink-rdf-streaming-updater should no longer be used and can be safely removed if needed)
Sorry to see this happening again, it is probable that we missed some edge cases when deploying T317045.
May 31 2024
May 30 2024
Hi, we might have a use-case related to "other dumps" that might benefit from the Dumps 2.0 infrastructure, I filed T366248 with some details about it.
May 29 2024
The system should now index lexemes properly.
We still have to reindex all the lexemes to fix the ones created/edited before the fix was applied.
@BTullis thanks! Categories are reloaded via a cronjob on all WDQS machine, the job is about to run in 30 mins
May 28 2024
Output with:
cirrus = (spark.table("discovery.cirrus_index").where('cirrus_replica="codfw" AND snapshot="20240428"'))
The search fields specific to Lexemes are currently ignored causing this NOTICE but also preventing lexemes from being searchable (esp. the new ones).
The schemas should be adapted to support these fields and the lexemes will have to be re-indexed.
@achou except expert search users explicitly searching for topics (which I suspect are rare) the growth team is the only team using this data in a user facing product, it is hard to tell what would be the impact for them but I suspect that if only a few (<100) are lost these might hardly impact anything. If you suspect that more might be lost perhaps having duplicates is better if this is an option for you.
May 23 2024
May 16 2024
May 15 2024
May 14 2024
May 13 2024
May 7 2024
May 6 2024
Possible options I see so far:
- Runs hdfs-rsync directly from the blazegraph hosts
- requires installing its dependencies
- open a holes between blazegraph and the hadoop cluster
- Schedule hdfs-rsync on a stat machine copying the ttl dumps from hdfs to /srv/analytics-search/wikibase_processed_dumps/wikidata/$SNAPSHOT
- cons: consumes some space on a stat machine
- Run hdfs-rsync on-demand to copy the ttl dump from hdfs to /srv/analytics-search/wikibase_processed_dumps/temp and cleanup this folder once done
- cons: slows down a bit the process
Another approach could be to use the /mnt/hdfs mountpoint? I have been told that it might not be stable enough but perhaps it's OK for doing a copy?
May 3 2024
Looking at the constraints I believe that 4 may use sparql:
- FormatChecker.php
- TypeChecker.php
- UniqueValueChecker.php
- ValueTypeChecker.php
May 2 2024
@BTullis @bking I plan to use a cookbook to transfer some data out of hdfs to blazegraph machines, a naive approach I thought about was to use a temp folder somewhere in /srv of a stat100x machine that would be populated using hdfs dfs or hdfs-rsync and then re-use the transferpy python module.
The current dumps are about 200G, do you think that this option is viable? Can we use a folder in /srv as a temp folder for such transfers? This data is only useful for the transfer and should be deleted by the cookbook when it ends.