[go: nahoru, domu]

Page MenuHomePhabricator

FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0".
Closed, ResolvedPublic

Description

Trying to upload T200223: Server side upload for -revi, I miserably failed (at least twice IIRC) with the warning:

FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0".

I have no idea what's wrong with my file here <,< And I think this is partially reason that I can't see UploadStash page? (It 503s)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
revi added a subscriber: Bawolff.

Odd UploadStash would fail, can you try again? Does the 503 report any further error?

I tried once again, actually stash was empty at the moment (meh) so I re-tried the upload with the expectation that the upload will fail (and UploadStash will fail again), but it managed to upload it successfully this time (meh).

Ack, thanks for the report @revi ! I'll defer to some people more intimate with uploadstash

I got the same error while trying to upload a file with Rillke's Chunked Upload script on https://commons.wikimedia.org/wiki/File:L%27Illustration,_mars-ao%C3%BBt_1849.pdf
( https://commons.wikimedia.org/wiki/User:Rillke/bigChunkedUpload.js )

03592: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/2/2a/16981zp2ivt0.wq8wbr.1.pdf.0".

I think the file is in Upload stash: 16986mjwpyvc.x9ltso.1.pdf

And it is the second time, it happened today. There is another file stucked in Upload stash: 1697wll1czuk.rcpwyj.1.pdf

That error message is backend-fail-read in FileBackendStore::doConcatenate() I believe. The earliest error in the request is Memcached error for key "/*/mw-wan/WANCache:v:filebackend:local-swift-eqiad:wikipedia-commons:file:<hash>" on server "127.0.0.1:11213": A TIMEOUT OCCURRED; that key seems to come from FileBackendStore::fileCacheKey(). Tracing back code paths in FileBackend is not really feasible (we should probably add a backtrace to the memcached error) but I'm guessig something in FileBackendStore or SwiftFileBackend messes up cache error handling and interprets "file stat not in cache" as "file does not exist".

I could upload the 3 files which caused this bug.

Just received the same error on Commons for File:116th United States Congress House Floor - 2019-01-10.webm. Trying again.

It failed twice. :/ I guess I am going to use video2commons and then make a Phabricator tag if (when) it fails again. </3

FWIW, i've been investigating this. It does seem to be happening at commons with some regularity for larger uploads. However in the modern case it is backend-fail-notexists not the read error tgr mentions above.

What i have observed so far

  • Uploading usually seems to go fine. In debug log it seems like the PUT for individual chunks gives a 200 from Swift, so presumably they succeed.
  • concatenation is failing fast. Usually within 0.3 seconds. This indicates it is probably not a timeout issue. This is so fast, i imagine none of the chunks can be downloaded (MW tries to download all in parallel)
  • The failure is always with the first chunk. The chunks are checked in order, so this might just mean all chunks are missing, but at least the first is always missing.
  • The failure is not a read failure but a missing file failure.

So it seems like either Swift is randomly losing the files, or something is deleting them. (unfortunately, there is no logging in logstash of DELETE operations, so I don't know directly. Of course that is probably for good reason as the log would be huge)

These are the theories I have thought of so far. Not a lot of evidence to back up any of them.

  • Theory one: The concatenate job gets triggered multiple times. The first time silently errors or timesouts after it deletes the original file (But rollsback the DB). The job is requeued to try again, but all the source files are missing at this point.
    • Problem - why would there be no logs of this. Also if the concatenate operation fails, MW should abort before deleting the source file.
  • theory 2: The files are being removed by cleanupUploadStash.php. This script is supposed to remove stashed files after 48 hours (6 hours for wikis that are not commons). Perhaps it has a bug where it deletes files to soon
    • The script is set once an hour. When i looked through the logs, it did seem like this issue mostly happened to files that took a long time to upload, like ~45 minutes. (You can tell, because the first part of a chunked upload file name is a timestamp of when the upload started [unix epoch * 10^8 and then base36 encoded]). However there were a few examples where the upload only took about 10 minutes. Hard to know though, as instead of time it might be triggered by number of chunks or total file size. All 3 are correlated.
    • I tried uploading a large file to testwiki. It eventually failed with this issue. The upload took a little more than 6 hours, so some of the chunks were probably deleted by the script. However unlike the examples at commons where the concatenation fails in less then a second, my example took 9.7 seconds for the concatenation to fail. This suggests to me that only some of the chunks were gone, and the rest were still downloaded. Which seems like a different failure case then what is being experienced on commons.
  • Theory 3: There exists some other mechanism to expire files in swift I am unaware of. Swift supports this (X-Delete-At or X-Delete-After header). However as far as i can tell we don't use this feature.
  • Theory 4: some sort of rate limit in swift that MediaWiki handles poorly. Seems unlikely that this wouldn't cause an error to be logged in logstash. Also T162793 says that the MW account is exempt from rate limits.

I'd be surprised (and unhappy!) were swift randomly losing objects. If you have object names (ideally plus timestamps) from a recent example I could go grobbling in the logs to check.

[When I looked into a similar complaint recently (T355610) it looked like concatenation/assembly/what-have-you failed and DELETEd the fragments which then resulted in an object not found being returned to the user]

Apropos theory 3, we do run the swift object expirer, but the relevant headers are not set (except for some specific use cases e.g. phonos). So I don't think it can be that.

maybe what is happening is that two assemble jobs are running at the same time and the first one is successful. Previously i was investigating the possibility of two jobs where both fail and wondering why no errors happened. However if the first one succeeds there would be no error, source file would be deleted. The second would then fail due to missing files. The second one overwrites the status in db, thus the user only sees the failure. The only thing is that then the first job should make a new stash file, which while never published should still be visible in user's uploadstash.

Maybe I should add more logging to MW around this part. Although i have no idea how two jobs would happen. A log entry is generated when a job is inserted and there was only one. The Assemble job has retrying disabled.

Evidence for this theory:

  • When looking at logs related to the assembly job for request id "cce2ebe2-cc67-40e3-a4e4-e09308a9d640", i see two different job runner servers showing up in log stash. One with the error, and one with a harmless warning. This suggests at least two jobs are running one without error, the other with a lot of errors, although the one without errors might possibly be a different unrelated job. T355610 is also consistent with this theory
  • Occasionally we do see publish jobs (but not assembly jobs) fail with "Could not acquire lock. Somebody else is doing something to this file.". which suggests some concurrency. However, the user can also trigger publish requests multiple times (They can do this for publish not assemble), so it is possibly induced by a user and not a bug on our end. Also the lock might not be on the source file but could be on the destination file which could be locked for an unrelated reason.

T355610 Looks like the same issue, so that is helpful. thanks :)


I'd be surprised (and unhappy!) were swift randomly losing objects. If you have object names (ideally plus timestamps) from a recent example I could go grobbling in the logs to check.

Thanks. A recent example from commons: mwstore://local-swift-codfw/local-temp/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 . It should have been first uploaded at about Wed, 21 Feb 2024 20:19:09 and reported missing at Feb 21, 2024 @ 21:03:12.363

Change 1005632 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[mediawiki/core@master] Improve chunked upload jobs and abort assemble job if already in progress

https://gerrit.wikimedia.org/r/1005632

Gerrit patch to detect the situation where jobs are racing each other. It will also abort the job if a race is detected, but if this race is actually happening with any frequency (which hopefully the logging code would detect) we would probably want to use a real lock for robustness. In principle there should never be more than one assemble job for a specific file, not just at a time, but ever, so races should not be happening 🤷

Here's the relevant logs, sorted by time:

moss-fe2001.codfw.wmnet: Feb 21 20:19:14 moss-fe2001 proxy-server: 10.192.48.105 10.192.32.51 21/Feb/2024/20/19/14 PUT /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 201 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - - - tx948576b6a72e4600a62d1-0065d65ac1 - 0.9693 - - 1708546753.577275038 1708546754.546539783 -
ms-fe1009.eqiad.wmnet: Feb 21 20:19:16 ms-fe1009 proxy-server: 10.192.48.105 10.64.0.166 21/Feb/2024/20/19/16 PUT /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 201 - wikimedia/multi-http-client%20v1.1 AUTH_tke5beae87e... - - - tx4683f2a851d249ea89bf1-0065d65ac2 - 1.5437 - - 1708546754.563668013 1708546756.107330084 -
ms-fe2013.codfw.wmnet: Feb 21 20:59:24 ms-fe2013 proxy-server: 10.194.152.76 10.192.0.87 21/Feb/2024/20/59/24 GET /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 200 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - 97498542 - tx8b68cb8506d34d1a9bb03-0065d66420 - 12.5442 - - 1708549152.229032040 1708549164.773195744 0
ms-fe2013.codfw.wmnet: Feb 21 21:00:02 ms-fe2013 proxy-server: 10.194.152.76 10.192.0.87 21/Feb/2024/21/00/02 DELETE /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 204 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - - - txfabad3d0f009413ea82dd-0065d66452 - 0.0458 - - 1708549202.850069761 1708549202.895848513 0
ms-fe1009.eqiad.wmnet: Feb 21 21:00:03 ms-fe1009 proxy-server: 10.194.152.76 10.64.0.166 21/Feb/2024/21/00/03 DELETE /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 204 - wikimedia/multi-http-client%20v1.1 AUTH_tke5beae87e... - - - tx26b0b6c03f1b4577a65b7-0065d66453 - 0.0454 - - 1708549203.070249796 1708549203.115652323 0
ms-fe2013.codfw.wmnet: Feb 21 21:03:12 ms-fe2013 proxy-server: 10.194.155.232 10.192.0.87 21/Feb/2024/21/03/12 GET /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 404 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - 70 - tx36a819fdb39c4d61bb4ee-0065d66510 - 0.0335 - - 1708549392.151752949 1708549392.185259581 0
ms-fe2013.codfw.wmnet: Feb 21 21:03:12 ms-fe2013 proxy-server: 10.194.155.232 10.192.0.87 21/Feb/2024/21/03/12 GET /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 404 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - 70 - txfffc016f7af4438eac05b-0065d66510 - 0.0284 - - 1708549392.332885504 1708549392.361268997 0

From which, I see:

  • First, 10.192.48.105 (mw2283) uploads to eqiad and codfw at 20:19
  • Then at 20:59, 10.194.152.76 (I guess a k8s node?) does a successful GET
  • almost immediately, it issues a DELETE from both eqiad and codfw, which succeed
  • Finally, at 21:03, 10.194.155.232 (another k8s node?) tries two GETs, both of which return 404 (as you'd expect, the object has been deleted)

Change 1005632 merged by jenkins-bot:

[mediawiki/core@master] Improve chunked upload jobs and abort assemble job if already in progress

https://gerrit.wikimedia.org/r/1005632

Here's the relevant logs, sorted by time:

moss-fe2001.codfw.wmnet: Feb 21 20:19:14 moss-fe2001 proxy-server: 10.192.48.105 10.192.32.51 21/Feb/2024/20/19/14 PUT /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 201 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - - - tx948576b6a72e4600a62d1-0065d65ac1 - 0.9693 - - 1708546753.577275038 1708546754.546539783 -
ms-fe1009.eqiad.wmnet: Feb 21 20:19:16 ms-fe1009 proxy-server: 10.192.48.105 10.64.0.166 21/Feb/2024/20/19/16 PUT /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 201 - wikimedia/multi-http-client%20v1.1 AUTH_tke5beae87e... - - - tx4683f2a851d249ea89bf1-0065d65ac2 - 1.5437 - - 1708546754.563668013 1708546756.107330084 -
ms-fe2013.codfw.wmnet: Feb 21 20:59:24 ms-fe2013 proxy-server: 10.194.152.76 10.192.0.87 21/Feb/2024/20/59/24 GET /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 200 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - 97498542 - tx8b68cb8506d34d1a9bb03-0065d66420 - 12.5442 - - 1708549152.229032040 1708549164.773195744 0
ms-fe2013.codfw.wmnet: Feb 21 21:00:02 ms-fe2013 proxy-server: 10.194.152.76 10.192.0.87 21/Feb/2024/21/00/02 DELETE /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 204 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - - - txfabad3d0f009413ea82dd-0065d66452 - 0.0458 - - 1708549202.850069761 1708549202.895848513 0
ms-fe1009.eqiad.wmnet: Feb 21 21:00:03 ms-fe1009 proxy-server: 10.194.152.76 10.64.0.166 21/Feb/2024/21/00/03 DELETE /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 204 - wikimedia/multi-http-client%20v1.1 AUTH_tke5beae87e... - - - tx26b0b6c03f1b4577a65b7-0065d66453 - 0.0454 - - 1708549203.070249796 1708549203.115652323 0
ms-fe2013.codfw.wmnet: Feb 21 21:03:12 ms-fe2013 proxy-server: 10.194.155.232 10.192.0.87 21/Feb/2024/21/03/12 GET /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 404 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - 70 - tx36a819fdb39c4d61bb4ee-0065d66510 - 0.0335 - - 1708549392.151752949 1708549392.185259581 0
ms-fe2013.codfw.wmnet: Feb 21 21:03:12 ms-fe2013 proxy-server: 10.194.155.232 10.192.0.87 21/Feb/2024/21/03/12 GET /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 404 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - 70 - txfffc016f7af4438eac05b-0065d66510 - 0.0284 - - 1708549392.332885504 1708549392.361268997 0

From which, I see:

  • First, 10.192.48.105 (mw2283) uploads to eqiad and codfw at 20:19
  • Then at 20:59, 10.194.152.76 (I guess a k8s node?) does a successful GET
  • almost immediately, it issues a DELETE from both eqiad and codfw, which succeed
  • Finally, at 21:03, 10.194.155.232 (another k8s node?) tries two GETs, both of which return 404 (as you'd expect, the object has been deleted)

Correlating with other logs, I think the first GET/DELETE is coming from mw-jobrunner.codfw.main-6c569f5b7d-jt5sx, and the second one from mw-jobrunner.codfw.main-6c569f5b7d-pckkc (both with reqid cce2ebe2-cc67-40e3-a4e4-e09308a9d640) [Commons does not generally seem to be on k8s nodes, but i think the job queue is]. This is consistent with the theory that there are two jobs running that are stepping on each other. My MW change should go live soon, and then we'd know more (I guess we could correlate also with which jobs are running with that reqid, but i don't have access to mw-log anymore to check that particular log, but in any case its probably not necessary as I'm now feeling pretty confident in this theory).

Change 1005698 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[mediawiki/core@wmf/1.42.0-wmf.19] Improve chunked upload jobs and abort assemble job if already in progress

https://gerrit.wikimedia.org/r/1005698

Change 1005698 merged by jenkins-bot:

[mediawiki/core@wmf/1.42.0-wmf.19] Improve chunked upload jobs and abort assemble job if already in progress

https://gerrit.wikimedia.org/r/1005698

Mentioned in SAL (#wikimedia-operations) [2024-02-22T22:30:38Z] <cjming@deploy2002> Started scap: Backport for [[gerrit:1005698|Improve chunked upload jobs and abort assemble job if already in progress (T200820)]]

Mentioned in SAL (#wikimedia-operations) [2024-02-22T22:32:02Z] <cjming@deploy2002> bawolff and cjming: Backport for [[gerrit:1005698|Improve chunked upload jobs and abort assemble job if already in progress (T200820)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-02-22T22:40:24Z] <cjming@deploy2002> Finished scap: Backport for [[gerrit:1005698|Improve chunked upload jobs and abort assemble job if already in progress (T200820)]] (duration: 09m 46s)

Ok, so new logs do show a bunch of:

Tried to assemble upload that is in stage assembling/Poll

They show up exactly 4 minutes later then the initial start. The initial start never seems to complete either.

I'm now thinking what is happening: is that there is some sort of timeout on jobs of 4 minutes, and the job is automatically retried if it times-out, despite being a no retry job, and timeouts aren't logged anywhere.

The upload status is in MainObjectStash, which is presumably why a rollback is not touching it (Not that it should be touched, we want it not to be, but i was initially confused how that worked).


So things we could do here - First we should figure out how to make the AssembleUploadChunksJob have a longer timeout.

I guess it would also be nice to actually make the job retryable... but that seems more complicated.

Longer term, using swift large object support might be a better way to handle these files, since they are already chunked.

I wonder if the slow part of the process is deleting all the chunks. Im not 100% sure, but as far as i can tell that is not paralelized. It would also help explain the errors being file missing, instead of running out of time in middle of concatenate. Its also not super important since they will eventually get deleted anyways, maybe we should rearrange things so code reports success before doing delete step.

Presumably that would make the job retryable on failure as well.

Edit: we dont see the Stashed combined ({chunks} chunks) of {oldkey} under new name {filekey} log entry, so that suggests the job has to be failing either during chunk delete, during file verification, or during stashing of combined file. It looks like stashing a file calls $fileProps = $mwProps->getPropsFromPath( $path, true ); which i imagine will calculate an sha1, which is probably quite slow on a large file

Longer term, using swift large object support might be a better way to handle these files, since they are already chunked.

As an aside here, we don't currently support swift large objects on the servers (and some of our infrastructure assumes that we will continue not to do so).

Change 1005965 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[mediawiki/core@master] Make chunked upload delete chunks in deferred job

https://gerrit.wikimedia.org/r/1005965

Change 1006083 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[mediawiki/core@master] Optimize AssembleUploadChunks to reuse SHA1 hash instead of recalc

https://gerrit.wikimedia.org/r/1006083

Change 1006083 merged by jenkins-bot:

[mediawiki/core@master] Optimize AssembleUploadChunks to reuse SHA1 hash instead of recalc

https://gerrit.wikimedia.org/r/1006083

Change 1005965 merged by jenkins-bot:

[mediawiki/core@master] Make chunked upload delete chunks in deferred update

https://gerrit.wikimedia.org/r/1005965

Change 1007998 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[mediawiki/core@master] Make verifyPartialFile reuse SHA1 hash if we already know

https://gerrit.wikimedia.org/r/1007998

On discord, Sazpaimon suggested having chunked upload keep a running total of the SHA1 hash so far (e.g. using https://www.php.net/manual/en/function.hash-update.php ). That way MW never has to calculate the entire hash at one time, but instead the work is split over all the uploaded chunks.

I'm not sure if this is worth the complexity right now, seems like its easier to just increase the timeout on AssembleUploadChunks jobs.

I guess PublishStashedFile could be further optimized if we didn't verify the file. In theory, the file should already have been verified when uploaded to stash. If we don't need to verify the file, then there is no need to download the file locally, and all operations can take place directly on swift. However maybe that's not worth it, and its better just to be paranoid and re-verify the file at all stages.

Change 1007998 merged by jenkins-bot:

[mediawiki/core@master] Make verifyPartialFile reuse SHA1 hash if we already know

https://gerrit.wikimedia.org/r/1007998

I think we can close this ticket ?
I'm sure some incidental problems might still exist, but reading the history of this ticket, those are probably better investigated in separate new tickets if they are causing a nuisance.

Bawolff claimed this task.

The biggest known issue at this point is the jobs aren't restartable, but they will be restarted if a deploy happens in the middle of the job (which will cause a failure). But i agree that is best served by other tickets.