You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When an exception is thrown during a compaction job, the task fails completely and terminates. The job stays on the queue and is retried when the message visibility timeout runs out, which is 15 minutes by default.
If a job fails completely and is not returned to the queue (eg. because it's sent to the dead letter queue), the files will never be compacted, since they're still assigned to that job.
Expected behaviour
A compaction job failing should not prevent a compaction task from continuing to process jobs.
A job which fails should be released back to the compaction job queue to be retried.
It will then automatically be moved to the dead letter queue if it has been retried too many times (this is built-in behaviour in SQS given that we've configured a dead letter queue). If a job ends up on the dead letter queue, the files can be left assigned to that job until a human deals with it.
Background
This is also related to a separate issue where if a compaction job fails its state store update, the file will never be compacted:
A job which fails should either be retried, or should result in the files being freed up to be assigned to another compaction job.
I'm not sure that if a job fails a few times we should free the files up to be assigned to another compaction job. If for example the compaction job is failing because one of the files is malformed then this will just propagate the problem indefinitely. Or imagine that an iterator fails if a field takes a certain value - a file containing that value can never be compacted so we don't want it to end up in another job as it still won't work. I'd suggest we try a job N times and then if it still fails, raise that as an issue for a human to investigate. We already try a job multiple times if it fails before it eventually ends up on the dead-letter queue. The intention is that messages on the dead-letter queue should be investigated manually. We could surface dead-letters to a SNS queue to make it more obvious that something needs looking at.
Description
When an exception is thrown during a compaction job, the task fails completely and terminates. The job stays on the queue and is retried when the message visibility timeout runs out, which is 15 minutes by default.
If a job fails completely and is not returned to the queue (eg. because it's sent to the dead letter queue), the files will never be compacted, since they're still assigned to that job.
Expected behaviour
A compaction job failing should not prevent a compaction task from continuing to process jobs.
A job which fails should be released back to the compaction job queue to be retried.
It will then automatically be moved to the dead letter queue if it has been retried too many times (this is built-in behaviour in SQS given that we've configured a dead letter queue). If a job ends up on the dead letter queue, the files can be left assigned to that job until a human deals with it.
Background
This is also related to a separate issue where if a compaction job fails its state store update, the file will never be compacted:
The text was updated successfully, but these errors were encountered: