Compaction task fails entirely when an exception is thrown during a job #1533

patchwork01 · 2023-11-14T15:57:24Z

Description

When an exception is thrown during a compaction job, the task fails completely and terminates. The job stays on the queue and is retried when the message visibility timeout runs out, which is 15 minutes by default.

If a job fails completely and is not returned to the queue (eg. because it's sent to the dead letter queue), the files will never be compacted, since they're still assigned to that job.

Expected behaviour

A compaction job failing should not prevent a compaction task from continuing to process jobs.

A job which fails should be released back to the compaction job queue to be retried.

It will then automatically be moved to the dead letter queue if it has been retried too many times (this is built-in behaviour in SQS given that we've configured a dead letter queue). If a job ends up on the dead letter queue, the files can be left assigned to that job until a human deals with it.

Background

This is also related to a separate issue where if a compaction job fails its state store update, the file will never be compacted:

When state store update fails after compaction, file will never be compacted #1412

gaffer01 · 2023-11-14T16:17:57Z

A job which fails should either be retried, or should result in the files being freed up to be assigned to another compaction job.

I'm not sure that if a job fails a few times we should free the files up to be assigned to another compaction job. If for example the compaction job is failing because one of the files is malformed then this will just propagate the problem indefinitely. Or imagine that an iterator fails if a field takes a certain value - a file containing that value can never be compacted so we don't want it to end up in another job as it still won't work. I'd suggest we try a job N times and then if it still fails, raise that as an issue for a human to investigate. We already try a job multiple times if it fails before it eventually ends up on the dead-letter queue. The intention is that messages on the dead-letter queue should be investigated manually. We could surface dead-letters to a SNS queue to make it more obvious that something needs looking at.

patchwork01 added the bug Something isn't working label Nov 14, 2023

gaffer01 removed the bug Something isn't working label Nov 14, 2023

kr565370 mentioned this issue Nov 29, 2023

When state store update fails after compaction, file will never be compacted #1412

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compaction task fails entirely when an exception is thrown during a job #1533

Compaction task fails entirely when an exception is thrown during a job #1533

Compaction task fails entirely when an exception is thrown during a job #1533

Compaction task fails entirely when an exception is thrown during a job #1533

Comments

Description

Expected behaviour

Background