[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compaction task fails entirely when an exception is thrown during a job #1533

Open
patchwork01 opened this issue Nov 14, 2023 · 1 comment
Open

Comments

@patchwork01
Copy link
Collaborator
patchwork01 commented Nov 14, 2023

Description

When an exception is thrown during a compaction job, the task fails completely and terminates. The job stays on the queue and is retried when the message visibility timeout runs out, which is 15 minutes by default.

If a job fails completely and is not returned to the queue (eg. because it's sent to the dead letter queue), the files will never be compacted, since they're still assigned to that job.

Expected behaviour

A compaction job failing should not prevent a compaction task from continuing to process jobs.

A job which fails should be released back to the compaction job queue to be retried.

It will then automatically be moved to the dead letter queue if it has been retried too many times (this is built-in behaviour in SQS given that we've configured a dead letter queue). If a job ends up on the dead letter queue, the files can be left assigned to that job until a human deals with it.

Background

This is also related to a separate issue where if a compaction job fails its state store update, the file will never be compacted:

@patchwork01 patchwork01 added the bug Something isn't working label Nov 14, 2023
@gaffer01
Copy link
Member
gaffer01 commented Nov 14, 2023

A job which fails should either be retried, or should result in the files being freed up to be assigned to another compaction job.

I'm not sure that if a job fails a few times we should free the files up to be assigned to another compaction job. If for example the compaction job is failing because one of the files is malformed then this will just propagate the problem indefinitely. Or imagine that an iterator fails if a field takes a certain value - a file containing that value can never be compacted so we don't want it to end up in another job as it still won't work. I'd suggest we try a job N times and then if it still fails, raise that as an issue for a human to investigate. We already try a job multiple times if it fails before it eventually ends up on the dead-letter queue. The intention is that messages on the dead-letter queue should be investigated manually. We could surface dead-letters to a SNS queue to make it more obvious that something needs looking at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants