[go: nahoru, domu]

Page MenuHomePhabricator

[Flink Operations] Automate Replay of Failed Events
Open, Needs TriagePublic

Description

As an event platform engineer, I want to automate the replay of failed events
Why?
  • If the processing of an event fails there should be minimal intervention from the platform or implementing teams as possible, to reduce impact and effort of manually replaying
Done is:
  • Solution is discussed and agreed with the team
  • Preliminary work can start on implementing the process

To be groomed:

  • Should we agree on a convention for failure output? For example, like HTTP errors - 418 is code failure, etc
  • Agree on how to handle each failure output type. For example, don’t replay code failures but alert?
  • Agree on how to replay each failure? For example, retry 3 times in 24 hours then alert? Flexible per job?
  • What tool should be used? Batch Flink job? Airflow?

Event Timeline