Maniphest T328565

[Flink Operations] Automate Replay of Failed Events
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	lbowmaker
	Feb 1 2023, 3:00 PM

Tags

Referenced Files

None

Subscribers

Description

As an event platform engineer, I want to automate the replay of failed events

Why?

If the processing of an event fails there should be minimal intervention from the platform or implementing teams as possible, to reduce impact and effort of manually replaying

Done is:

Solution is discussed and agreed with the team
Preliminary work can start on implementing the process

To be groomed:

Should we agree on a convention for failure output? For example, like HTTP errors - 418 is code failure, etc
Agree on how to handle each failure output type. For example, don’t replay code failures but alert?
Agree on how to replay each failure? For example, retry 3 times in 24 hours then alert? Flexible per job?
What tool should be used? Batch Flink job? Airflow?

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Gehel	T317045 [Epic] Re-architect the Search Update Pipeline
		Resolved		Gehel	T340548 [EPIC] Deployment of the Search Update Pipeline on Flink / k8s
		Open		lbowmaker	T328561 [Event Platform] Flink Operations
		Open		None	T328565 [Flink Operations] Automate Replay of Failed Events

Event Timeline

lbowmaker created this task.Feb 1 2023, 3:00 PM

lbowmaker moved this task from Backlog to To be Estimated/To be discussed on the Event-Platform board.

lbowmaker removed lbowmaker as the assignee of this task.Feb 1 2023, 3:09 PM

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Feb 10 2023, 12:38 PM

• EChetty moved this task from Backlog to Event Platform on the Data-Engineering-Planning board.Feb 10 2023, 12:45 PM

JArguello-WMF moved this task from To be Estimated/To be discussed to Estimated/ Discussed on the Event-Platform board.Feb 10 2023, 5:51 PM

JArguello-WMF moved this task from Estimated/ Discussed to To be Estimated/To be discussed on the Event-Platform board.

JArguello-WMF moved this task from To be Estimated/To be discussed to Backlog on the Event-Platform board.Mar 8 2023, 3:15 PM

JArguello-WMF removed a project: Data-Engineering-Planning.Jun 29 2023, 9:47 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptJun 29 2023, 9:47 PM

JArguello-WMF moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Jun 29 2023, 10:30 PM

JArguello-WMF added a project: Data Engineering and Event Platform Team.Jun 30 2023, 4:29 PM

JArguello-WMF moved this task from Data Eng Backlog to Event Platform Backlog on the Data Engineering and Event Platform Team board.Jun 30 2023, 4:38 PM

bking mentioned this in T342149: Test common operations in the flink operator/k8s/Flink ZK environment.Jul 18 2023, 5:44 PM

bking subscribed.Aug 22 2023, 6:27 PM

lbowmaker removed a project: Data Engineering and Event Platform Team.Nov 10 2023, 2:29 PM