[Flink Operations] How to handle restarting a Flink application
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	lbowmaker
	Feb 1 2023, 2:56 PM

Description

User Story

As an event platform engineer, I need to understand how I can restart a Flink application from the point in time that it failed

Why?

So that restarts can be handled cleanly, with minimal impact and with minimal manual intervention

Done is:

Process for restarts is documented in runbook (major potential failure points are documented and process at each step is documented)
Storage requirements for state and approach is documented

Related Objects
Search...

Status	Assigned	Task
Resolved	Gehel	T317045 [Epic] Re-architect the Search Update Pipeline
Resolved	Gehel	T340548 [EPIC] Deployment of the Search Update Pipeline on Flink / k8s
Open	lbowmaker	T328561 [Event Platform] Flink Operations
Resolved	gmodena	T328563 [Flink Operations] How to handle restarting a Flink application

Event Timeline

lbowmaker renamed this task from How to handle restarting a Flink application to [Flink Operations] How to handle restarting a Flink application.Feb 1 2023, 2:56 PM

lbowmaker created this task.

lbowmaker moved this task from Backlog to To be Estimated/To be discussed on the Event-Platform board.

lbowmaker updated the task description. (Show Details)Feb 8 2023, 3:44 PM

JArguello-WMF moved this task from To be Estimated/To be discussed to Sprint 09 on the Event-Platform board.Feb 8 2023, 3:45 PM

JArguello-WMF edited projects, added Event-Platform (Sprint 09); removed Event-Platform.

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Feb 10 2023, 12:38 PM

gmodena claimed this task.Feb 20 2023, 11:08 AM

gmodena moved this task from Next Up to In Progress on the Event-Platform (Sprint 09) board.

gmodena set the point value for this task to 5.

gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_requests/11

Draft: Enable s3-fs-presto plugin

gerritbot added a project: Patch-For-Review.Feb 21 2023, 11:12 AM

I have a working setup on minikube that manages restarts and HA using the flink k8s operator, minio (for checkpointing) and the helm template.

As a next step, I'd like to scale up experiments to DSE. @Ottomata need to touch base and validate some assumptions of our current deployment.
Flink HA services require _to start the JobManager and TaskManager pods with a service account which has the permissions to create, edit, delete ConfigMaps._ (https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/standalone/kubernetes/).

Does our service account meet this requirements?

We'll also need storage for application checkpointing. This is required for HA services (e.g. restart strategy), and we won't able to fallback to the Job Manager heap. I reached out to SRE Data Persistence for info re onboarding to Swift.

permissions to create, edit, delete ConfigMaps

Yes, we got it!

FWIW, we MAYYYBYE will want to use Zookeeper for the HA state. This would make k8s cluster restarts easier on Service Ops. Modified ConfigMap state is not restored on k8s cluster restart, so if we need to persist it, we have to do so manually when they want to restart k8s.

bking subscribed.Feb 23 2023, 3:51 PM

@lbowmaker @Ottomata @dcausse I documented today's application restart discussion at https://www.mediawiki.org/wiki/Platform_Engineering_Team/Event_Platform_Value_Stream/Pyflink_Enrichment_Service_Deployment#Application_restarts.

otto merged https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_requests/11

Enable s3-fs-presto plugin

Maintenance_bot removed a project: Patch-For-Review.Feb 25 2023, 12:10 AM

gmodena mentioned this in T328569: [Flink Operation] How to handle app upgrades.Feb 27 2023, 8:59 PM

gmodena moved this task from In Progress to In Review on the Event-Platform (Sprint 09) board.Feb 28 2023, 10:05 AM

gmodena updated the task description. (Show Details)Feb 28 2023, 7:13 PM

JArguello-WMF moved this task from In Review to Done on the Event-Platform (Sprint 09) board.Mar 2 2023, 2:11 PM

lbowmaker moved this task from Backlog to Event Platform on the Data-Engineering-Planning board.Mar 3 2023, 2:39 PM

lbowmaker edited projects, added Event-Platform; removed Event-Platform (Sprint 09).

Ottomata mentioned this in T331283: [Event Platform] [NEEDS GROOMING] Store Flink HA metadata in Zookeeper.Mar 6 2023, 11:46 AM

lbowmaker moved this task from Backlog to Sprint 09 on the Event-Platform board.Mar 6 2023, 1:54 PM

lbowmaker edited projects, added Event-Platform (Sprint 09); removed Event-Platform.

JArguello-WMF closed this task as Resolved.Mar 13 2023, 2:12 PM

bking mentioned this in T342149: Test common operations in the flink operator/k8s/Flink ZK environment.Jul 18 2023, 5:44 PM

[Flink Operations] How to handle restarting a Flink applicationClosed, ResolvedPublic5 Estimated Story PointsActions