[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart Agent when SystemD Network unit is restarted #103

Closed
ricbartm opened this issue Mar 26, 2021 · 3 comments · Fixed by #104
Closed

Restart Agent when SystemD Network unit is restarted #103

ricbartm opened this issue Mar 26, 2021 · 3 comments · Fixed by #104
Assignees

Comments

@ricbartm
Copy link

Environment

OS: Ubuntu 20.04 LTS
Kernel: 5.4.0-1037-gcp #40-Ubuntu SMP Fri Feb 5 11:57:53 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
SystemD version: 245.4-4ubuntu3.5
Google Guess Agent version: 20201217.02-0ubuntu1~20.04.0

Problem

With the release of a security upgrade by Ubuntu on package systemd, the SystemD service systemd-networkd is restarted. This can make a GCP instance impaired for serving traffic.

When the systemd-networkd.service unit is restarted, the operating system local routing table is wiped. This cause the local host routes for Google Cloud regional TCP Load Balancers to disappear and produce the following behavior:

  • The health checks, originated from the TCP LB service IP, start failing because the node does not have a host route for it
  • With all instances in a failed state, the TCP LB enters into an always-open state. The traffic directed to the TCP LB service IP is being dropped by the instances (never answer to the TCP SYN packet) because of the lack of the host route.

The triage for this issue is restarting the google-guest-agent.service SystemD unit, so host routes are added back and both health checks and traffic start working again.

Reproduction steps

  1. Create a TCP regional LB in a given region (does not matter if the public IP is static or ephemeral)
  2. Configure a GCP instance in the same region as a backend instance. Configure a basic TCP health check on a TCP port that is wide open
  3. Configure a frontend listener on port 80 using an ephemeral IP
  4. Wait for it to be created
  5. SSH to the instance and verify that TCP LB ephemeral IP is listed as host route in the output of ip ro list table local
  6. Restart systemd-networkd using systemd restart systemd-networkd
  7. Check the local route table again and verify the route is no longer there.

At this point, the route won't be re-added. You need to restart the google-guest-agent.service SystemD unit to the routes to be re-added.

Solution

The systemd-networkd.service unit is not listed as part of the PartOf directive in the Google Guest Agent service unit configuration. See https://github.com/GoogleCloudPlatform/guest-agent/blob/master/google-guest-agent.service#L7

There is an item in the PartOf for networking.service, but this systemd unit is managed by ifupdown package. In this specific user case, SystemD is also network managed and we'll need to consider it like that in the google-guest-agent.service configuration.

@ricbartm ricbartm changed the title Restart Agent when SystemD Networking service is restarted Restart Agent when SystemD Network unit is restarted Mar 26, 2021
@zmarano
Copy link
Contributor
zmarano commented Mar 30, 2021

/assign @hopkiw

@dnsmichi
Copy link

We've discovered this issue & fix in an incident today on GitLab.com SaaS, sharing the RCA for visibility: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5196#note_632054352

@ricbartm
Copy link
Author

https://www.netlify.com/blog/2021/04/02/load-balancer-service-degradation-march-25-2021/ for visibility too

patelne pushed a commit to patelne/guest-agent that referenced this issue Feb 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants