[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent DNS name resolution failure, likely IPv6-related #29069

Open
schams-net opened this issue Sep 5, 2023 · 11 comments
Open

Intermittent DNS name resolution failure, likely IPv6-related #29069

schams-net opened this issue Sep 5, 2023 · 11 comments
Labels
bug 🐛 Programming errors, that need preferential fixing resolve

Comments

@schams-net
Copy link

systemd version the issue has been seen with

252

Used distribution

Debian 12

Linux kernel version used

6.1.0-10-cloud-amd64

CPU architectures issue was seen on

x86_64

Component

systemd-resolved

Expected behaviour you didn't see

DNS look-ups with a constant 100% success rate for IPv4 and IPv6 addresses.

Unexpected behaviour you saw

Intermittent DNS name resolution failures with a success rate of approx. 90% (approx. 10 errors out of 100 queries), likely IPv6-related.

Official Debian v12 AMIs on AWS show intermittent DNS name resolution failures when they perform DNS look-ups of an RDS Aurora endpoint. The instances use the Amazon-provided DNS server (VPC.2). While approx. 90 percent of the DNS queries succeed, 10 percent fail (name or service not known).

The issue occurs on Debian v12 but doesn't occur on Debian v11 or Amazon Linux 2 systems (same environment). No error occurs if the "systemd-resolved" service is disabled or bypassed on Debian v12. No error occurs if IPv6 is disabled on the Debian v12 system.

Steps to reproduce the problem

Please find a detailed problem description including analysis, steps how to reproduce the problem, and a Terraform plan to provision a test infrastructure in the following Git repository: https://github.com/schams-net/aws-ec2-rds-dns-lookup-issue

Additional program output to the terminal or log subsystem illustrating the issue

No response

@schams-net schams-net added the bug 🐛 Programming errors, that need preferential fixing label Sep 5, 2023
@wjsc
Copy link
wjsc commented Oct 26, 2023

Hi! Any update on this issue? any new workaround?

@jackarias
Copy link

Hello! Same problem here. It would be nice to have a workaround on this asap.

@schams-net
Copy link
Author

As documented here, the workaround across our server fleet is a small change to the file /etc/nsswitch.conf. Locate this line:

hosts: files resolve [!UNAVAIL=return] dns

...and remove the "resolve" service. After the change, the line reads:

hosts: files dns

Note that this is not the final solution but only a workaround.

@levinse
Copy link
levinse commented Dec 18, 2023

We are experiencing this since upgrading to Debian 12 AMIs

@ahmgithubahm
Copy link
ahmgithubahm commented Apr 9, 2024

We're affected by this, using Debian 12 AWS EC2 AMIs. As well as agreeing that it is somehow IPv6 related, it also seems to only cause resolution failures for CNAMEs. At least in my testing with A and CNAME lookups, only CNAMEs exhibit the intermittent failures.

Our workaround has been to simply purge the systemd-resolved package entirely, which has the effect of sorting out nsswitch on its way out of the door. <edit> DO NOT USE THIS WORKAROUND, IT BREAKS DNS NAME RESOLUTION AFTER A REBOOT, SORRY 🤡

apt purge --auto-remove systemd-resolved

Thanks to @schams-net for the detailed repro details. I shamelessly grabbed the bash fragment [1] to easily reproduce the issue. Using the script to resolve a CNAME produced intermittent failures. Pointing it at an A record gave no failures. Telling netcat to use IPv4 (nc -4) also meant no failures, for A or CNAME lookups.

[1] https://github.com/schams-net/aws-ec2-rds-dns-lookup-issue/blob/main/docs/how-to-reproduce-the-issue.md

@nmeyerhans
Copy link
Contributor

@ahmgithubahm You don't need to remove all of systemd-resolved. Just removing libnss-resolve will accomplish what you need without disabling resolv.conf management altogether. Debian 12.6 images, when they're published, will make this the default.

@ahmgithubahm
Copy link

Thanks @nmeyerhans, that's good to know. I've edited my post now I realise that my workaround leaves you with no working /etc/resolv.conf after a reboot. Oops. 🤡

@nmeyerhans
Copy link
Contributor

@schams-net @ahmgithubahm @jackarias @levinse @wjsc Be aware that the Debian 12 cloud images just released (version 20240415-1718) no longer enable libnss-resolve by default. So this issue should be mitigated in instances launched from these and future images.

If somebody who has encountered this issue on Debian 12 would be able to test the Debian sid images, that would help the systemd maintainers understand if the issue is still present.

@kuzminets
Copy link

Thanks @nmeyerhans! Is the problem somehow specific to AWS? I can't reproduce it from a local VM with libnss-resolve being enabled.

@nmeyerhans
Copy link
Contributor

I don't see any reason to think it'd be specific to AWS, but it's possible. As the problem is intermittent, it could be that something about the AWS DNS infrastructure triggers it more frequently, but that's just speculation.

@schams-net
Copy link
Author

I can confirm that I'm unable reproduce the issue with the latest AMIs at AWS anymore 🥳
I tested debian-12-amd64-20240429-1732 (for example AMI ID ami-02175b0058d3ce245 in us-east-1). The file /etc/nsswitch.conf contains the following configuration in the old vs. new image.

Previously (with "resolve"):

hosts: files resolve [!UNAVAIL=return] dns

New(er) images (without "resolve"):

hosts: files dns myhostname

(As @nmeyerhans pointed out).

I haven't had the chance to test the Debian sid images yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Programming errors, that need preferential fixing resolve
Development

No branches or pull requests

7 participants