[go: nahoru, domu]

Page MenuHomePhabricator

Core router error logs: "sshd: Did not receive identification string" from prometheus hosts
Closed, ResolvedPublic

Description

I noticed randomly our CR routers have a lot of this in the logs, the IPs in each case appear to be the local prometheus instances:

cmooney@re0.cr1-eqiad> show log messages | match "Did not receive identification string|/usr/sbin/sshd" 
Jun 26 06:00:58  re0.cr1-eqiad sshd[18240]: Did not receive identification string from 10.64.0.82 port 35834
Jun 26 06:00:58  re0.cr1-eqiad inetd[29367]: /usr/sbin/sshd[18240]: exited, status 255
Jun 26 06:03:44  re0.cr1-eqiad sshd[18375]: Did not receive identification string from 10.64.16.62 port 39704
Jun 26 06:03:44  re0.cr1-eqiad inetd[29367]: /usr/sbin/sshd[18375]: exited, status 255
Jun 26 06:04:58  re0.cr1-eqiad sshd[18419]: Did not receive identification string from 10.64.0.82 port 41180
Jun 26 06:04:58  re0.cr1-eqiad inetd[29367]: /usr/sbin/sshd[18419]: exited, status 255
Jun 26 06:07:44  re0.cr1-eqiad sshd[18587]: Did not receive identification string from 10.64.16.62 port 40044
Jun 26 06:07:44  re0.cr1-eqiad inetd[29367]: /usr/sbin/sshd[18587]: exited, status 255
Jun 26 06:08:58  re0.cr1-eqiad sshd[18633]: Did not receive identification string from 10.64.0.82 port 49898
Jun 26 06:08:58  re0.cr1-eqiad inetd[29367]: /usr/sbin/sshd[18633]: exited, status 255
Jun 26 06:11:44  re0.cr1-eqiad sshd[18788]: Did not receive identification string from 10.64.16.62 port 51994
Jun 26 06:11:44  re0.cr1-eqiad inetd[29367]: /usr/sbin/sshd[18788]: exited, status 255
Jun 26 06:12:58  re0.cr1-eqiad sshd[18838]: Did not receive identification string from 10.64.0.82 port 36248
Jun 26 06:12:58  re0.cr1-eqiad inetd[29367]: /usr/sbin/sshd[18838]: exited, status 255
Jun 26 06:15:44  re0.cr1-eqiad sshd[19005]: Did not receive identification string from 10.64.16.62 port 35270
Jun 26 06:15:44  re0.cr1-eqiad inetd[29367]: /usr/sbin/sshd[19005]: exited, status 255

Need to investigate the cause, are these failed SSH attempts from prometheus hosts to the routers?

Event Timeline

cmooney triaged this task as Medium priority.Jun 26 2024, 8:40 AM
cmooney created this task.

Those are SSH probes from local prometheus hosts indeed, in this case the probe consists of a TCP connection reading the SSH banner, and then closing the connection, HTH!

@cmooney @fgiunchedi I'm wondering if the probe could/should be changed to a TCP handshake only or totally removed. In order to reduce log spam.
Thanks to Rancid and the daily diff scripts we already get a notification within the hour if SSH gets unreachable.
While still keeping the ICMP check for the "normal" connectivity probe.

Those are SSH probes from local prometheus hosts indeed, in this case the probe consists of a TCP connection reading the SSH banner, and then closing the connection, HTH!

Thanks for the info!

@cmooney @fgiunchedi I'm wondering if the probe could/should be changed to a TCP handshake only or totally removed.

In general it's probably good to have a check on SSH, but as you say Rancid mostly covers that for us and is something we'd see. So no objection if we remove it to reduce the log spam.

So I looked where the probes come from, and they are part of the generic "probe mgmt network hosts for ssh" and data comes from netbox, specifically these bits in modules/profile/manifests/prometheus/ops.pp:

 include profile::netbox::data
 
 $site_mgmt_hosts = $profile::netbox::data::mgmt.filter |$host, $config| {
   $config['site'] == $::site
 } 

# icmp probes 
netops::prometheus::mgmt { 'site':
   targets      => $site_mgmt_hosts,
   targets_file => "${targets_path}/smoke-mgmt_site.yaml",
 } 

 # ssh probes
prometheus::targets::mgmt { 'site':
   targets      => $site_mgmt_hosts,
   targets_file => "${targets_path}/probes-mgmt_site.yaml",
 }

Note that at this level we have each re as a separate entity, not cr1-eqiad as a whole, from netbox-hiera/common.yaml on puppetmaster:

re0.cr1-eqiad.mgmt.eqiad.wmnet:
  rack: A1
  row: eqiad-row-a
  site: eqiad
re1.cr1-eqiad.mgmt.eqiad.wmnet:
  rack: A1
  row: eqiad-row-a
  site: eqiad

I'm not really sure what's the more sustainable option here tbh to be able to say "exclude cr and other network devices from this probing"

Change #1056880 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/cookbooks@master] Netbox-hiera: add device role to mgmt_hosts

https://gerrit.wikimedia.org/r/1056880

Change #1056899 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Prometheus SSH probe: ignore network devices

https://gerrit.wikimedia.org/r/1056899

Change #1056880 merged by jenkins-bot:

[operations/cookbooks@master] Netbox-hiera: add device role to mgmt_hosts

https://gerrit.wikimedia.org/r/1056880

Change #1056899 merged by Ayounsi:

[operations/puppet@production] Prometheus SSH probe: ignore network devices

https://gerrit.wikimedia.org/r/1056899

Change #1060385 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Add role to type Netbox::Device::Location::BareMetal

https://gerrit.wikimedia.org/r/1060385

Change #1060385 merged by Ayounsi:

[operations/puppet@production] Add role to type Netbox::Device::Location::BareMetal

https://gerrit.wikimedia.org/r/1060385

Change #1060388 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Prometheus SSH probe: ignore network devices - try 2

https://gerrit.wikimedia.org/r/1060388

This went very well until it didn't. Changes fully rolled back.

The cookbook change (https://gerrit.wikimedia.org/r/1056880) did its job properly
But this silently broke Puppet for the Prometheus hosts as the Hiera structure changed, which didn't match the defined Netbox::Device::Location::BareMetal type.
Rolling forward by adding the new key to the existing Netbox::Device::Location::BareMetal caused a larger Puppet issue as this type is also used to define the Netbox::Device::Location type globally.

profile::netbox::data::mgmt:
  alert1001.mgmt.eqiad.wmnet:
    rack: C6
    row: eqiad-row-c
    site: eqiad
cat /srv/netbox-hiera/hosts/alert1001.yaml
profile::netbox::host::location:
  rack: C6
  row: eqiad-row-c
  site: eqiad
profile::netbox::host::status: active

I see three possible paths forward:

As the naming has all been made around "location" and not "metadata", I'm a bit reluctant to rename it all at this point.

  • Add a new type for the keys under profile::netbox::data::mgmt, that is similar to baremetal but with the role <= so far my preferred option

And of course, add big warnings about this dependency to the sync-netbox-hiera cookbook so future changes don't run into the same trap.

Thank you @ayounsi for the write up! I agree with your preferred option, and argue that ::mgmt really contains a list of Device (in netbox terms)

Change #1060388 abandoned by Ayounsi:

[operations/puppet@production] Prometheus SSH probe: ignore network devices - try 2

Reason:

will do a different approach

https://gerrit.wikimedia.org/r/1060388

Change #1063990 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] profile::netbox::data: add type for Netbox::Device

https://gerrit.wikimedia.org/r/1063990

Change #1063990 merged by Ayounsi:

[operations/puppet@production] profile::netbox::data: add type for Netbox::Device

https://gerrit.wikimedia.org/r/1063990

Change #1064061 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/cookbooks@master] Netbox-hiera: add device role to mgmt_hosts (try 2)

https://gerrit.wikimedia.org/r/1064061

Change #1064061 merged by jenkins-bot:

[operations/cookbooks@master] Netbox-hiera: add device role to mgmt_hosts (try 2)

https://gerrit.wikimedia.org/r/1064061

Change #1064380 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Prometheus SSH probe: ignore network devices - try 2

https://gerrit.wikimedia.org/r/1064380

Change #1064380 merged by Ayounsi:

[operations/puppet@production] Prometheus SSH probe: ignore network devices - try 2

https://gerrit.wikimedia.org/r/1064380

Change #1064385 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] type Netbox::Device - make role mandatory

https://gerrit.wikimedia.org/r/1064385

Confirmed that cr1-eqiad stopped generating those logs for 10.64.0.82 (prometheus1005). The other one will happen anytime puppet picks up the changes.

Change #1064385 merged by Ayounsi:

[operations/puppet@production] type Netbox::Device - make role mandatory

https://gerrit.wikimedia.org/r/1064385

ayounsi claimed this task.

All done !