Core router error logs: "sshd: Did not receive identification string" from prometheus hosts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	cmooney
	Jun 26 2024, 8:40 AM

Description

I noticed randomly our CR routers have a lot of this in the logs, the IPs in each case appear to be the local prometheus instances:

cmooney@re0.cr1-eqiad> show log messages | match "Did not receive identification string|/usr/sbin/sshd" 
Jun 26 06:00:58  re0.cr1-eqiad sshd[18240]: Did not receive identification string from 10.64.0.82 port 35834
Jun 26 06:00:58  re0.cr1-eqiad inetd[29367]: /usr/sbin/sshd[18240]: exited, status 255
Jun 26 06:03:44  re0.cr1-eqiad sshd[18375]: Did not receive identification string from 10.64.16.62 port 39704
Jun 26 06:03:44  re0.cr1-eqiad inetd[29367]: /usr/sbin/sshd[18375]: exited, status 255
Jun 26 06:04:58  re0.cr1-eqiad sshd[18419]: Did not receive identification string from 10.64.0.82 port 41180
Jun 26 06:04:58  re0.cr1-eqiad inetd[29367]: /usr/sbin/sshd[18419]: exited, status 255
Jun 26 06:07:44  re0.cr1-eqiad sshd[18587]: Did not receive identification string from 10.64.16.62 port 40044
Jun 26 06:07:44  re0.cr1-eqiad inetd[29367]: /usr/sbin/sshd[18587]: exited, status 255
Jun 26 06:08:58  re0.cr1-eqiad sshd[18633]: Did not receive identification string from 10.64.0.82 port 49898
Jun 26 06:08:58  re0.cr1-eqiad inetd[29367]: /usr/sbin/sshd[18633]: exited, status 255
Jun 26 06:11:44  re0.cr1-eqiad sshd[18788]: Did not receive identification string from 10.64.16.62 port 51994
Jun 26 06:11:44  re0.cr1-eqiad inetd[29367]: /usr/sbin/sshd[18788]: exited, status 255
Jun 26 06:12:58  re0.cr1-eqiad sshd[18838]: Did not receive identification string from 10.64.0.82 port 36248
Jun 26 06:12:58  re0.cr1-eqiad inetd[29367]: /usr/sbin/sshd[18838]: exited, status 255
Jun 26 06:15:44  re0.cr1-eqiad sshd[19005]: Did not receive identification string from 10.64.16.62 port 35270
Jun 26 06:15:44  re0.cr1-eqiad inetd[29367]: /usr/sbin/sshd[19005]: exited, status 255

Need to investigate the cause, are these failed SSH attempts from prometheus hosts to the routers?

Details

Subject	Repo	Branch	Lines +/-
type Netbox::Device - make role mandatory	operations/puppet	production	+1 -1
Prometheus SSH probe: ignore network devices - try 2	operations/puppet	production	+5 -1
Netbox-hiera: add device role to mgmt_hosts (try 2)	operations/cookbooks	master	+2 -0
profile::netbox::data: add type for Netbox::Device	operations/puppet	production	+10 -3
Prometheus SSH probe: ignore network devices - try 2	operations/puppet	production	+6 -1
Netbox-hiera: add device role to mgmt_hosts	operations/cookbooks	master	+2 -0
Add role to type Netbox::Device::Location::BareMetal	operations/puppet	production	+1 -0
Prometheus SSH probe: ignore network devices	operations/puppet	production	+5 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• ayounsi	T368513 Core router error logs: "sshd: Did not receive identification string" from prometheus hosts
		Resolved		• ayounsi	T336275 Upgrade Netbox to 4.x

Event Timeline

cmooney triaged this task as Medium priority.Jun 26 2024, 8:40 AM

cmooney created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 26 2024, 8:40 AM

Those are SSH probes from local prometheus hosts indeed, in this case the probe consists of a TCP connection reading the SSH banner, and then closing the connection, HTH!

@cmooney @fgiunchedi I'm wondering if the probe could/should be changed to a TCP handshake only or totally removed. In order to reduce log spam.
Thanks to Rancid and the daily diff scripts we already get a notification within the hour if SSH gets unreachable.
While still keeping the ICMP check for the "normal" connectivity probe.

In T368513#9938867, @fgiunchedi wrote:

Those are SSH probes from local prometheus hosts indeed, in this case the probe consists of a TCP connection reading the SSH banner, and then closing the connection, HTH!

Thanks for the info!

In T368513#9989379, @ayounsi wrote:

@cmooney @fgiunchedi I'm wondering if the probe could/should be changed to a TCP handshake only or totally removed.

In general it's probably good to have a check on SSH, but as you say Rancid mostly covers that for us and is something we'd see. So no objection if we remove it to reduce the log spam.

So I looked where the probes come from, and they are part of the generic "probe mgmt network hosts for ssh" and data comes from netbox, specifically these bits in modules/profile/manifests/prometheus/ops.pp:

 include profile::netbox::data
 
 $site_mgmt_hosts = $profile::netbox::data::mgmt.filter |$host, $config| {
   $config['site'] == $::site
 } 

# icmp probes 
netops::prometheus::mgmt { 'site':
   targets      => $site_mgmt_hosts,
   targets_file => "${targets_path}/smoke-mgmt_site.yaml",
 } 

 # ssh probes
prometheus::targets::mgmt { 'site':
   targets      => $site_mgmt_hosts,
   targets_file => "${targets_path}/probes-mgmt_site.yaml",
 }

Note that at this level we have each re as a separate entity, not cr1-eqiad as a whole, from netbox-hiera/common.yaml on puppetmaster:

re0.cr1-eqiad.mgmt.eqiad.wmnet:
  rack: A1
  row: eqiad-row-a
  site: eqiad
re1.cr1-eqiad.mgmt.eqiad.wmnet:
  rack: A1
  row: eqiad-row-a
  site: eqiad

I'm not really sure what's the more sustainable option here tbh to be able to say "exclude cr and other network devices from this probing"

Change #1056880 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/cookbooks@master] Netbox-hiera: add device role to mgmt_hosts

https://gerrit.wikimedia.org/r/1056880

gerritbot added a project: Patch-For-Review.Jul 25 2024, 9:17 AM

Change #1056899 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Prometheus SSH probe: ignore network devices

https://gerrit.wikimedia.org/r/1056899

• ayounsi added a subtask: T336275: Upgrade Netbox to 4.x.Jul 29 2024, 8:10 AM

• ayounsi moved this task from Backlog to This quarter on the netops board.Jul 31 2024, 9:02 AM

Change #1056880 merged by jenkins-bot:

[operations/cookbooks@master] Netbox-hiera: add device role to mgmt_hosts

https://gerrit.wikimedia.org/r/1056880

Change #1056899 merged by Ayounsi:

[operations/puppet@production] Prometheus SSH probe: ignore network devices

https://gerrit.wikimedia.org/r/1056899

Change #1060385 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Add role to type Netbox::Device::Location::BareMetal

https://gerrit.wikimedia.org/r/1060385

Change #1060385 merged by Ayounsi:

[operations/puppet@production] Add role to type Netbox::Device::Location::BareMetal

https://gerrit.wikimedia.org/r/1060385

Maintenance_bot removed a project: Patch-For-Review.Wed, Aug 7, 8:30 AM

Change #1060388 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Prometheus SSH probe: ignore network devices - try 2

https://gerrit.wikimedia.org/r/1060388

gerritbot added a project: Patch-For-Review.Wed, Aug 7, 8:31 AM

• ayounsi closed subtask T336275: Upgrade Netbox to 4.x as Resolved.Wed, Aug 7, 9:37 AM

This went very well until it didn't. Changes fully rolled back.

The cookbook change (https://gerrit.wikimedia.org/r/1056880) did its job properly
But this silently broke Puppet for the Prometheus hosts as the Hiera structure changed, which didn't match the defined Netbox::Device::Location::BareMetal type.
Rolling forward by adding the new key to the existing Netbox::Device::Location::BareMetal caused a larger Puppet issue as this type is also used to define the Netbox::Device::Location type globally.

profile::netbox::data::mgmt:
  alert1001.mgmt.eqiad.wmnet:
    rack: C6
    row: eqiad-row-c
    site: eqiad

cat /srv/netbox-hiera/hosts/alert1001.yaml

profile::netbox::host::location:
  rack: C6
  row: eqiad-row-c
  site: eqiad
profile::netbox::host::status: active

I see three possible paths forward:

either add the role as optional to the Netbox::Device::Location::BareMetal type, this is what this CR does, but I don't get why CI is failing: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060388
or also add the role to the location dict for all hosts in : https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/puppet/sync-netbox-hiera.py#317 and apply the same patch as above without the Optional.

As the naming has all been made around "location" and not "metadata", I'm a bit reluctant to rename it all at this point.

Add a new type for the keys under profile::netbox::data::mgmt, that is similar to baremetal but with the role <= so far my preferred option

And of course, add big warnings about this dependency to the sync-netbox-hiera cookbook so future changes don't run into the same trap.

Thank you @ayounsi for the write up! I agree with your preferred option, and argue that ::mgmt really contains a list of Device (in netbox terms)

Change #1060388 abandoned by Ayounsi:

[operations/puppet@production] Prometheus SSH probe: ignore network devices - try 2

Reason:

will do a different approach

https://gerrit.wikimedia.org/r/1060388

Maintenance_bot removed a project: Patch-For-Review.Tue, Aug 13, 8:31 AM

Change #1063990 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] profile::netbox::data: add type for Netbox::Device

https://gerrit.wikimedia.org/r/1063990

gerritbot added a project: Patch-For-Review.Tue, Aug 20, 9:27 AM

Change #1063990 merged by Ayounsi:

[operations/puppet@production] profile::netbox::data: add type for Netbox::Device

https://gerrit.wikimedia.org/r/1063990

Change #1064061 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/cookbooks@master] Netbox-hiera: add device role to mgmt_hosts (try 2)

https://gerrit.wikimedia.org/r/1064061

Change #1064061 merged by jenkins-bot:

[operations/cookbooks@master] Netbox-hiera: add device role to mgmt_hosts (try 2)

https://gerrit.wikimedia.org/r/1064061

Change #1064380 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Prometheus SSH probe: ignore network devices - try 2

https://gerrit.wikimedia.org/r/1064380

Change #1064380 merged by Ayounsi:

[operations/puppet@production] Prometheus SSH probe: ignore network devices - try 2

https://gerrit.wikimedia.org/r/1064380

Change #1064385 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] type Netbox::Device - make role mandatory

https://gerrit.wikimedia.org/r/1064385

Confirmed that cr1-eqiad stopped generating those logs for 10.64.0.82 (prometheus1005). The other one will happen anytime puppet picks up the changes.

Change #1064385 merged by Ayounsi:

[operations/puppet@production] type Netbox::Device - make role mandatory

https://gerrit.wikimedia.org/r/1064385

All done !

cmooney awarded a token.Wed, Aug 21, 2:44 PM

Core router error logs: "sshd: Did not receive identification string" from prometheus hostsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Core router error logs: "sshd: Did not receive identification string" from prometheus hosts
Closed, ResolvedPublic
Actions

Related Objects
Search...