[go: nahoru, domu]

×

Network Observability Operator 1.7.0

The following advisory is available for the Network Observability Operator 1.7.0:

New features and enhancements

OpenTelemetry support

You can now export enriched network flows to a compatible OpenTelemetry endpoint, such as the Red Hat build of OpenTelemetry. For more information see Export enriched network flow data.

Network Observability Developer perspective

You can now use Network Observability in the Developer perspective. For more information, see OpenShift Container Platform console integration.

TCP flags filtering

You can now use the tcpFlags filter to limit the volume of packets processed by the eBPF program. For more information, see Flow filter configuration parameters, eBPF flow rule filter, and Detecting SYN flooding using the FlowMetric API and TCP flags.

Network Observability for OpenShift Virtualization

You can observe networking patterns on an OpenShift Virtualization setup by identifying eBPF-enriched network flows coming from VMs that are connected to secondary networks, such as through Open Virtual Network (OVN)-Kubernetes. For more information, see Configuring virtual machine (VM) secondary network interfaces for Network Observability.

Network policy deploys in the FlowCollector custom resource (CR)

With this release, you can configure the FlowCollector CR to deploy a network policy for Network Observability. Previously, if you wanted a network policy, you had to manually create one. The option to manually create a network policy is still available. For more information, see Configuring an ingress network policy by using the FlowCollector custom resource.

FIPS compliance

  • You can install and use the Network Observability Operator in an OpenShift Container Platform cluster running in FIPS mode.

    To enable FIPS mode for your cluster, you must run the installation program from a RHEL computer configured to operate in FIPS mode. For more information about configuring FIPS mode on RHEL, see Installing the system in FIPS mode.

eBPF agent enhancements

The following enhancements are available for the eBPF agent:

  • If the DNS service maps to a different port than 53, you can specify this DNS tracking port using spec.agent.ebpf.advanced.env.DNS_TRACKING_PORT.

  • You can now use two ports for transport protocols (TCP, UDP, or SCTP) filtering rules.

  • You can now filter on transport ports with a wildcard protocol by leaving the protocol field empty.

For more information, see FlowCollector API specifications.

Network Observability CLI

The Network Observability CLI (oc netobserv), is now generally available. The following enhancements have been made since the 1.6 Technology Preview release: * There are now eBPF enrichment filters for packet capture similar to flow capture. * You can now use filter tcp_flags with both flow and packets capture. * The auto-teardown option is available when max-bytes or max-time is reached. For more information, see Network Observability CLI and Network Observability CLI 1.7.0.

Bug fixes

  • Previously, when using a RHEL 9.2 real-time kernel, some of the webhooks did not work. Now, a fix is in place to check whether this RHEL 9.2 real-time kernel is being used. If the kernel is being used, a warning is displayed about the features that do not work, such as packet drop and neither Round-trip Time when using s390x architecture. The fix is in OpenShift 4.16 and later. (NETOBSERV-1808)

  • Previously, in the Manage panels dialog in the Overview tab, filtering on total, bar, donut, or line did not show a result. Now the available panels are correctly filtered. (NETOBSERV-1540)

  • Previously, under high stress, the eBPF agents were susceptible to enter into a state where they generated a high number of small flows, almost not aggregated. With this fix, the aggregation process is still maintained under high stress, resulting in less flows being created. This fix improves the resource consumption not only in the eBPF agent but also in flowlogs-pipeline and Loki. (NETOBSERV-1564)

  • Previously, when the workload_flows_total metric was enabled instead of the namespace_flows_total metric, the health dashboard stopped showing By namespace flow charts. With this fix, the health dashboard now shows the flow charts when the workload_flows_total is enabled. (NETOBSERV-1746)

  • Previously, when you used the FlowMetrics API to generate a custom metric and later modified its labels, such as by adding a new label, the metric stopped populating and an error was shown in the flowlogs-pipeline logs. With this fix, you can modify the labels, and the error is no longer raised in the flowlogs-pipeline logs. (NETOBSERV-1748)

  • Previously, there was an inconsistency with the default Loki WriteBatchSize configuration: it was set to 100 KB in the FlowCollector CRD default, and 10 MB in the OLM sample or default configuration. Both are now aligned to 10 MB, which generally provides better performances and less resource footprint. (NETOBSERV-1766)

  • Previously, the eBPF flow filter on ports was ignored if you did not specify a protocol. With this fix, you can set eBPF flow filters independently on ports and or protocols. (NETOBSERV-1779)

  • Previously, traffic from Pods to Services was hidden from the Topology view. Only the return traffic from Services to Pods was visible. With this fix, that traffic is correctly displayed. (NETOBSERV-1788)

  • Previously, non-cluster administrator users that had access to Network Observability saw an error in the console plugin when they tried to filter for something that triggered auto-completion, such as a namespace. With this fix, no error is displayed, and the auto-completion returns the expected results. (NETOBSERV-1798)

  • When the secondary interface support was added, you had to iterate multiple times to register the per network namespace with the netlink to learn about interface notifications. At the same time, unsuccessful handlers caused a leaking file descriptor because with TCX hook, unlike TC, handlers needed to be explicitly removed when the interface went down. Furthermore, when the network namespace was deleted, there was no Go close channel event to terminate the netlink goroutine socket, which caused go threads to leak. Now, there are no longer leaking file descriptors or go threads when you create or delete pods. (NETOBSERV-1805)

  • Previously, the ICMP type and value were displaying 'n/a' in the Traffic flows table even when related data was available in the flow JSON. With this fix, ICMP columns display related values as expected in the flow table. (NETOBSERV-1806)

  • Previously in the console plugin, it wasn’t always possible to filter for unset fields, such as unset DNS latency. With this fix, filtering on unset fields is now possible. (NETOBSERV-1816)

  • Previously, when you cleared filters in the OpenShift web console plugin, sometimes the filters reappeared after you navigated to another page and returned to the page with filters. With this fix, filters do not unexpectedly reappear after they are cleared. (NETOBSERV-1733)

Known issues

  • WWhen you use the must-gather tool with Network Observability, logs are not collected when the cluster has FIPS enabled. (NETOBSERV-1830)

  • When the spec.networkPolicy is enabled in the FlowCollector, which installs a network policy on the netobserv namespace, it is impossible to use the FlowMetrics API. The network policy blocks calls to the validation webhook. As a workaround, use the following network policy:

    kind: NetworkPolicy
    apiVersion: networking.k8s.io/v1
    metadata:
      name: allow-from-hostnetwork
      namespace: netobserv
    spec:
      podSelector:
        matchLabels:
          app: netobserv-operator
      ingress:
        - from:
            - namespaceSelector:
                matchLabels:
                  policy-group.network.openshift.io/host-network: ''
      policyTypes:
        - Ingress

Network Observability Operator 1.6.2

The following advisory is available for the Network Observability Operator 1.6.2:

Bug fixes

  • When the secondary interface support was added, there was a need to iterate multiple times to register the per network namespace with the netlink to learn about interface notifications. At the same time, unsuccessful handlers caused a leaking file descriptor because with TCX hook, unlike TC, handlers needed to be explicitly removed when the interface went down. Now, there is no longer leaking file descriptors when creating and deleting pods. (NETOBSERV-1805)

Known issues

There was a compatibility issue with console plugins that would have prevented Network Observability from being installed on future versions of an OpenShift Container Platform cluster. By upgrading to 1.6.2, the compatibility issue is resolved and Network Observability can be installed as expected. (NETOBSERV-1737)

Network Observability Operator 1.6.1

The following advisory is available for the Network Observability Operator 1.6.1:

Bug fixes

  • Previously, information about packet drops, such as the cause and TCP state, was only available in the Loki datastore and not in Prometheus. For that reason, the drop statistics in the OpenShift web console plugin Overview was only available with Loki. With this fix, information about packet drops is also added to metrics, so you can view drops statistics when Loki is disabled. (NETOBSERV-1649)

  • When the eBPF agent PacketDrop feature was enabled, and sampling was configured to a value greater than 1, reported dropped bytes and dropped packets ignored the sampling configuration. While this was done on purpose, so as not to miss any drops, a side effect was that the reported proportion of drops compared with non-drops became biased. For example, at a very high sampling rate, such as 1:1000, it was likely that almost all the traffic appears to be dropped when observed from the console plugin. With this fix, the sampling configuration is honored with dropped bytes and packets. (NETOBSERV-1676)

  • Previously, the SR-IOV secondary interface was not detected if the interface was created first and then the eBPF agent was deployed. It was only detected if the agent was deployed first and then the SR-IOV interface was created. With this fix, the SR-IOV secondary interface is detected no matter the sequence of the deployments. (NETOBSERV-1697)

  • Previously, when Loki was disabled, the Topology view in the OpenShift web console displayed the Cluster and Zone aggregation options in the slider beside the network topology diagram, even when the related features were not enabled. With this fix, the slider now only displays options according to the enabled features. (NETOBSERV-1705)

  • Previously, when Loki was disabled, and the OpenShift web console was first loading, an error would occur: Request failed with status code 400 Loki is disabled. With this fix, the errors no longer occur. (NETOBSERV-1706)

  • Previously, in the Topology view of the OpenShift web console, when clicking on the Step into icon next to any graph node, the filters were not applied as required in order to set the focus to the selected graph node, resulting in showing a wide view of the Topology view in the OpenShift web console. With this fix, the filters are correctly set, effectively narrowing down the Topology. As part of this change, clicking the Step into icon on a Node now brings you to the Resource scope instead of the Namespaces scope. (NETOBSERV-1720)

  • Previously, when Loki was disabled, in the Topology view of the OpenShift web console with the Scope set to Owner, clicking on the Step into icon next to any graph node would bring the Scope to Resource, which is not available without Loki, so an error message was shown. With this fix, the Step into icon is hidden in the Owner scope when Loki is disabled, so this scenario no longer occurs.(NETOBSERV-1721)

  • Previously, when Loki was disabled, an error was displayed in the Topology view of the OpenShift web console when a group was set, but then the scope was changed so that the group becomes invalid. With this fix, the invalid group is removed, preventing the error. (NETOBSERV-1722)

  • When creating a FlowCollector resource from the OpenShift web console Form view, as opposed to the YAML view, the following settings were incorrectly managed by the web console: agent.ebpf.metrics.enable and processor.subnetLabels.openShiftAutoDetect. These settings can only be disabled in the YAML view, not in the Form view. To avoid any confusion, these settings have been removed from the Form view. They are still accessible in the YAML view. (NETOBSERV-1731)

  • Previously, the eBPF agent was unable to clean up traffic control flows installed before an ungraceful crash, for example a crash due to a SIGTERM signal. This led to the creation of multiple traffic control flow filters with the same name, since the older ones were not removed. With this fix, all previously installed traffic control flows are cleaned up when the agent starts, before installing new ones. (NETOBSERV-1732)

  • Previously, when configuring custom subnet labels and keeping the OpenShift subnets auto-detection enabled, OpenShift subnets would take precedence over the custom ones, preventing the definition of custom labels for in cluster subnets. With this fix, custom defined subnets take precedence, allowing the definition of custom labels for in cluster subnets. (NETOBSERV-1734)

Network Observability Operator 1.6.0

The following advisory is available for the Network Observability Operator 1.6.0:

Before upgrading to the latest version of the Network Observability Operator, you must Migrate removed stored versions of the FlowCollector CRD. An automated solution to this workaround is planned with NETOBSERV-1747.

New features and enhancements

Enhanced use of Network Observability Operator without Loki

You can now use Prometheus metrics and rely less on Loki for storage when using the Network Observability Operator. For more information, see Network Observability without Loki.

Custom metrics API

You can create custom metrics out of flowlogs data by using the FlowMetrics API. Flowlogs data can be used with Prometheus labels to customize cluster information on your dashboards. You can add custom labels for any subnet that you want to identify in your flows and metrics. This enhancement can also be used to more easily identify external traffic by using the new labels SrcSubnetLabel and DstSubnetLabel, which exists both in flow logs and in metrics. Those fields are empty when there is external traffic, which gives a way to identify it. For more information, see Custom metrics and FlowMetric API reference.

eBPF performance enhancements

Experience improved performances of the eBPF agent, in terms of CPU and memory, with the following updates:

  • The eBPF agent now uses TCX webhooks instead of TC.

  • The NetObserv / Health dashboard has a new section that shows eBPF metrics.

    • Based on the new eBPF metrics, an alert notifies you when the eBPF agent is dropping flows.

  • Loki storage demand decreases significantly now that duplicated flows are removed. Instead of having multiple, individual duplicated flows per network interface, there is one de-duplicated flow with a list of related network interfaces.

With the duplicated flows update, the Interface and Interface Direction fields in the Network Traffic table are renamed to Interfaces and Interface Directions, so any bookmarked Quick filter queries using these fields need to be updated to interfaces and ifdirections.

For more information, see Using the eBPF agent alert and Quick filters.

eBPF collection rule-based filtering

You can use rule-based filtering to reduce the volume of created flows. When this option is enabled, the Netobserv / Health dashboard for eBPF agent statistics has the Filtered flows rate view. For more information, see eBPF flow rule filter.

Technology Preview features

Some features in this release are currently in Technology Preview. These experimental features are not intended for production use. Note the following scope of support on the Red Hat Customer Portal for these features:

Network Observability CLI

You can debug and troubleshoot network traffic issues without needing to install the Network Observability Operator by using the Network Observability CLI. Capture and visualize flow and packet data in real-time with no persistent storage requirement during the capture. For more information, see Network Observability CLI and Network Observability CLI 1.6.0.

Bug fixes

  • Previously, a dead link to the OpenShift containter platform documentation was displayed in the Operator Lifecycle Manager (OLM) form for the FlowMetrics API creation. Now the link has been updated to point to a valid page. (NETOBSERV-1607)

  • Previously, the Network Observability Operator description in the Operator Hub displayed a broken link to the documentation. With this fix, this link is restored. (NETOBSERV-1544)

  • Previously, if Loki was disabled and the Loki Mode was set to LokiStack, or if Loki manual TLS configuration was configured, the Network Observability Operator still tried to read the Loki CA certificates. With this fix, when Loki is disabled, the Loki certificates are not read, even if there are settings in the Loki configuration. (NETOBSERV-1647)

  • Previously, the oc must-gather plugin for the Network Observability Operator was only working on the amd64 architecture and failing on all others because the plugin was using amd64 for the oc binary. Now, the Network Observability Operator oc must-gather plugin collects logs on any architecture platform.

  • Previously, when filtering on IP addresses using not equal to, the Network Observability Operator would return a request error. Now, the IP filtering works in both equal and not equal to cases for IP addresses and ranges. (NETOBSERV-1630)

  • Previously, when a user was not an admin, the error messages were not consistent with the selected tab of the Network Traffic view in the web console. Now, the user not admin error displays on any tab with improved display.(NETOBSERV-1621)

Known issues

  • When the eBPF agent PacketDrop feature is enabled, and sampling is configured to a value greater than 1, reported dropped bytes and dropped packets ignore the sampling configuration. While this is done on purpose to not miss any drops, a side effect is that the reported proportion of drops compared to non-drops becomes biased. For example, at a very high sampling rate, such as 1:1000, it is likely that almost all the traffic appears to be dropped when observed from the console plugin. (NETOBSERV-1676)

  • In the Manage panels pop-up window in the Overview tab, filtering on total, bar, donut, or line does not show any result. (NETOBSERV-1540)

  • The SR-IOV secondary interface is not detected if the interface was created first and then the eBPF agent was deployed. It is only detected if the agent was deployed first and then the SR-IOV interface is created. (NETOBSERV-1697)

  • When Loki is disabled, the Topology view in the OpenShift web console always shows the Cluster and Zone aggregation options in the slider beside the network topology diagram, even when the related features are not enabled. There is no specific workaround, besides ignoring these slider options. (NETOBSERV-1705)

  • When Loki is disabled, and the OpenShift web console first loads, it might display an error: Request failed with status code 400 Loki is disabled. As a workaround, you can continue switching content on the Network Traffic page, such as clicking between the Topology and the Overview tabs. The error should disappear. (NETOBSERV-1706)

Network Observability Operator 1.5.0

The following advisory is available for the Network Observability Operator 1.5.0:

New features and enhancements

DNS tracking enhancements

In 1.5, the TCP protocol is now supported in addition to UDP. New dashboards are also added to the Overview view of the Network Traffic page. For more information, see Configuring DNS tracking and Working with DNS tracking.

Round-trip time (RTT)

You can use TCP handshake Round-Trip Time (RTT) captured from the fentry/tcp_rcv_established Extended Berkeley Packet Filter (eBPF) hookpoint to read smoothed round-trip time (SRTT) and analyze network flows. In the Overview, Network Traffic, and Topology pages in web console, you can monitor network traffic and troubleshoot with RTT metrics, filtering, and edge labeling. For more information, see RTT Overview and Working with RTT.

Metrics, dashboards, and alerts enhancements

The Network Observability metrics dashboards in ObserveDashboardsNetObserv have new metrics types you can use to create Prometheus alerts. You can now define available metrics in the includeList specification. In previous releases, these metrics were defined in the ignoreTags specification. For a complete list of these metrics, see Network Observability Metrics.

Improvements for Network Observability without Loki

You can create Prometheus alerts for the Netobserv dashboard using DNS, Packet drop, and RTT metrics, even if you don’t use Loki. In the previous version of Network Observability, 1.4, these metrics were only available for querying and analysis in the Network Traffic, Overview, and Topology views, which are not available without Loki. For more information, see Network Observability Metrics.

Availability zones

You can configure the FlowCollector resource to collect information about the cluster availability zones. This configuration enriches the network flow data with the topology.kubernetes.io/zone label value applied to the nodes. For more information, see Working with availability zones.

Notable enhancements

The 1.5 release of the Network Observability Operator adds improvements and new capabilities to the OpenShift Container Platform web console plugin and the Operator configuration.

Performance enhancements
  • The spec.agent.ebpf.kafkaBatchSize default is changed from 10MB to 1MB to enhance eBPF performance when using Kafka.

    When upgrading from an existing installation, this new value is not set automatically in the configuration. If you monitor a performance regression with the eBPF Agent memory consumption after upgrading, you might consider reducing the kafkaBatchSize to the new value.

Web console enhancements:
  • There are new panels added to the Overview view for DNS and RTT: Min, Max, P90, P99.

  • There are new panel display options added:

    • Focus on one panel while keeping others viewable but with smaller focus.

    • Switch graph type.

    • Show Top and Overall.

  • A collection latency warning is shown in the Custom time range pop-up window.

  • There is enhanced visibility for the contents of the Manage panels and Manage columns pop-up windows.

  • The Differentiated Services Code Point (DSCP) field for egress QoS is available for filtering QoS DSCP in the web console Network Traffic page.

Configuration enhancements:
  • The LokiStack mode in the spec.loki.mode specification simplifies installation by automatically setting URLs, TLS, cluster roles and a cluster role binding, as well as the authToken value. The Manual mode allows more control over configuration of these settings.

  • The API version changes from flows.netobserv.io/v1beta1 to flows.netobserv.io/v1beta2.

Bug fixes

  • Previously, it was not possible to register the console plugin manually in the web console interface if the automatic registration of the console plugin was disabled. If the spec.console.register value was set to false in the FlowCollector resource, the Operator would override and erase the plugin registration. With this fix, setting the spec.console.register value to false does not impact the console plugin registration or registration removal. As a result, the plugin can be safely registered manually. (NETOBSERV-1134)

  • Previously, using the default metrics settings, the NetObserv/Health dashboard was showing an empty graph named Flows Overhead. This metric was only available by removing "namespaces-flows" and "namespaces" from the ignoreTags list. With this fix, this metric is visible when you use the default metrics setting. (NETOBSERV-1351)

  • Previously, the node on which the eBPF Agent was running would not resolve with a specific cluster configuration. This resulted in cascading consequences that culminated in a failure to provide some of the traffic metrics. With this fix, the eBPF agent’s node IP is safely provided by the Operator, inferred from the pod status. Now, the missing metrics are restored. (NETOBSERV-1430)

  • Previously, the Loki error 'Input size too long' error for the Loki Operator did not include additional information to troubleshoot the problem. With this fix, help is directly displayed in the web console next to the error with a direct link for more guidance. (NETOBSERV-1464)

  • Previously, the console plugin read timeout was forced to 30s. With the FlowCollector v1beta2 API update, you can configure the spec.loki.readTimeout specification to update this value according to the Loki Operator queryTimeout limit. (NETOBSERV-1443)

  • Previously, the Operator bundle did not display some of the supported features by CSV annotations as expected, such as features.operators.openshift.io/…​ With this fix, these annotations are set in the CSV as expected. (NETOBSERV-1305)

  • Previously, the FlowCollector status sometimes oscillated between DeploymentInProgress and Ready states during reconciliation. With this fix, the status only becomes Ready when all of the underlying components are fully ready. (NETOBSERV-1293)

Known issues

  • When trying to access the web console, cache issues on OCP 4.14.10 prevent access to the Observe view. The web console shows the error message: Failed to get a valid plugin manifest from /api/plugins/monitoring-plugin/. The recommended workaround is to update the cluster to the latest minor version. If this does not work, you need to apply the workarounds described in this Red Hat Knowledgebase article.(NETOBSERV-1493)

  • Since the 1.3.0 release of the Network Observability Operator, installing the Operator causes a warning kernel taint to appear. The reason for this error is that the Network Observability eBPF agent has memory constraints that prevent preallocating the entire hashmap table. The Operator eBPF agent sets the BPF_F_NO_PREALLOC flag so that pre-allocation is disabled when the hashmap is too memory expansive.

Network Observability Operator 1.4.2

The following advisory is available for the Network Observability Operator 1.4.2:

Network Observability Operator 1.4.1

The following advisory is available for the Network Observability Operator 1.4.1:

Bug fixes

  • In 1.4, there was a known issue when sending network flow data to Kafka. The Kafka message key was ignored, causing an error with connection tracking. Now the key is used for partitioning, so each flow from the same connection is sent to the same processor. (NETOBSERV-926)

  • In 1.4, the Inner flow direction was introduced to account for flows between pods running on the same node. Flows with the Inner direction were not taken into account in the generated Prometheus metrics derived from flows, resulting in under-evaluated bytes and packets rates. Now, derived metrics are including flows with the Inner direction, providing correct bytes and packets rates. (NETOBSERV-1344)

Network Observability Operator 1.4.0

The following advisory is available for the Network Observability Operator 1.4.0:

Channel removal

You must switch your channel from v1.0.x to stable to receive the latest Operator updates. The v1.0.x channel is now removed.

New features and enhancements

Notable enhancements

The 1.4 release of the Network Observability Operator adds improvements and new capabilities to the OpenShift Container Platform web console plugin and the Operator configuration.

Web console enhancements:
  • In the Query Options, the Duplicate flows checkbox is added to choose whether or not to show duplicated flows.

  • You can now filter source and destination traffic with arrow up long solid One-way, arrow up long solid arrow down long solid Back-and-forth, and Swap filters.

  • The Network Observability metrics dashboards in ObserveDashboardsNetObserv and NetObserv / Health are modified as follows:

    • The NetObserv dashboard shows top bytes, packets sent, packets received per nodes, namespaces, and workloads. Flow graphs are removed from this dashboard.

    • The NetObserv / Health dashboard shows flows overhead as well as top flow rates per nodes, namespaces, and workloads.

    • Infrastructure and Application metrics are shown in a split-view for namespaces and workloads.

For more information, see Network Observability metrics and Quick filters.

Configuration enhancements:
  • You now have the option to specify different namespaces for any configured ConfigMap or Secret reference, such as in certificates configuration.

  • The spec.processor.clusterName parameter is added so that the name of the cluster appears in the flows data. This is useful in a multi-cluster context. When using OpenShift Container Platform, leave empty to make it automatically determined.

Network Observability without Loki

The Network Observability Operator is now functional and usable without Loki. If Loki is not installed, it can only export flows to KAFKA or IPFIX format and provide metrics in the Network Observability metrics dashboards. For more information, see Network Observability without Loki.

DNS tracking

In 1.4, the Network Observability Operator makes use of eBPF tracepoint hooks to enable DNS tracking. You can monitor your network, conduct security analysis, and troubleshoot DNS issues in the Network Traffic and Overview pages in the web console.

For more information, see Configuring DNS tracking and Working with DNS tracking.

SR-IOV support

You can now collect traffic from a cluster with Single Root I/O Virtualization (SR-IOV) device. For more information, see Configuring the monitoring of SR-IOV interface traffic.

IPFIX exporter support

You can now export eBPF-enriched network flows to the IPFIX collector. For more information, see Export enriched network flow data.

Packet drops

In the 1.4 release of the Network Observability Operator, eBPF tracepoint hooks are used to enable packet drop tracking. You can now detect and analyze the cause for packet drops and make decisions to optimize network performance. In OpenShift Container Platform 4.14 and later, both host drops and OVS drops are detected. In OpenShift Container Platform 4.13, only host drops are detected. For more information, see Configuring packet drop tracking and Working with packet drops.

s390x architecture support

Network Observability Operator can now run on s390x architecture. Previously it ran on amd64, ppc64le, or arm64.

Bug fixes

  • Previously, the Prometheus metrics exported by Network Observability were computed out of potentially duplicated network flows. In the related dashboards, from ObserveDashboards, this could result in potentially doubled rates. Note that dashboards from the Network Traffic view were not affected. Now, network flows are filtered to eliminate duplicates before metrics calculation, which results in correct traffic rates displayed in the dashboards. (NETOBSERV-1131)

  • Previously, the Network Observability Operator agents were not able to capture traffic on network interfaces when configured with Multus or SR-IOV, non-default network namespaces. Now, all available network namespaces are recognized and used for capturing flows, allowing capturing traffic for SR-IOV. There are configurations needed for the FlowCollector and SRIOVnetwork custom resource to collect traffic. (NETOBSERV-1283)

  • Previously, in the Network Observability Operator details from OperatorsInstalled Operators, the FlowCollector Status field might have reported incorrect information about the state of the deployment. The status field now shows the proper conditions with improved messages. The history of events is kept, ordered by event date. (NETOBSERV-1224)

  • Previously, during spikes of network traffic load, certain eBPF pods were OOM-killed and went into a CrashLoopBackOff state. Now, the eBPF agent memory footprint is improved, so pods are not OOM-killed and entering a CrashLoopBackOff state. (NETOBSERV-975)

  • Previously when processor.metrics.tls was set to PROVIDED the insecureSkipVerify option value was forced to be true. Now you can set insecureSkipVerify to true or false, and provide a CA certificate if needed. (NETOBSERV-1087)

Known issues

  • Since the 1.2.0 release of the Network Observability Operator, using Loki Operator 5.6, a Loki certificate change periodically affects the flowlogs-pipeline pods and results in dropped flows rather than flows written to Loki. The problem self-corrects after some time, but it still causes temporary flow data loss during the Loki certificate change. This issue has only been observed in large-scale environments of 120 nodes or greater. (NETOBSERV-980)

  • Currently, when spec.agent.ebpf.features includes DNSTracking, larger DNS packets require the eBPF agent to look for DNS header outside of the 1st socket buffer (SKB) segment. A new eBPF agent helper function needs to be implemented to support it. Currently, there is no workaround for this issue. (NETOBSERV-1304)

  • Currently, when spec.agent.ebpf.features includes DNSTracking, DNS over TCP packets requires the eBPF agent to look for DNS header outside of the 1st SKB segment. A new eBPF agent helper function needs to be implemented to support it. Currently, there is no workaround for this issue. (NETOBSERV-1245)

  • Currently, when using a KAFKA deployment model, if conversation tracking is configured, conversation events might be duplicated across Kafka consumers, resulting in inconsistent tracking of conversations, and incorrect volumetric data. For that reason, it is not recommended to configure conversation tracking when deploymentModel is set to KAFKA. (NETOBSERV-926)

  • Currently, when the processor.metrics.server.tls.type is configured to use a PROVIDED certificate, the operator enters an unsteady state that might affect its performance and resource consumption. It is recommended to not use a PROVIDED certificate until this issue is resolved, and instead using an auto-generated certificate, setting processor.metrics.server.tls.type to AUTO. (NETOBSERV-1293

  • Since the 1.3.0 release of the Network Observability Operator, installing the Operator causes a warning kernel taint to appear. The reason for this error is that the Network Observability eBPF agent has memory constraints that prevent preallocating the entire hashmap table. The Operator eBPF agent sets the BPF_F_NO_PREALLOC flag so that pre-allocation is disabled when the hashmap is too memory expansive.

Network Observability Operator 1.3.0

The following advisory is available for the Network Observability Operator 1.3.0:

Channel deprecation

You must switch your channel from v1.0.x to stable to receive future Operator updates. The v1.0.x channel is deprecated and planned for removal in the next release.

New features and enhancements

Multi-tenancy in Network Observability

Flow-based metrics dashboard

  • This release adds a new dashboard, which provides an overview of the network flows in your OpenShift Container Platform cluster. For more information, see Network Observability metrics.

Troubleshooting with the must-gather tool

  • Information about the Network Observability Operator can now be included in the must-gather data for troubleshooting. For more information, see Network Observability must-gather.

Multiple architectures now supported

  • Network Observability Operator can now run on an amd64, ppc64le, or arm64 architectures. Previously, it only ran on amd64.

Deprecated features

Deprecated configuration parameter setting

The release of Network Observability Operator 1.3 deprecates the spec.Loki.authToken HOST setting. When using the Loki Operator, you must now only use the FORWARD setting.

Bug fixes

  • Previously, when the Operator was installed from the CLI, the Role and RoleBinding that are necessary for the Cluster Monitoring Operator to read the metrics were not installed as expected. The issue did not occur when the operator was installed from the web console. Now, either way of installing the Operator installs the required Role and RoleBinding. (NETOBSERV-1003)

  • Since version 1.2, the Network Observability Operator can raise alerts when a problem occurs with the flows collection. Previously, due to a bug, the related configuration to disable alerts, spec.processor.metrics.disableAlerts was not working as expected and sometimes ineffectual. Now, this configuration is fixed so that it is possible to disable the alerts. (NETOBSERV-976)

  • Previously, when Network Observability was configured with spec.loki.authToken set to DISABLED, only a kubeadmin cluster administrator was able to view network flows. Other types of cluster administrators received authorization failure. Now, any cluster administrator is able to view network flows. (NETOBSERV-972)

  • Previously, a bug prevented users from setting spec.consolePlugin.portNaming.enable to false. Now, this setting can be set to false to disable port-to-service name translation. (NETOBSERV-971)

  • Previously, the metrics exposed by the console plugin were not collected by the Cluster Monitoring Operator (Prometheus), due to an incorrect configuration. Now the configuration has been fixed so that the console plugin metrics are correctly collected and accessible from the OpenShift Container Platform web console. (NETOBSERV-765)

  • Previously, when processor.metrics.tls was set to AUTO in the FlowCollector, the flowlogs-pipeline servicemonitor did not adapt the appropriate TLS scheme, and metrics were not visible in the web console. Now the issue is fixed for AUTO mode. (NETOBSERV-1070)

  • Previously, certificate configuration, such as used for Kafka and Loki, did not allow specifying a namespace field, implying that the certificates had to be in the same namespace where Network Observability is deployed. Moreover, when using Kafka with TLS/mTLS, the user had to manually copy the certificate(s) to the privileged namespace where the eBPF agent pods are deployed and manually manage certificate updates, such as in the case of certificate rotation. Now, Network Observability setup is simplified by adding a namespace field for certificates in the FlowCollector resource. As a result, users can now install Loki or Kafka in different namespaces without needing to manually copy their certificates in the Network Observability namespace. The original certificates are watched so that the copies are automatically updated when needed. (NETOBSERV-773)

  • Previously, the SCTP, ICMPv4 and ICMPv6 protocols were not covered by the Network Observability agents, resulting in a less comprehensive network flows coverage. These protocols are now recognized to improve the flows coverage. (NETOBSERV-934)

Known issues

  • When processor.metrics.tls is set to PROVIDED in the FlowCollector, the flowlogs-pipeline servicemonitor is not adapted to the TLS scheme. (NETOBSERV-1087)

  • Since the 1.2.0 release of the Network Observability Operator, using Loki Operator 5.6, a Loki certificate change periodically affects the flowlogs-pipeline pods and results in dropped flows rather than flows written to Loki. The problem self-corrects after some time, but it still causes temporary flow data loss during the Loki certificate change. This issue has only been observed in large-scale environments of 120 nodes or greater.(NETOBSERV-980)

  • When you install the Operator, a warning kernel taint can appear. The reason for this error is that the Network Observability eBPF agent has memory constraints that prevent preallocating the entire hashmap table. The Operator eBPF agent sets the BPF_F_NO_PREALLOC flag so that pre-allocation is disabled when the hashmap is too memory expansive.

Network Observability Operator 1.2.0

The following advisory is available for the Network Observability Operator 1.2.0:

Preparing for the next update

The subscription of an installed Operator specifies an update channel that tracks and receives updates for the Operator. Until the 1.2 release of the Network Observability Operator, the only channel available was v1.0.x. The 1.2 release of the Network Observability Operator introduces the stable update channel for tracking and receiving updates. You must switch your channel from v1.0.x to stable to receive future Operator updates. The v1.0.x channel is deprecated and planned for removal in a following release.

New features and enhancements

Histogram in Traffic Flows view

  • You can now choose to show a histogram bar chart of flows over time. The histogram enables you to visualize the history of flows without hitting the Loki query limit. For more information, see Using the histogram.

Conversation tracking

  • You can now query flows by Log Type, which enables grouping network flows that are part of the same conversation. For more information, see Working with conversations.

Network Observability health alerts

  • The Network Observability Operator now creates automatic alerts if the flowlogs-pipeline is dropping flows because of errors at the write stage or if the Loki ingestion rate limit has been reached. For more information, see Health dashboards.

Bug fixes

  • Previously, after changing the namespace value in the FlowCollector spec, eBPF agent pods running in the previous namespace were not appropriately deleted. Now, the pods running in the previous namespace are appropriately deleted. (NETOBSERV-774)

  • Previously, after changing the caCert.name value in the FlowCollector spec (such as in Loki section), FlowLogs-Pipeline pods and Console plug-in pods were not restarted, therefore they were unaware of the configuration change. Now, the pods are restarted, so they get the configuration change. (NETOBSERV-772)

  • Previously, network flows between pods running on different nodes were sometimes not correctly identified as being duplicates because they are captured by different network interfaces. This resulted in over-estimated metrics displayed in the console plug-in. Now, flows are correctly identified as duplicates, and the console plug-in displays accurate metrics. (NETOBSERV-755)

  • The "reporter" option in the console plug-in is used to filter flows based on the observation point of either source node or destination node. Previously, this option mixed the flows regardless of the node observation point. This was due to network flows being incorrectly reported as Ingress or Egress at the node level. Now, the network flow direction reporting is correct. The "reporter" option filters for source observation point, or destination observation point, as expected. (NETOBSERV-696)

  • Previously, for agents configured to send flows directly to the processor as gRPC+protobuf requests, the submitted payload could be too large and is rejected by the processors' GRPC server. This occurred under very-high-load scenarios and with only some configurations of the agent. The agent logged an error message, such as: grpc: received message larger than max. As a consequence, there was information loss about those flows. Now, the gRPC payload is split into several messages when the size exceeds a threshold. As a result, the server maintains connectivity. (NETOBSERV-617)

Known issue

  • In the 1.2.0 release of the Network Observability Operator, using Loki Operator 5.6, a Loki certificate transition periodically affects the flowlogs-pipeline pods and results in dropped flows rather than flows written to Loki. The problem self-corrects after some time, but it still causes temporary flow data loss during the Loki certificate transition. (NETOBSERV-980)

Notable technical changes

  • Previously, you could install the Network Observability Operator using a custom namespace. This release introduces the conversion webhook which changes the ClusterServiceVersion. Because of this change, all the available namespaces are no longer listed. Additionally, to enable Operator metrics collection, namespaces that are shared with other Operators, like the openshift-operators namespace, cannot be used. Now, the Operator must be installed in the openshift-netobserv-operator namespace. You cannot automatically upgrade to the new Operator version if you previously installed the Network Observability Operator using a custom namespace. If you previously installed the Operator using a custom namespace, you must delete the instance of the Operator that was installed and re-install your operator in the openshift-netobserv-operator namespace. It is important to note that custom namespaces, such as the commonly used netobserv namespace, are still possible for the FlowCollector, Loki, Kafka, and other plug-ins. (NETOBSERV-907)(NETOBSERV-956)

Network Observability Operator 1.1.0

The following advisory is available for the Network Observability Operator 1.1.0:

The Network Observability Operator is now stable and the release channel is upgraded to v1.1.0.

Bug fix

  • Previously, unless the Loki authToken configuration was set to FORWARD mode, authentication was no longer enforced, allowing any user who could connect to the OpenShift Container Platform console in an OpenShift Container Platform cluster to retrieve flows without authentication. Now, regardless of the Loki authToken mode, only cluster administrators can retrieve flows. (BZ#2169468)