-
Notifications
You must be signed in to change notification settings - Fork 47
Refresh archicture doc #1022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refresh archicture doc #1022
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| ## Contributing | ||
|
|
||
| Please refer to [NetObserv projects contribution guide](https://github.com/netobserv/documents/blob/main/CONTRIBUTING.md). |
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,49 +1,98 @@ | ||||||||||||
| # Network Observability Architecture | ||||||||||||
|
|
||||||||||||
| The Network Observability solution consists on a [Network Observability Operator (NOO)](https://github.com/netobserv/network-observability-operator) | ||||||||||||
| that deploys, configures and controls the status of the following components: | ||||||||||||
|
|
||||||||||||
| * [Network Observability eBPF Agent](https://github.com/netobserv/netobserv-ebpf-agent/) | ||||||||||||
| * It is attached to all the interfaces in the host network and listen for each network packet that | ||||||||||||
| is submitted or received by their egress/ingress. The agent aggregates all the packets by source | ||||||||||||
| and destination addresses, protocol, etc... into network flows that are submitted to the | ||||||||||||
| Flowlogs-Pipeline flow processor. | ||||||||||||
| * [Network Observabiilty Flowlogs-Pipeline (FLP)](https://github.com/netobserv/flowlogs-pipeline) | ||||||||||||
| * It receives the raw flows from the agent and decorates them with Kubernetes information (Pod | ||||||||||||
| and host names, namespaces, etc.), and stores them as JSON into a [Loki](https://grafana.com/oss/loki/) | ||||||||||||
| instance. | ||||||||||||
| * [Network Observability Console Plugin](https://github.com/netobserv/network-observability-console-plugin) | ||||||||||||
| * It is attached to the Openshift console as a plugin (see Figure 1, though it can be also | ||||||||||||
| deployed in standalone mode). The Console Plugin queries the flows information stored in Loki | ||||||||||||
| and allows filtering flows, showing network topologies, etc. | ||||||||||||
|
|
||||||||||||
|  | ||||||||||||
| Figure 1: Console Plugin deployment | ||||||||||||
|
|
||||||||||||
| There are two existing deployment modes for Network Observability: direct mode and Kafka mode. | ||||||||||||
|
|
||||||||||||
| ## Direct-mode deployment | ||||||||||||
|
|
||||||||||||
| In direct mode (figure 2), the eBPF agent sends the flows information to Flowlogs-Pipeline encoded as Protocol | ||||||||||||
| Buffers (binary representation) via [gRPC](https://grpc.io/). In this scenario, Flowlogs-Pipeline | ||||||||||||
| is usually deployed as a DaemonSet so there is a 1:1 communication between the Agent and FLP internal | ||||||||||||
| to the host, so we minimize cluster network usage. | ||||||||||||
|
|
||||||||||||
|  | ||||||||||||
| Figure 2: Direct deployment | ||||||||||||
|
|
||||||||||||
| ## Kafka-mode deployment | ||||||||||||
|
|
||||||||||||
| In Kafka mode (figure 3), the communication between the eBFP agent and FLP is done via a Kafka topic. | ||||||||||||
|
|
||||||||||||
|  | ||||||||||||
| Figure 3: Kafka deployment | ||||||||||||
|
|
||||||||||||
| This has some advantages over the direct mode: | ||||||||||||
| 1. The flows' are buffered in the Kafka topic, so if there is a peak of flows, we make sure that | ||||||||||||
| FLP will receive/process them without any kind of denial of service. | ||||||||||||
| 2. Flows are persisted in the topic, so if FLP is restarted by any reason (an update in the | ||||||||||||
| configuration or just a crash), the forwarded flows are persisted in Kafka for its later | ||||||||||||
| processing, and we don't lose them. | ||||||||||||
| 3. Deploying FLP as a deployment, you don't have to keep the 1:1 proportion. You can scale up and | ||||||||||||
| down FLP pods according to your load. | ||||||||||||
| # NetObserv architecture | ||||||||||||
|
|
||||||||||||
| _See also: [architecture in the downstream documentation](https://docs.openshift.com/container-platform/latest/observability/network_observability/understanding-network-observability-operator.html#network-observability-architecture_nw-network-observability-operator)_ | ||||||||||||
|
|
||||||||||||
| NetObserv is a collection of components that can sometimes run independently, or as a whole. | ||||||||||||
|
|
||||||||||||
| The components are: | ||||||||||||
|
|
||||||||||||
| - An [eBPF agent](https://github.com/netobserv/netobserv-ebpf-agent), that generates network flows from captured packets. | ||||||||||||
| - It is attached to any/all of the network interfaces in the host, and listens for packets (ingress+egress) with [eBPF](https://ebpf.io/). | ||||||||||||
| - Packets are aggregated into logical flows (similar to NetFlows), periodically exported to a collector, generally FLP. | ||||||||||||
| - Optional features allow to add rich data, such as TCP latency or DNS information. | ||||||||||||
| - It is able to correlate those flows with other events such as network policy rules and drops (network policy correlation requires the [OVN Kubernetes](https://github.com/ovn-org/ovn-kubernetes/) network plugin). | ||||||||||||
| - When used with the CLI or as a standalone, the agent can also do full packet captures instead of generating logical flows. | ||||||||||||
| - [Flowlogs-pipeline](https://github.com/netobserv/flowlogs-pipeline) (FLP), a component that collects, enriches and exports these flows. | ||||||||||||
| - It uses Kubernetes informers to enrich flows with details such as Pod names, namespaces, availability zones, etc. | ||||||||||||
| - It derives all flows into metric counters, for Prometheus. | ||||||||||||
| - Raw flows can be exported to Loki and/or custom exporters (Kafka, IPFIX, OpenTelemetry). | ||||||||||||
| - As a standalone, FLP is very flexible and configurable. It supports more inputs and outputs, allows more arbitrary filters, sampling, aggregations, relabelling, etc. When deployed via the operator, only a subset of its capacities is used. | ||||||||||||
| - When used in OpenShift, [a Console plugin](https://github.com/netobserv/network-observability-console-plugin) for flows visualization with powerful filtering options, a topology representation and more (outside of OpenShift, [it can be deployed as a standalone](https://github.com/netobserv/network-observability-operator/blob/main/FAQ.md#how-do-i-visualize-flows-and-metrics)). | ||||||||||||
| - It provides a polished web UI to visualize and explore the flow logs and metrics stored in Loki and/or Prometheus. | ||||||||||||
| - Different views include metrics overview, a network topology and a table listing raw flows logs. | ||||||||||||
| - It supports multi-tenant access, making it relevant for various use cases: cluster/network admins, SREs, development teams... | ||||||||||||
| - [An operator](https://github.com/netobserv/network-observability-operator) that manages all of the above. | ||||||||||||
| - It provides two APIs (CRD), one called [FlowCollector](https://github.com/netobserv/network-observability-operator/blob/main/docs/FlowCollector.md), which configures and pilots the whole deployment, and another called [FlowMetrics](https://github.com/netobserv/network-observability-operator/blob/main/docs/FlowMetric.md) which allows to customize which metrics to generate out of flow logs. | ||||||||||||
| - As an [OLM operator](https://olm.operatorframework.io/), it is designed with `operator-sdk`, and allows subscriptions for easy updates. | ||||||||||||
| - [A CLI](https://github.com/netobserv/network-observability-cli) that also manages some of the above components, for on-demand monitoring and packet capture. | ||||||||||||
| - It is provided as a `kubectl` or `oc` plugin, allowing to capture flows (similar to what the operator does, except it's on-demand and in the terminal), full packets (much like a `tcpdump` command) or metrics. | ||||||||||||
| - It is also available via [Krew](https://krew.sigs.k8s.io/). | ||||||||||||
| - It offers a live visualization via a TUI. For metrics, when used in OpenShift, it provides out-of-the-box dashboards. | ||||||||||||
| - Check out the blog post: [Network observability on demand](https://developers.redhat.com/articles/2024/09/17/network-observability-demand#what_is_the_network_observability_cli_). | ||||||||||||
|
|
||||||||||||
| ## Direct deployment model | ||||||||||||
|
|
||||||||||||
| When using the operator with `FlowCollector` `spec.deploymentModel` set to `Direct`, agents and FLP are both deployed per node (as `DaemonSets`). This is perfect for an assessment of the technology, suitable on small clusters, but isn't very memory efficient in large clusters as every instance of FLP ends up caching the same cluster information, which can be huge. | ||||||||||||
|
|
||||||||||||
| Note that Loki isn't managed by the operator and must be installed separately, such as with the Loki operator. Same goes with Prometheus and any custom receiver. | ||||||||||||
|
|
||||||||||||
| <!-- You can use https://mermaid.live/ to test it --> | ||||||||||||
|
|
||||||||||||
| ```mermaid | ||||||||||||
| flowchart TD | ||||||||||||
| subgraph "for each node" | ||||||||||||
| A[eBPF Agent] -->|generates flows| F[FLP] | ||||||||||||
| end | ||||||||||||
| F -. exports .-> E[(Kafka/Otlp/IPFIX)] | ||||||||||||
| F -->|raw logs| L[(Loki)] | ||||||||||||
| F -->|metrics| P[(Prometheus)] | ||||||||||||
| C[Console plugin] <-->|fetches| L | ||||||||||||
| C <-->|fetches| P | ||||||||||||
| O[Operator] -->|manages| A | ||||||||||||
| O -->|manages| F | ||||||||||||
| O -->|manages| C | ||||||||||||
| ``` | ||||||||||||
|
|
||||||||||||
| ## Kafka deployment model | ||||||||||||
|
|
||||||||||||
| When using the operator with `FlowCollector` `spec.deploymentModel` set to `Kafka`, only the agents are deployed per node as a `DaemonSet`. FLP becomes a Kafka consumer that can be scaled independently. This is the recommended mode for large clusters, and is a more robust/resilient solution. | ||||||||||||
|
|
||||||||||||
| Like in `Direct` mode, data stores aren't managed by the operator. The same applies to the Kafka brokers and stores. You can check the Strimzi operator for that. | ||||||||||||
|
|
||||||||||||
| <!-- You can use https://mermaid.live/ to test it --> | ||||||||||||
|
|
||||||||||||
| ```mermaid | ||||||||||||
| flowchart TD | ||||||||||||
| subgraph "for each node" | ||||||||||||
| A[eBPF Agent] | ||||||||||||
| end | ||||||||||||
| A -->|produces flows| K[(Kafka)] | ||||||||||||
| F[FLP] <-->|consumes| K | ||||||||||||
| F -. exports .-> E[(Kafka/Otlp/IPFIX)] | ||||||||||||
| F -->|raw logs| L[(Loki)] | ||||||||||||
| F -->|metrics| P[(Prometheus)] | ||||||||||||
| C[Console plugin] <-->|fetches| L | ||||||||||||
| C <-->|fetches| P | ||||||||||||
| O[Operator] -->|manages| A | ||||||||||||
| O -->|manages| F | ||||||||||||
| O -->|manages| C | ||||||||||||
| ``` | ||||||||||||
|
|
||||||||||||
| ## CLI | ||||||||||||
|
|
||||||||||||
| When using the CLI, the operator is not involved, which means you can use it without installing NetObserv as a whole. It uses a special mode of the eBPF agents that embeds FLP. | ||||||||||||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes sounds good. Note that the collector is not involved for the metrics generation. That could be mentionned too
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe that's something I missed .. how does metrics work when using the kubectl plugin ? The readme mentions only a make target for capturing metrics. If I run
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nvm I've checked the PR, it's explained: next question: README says that's only for openshift, but what if I have prometheus without openshift? I guess I can still have the metrics, it's just the charts that I won't get, correct?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added details in last commit c51d0d7
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes if you have a ServiceMonitor you will have the metrics |
||||||||||||
|
|
||||||||||||
| When running flows or packet capture, a collector Pod is deployed in addition to the agents. When capturing only metrics, the collector isn't deployed, and metrics are exposed directly from the agents, pulled by Prometheus. | ||||||||||||
|
|
||||||||||||
| <!-- You can use https://mermaid.live/ to test it --> | ||||||||||||
|
|
||||||||||||
| ```mermaid | ||||||||||||
| flowchart TD | ||||||||||||
| subgraph "for each node" | ||||||||||||
| A[eBPF Agent w/ embedded FLP] | ||||||||||||
| end | ||||||||||||
| A -->|generates flows or packets| C[Collector] | ||||||||||||
| CL[CLI] -->|manages| A | ||||||||||||
| CL -->|manages| C | ||||||||||||
| A -..->|metrics| P[(Prometheus)] | ||||||||||||
| ``` | ||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention the packet capture here ?
That may come in the operator too in the future