Skip to content

Commit 3bd5626

Browse files
authored
NETOBSERV-2606: increase default cacheActiveTimeout to 15s (#2412)
Adapt cacheMaxFlows to 120K - betting it's still sufficient to cache all the flows, but we can increase more if we find problems in our perf tests Add more info in README about fine tuning.
1 parent 179bc8b commit 3bd5626

File tree

10 files changed

+39
-37
lines changed

10 files changed

+39
-37
lines changed

README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -208,16 +208,18 @@ More information on Prometheus metrics is available in a dedicated page: [Metric
208208

209209
### Performance fine-tuning
210210

211-
In addition to sampling and using Kafka or not, other settings can help you get an optimal setup without compromising on the observability.
211+
In addition to sampling and using Kafka or not, other settings can help you get an optimal setup, with or without compromising on the observability.
212212

213213
Here is what you should pay attention to:
214214

215-
- Resource requirements and limits (`spec.agent.ebpf.resources`, `spec.agent.processor.resources`): adapt the resource requirements and limits to the load and memory usage you expect on your cluster. The default limits (800MB) should be sufficient for most medium sized clusters. You can read more about reqs and limits [here](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/).
216-
217-
- eBPF agent's cache max flows (`spec.agent.ebpf.cacheMaxFlows`) and timeout (`spec.agent.ebpf.cacheActiveTimeout`) control how often flows are reported by the agents. The higher are `cacheMaxFlows` and `cacheActiveTimeout`, the less traffic will be generated by the agents themselves, which also ties with less CPU load. But on the flip side, it leads to a slightly higher memory consumption, and might generate more latency in the flow collection. There is [a blog entry](https://github.com/netobserv/documents/blob/main/blogs/agent_metrics_perf/index.md) dedicated to this fine-tuning.
215+
- eBPF agent's cache eviction interval (`spec.agent.ebpf.cacheActiveTimeout`) controls how often flows are reported by the agents. The higher it is, the more aggregated the flows are, which results in less traffic sent by the agents themselves, and also ties with less CPU load. But on the flip side, it leads to a slightly higher memory consumption in the agent, and generates more latency in the flow collection. It must be configured in relation with the max flows parameters (`spec.agent.ebpf.cacheMaxFlows`), which defines the size of the eBPF data structures, to make sure there is always enough room for new flows. There is [a blog entry](https://netobserv.io/posts/performance-fine-tuning-a-deep-dive-in-ebpf-agent-metrics/) dedicated to this tuning.
218216

219217
- It is possible to reduce the overall observed traffic by restricting or excluding interfaces via `spec.agent.ebpf.interfaces` and `spec.agent.ebpf.excludeInterfaces`. Note that the interface names may vary according to the CNI used.
220218

219+
- You can also add [eBPF filters](https://netobserv.io/posts/enhancing-netobserv-by-introducing-multi-rules-flow-filtering-capability-in-ebpf/) and [flowlogs-pipeline filters](https://github.com/netobserv/flowlogs-pipeline/blob/main/docs/filtering.md) to further narrow down what's being collected, if you find that you don't need every kind of flows. The former has the greatest impact on the performance of each component, while the latter mainly improves the storage/Loki end of the pipeline.
220+
221+
- Resource requirements and limits (`spec.agent.ebpf.resources`, `spec.agent.processor.resources`): adapt the resource requirements and limits to the load and memory usage you expect on your cluster. The default limits (800MB) should be sufficient for most medium sized clusters. You can read more about reqs and limits [here](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/).
222+
221223
- Each component offers more advanced settings via `spec.agent.ebpf.advanced`, `spec.processor.advanced`, `spec.loki.advanced` and `spec.consolePlugin.advanced`. The agent has [environment variables](https://github.com/netobserv/netobserv-ebpf-agent/blob/main/docs/config.md) that you can set through `spec.agent.ebpf.advanced.env`.
222224

223225
#### Loki

api/flowcollector/v1beta2/flowcollector_types.go

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -333,18 +333,18 @@ type FlowCollectorEBPF struct {
333333
//+optional
334334
Sampling *int32 `json:"sampling,omitempty"`
335335

336-
// `cacheActiveTimeout` is the max period during which the reporter aggregates flows before sending.
336+
// `cacheActiveTimeout` is the period during which the agent aggregates flows before sending.
337337
// Increasing `cacheMaxFlows` and `cacheActiveTimeout` can decrease the network traffic overhead and the CPU load,
338338
// however you can expect higher memory consumption and an increased latency in the flow collection.
339339
//+kubebuilder:validation:Pattern:=^\d+(ns|ms|s|m)?$
340-
//+kubebuilder:default:="5s"
340+
//+kubebuilder:default:="15s"
341341
CacheActiveTimeout string `json:"cacheActiveTimeout,omitempty"`
342342

343-
// `cacheMaxFlows` is the max number of flows in an aggregate; when reached, the reporter sends the flows.
343+
// `cacheMaxFlows` is the maximum number of flows in an aggregate; when reached, the reporter sends the flows.
344344
// Increasing `cacheMaxFlows` and `cacheActiveTimeout` can decrease the network traffic overhead and the CPU load,
345345
// however you can expect higher memory consumption and an increased latency in the flow collection.
346346
//+kubebuilder:validation:Minimum=1
347-
//+kubebuilder:default:=100000
347+
//+kubebuilder:default:=120000
348348
CacheMaxFlows int32 `json:"cacheMaxFlows,omitempty"`
349349

350350
// `interfaces` contains the interface names from where flows are collected. If empty, the agent

bundle/manifests/flows.netobserv.io_flowcollectors.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1098,17 +1098,17 @@ spec:
10981098
type: object
10991099
type: object
11001100
cacheActiveTimeout:
1101-
default: 5s
1101+
default: 15s
11021102
description: |-
1103-
`cacheActiveTimeout` is the max period during which the reporter aggregates flows before sending.
1103+
`cacheActiveTimeout` is the period during which the agent aggregates flows before sending.
11041104
Increasing `cacheMaxFlows` and `cacheActiveTimeout` can decrease the network traffic overhead and the CPU load,
11051105
however you can expect higher memory consumption and an increased latency in the flow collection.
11061106
pattern: ^\d+(ns|ms|s|m)?$
11071107
type: string
11081108
cacheMaxFlows:
1109-
default: 100000
1109+
default: 120000
11101110
description: |-
1111-
`cacheMaxFlows` is the max number of flows in an aggregate; when reached, the reporter sends the flows.
1111+
`cacheMaxFlows` is the maximum number of flows in an aggregate; when reached, the reporter sends the flows.
11121112
Increasing `cacheMaxFlows` and `cacheActiveTimeout` can decrease the network traffic overhead and the CPU load,
11131113
however you can expect higher memory consumption and an increased latency in the flow collection.
11141114
format: int32

bundle/manifests/netobserv-operator.clusterserviceversion.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,8 +39,8 @@ metadata:
3939
"spec": {
4040
"agent": {
4141
"ebpf": {
42-
"cacheActiveTimeout": "5s",
43-
"cacheMaxFlows": 100000,
42+
"cacheActiveTimeout": "15s",
43+
"cacheMaxFlows": 120000,
4444
"excludeInterfaces": [
4545
"lo"
4646
],

config/crd/bases/flows.netobserv.io_flowcollectors.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1024,17 +1024,17 @@ spec:
10241024
type: object
10251025
type: object
10261026
cacheActiveTimeout:
1027-
default: 5s
1027+
default: 15s
10281028
description: |-
1029-
`cacheActiveTimeout` is the max period during which the reporter aggregates flows before sending.
1029+
`cacheActiveTimeout` is the period during which the agent aggregates flows before sending.
10301030
Increasing `cacheMaxFlows` and `cacheActiveTimeout` can decrease the network traffic overhead and the CPU load,
10311031
however you can expect higher memory consumption and an increased latency in the flow collection.
10321032
pattern: ^\d+(ns|ms|s|m)?$
10331033
type: string
10341034
cacheMaxFlows:
1035-
default: 100000
1035+
default: 120000
10361036
description: |-
1037-
`cacheMaxFlows` is the max number of flows in an aggregate; when reached, the reporter sends the flows.
1037+
`cacheMaxFlows` is the maximum number of flows in an aggregate; when reached, the reporter sends the flows.
10381038
Increasing `cacheMaxFlows` and `cacheActiveTimeout` can decrease the network traffic overhead and the CPU load,
10391039
however you can expect higher memory consumption and an increased latency in the flow collection.
10401040
format: int32

config/samples/flows_v1beta2_flowcollector.yaml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,11 @@ spec:
1111
agent:
1212
type: eBPF
1313
ebpf:
14-
# imagePullPolicy: IfNotPresent
15-
# logLevel: info
14+
# imagePullPolicy: Always
15+
# logLevel: debug
1616
sampling: 50
17-
cacheActiveTimeout: 5s
18-
cacheMaxFlows: 100000
17+
cacheActiveTimeout: 15s
18+
cacheMaxFlows: 120000
1919
# Change privileged to "true" on old kernel version not knowing CAP_BPF or when using "PacketDrop" feature
2020
privileged: false
2121
# features:
@@ -77,8 +77,8 @@ spec:
7777
# certFile: user.crt
7878
# certKey: user.key
7979
processor:
80-
# imagePullPolicy: IfNotPresent
81-
# logLevel: info
80+
# imagePullPolicy: Always
81+
# logLevel: debug
8282
# Change logTypes to "Conversations", "EndedConversations" or "All" to enable conversation tracking
8383
# logTypes: Flows
8484
# Append a unique cluster name to each record
@@ -182,8 +182,8 @@ spec:
182182
# timeout: 30s
183183
consolePlugin:
184184
enable: true
185-
# imagePullPolicy: IfNotPresent
186-
# logLevel: info
185+
# imagePullPolicy: Always
186+
# logLevel: debug
187187
# Scaling configuration
188188
# replicas: 1
189189
# autoscaler:

docs/FlowCollector.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -256,23 +256,23 @@ override the default Linux capabilities from there.<br/>
256256
<td><b>cacheActiveTimeout</b></td>
257257
<td>string</td>
258258
<td>
259-
`cacheActiveTimeout` is the max period during which the reporter aggregates flows before sending.
259+
`cacheActiveTimeout` is the period during which the agent aggregates flows before sending.
260260
Increasing `cacheMaxFlows` and `cacheActiveTimeout` can decrease the network traffic overhead and the CPU load,
261261
however you can expect higher memory consumption and an increased latency in the flow collection.<br/>
262262
<br/>
263-
<i>Default</i>: 5s<br/>
263+
<i>Default</i>: 15s<br/>
264264
</td>
265265
<td>false</td>
266266
</tr><tr>
267267
<td><b>cacheMaxFlows</b></td>
268268
<td>integer</td>
269269
<td>
270-
`cacheMaxFlows` is the max number of flows in an aggregate; when reached, the reporter sends the flows.
270+
`cacheMaxFlows` is the maximum number of flows in an aggregate; when reached, the reporter sends the flows.
271271
Increasing `cacheMaxFlows` and `cacheActiveTimeout` can decrease the network traffic overhead and the CPU load,
272272
however you can expect higher memory consumption and an increased latency in the flow collection.<br/>
273273
<br/>
274274
<i>Format</i>: int32<br/>
275-
<i>Default</i>: 100000<br/>
275+
<i>Default</i>: 120000<br/>
276276
<i>Minimum</i>: 1<br/>
277277
</td>
278278
<td>false</td>

helm/crds/flows.netobserv.io_flowcollectors.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1028,17 +1028,17 @@ spec:
10281028
type: object
10291029
type: object
10301030
cacheActiveTimeout:
1031-
default: 5s
1031+
default: 15s
10321032
description: |-
1033-
`cacheActiveTimeout` is the max period during which the reporter aggregates flows before sending.
1033+
`cacheActiveTimeout` is the period during which the agent aggregates flows before sending.
10341034
Increasing `cacheMaxFlows` and `cacheActiveTimeout` can decrease the network traffic overhead and the CPU load,
10351035
however you can expect higher memory consumption and an increased latency in the flow collection.
10361036
pattern: ^\d+(ns|ms|s|m)?$
10371037
type: string
10381038
cacheMaxFlows:
1039-
default: 100000
1039+
default: 120000
10401040
description: |-
1041-
`cacheMaxFlows` is the max number of flows in an aggregate; when reached, the reporter sends the flows.
1041+
`cacheMaxFlows` is the maximum number of flows in an aggregate; when reached, the reporter sends the flows.
10421042
Increasing `cacheMaxFlows` and `cacheActiveTimeout` can decrease the network traffic overhead and the CPU load,
10431043
however you can expect higher memory consumption and an increased latency in the flow collection.
10441044
format: int32

internal/controller/flowcollector_controller_ebpf_test.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ func flowCollectorEBPFSpecs() {
5757
Type: "eBPF",
5858
EBPF: flowslatest.FlowCollectorEBPF{
5959
Sampling: ptr.To(int32(123)),
60-
CacheActiveTimeout: "15s",
60+
CacheActiveTimeout: "1s",
6161
CacheMaxFlows: 100,
6262
Interfaces: []string{"veth0", "/^br-/"},
6363
ExcludeInterfaces: []string{"br-3", "lo"},
@@ -96,7 +96,7 @@ func flowCollectorEBPFSpecs() {
9696
Expect(*spec.Containers[0].SecurityContext.RunAsUser).To(Equal(int64(0)))
9797
Expect(spec.Containers[0].Env).To(ContainElements(
9898
v1.EnvVar{Name: "EXPORT", Value: "grpc"},
99-
v1.EnvVar{Name: "CACHE_ACTIVE_TIMEOUT", Value: "15s"},
99+
v1.EnvVar{Name: "CACHE_ACTIVE_TIMEOUT", Value: "1s"},
100100
v1.EnvVar{Name: "CACHE_MAX_FLOWS", Value: "100"},
101101
v1.EnvVar{Name: "LOG_LEVEL", Value: "trace"},
102102
v1.EnvVar{Name: "INTERFACES", Value: "veth0,/^br-/"},

internal/controller/flowcollector_controller_iso_test.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ func flowCollectorIsoSpecs() {
110110
},
111111
EBPF: flowslatest.FlowCollectorEBPF{
112112
Sampling: &zero,
113-
CacheActiveTimeout: "5s",
113+
CacheActiveTimeout: "15s",
114114
CacheMaxFlows: 100,
115115
ImagePullPolicy: "Always",
116116
Advanced: &flowslatest.AdvancedAgentConfig{},

0 commit comments

Comments
 (0)