Skip to content

Conversation

@essinghigh
Copy link

@essinghigh essinghigh commented Jan 15, 2026

Follow-on from #17906


Persistent Connection Pool
Updated get_docker_client to reuse a single Docker client instance. This eliminates the massive overhead of establishing new socket connections for each execution cycle.

Removed Unnecessary Disk I/O
Explicitly disabled trust_env and defined the local Docker socket Base URL. This prevents requests from performing filesystem lookups for .netrc and and environment variables on every single API call.

Sparse Container Listing
Switched container listing to use sparse=True. With this set to False like it was originally, it would do a full inspect API call for each container when listing. We only list these to identify the containers, and all the data needed is returned from a sparse listing.

Race Condition Handling
Removed the blocking retry loop. If a container dies between listing and and stats collection, we can just silently skip it.

Type Safety & Readability
I've rewritten the stats collection module to use TypedDict and strict type hints, which improves readability significantly compared to the previous nested dictionary approach.


I have tested this on the latest MASTER build without issues. I have also been running these changes on a production machine on 25.10.0 without any problems.

Before:
before

After:
after

I'm also still convinced that there's some bug in the WebUI and/or Middlewared causing the AppStatsEventSource to never terminate. However I haven't been able to find anything to that end.


There are some areas I think could be improved still:

  • YAML parsing could be eliminated almost entirely if it was cached and then invalidated as needed (metadata updates / mtime change).
  • It might make sense to use actual periodic background collection instead of the "on-demand" polling (which seems broken) and using something like a deque-based stats history, then having the event source read from that instead of making realtime API calls. Would improve responsiveness on page-load.

Introduce TypedDict definitions (ResourceStats, BlkioStats, NetworkStats) and apply type annotations, Extract Block IO and Network stats parsing logic into dedicated helper functions (_parse_blkio and _parse_networks) to simplify get_container_stats, Move the project label check in get_container_stats before the expensive container.stats() API calls, Refactor the aggregation loop in list_resources_stats_by_project for easier readability
Neither is scanning the home directory for auth configs on every request. Switched to a cached singleton client.
The threads are spending more time fighting over the lock than actually getting stats, so we might as well just go back to sequential querying. For container.stats(), this is very fast now that we aren't recreating the client constantly
This has crept back up over time, we just need to ensure we are avoiding connect pool contention
@bugclerk
Copy link
Contributor

@essinghigh
Copy link
Author

Had to reopen this as I merged my personal & work github accounts, which closed out my open issues/PRs.

@bugclerk bugclerk changed the title NAS-139113 / 26.04 / Optimize Docker Stats Collection NAS-139113 / 26.0.0-BETA.1 / Optimize Docker Stats Collection Jan 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants