Skip to content

refactor(otp): AgentServer hardening + storage contract cleanup#130

Draft
nshkrdotcom wants to merge 12 commits intoagentjido:mainfrom
nshkrdotcom:refactor/otp-cleanup
Draft

refactor(otp): AgentServer hardening + storage contract cleanup#130
nshkrdotcom wants to merge 12 commits intoagentjido:mainfrom
nshkrdotcom:refactor/otp-cleanup

Conversation

@nshkrdotcom
Copy link
Contributor

Refs #128

This PR is an ongoing, preliminary batch of changes addressing several high-priority findings from the v2.0.0-rc.4 lib/ audit. Posting early for review feedback on direction and API/behavior changes; more commits are expected to follow.

Included so far

  • AgentServer: lifecycle hardening and post-init sequencing; cron schedule tracking/delivery improvements
  • Storage: unify :not_found contract; remove legacy Agent.Store/Persistence layer; centralize thread reconstruction
  • Runtime/OTP: supervision and resource cleanup hardening
  • Tests: backfill coverage around AgentServer, cron delivery, persist/thread behavior, and ETS heir/ownership

Notes

…e cleanup

- Add Jido.RuntimeDefaults module centralizing all timeout, limit, and
  supervisor guardrail defaults with runtime config override support
- Add Jido.SystemTaskSupervisor for fire-and-forget async tasks
- Wrap GenServer.call in safe_call/3 returning {:error, :not_found} on
  noproc/shutdown exits instead of crashing callers
- Trap exits in AgentServer.init and handle {:EXIT, ...} messages for
  parent/child/ancestor classification
- Start plugin children and sensor runtimes under DynamicSupervisor
  instead of linking directly to AgentServer
- Track spawned children from Spawn directive with monitor refs
- Add scheduled_timers map to State for deterministic timer cleanup on
  terminate; schedule messages now carry a message_ref
- Add completion waiter expiry timers and reply {:error, :shutdown} to
  pending waiters on terminate
- Store parent_monitor_ref in State for precise parent-down detection
- Convert stream_status/2 from Stream.repeatedly to Stream.resource with
  :on_error option (:raise | :halt | :emit)
- Run plugin handle_signal/transform_result hooks in supervised tasks
  with configurable timeout (plugin_hook_timeout), returning structured
  timeout errors instead of blocking the AgentServer
- Run Lifecycle.Keyed hibernate in Task.async with yield/shutdown guard
- Add max_children limits to DynamicSupervisor in Jido, InstanceManager
- Sensor.Runtime: track timer refs, cancel on terminate, validate
  positive intervals, dispatch signals via async tasks
- Discovery.init_async returns {:ok, pid} via TaskSupervisor instead of
  bare Task struct
- Plugin.Instance: validate aliases against regex and max length
- Thread.Entry: generate default id when not provided
- Thread.Store: add @enforce_keys, use :global.trans for ETS append
  concurrency safety
- JournalBacked: decode unknown string kinds to :unknown atom safely
- Telemetry.Config: change default log level to :info, remove
  compile-time config, add runtime normalization/validation
- Scheduler.cancel/1: catch exit on already-dead SchedEx processes
- DefaultPlugins: simplify cond to if/else, use Enum.map_join
- Update Await.all/any to start waiters under SystemTaskSupervisor with
  proper error propagation and timeout diagnostic handling
- Relax uniq and excoveralls version constraints in mix.exs
- Add tests covering: max_agents limit, dead pid safe_call, waiter
  expiry cleanup, shutdown reply, supervised plugin/sensor children,
  sensor crash isolation, plugin hook timeout, timer ref tracking,
  alias validation, telemetry config normalization, journal kind safety
…andardize error contracts

- Replace bare DynamicSupervisor calls with safe_start_child/safe_terminate_child
  wrappers returning {:error, {:missing_supervisor, name}} on noproc/shutdown
- Add ensure_supervisor_running/1 and maybe_whereis_supervisor/1 guards to
  Jido, AgentServer, and directive executors to prevent silent fallback to
  legacy global supervisors
- Defer handle_call({:signal, ...}) to async task via :call_signal message,
  keeping GenServer mailbox responsive during action execution
- Extract with_telemetry_span/3 helper consolidating start/stop/exception
  telemetry pattern across signal and directive processing
- Add State.start_processing/1 and finish_processing/1 to centralize
  processing/idle status transitions
- Change Lifecycle.init callback contract from keyword opts to lifecycle
  struct as single source of truth
- Add Lifecycle.clear_idle_timer/1 accessor replacing inline map updates
- Unify Agent.Store.ETS and Agent.Store.File as thin wrappers over
  Jido.Storage.ETS/File checkpoint APIs with normalize_opts adapters
- Add Jido.Storage.ETS.Owner and Jido.Storage.ETS.Heir GenServers for
  dedicated ETS table ownership with heir-based crash recovery
- Wrap Storage.ETS table name resolution in {:ok, name} tuples, rejecting
  non-atom table names to prevent dynamic atom creation
- Add resolve_store_config/1 to Persistence supporting both :store (legacy)
  and :storage (unified) config keys with function_exported? dispatch
- Change InstanceManager.lookup/2 return from :error to {:error, :not_found}
- Add guard clauses to InstanceManager supervisor/registry/dynamic_supervisor
  name helpers rejecting non-atom manager arguments
- Normalize Agent.new/1 to handle nil/""/missing id via ensure_agent_id/1
- Return structured Jido.Error types from validate_name, validate_actions,
  validate_module, FSM.Machine.transition, and routing errors
- Add validate_name_refinement/1 and validate_actions_refinement/1 for Zoi
  schema compatibility (returning plain string errors)
- Add module name regex validation to jido.gen.agent plugin normalization
- Add jido_action_dep/0 helper preferring local path dependency when present
- Document not_found convention, raise-vs-return policy, and Thread.Store
  3-tuple contract in errors guide
- Add tests covering: supervisor race hardening, async call signal
  responsiveness, lifecycle init contract, ETS owner/heir policy, table
  name rejection, persistence dual-config, manager naming guards, structured
  error types, agent nil-id normalization, plugin module name validation,
  spawn directive no-fallback behavior
…d isolate plugin startup failures

- Extract initialize_strategy/1, build_signal_router/1 from inline
  handle_continue(:post_init) into pipeline of named private functions
- Wrap plugin child_spec, subscription, and schedule startup in
  safe_start_plugin_spec_children/3, safe_start_plugin_subscriptions/3,
  and safe_register_schedule/2 with rescue/catch returning state on failure
  instead of crashing the AgentServer during post_init
- Normalize Observe.Log threshold and level handling: add
  @telemetry_compatible_levels whitelist, map :trace to :debug for Logger,
  fall back to :info on unrecognized levels
- Lower default max_tasks from 2000 to 1000 in docs and examples
- Add tests covering: plugin child_spec crash isolation during post_init
  with mixed healthy/crashing plugins, invalid log_level fallback to :info,
  :trace threshold treated as most verbose level
…e/Persistence layer, and centralize thread reconstruction

- Change all storage/persistence APIs from bare :not_found to
  {:error, :not_found} tuples: Storage.ETS, Storage.File, Persist.thaw,
  Jido.thaw, InstanceManager.maybe_thaw
- Delete Jido.Agent.Persistence module, replacing with direct
  Jido.Persist calls in InstanceManager and Lifecycle.Keyed
- Delete Jido.Agent.Store behaviour and ETS/File adapter wrappers;
  InstanceManager now accepts :storage {Adapter, opts} directly with
  resolve_storage_config/1 handling legacy :persistence keyword migration
- Add Thread.prepare_entries/3 and Thread.from_entries/3 centralizing
  entry preparation and thread reconstruction across Storage.ETS,
  Storage.File, JournalBacked, and Thread.append
- Move generate_entry_id into Entry.generate_id/0 public function,
  removing duplicate implementations from ETS, File, and JournalBacked
- Rename :persistence to :storage across Options, State.Lifecycle,
  Lifecycle.Keyed, and InstanceManager config/docs
- Add resolve_server/1 Process.alive? guard and safe_whereis/safe_registry_lookup
  wrappers in AgentServer for dead-pid and missing-registry safety
- Replace hardcoded shutdown/timeout values with RuntimeDefaults accessors:
  agent_server_shutdown_timeout, sensor_runtime_shutdown_timeout,
  instance_manager_stop_timeout, stop_child_shutdown_timeout,
  await_child_timeout, jido_supervisor_shutdown_timeout,
  agent_supervisor_max_restarts, agent_supervisor_max_seconds
- Replace defdelegate for await/await_child/await_all/await_any with
  explicit wrapper functions using RuntimeDefaults timeout values
- Update errors guide to document single {:error, :not_found} convention
- Add tests covering: RuntimeDefaults shutdown/supervisor overrides,
  Entry.generate_id uniqueness, Thread.prepare_entries/from_entries,
  whereis registry safety, Jido start/stop/init wrappers, registry
  and counting helpers, storage-backed hibernate/thaw with unified API
- Delete legacy test suites: persistence_test, store_test
- Update all existing tests to match {:error, :not_found} contract
…esolve_server to preserve via/name refs, and add cron monitor tracking

- Split synchronous handle_continue(:post_init) into sequential message
  steps (:post_init_start_children, :post_init_start_subscriptions,
  :post_init_register_schedules, :post_init_finalize) so GenServer mailbox
  remains responsive during plugin startup
- Change resolve_server/1 to return {:ok, via} or {:ok, name} instead of
  resolving to bare pid, preserving registry-based routing through
  safe_call/safe_cast/GenServer.call sites
- Add safe_cast/2 wrapper catching noproc/shutdown exits like safe_call
- Add cron_monitors map to State tracking monitor_ref => job_id; handle
  :DOWN for cron processes via handle_non_cron_down extraction
- Add track_cron_job/3 replacing old cron pid with monitor and cleanup
  of stale refs on re-registration
- Add cleanup_tracked_children/1 on terminate demonitoring and stopping
  supervised children, skipping :spawn_agent directive children
- Add safe_dynamic_supervisor_terminate_child/2, safe_process_exit/2,
  normalize_status_pid/1 helpers
- Change alive?/1 for non-pid refs to use status/1 instead of
  resolve_server to avoid false positives on stale pids
- Change Jido.stop_agent/2 to return :ok when agent already gone
  (idempotent stop semantics)
- Add instance_manager_max_restarts, instance_manager_max_seconds to
  InstanceManager DynamicSupervisor config via RuntimeDefaults
- Add status_call_timeout, stream_status_interval_ms, debug_event_buffer_size
  to RuntimeDefaults with runtime config override support
- Move debug event buffer size from compile-time @max_debug_events to
  runtime RuntimeDefaults.debug_event_buffer_size/0
- Add validate_agent_refinement/1 Zoi refinement on State.agent field
- Change Lifecycle.Keyed hibernate to fire-and-forget via
  SystemTaskSupervisor instead of Task.async with yield/shutdown
- Normalize SpawnAgent meta to map, tag with :directive => :spawn_agent;
  auto-tag nil Spawn directives with {:spawn, pid}
- Make Discovery.refresh/0 skip persistent_term write when components
  unchanged, preserving last_updated timestamp
- Remove bare :not_found fallback clauses from Persist.thaw/restore_thread
  now that storage layer returns {:error, :not_found} uniformly
- Add Plugin.Config warning when config values present but no config_schema
- Expose Observe.observability_config/0 as public, use from Observe.Log
- Change Await task fallback from Task.start to {:error, :task_supervisor_not_found}
- Add tests covering: post_init non-blocking state calls, tracked child
  cleanup on terminate, nil-tag spawn tracking, spawn_agent meta normalization,
  cron schedule async readiness, discovery idempotent refresh, idempotent
  stop_agent, new RuntimeDefaults accessors
- Update existing tests with eventually_state/await_children helpers for
  async post_init readiness
…nd backfill unit tests

- Change Task.start to Task.Supervisor.start_child in Observe module
  doc examples to match supervised-task conventions
- Relax cmd/2 timeout assertion from 300ms to 1_000ms for CI stability
- Add Lifecycle.Noop test covering init/2 passthrough, handle_event/2
  continuation, and terminate/2 no-op contract
- Add Igniter.Helpers test covering module_to_name/1 atom/string
  conversion and parse_list/1 nil/comma-separated handling
- Add Storage.ETS.Heir test covering ETS-TRANSFER event recording
  and unknown message filtering
…ournal, and harden registry/cron routing

- Replace async call-signal offload (Task under SystemTaskSupervisor with
  :call_signal/:call_signal_result messages) with synchronous
  process_call_signal_with_trace inside handle_call, removing
  start_call_signal_task/3, safe_task_start/1, and
  resolve_task_supervisor_for_call/1 helpers
- Extract register_server/1 from init wrapping Registry.register in
  {:ok, _} / {:error, {:already_started, pid}} / {:error, :invalid_registry}
  tuple contract, preventing silent double-registration
- Allow explicit :registry override in Options.new/1 attrs so
  InstanceManager-spawned agents inherit the keyed registry
- Pass registry to InstanceManager child_spec so keyed agents resolve
  via_tuple through their own registry instead of the default
- Route cron and schedule ticks through via_tuple(agent_id, registry)
  instead of bare agent_id, with dispatch_tick/4 logging delivery
  failures instead of silently discarding them
- Add stop_child_now/4 fallback in StopChild directive executor when
  task supervisor resolution or async stop start fails, ensuring child
  termination is not silently skipped
- Simplify Lifecycle.Keyed hibernate from fire-and-forget async task
  under SystemTaskSupervisor to synchronous Persist.hibernate call
- Replace flush_journal blind append-at-rev-0 with delta-based strategy:
  current_persisted_rev/3 loads stored rev, validate_flush_rev/2 rejects
  stale in-memory threads, append_thread_delta/5 sends only new entries
- Add clarifying comment on :global.trans lock key tuple format in
  Storage.ETS.run_locked_append/4
- Switch jido_action and jido_signal deps from local paths to git refs
  on refactor/otp-cleanup branch
- Add tests covering: duplicate agent id already_started error,
  serialized call-signal blocking behavior, cron tick delivery to owning
  agent, plugin schedule tick delivery, subsequent hibernate delta-only
  append, concurrent ETS append serialization
- Update existing async call-signal tests to match synchronous semantics
…tion, and guard plugin state-key atom creation

- Add Jido.Runtime.Tasking module centralizing task supervisor resolution
  with configurable candidate lists, jido-instance derivation, and
  system supervisor fallback into resolve_task_supervisor/1 and
  start_child/2 helpers
- Replace inline resolve_task_supervisor/1 duplicates in Emit, StopChild,
  ErrorPolicy, AgentServer, Await, Discovery, and Sensor.Runtime with
  Tasking.start_child/2 and Tasking.resolve_task_supervisor/1 calls
- Change run_plugin_callback_task to accept task supervisor atom instead
  of raw pid, resolving via Tasking.resolve_task_supervisor/1
- Change Spawn directive executor from Logger.error swallow to
  enqueue_spawn_error/5 producing Directive.Error with :spawn context,
  propagating failure through the directive queue instead of silently
  discarding it
- Change SpawnAgent directive executor from Logger.error swallow to
  enqueue_spawn_agent_error/6 producing Directive.Error with
  :spawn_agent context with the same queue-based error propagation
- Guard Plugin.Instance.derive_state_key/2 against unbounded atom
  creation via derive_or_create_state_key_atom/1 backed by ETS guard
  table with @max_derived_state_keys (4096) limit and :global.trans
  serialization on first-time atom creation
- Switch Sensor.Runtime schedule_event from Process.send_after to
  :erlang.start_timer with {:sensor_timer, key, event} tagged tuples,
  adding handle_info clause that validates timer_ref against active
  timers map to reject stale timeout deliveries
- Remove bare Task.start fallback in Sensor.Runtime dispatch_async,
  returning error tuple on supervisor resolution failure instead
- Remove String.t() from AgentServer.server type union; update doc
  examples to use via_tuple/2 and whereis/1 instead of bare string ids
- Add tests covering: Tasking supervisor resolution and start_child,
  spawn directive error enqueueing, spawn_agent directive error
  enqueueing, stale tick timeout rejection, stale custom event timeout
  rejection
- Update existing directive exec tests to assert queue-based error
  propagation instead of silent state passthrough
…expiry, support restart and route opts passthrough

- Extract start/1 into build_child_spec/2, resolve_supervisor/1,
  start_under_supervisor/2, and extract_child_spec_overrides/1 pipeline
  so callers can pass :restart through opts to customize child_spec
- Move await_completion timeout from GenServer.call deadline to
  server-side waiter expiry, calling GenServer.call with :infinity to
  prevent late reply messages leaking into caller mailboxes after timeout
- Add build_timeout_diagnostic_from_state/2 producing diagnostic map
  (waited_ms, server_status, queue_length, iteration, hint) on waiter
  expiry, replying {:error, {:timeout, diagnostic}} from handle_info
- Store timeout in waiter struct so expiry handler has access to the
  original deadline for diagnostic reporting
- Change infer_timeout_hint/1 to accept bare status atom instead of
  full status struct
- Change supervised_runtime_child_spec from hardcoded restart: :temporary
  to empty overrides, pushing restart policy decision to call sites
- Add restart: :temporary to SpawnAgent directive executor child_opts so
  spawned child agents are not restarted by the DynamicSupervisor
- Add keyword opts support to SignalRouter.normalize_routes/2 accepting
  {path, target, opts} and {path, match_fn, target, opts} tuples with
  priority: keyword extraction via route_priority/2
- Add tests covering: await_completion timeout leaves no late reply in
  caller mailbox, spawned child agent uses temporary restart semantics,
  plugin signal_routes/1 tuples with keyword opts priority resolution
mikehostetler and others added 3 commits February 8, 2026 07:26
Harden module-based new/1 and base Agent.new/1 to treat nil and ""
as missing, auto-generating a UUID7 in those cases. Previously:

- Module-based new/1 used opts[:id] || generate_id(), which accepted
  empty string since it is truthy in Elixir
- Base Agent.new/1 used Map.put_new_lazy/3, which did not override
  an explicitly-passed id: nil (key present, value nil)

Addresses agentjido#131

Amp-Thread-ID: https://ampcode.com/threads/T-019c3d5b-a7dd-774a-9228-18595bf97861
Co-authored-by: Amp <amp@ampcode.com>
…to refactor/otp-cleanup

# Conflicts:
#	lib/jido/agent.ex
- Change agent schema id field from required Zoi.string to
  Zoi.optional() so struct literals and constructor paths can
  omit id and let UUID7 auto-generation handle it, aligning
  with upstream fix/agent-id-normalization nil/empty handling
- Update jido_action git ref to ad3a37a (refactor/otp-cleanup)
- Update jido_signal git ref to 3997911 (refactor/otp-cleanup)
- Bump spitfire dep from 0.3.2 to 0.3.3
- Add test asserting %Agent{} struct literal permits missing id
@nshkrdotcom
Copy link
Contributor Author

For the #128 audit work: I pushed a smaller “non-breaking first” branch refactor/otp-hardening (HEAD 0e4be69) that isolates the OTP/lifecycle hardening and schema fixes, without the larger storage/contract refactors.

What’s in refactor/otp-hardening:

  • AgentServer: trap_exit, O(1) child lookup (child_pids/child_monitors), parent_monitor_ref, scheduled timer tracking + prune, and best-effort terminate/2 cleanup (timers/cron/children/monitors).
  • Jido.SystemTask + Jido.SystemTaskSupervisor, used by Discovery.init_async/0 and Await (with fallback behavior).
  • Zoi: nullable() |> optional() for description/category/vsn where nil was previously rejected.
  • Restore callback refactor + simplified __jido_restore_create_agent__/1 while keeping Jido.Agent.id optional.

I ran mix test on the branch (1886 tests, 0 failures) and added otp_hardening_test.exs + system_task_test.exs to cover linked-child crash survival and terminate cleanup.

If we want to de-risk review/merge, we could land the refactor/otp-hardening slice first (as its own PR), then keep this PR focused on the bigger storage/contract changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants