refactor(otp): AgentServer hardening + storage contract cleanup#130
Draft
nshkrdotcom wants to merge 12 commits intoagentjido:mainfrom
Draft
refactor(otp): AgentServer hardening + storage contract cleanup#130nshkrdotcom wants to merge 12 commits intoagentjido:mainfrom
nshkrdotcom wants to merge 12 commits intoagentjido:mainfrom
Conversation
…e cleanup
- Add Jido.RuntimeDefaults module centralizing all timeout, limit, and
supervisor guardrail defaults with runtime config override support
- Add Jido.SystemTaskSupervisor for fire-and-forget async tasks
- Wrap GenServer.call in safe_call/3 returning {:error, :not_found} on
noproc/shutdown exits instead of crashing callers
- Trap exits in AgentServer.init and handle {:EXIT, ...} messages for
parent/child/ancestor classification
- Start plugin children and sensor runtimes under DynamicSupervisor
instead of linking directly to AgentServer
- Track spawned children from Spawn directive with monitor refs
- Add scheduled_timers map to State for deterministic timer cleanup on
terminate; schedule messages now carry a message_ref
- Add completion waiter expiry timers and reply {:error, :shutdown} to
pending waiters on terminate
- Store parent_monitor_ref in State for precise parent-down detection
- Convert stream_status/2 from Stream.repeatedly to Stream.resource with
:on_error option (:raise | :halt | :emit)
- Run plugin handle_signal/transform_result hooks in supervised tasks
with configurable timeout (plugin_hook_timeout), returning structured
timeout errors instead of blocking the AgentServer
- Run Lifecycle.Keyed hibernate in Task.async with yield/shutdown guard
- Add max_children limits to DynamicSupervisor in Jido, InstanceManager
- Sensor.Runtime: track timer refs, cancel on terminate, validate
positive intervals, dispatch signals via async tasks
- Discovery.init_async returns {:ok, pid} via TaskSupervisor instead of
bare Task struct
- Plugin.Instance: validate aliases against regex and max length
- Thread.Entry: generate default id when not provided
- Thread.Store: add @enforce_keys, use :global.trans for ETS append
concurrency safety
- JournalBacked: decode unknown string kinds to :unknown atom safely
- Telemetry.Config: change default log level to :info, remove
compile-time config, add runtime normalization/validation
- Scheduler.cancel/1: catch exit on already-dead SchedEx processes
- DefaultPlugins: simplify cond to if/else, use Enum.map_join
- Update Await.all/any to start waiters under SystemTaskSupervisor with
proper error propagation and timeout diagnostic handling
- Relax uniq and excoveralls version constraints in mix.exs
- Add tests covering: max_agents limit, dead pid safe_call, waiter
expiry cleanup, shutdown reply, supervised plugin/sensor children,
sensor crash isolation, plugin hook timeout, timer ref tracking,
alias validation, telemetry config normalization, journal kind safety
…andardize error contracts
- Replace bare DynamicSupervisor calls with safe_start_child/safe_terminate_child
wrappers returning {:error, {:missing_supervisor, name}} on noproc/shutdown
- Add ensure_supervisor_running/1 and maybe_whereis_supervisor/1 guards to
Jido, AgentServer, and directive executors to prevent silent fallback to
legacy global supervisors
- Defer handle_call({:signal, ...}) to async task via :call_signal message,
keeping GenServer mailbox responsive during action execution
- Extract with_telemetry_span/3 helper consolidating start/stop/exception
telemetry pattern across signal and directive processing
- Add State.start_processing/1 and finish_processing/1 to centralize
processing/idle status transitions
- Change Lifecycle.init callback contract from keyword opts to lifecycle
struct as single source of truth
- Add Lifecycle.clear_idle_timer/1 accessor replacing inline map updates
- Unify Agent.Store.ETS and Agent.Store.File as thin wrappers over
Jido.Storage.ETS/File checkpoint APIs with normalize_opts adapters
- Add Jido.Storage.ETS.Owner and Jido.Storage.ETS.Heir GenServers for
dedicated ETS table ownership with heir-based crash recovery
- Wrap Storage.ETS table name resolution in {:ok, name} tuples, rejecting
non-atom table names to prevent dynamic atom creation
- Add resolve_store_config/1 to Persistence supporting both :store (legacy)
and :storage (unified) config keys with function_exported? dispatch
- Change InstanceManager.lookup/2 return from :error to {:error, :not_found}
- Add guard clauses to InstanceManager supervisor/registry/dynamic_supervisor
name helpers rejecting non-atom manager arguments
- Normalize Agent.new/1 to handle nil/""/missing id via ensure_agent_id/1
- Return structured Jido.Error types from validate_name, validate_actions,
validate_module, FSM.Machine.transition, and routing errors
- Add validate_name_refinement/1 and validate_actions_refinement/1 for Zoi
schema compatibility (returning plain string errors)
- Add module name regex validation to jido.gen.agent plugin normalization
- Add jido_action_dep/0 helper preferring local path dependency when present
- Document not_found convention, raise-vs-return policy, and Thread.Store
3-tuple contract in errors guide
- Add tests covering: supervisor race hardening, async call signal
responsiveness, lifecycle init contract, ETS owner/heir policy, table
name rejection, persistence dual-config, manager naming guards, structured
error types, agent nil-id normalization, plugin module name validation,
spawn directive no-fallback behavior
…d isolate plugin startup failures - Extract initialize_strategy/1, build_signal_router/1 from inline handle_continue(:post_init) into pipeline of named private functions - Wrap plugin child_spec, subscription, and schedule startup in safe_start_plugin_spec_children/3, safe_start_plugin_subscriptions/3, and safe_register_schedule/2 with rescue/catch returning state on failure instead of crashing the AgentServer during post_init - Normalize Observe.Log threshold and level handling: add @telemetry_compatible_levels whitelist, map :trace to :debug for Logger, fall back to :info on unrecognized levels - Lower default max_tasks from 2000 to 1000 in docs and examples - Add tests covering: plugin child_spec crash isolation during post_init with mixed healthy/crashing plugins, invalid log_level fallback to :info, :trace threshold treated as most verbose level
…e/Persistence layer, and centralize thread reconstruction
- Change all storage/persistence APIs from bare :not_found to
{:error, :not_found} tuples: Storage.ETS, Storage.File, Persist.thaw,
Jido.thaw, InstanceManager.maybe_thaw
- Delete Jido.Agent.Persistence module, replacing with direct
Jido.Persist calls in InstanceManager and Lifecycle.Keyed
- Delete Jido.Agent.Store behaviour and ETS/File adapter wrappers;
InstanceManager now accepts :storage {Adapter, opts} directly with
resolve_storage_config/1 handling legacy :persistence keyword migration
- Add Thread.prepare_entries/3 and Thread.from_entries/3 centralizing
entry preparation and thread reconstruction across Storage.ETS,
Storage.File, JournalBacked, and Thread.append
- Move generate_entry_id into Entry.generate_id/0 public function,
removing duplicate implementations from ETS, File, and JournalBacked
- Rename :persistence to :storage across Options, State.Lifecycle,
Lifecycle.Keyed, and InstanceManager config/docs
- Add resolve_server/1 Process.alive? guard and safe_whereis/safe_registry_lookup
wrappers in AgentServer for dead-pid and missing-registry safety
- Replace hardcoded shutdown/timeout values with RuntimeDefaults accessors:
agent_server_shutdown_timeout, sensor_runtime_shutdown_timeout,
instance_manager_stop_timeout, stop_child_shutdown_timeout,
await_child_timeout, jido_supervisor_shutdown_timeout,
agent_supervisor_max_restarts, agent_supervisor_max_seconds
- Replace defdelegate for await/await_child/await_all/await_any with
explicit wrapper functions using RuntimeDefaults timeout values
- Update errors guide to document single {:error, :not_found} convention
- Add tests covering: RuntimeDefaults shutdown/supervisor overrides,
Entry.generate_id uniqueness, Thread.prepare_entries/from_entries,
whereis registry safety, Jido start/stop/init wrappers, registry
and counting helpers, storage-backed hibernate/thaw with unified API
- Delete legacy test suites: persistence_test, store_test
- Update all existing tests to match {:error, :not_found} contract
…esolve_server to preserve via/name refs, and add cron monitor tracking
- Split synchronous handle_continue(:post_init) into sequential message
steps (:post_init_start_children, :post_init_start_subscriptions,
:post_init_register_schedules, :post_init_finalize) so GenServer mailbox
remains responsive during plugin startup
- Change resolve_server/1 to return {:ok, via} or {:ok, name} instead of
resolving to bare pid, preserving registry-based routing through
safe_call/safe_cast/GenServer.call sites
- Add safe_cast/2 wrapper catching noproc/shutdown exits like safe_call
- Add cron_monitors map to State tracking monitor_ref => job_id; handle
:DOWN for cron processes via handle_non_cron_down extraction
- Add track_cron_job/3 replacing old cron pid with monitor and cleanup
of stale refs on re-registration
- Add cleanup_tracked_children/1 on terminate demonitoring and stopping
supervised children, skipping :spawn_agent directive children
- Add safe_dynamic_supervisor_terminate_child/2, safe_process_exit/2,
normalize_status_pid/1 helpers
- Change alive?/1 for non-pid refs to use status/1 instead of
resolve_server to avoid false positives on stale pids
- Change Jido.stop_agent/2 to return :ok when agent already gone
(idempotent stop semantics)
- Add instance_manager_max_restarts, instance_manager_max_seconds to
InstanceManager DynamicSupervisor config via RuntimeDefaults
- Add status_call_timeout, stream_status_interval_ms, debug_event_buffer_size
to RuntimeDefaults with runtime config override support
- Move debug event buffer size from compile-time @max_debug_events to
runtime RuntimeDefaults.debug_event_buffer_size/0
- Add validate_agent_refinement/1 Zoi refinement on State.agent field
- Change Lifecycle.Keyed hibernate to fire-and-forget via
SystemTaskSupervisor instead of Task.async with yield/shutdown
- Normalize SpawnAgent meta to map, tag with :directive => :spawn_agent;
auto-tag nil Spawn directives with {:spawn, pid}
- Make Discovery.refresh/0 skip persistent_term write when components
unchanged, preserving last_updated timestamp
- Remove bare :not_found fallback clauses from Persist.thaw/restore_thread
now that storage layer returns {:error, :not_found} uniformly
- Add Plugin.Config warning when config values present but no config_schema
- Expose Observe.observability_config/0 as public, use from Observe.Log
- Change Await task fallback from Task.start to {:error, :task_supervisor_not_found}
- Add tests covering: post_init non-blocking state calls, tracked child
cleanup on terminate, nil-tag spawn tracking, spawn_agent meta normalization,
cron schedule async readiness, discovery idempotent refresh, idempotent
stop_agent, new RuntimeDefaults accessors
- Update existing tests with eventually_state/await_children helpers for
async post_init readiness
…nd backfill unit tests - Change Task.start to Task.Supervisor.start_child in Observe module doc examples to match supervised-task conventions - Relax cmd/2 timeout assertion from 300ms to 1_000ms for CI stability - Add Lifecycle.Noop test covering init/2 passthrough, handle_event/2 continuation, and terminate/2 no-op contract - Add Igniter.Helpers test covering module_to_name/1 atom/string conversion and parse_list/1 nil/comma-separated handling - Add Storage.ETS.Heir test covering ETS-TRANSFER event recording and unknown message filtering
…ournal, and harden registry/cron routing
- Replace async call-signal offload (Task under SystemTaskSupervisor with
:call_signal/:call_signal_result messages) with synchronous
process_call_signal_with_trace inside handle_call, removing
start_call_signal_task/3, safe_task_start/1, and
resolve_task_supervisor_for_call/1 helpers
- Extract register_server/1 from init wrapping Registry.register in
{:ok, _} / {:error, {:already_started, pid}} / {:error, :invalid_registry}
tuple contract, preventing silent double-registration
- Allow explicit :registry override in Options.new/1 attrs so
InstanceManager-spawned agents inherit the keyed registry
- Pass registry to InstanceManager child_spec so keyed agents resolve
via_tuple through their own registry instead of the default
- Route cron and schedule ticks through via_tuple(agent_id, registry)
instead of bare agent_id, with dispatch_tick/4 logging delivery
failures instead of silently discarding them
- Add stop_child_now/4 fallback in StopChild directive executor when
task supervisor resolution or async stop start fails, ensuring child
termination is not silently skipped
- Simplify Lifecycle.Keyed hibernate from fire-and-forget async task
under SystemTaskSupervisor to synchronous Persist.hibernate call
- Replace flush_journal blind append-at-rev-0 with delta-based strategy:
current_persisted_rev/3 loads stored rev, validate_flush_rev/2 rejects
stale in-memory threads, append_thread_delta/5 sends only new entries
- Add clarifying comment on :global.trans lock key tuple format in
Storage.ETS.run_locked_append/4
- Switch jido_action and jido_signal deps from local paths to git refs
on refactor/otp-cleanup branch
- Add tests covering: duplicate agent id already_started error,
serialized call-signal blocking behavior, cron tick delivery to owning
agent, plugin schedule tick delivery, subsequent hibernate delta-only
append, concurrent ETS append serialization
- Update existing async call-signal tests to match synchronous semantics
…tion, and guard plugin state-key atom creation
- Add Jido.Runtime.Tasking module centralizing task supervisor resolution
with configurable candidate lists, jido-instance derivation, and
system supervisor fallback into resolve_task_supervisor/1 and
start_child/2 helpers
- Replace inline resolve_task_supervisor/1 duplicates in Emit, StopChild,
ErrorPolicy, AgentServer, Await, Discovery, and Sensor.Runtime with
Tasking.start_child/2 and Tasking.resolve_task_supervisor/1 calls
- Change run_plugin_callback_task to accept task supervisor atom instead
of raw pid, resolving via Tasking.resolve_task_supervisor/1
- Change Spawn directive executor from Logger.error swallow to
enqueue_spawn_error/5 producing Directive.Error with :spawn context,
propagating failure through the directive queue instead of silently
discarding it
- Change SpawnAgent directive executor from Logger.error swallow to
enqueue_spawn_agent_error/6 producing Directive.Error with
:spawn_agent context with the same queue-based error propagation
- Guard Plugin.Instance.derive_state_key/2 against unbounded atom
creation via derive_or_create_state_key_atom/1 backed by ETS guard
table with @max_derived_state_keys (4096) limit and :global.trans
serialization on first-time atom creation
- Switch Sensor.Runtime schedule_event from Process.send_after to
:erlang.start_timer with {:sensor_timer, key, event} tagged tuples,
adding handle_info clause that validates timer_ref against active
timers map to reject stale timeout deliveries
- Remove bare Task.start fallback in Sensor.Runtime dispatch_async,
returning error tuple on supervisor resolution failure instead
- Remove String.t() from AgentServer.server type union; update doc
examples to use via_tuple/2 and whereis/1 instead of bare string ids
- Add tests covering: Tasking supervisor resolution and start_child,
spawn directive error enqueueing, spawn_agent directive error
enqueueing, stale tick timeout rejection, stale custom event timeout
rejection
- Update existing directive exec tests to assert queue-based error
propagation instead of silent state passthrough
…expiry, support restart and route opts passthrough
- Extract start/1 into build_child_spec/2, resolve_supervisor/1,
start_under_supervisor/2, and extract_child_spec_overrides/1 pipeline
so callers can pass :restart through opts to customize child_spec
- Move await_completion timeout from GenServer.call deadline to
server-side waiter expiry, calling GenServer.call with :infinity to
prevent late reply messages leaking into caller mailboxes after timeout
- Add build_timeout_diagnostic_from_state/2 producing diagnostic map
(waited_ms, server_status, queue_length, iteration, hint) on waiter
expiry, replying {:error, {:timeout, diagnostic}} from handle_info
- Store timeout in waiter struct so expiry handler has access to the
original deadline for diagnostic reporting
- Change infer_timeout_hint/1 to accept bare status atom instead of
full status struct
- Change supervised_runtime_child_spec from hardcoded restart: :temporary
to empty overrides, pushing restart policy decision to call sites
- Add restart: :temporary to SpawnAgent directive executor child_opts so
spawned child agents are not restarted by the DynamicSupervisor
- Add keyword opts support to SignalRouter.normalize_routes/2 accepting
{path, target, opts} and {path, match_fn, target, opts} tuples with
priority: keyword extraction via route_priority/2
- Add tests covering: await_completion timeout leaves no late reply in
caller mailbox, spawned child agent uses temporary restart semantics,
plugin signal_routes/1 tuples with keyword opts priority resolution
Harden module-based new/1 and base Agent.new/1 to treat nil and "" as missing, auto-generating a UUID7 in those cases. Previously: - Module-based new/1 used opts[:id] || generate_id(), which accepted empty string since it is truthy in Elixir - Base Agent.new/1 used Map.put_new_lazy/3, which did not override an explicitly-passed id: nil (key present, value nil) Addresses agentjido#131 Amp-Thread-ID: https://ampcode.com/threads/T-019c3d5b-a7dd-774a-9228-18595bf97861 Co-authored-by: Amp <amp@ampcode.com>
…to refactor/otp-cleanup # Conflicts: # lib/jido/agent.ex
- Change agent schema id field from required Zoi.string to
Zoi.optional() so struct literals and constructor paths can
omit id and let UUID7 auto-generation handle it, aligning
with upstream fix/agent-id-normalization nil/empty handling
- Update jido_action git ref to ad3a37a (refactor/otp-cleanup)
- Update jido_signal git ref to 3997911 (refactor/otp-cleanup)
- Bump spitfire dep from 0.3.2 to 0.3.3
- Add test asserting %Agent{} struct literal permits missing id
Contributor
Author
|
For the #128 audit work: I pushed a smaller “non-breaking first” branch What’s in
I ran If we want to de-risk review/merge, we could land the |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refs #128
This PR is an ongoing, preliminary batch of changes addressing several high-priority findings from the v2.0.0-rc.4 lib/ audit. Posting early for review feedback on direction and API/behavior changes; more commits are expected to follow.
Included so far
Notes