-
Notifications
You must be signed in to change notification settings - Fork 58
Description
[Audit] Jido lib/ Comprehensive Code Quality & Safety Review — v2.0.0-rc.4
Scope: 99 files across lib/
Audit date: 2026-02-06
Categories: OTP supervision, GenServer blocking, race conditions, error handling, code duplication, resource leaks, configuration hygiene
Executive Summary
This issue consolidates the findings from a full audit of the Jido lib/ directory. Across seven audit categories, we identified 2 critical, 22 high, 34+ medium, and 35+ low severity findings. The findings cluster around a few central themes:
AgentServer is the primary hot spot. At ~1800 lines, agent_server.ex appears in 20+ findings across every audit category — missing trap_exit, synchronous signal processing, race conditions in resolve_server, signal-handling boilerplate duplication, inconsistent error normalization, timer/monitor cleanup gaps, and scattered timeout magic numbers.
The storage layer has a parallel-hierarchy problem. Two complete storage abstractions (Agent.Store vs Storage) coexist, and thread reconstruction / entry preparation logic is independently reimplemented 3–4 times across storage adapters. Consolidation eliminates ~230+ lines of structural duplication.
Error handling lacks a unified contract. Three error shapes coexist (%Jido.Error{} structs, bare atoms, bare strings), the :not_found sentinel appears in two incompatible forms, and Thread.Store uses 3-element tuples. Callers cannot reliably pattern-match on errors from different modules.
Timeouts and defaults are scattered as magic numbers. At least 10 different timeout defaults across 5 files, 4 copies of a 5_000ms shutdown timeout, and max_queue_size defined in 2 places with no centralized defaults module.
Critical Findings — Fix Immediately
C1: String.to_atom/1 fallback in journal_backed.ex — VM crash risk
File: lib/jido/thread/store/adapters/journal_backed.ex:192
defp to_atom(string) when is_binary(string) do
String.to_existing_atom(string)
rescue
ArgumentError -> String.to_atom(string)
end
If journal data contains arbitrary user-generated strings, each unique string creates a permanent atom. The BEAM atom table has a hard limit (~1,048,576 atoms). An attacker or misbehaving data source can exhaust this and crash the entire VM — not just the process, the entire node.
Fix: Remove the fallback entirely. Either keep the value as a string, map through a fixed whitelist of allowed atoms, or return {:error, :unknown_atom}. Never create atoms from untrusted data.
C2: Parallel storage hierarchies — Agent.Store vs Storage
Files: 8 files across two hierarchies
The project maintains two completely separate storage abstraction hierarchies:
Hierarchy A (Legacy): agent/store.ex (behaviour), agent/store/ets.ex (68 lines), agent/store/file.ex (87 lines), agent/persistence.ex (214 lines facade).
Hierarchy B (Unified): storage.ex (behaviour), storage/ets.ex (314 lines), storage/file.ex (377 lines), persist.ex (356 lines facade).
The checkpoint operations in Hierarchy B are structurally identical to Hierarchy A — the ETS get implementations are nearly character-for-character duplicates. This creates ~150 lines of structural duplication plus ongoing maintenance confusion about which API to use.
Fix: Deprecate and remove the entire Agent.Store hierarchy and Agent.Persistence facade. Migrate all callers to Jido.Storage checkpoint operations. Jido.Persist is the evolved version.
High-Priority Findings
The following are organized by theme rather than by individual audit document, since many findings span multiple categories.
H1: AgentServer does not trap_exit despite starting linked children
File: lib/jido/agent_server.ex (init at ~646, start_plugin_child at ~1344, start_subscription_sensor at ~1413)
AgentServer starts plugin children via apply(m, f, a) on child specs and sensor runtimes via SensorRuntime.start_link/1. Both create linked processes. However, AgentServer never calls Process.flag(:trap_exit, true), and its exit handling is written for monitors (:DOWN messages), not {:EXIT, ...} messages.
If any linked child or sensor crashes, the AgentServer dies immediately without calling terminate/2, bypassing lifecycle persistence/hibernation, cron job cancellation, and completion waiter cleanup.
Fix (preferred): Start plugin children and sensors under a per-agent DynamicSupervisor rather than linking them directly to the GenServer. Alternative: Add Process.flag(:trap_exit, true) in init/1 and a handle_info({:EXIT, pid, reason}, state) clause that feeds into the existing child-down pipeline.
H2: handle_call({:signal, ...}) runs the full strategy pipeline synchronously
File: lib/jido/agent_server.ex:729-746
The call chain is handle_call → process_signal → do_process_signal → dispatch_action → agent_module.cmd/2 → strategy.cmd/3 → Jido.Exec.run/1 for each instruction sequentially. For a framework designed to work with LLM-based agents, a single handle_call blocks the GenServer for the duration of potentially slow external API calls (seconds to minutes). All other handle_call and handle_info messages queue behind it.
Additionally, before cmd/2, run_plugin_signal_hooks/2 iterates all plugins calling handle_signal/2, and after cmd/2, run_plugin_transform_hooks/4 iterates all plugins calling transform_result/3 — all user-provided code with no timeout enforcement.
The default 5-second timeout on call/3 (line 276) is insufficient for LLM agents, and the caller gets a timeout error while the GenServer continues processing, leading to state divergence.
Fix: The handle_call({:signal, ...}) should only enqueue the signal and return immediately. Actual execution should happen asynchronously (the drain loop already exists for directives). Alternatively, offload agent_module.cmd/2 to a Task and return {:noreply, state}, replying via GenServer.reply/2 when complete.
H3: Task.async/1 in Discovery.init_async/0 — linked, unsupervised, never awaited
File: lib/jido/discovery.ex:88-94, called from lib/jido/application.ex:16
Task.async/1 creates a linked process and expects Task.await/2 to be called. Here the return value is discarded — the task is started, then Supervisor.start_link is called, and the task result is never awaited. This violates the Task.async API contract, the unhandled reply message sits in the caller's mailbox, and if build_catalog() raises, it crashes the application start process.
Fix: Replace with Task.Supervisor.start_child/2 under a supervised Task.Supervisor started in Jido.Application:
# In Jido.Application children: {Task.Supervisor, name: Jido.SystemTaskSupervisor}In Discovery:
def init_async do
Task.Supervisor.start_child(Jido.SystemTaskSupervisor, fn ->
catalog = build_catalog()
:persistent_term.put(@catalog_key, catalog)
end)
end
H4: ETS tables created without :heir by arbitrary owner processes
Files: lib/jido/storage/ets.ex:224, lib/jido/agent/store/ets.ex:60
Tables are created by whichever process first calls the storage function — not by a supervised process. There is no :heir option, so if the owning process crashes, the entire ETS table and all its data are destroyed. No ets.delete/1 cleanup exists anywhere.
Fix: Create ETS tables from a dedicated supervised process and set the :heir option so another process can take ownership on crash.
H5: ETS append_thread — read-then-write without locking
File: lib/jido/storage/ets.ex:128-177
current_rev = get_current_rev(threads_table, thread_id) # read
# ... no lock ...
:ets.insert(threads_table, ets_entries) # write based on stale read
Two concurrent processes can read the same current_rev, both compute entries starting at the same sequence number, both insert, and the result is duplicate sequence numbers and data corruption. The expected_rev parameter provides optimistic concurrency control, but it is optional — when not provided, the check is skipped entirely. Even when provided, the check is not atomic with the write. Contrast with Storage.File which correctly uses :global.trans.
Fix: Use :ets.update_counter/4 to atomically reserve sequence ranges, add a per-thread-id lock (like File's :global.trans), route all ETS thread writes through a single GenServer per thread, or at minimum make expected_rev mandatory.
H6: Sensor runtime dispatch blocks the GenServer
File: lib/jido/sensor/runtime.ex:291-308
deliver_signal/2 calls Dispatch.dispatch(signal, agent_ref) synchronously within the sensor GenServer. If the dispatch target is an HTTP or webhook adapter, this blocks the sensor for the full network call duration. Subsequent :tick messages queue up.
Fix: Offload dispatch to a supervised Task, similar to how it's already done in the Emit directive executor (directive_executors.ex:31).
H7: Jido.Util returns {:error, String.t()} — API contract violation
File: lib/jido/util.ex:62-240
All validation functions return bare string errors like {:error, "All actions must implement the Jido.Action behavior"}. Any caller pattern-matching on {:error, %Error{message: msg}} will miss these entirely.
Fix: Wrap all Jido.Util validation returns in Jido.Error.validation_error(...). This is the single highest-impact error normalization change.
H8: :not_found returned in two incompatible forms
Bare :not_found is returned by Storage.ETS, Storage.File, Persist, and Agent.Store callbacks. Meanwhile {:error, :not_found} is returned by AgentServer, Await, Util, and InstanceManager. Any with chain expecting only {:ok, _} and {:error, _} silently falls through on bare :not_found.
Fix: Define a project-wide convention — either @type find_result :: {:ok, t()} | :not_found | {:error, term()} with explicit documentation, or normalize everything to {:error, :not_found}.
H9: AgentServer.process_signal doesn't normalize errors before returning
File: lib/jido/agent_server.ex:997, 1019, 1097, 1099-1100
process_signal can return errors of multiple shapes: raw plugin errors, a bare :no_matching_route atom, or routing reasons. Callers of AgentServer.call/3 cannot reliably pattern-match on what comes back.
Fix: Normalize all errors to Jido.Error structs before returning from process_signal.
H10: Thread reconstruction logic triplicated; entry preparation quadruplicated
Thread reconstruction (reconstruct_thread) is independently implemented in storage/ets.ex, thread/store/adapters/journal_backed.ex, and storage/file.ex — all building a %Thread{} from entries with entry_count, timestamps, metadata, and stats (~35 duplicated lines).
Entry preparation (assigning seq, timestamp, id) is implemented in four places: thread.ex (canonical), storage/ets.ex (copy), storage/file.ex (simplified), and journal_backed.ex (reimplementation). The fetch_entry_attr helper is character-for-character identical between Thread and Storage.ETS (~80 duplicated lines).
Entry ID generation uses three different approaches across four files — Jido.Util.generate_id(), Base.url_encode64(:crypto.strong_rand_bytes(12)), and random_string(12).
Fix: Add Thread.from_entries/2, Thread.prepare_entries/3, and Thread.Entry.generate_id/0 as public functions. All storage adapters delegate to them. Eliminates ~130 lines.
H11: Signal processing boilerplate copy-pasted three times in AgentServer
File: lib/jido/agent_server.ex
The identical "trace + try/catch + process_signal" pattern appears verbatim in handle_cast({:signal, ...}), handle_info({:scheduled_signal, ...}), and handle_info({:signal, ...}):
{traced_signal, _ctx} = TraceContext.ensure_from_signal(signal)
try do
case process_signal(traced_signal, state) do
{:ok, new_state, _resolved_action} -> {:noreply, new_state}
{:error, _reason, new_state} -> {:noreply, new_state}
end
after
TraceContext.clear()
end
Similarly, the Trace.put + new_root wrapping pattern is triplicated at lines 1678, 1712, and 1736.
Fix: Extract defp handle_signal_async(signal, state) and defp ensure_traced(signal) helpers. Eliminates ~42 lines.
H12: Persistence facade duplication
Two independent persistence facades exist: Agent.Persistence (214 lines, uses Agent.Store, dump/load callbacks) and Persist (356 lines, uses Storage, checkpoint/restore callbacks plus thread journal support). Both implement the same check-callback → call-adapter → handle-results pattern.
Fix: Jido.Persist is the evolved version. Remove Jido.Agent.Persistence and migrate all callers. Resolves as part of C2.
H13: Timeout defaults scattered as magic numbers
At least 10 different timeout values across 5 files:
| Value | Locations |
|---|---|
| 5_000ms (call) | agent_server.ex:276, worker_pool.ex:88,117,175 |
| 5_000ms (shutdown) | directive_executors.ex:193, instance_manager.ex:226, agent_server.ex:251, sensor/runtime.ex:90 |
| 10_000ms (await) | agent_server.ex:351, await.ex:78, jido.ex:468,486,495 |
| 30_000ms (child await) | jido.ex:477 |
Recommended Fix Order
The following sequence respects dependency chains and maximizes safety impact per unit of effort.
Phase 1 — Safety-critical (do first, independent changes):
- Remove
String.to_atom/1fallback injournal_backed.ex(C1) - Add
Process.flag(:trap_exit, true)to AgentServerinit/1+ handle{:EXIT, ...}(H1) - Replace
Task.async/1inDiscovery.init_async/0with supervised task (H3) - Create ETS tables from a supervised process with
:heir(H4)
Phase 2 — Data integrity & blocking:
5. Add locking or atomic ops to ETS append_thread (H5)
6. Offload sensor dispatch to a Task (H6)
7. Make signal processing async in AgentServer (H2) — largest architectural change
Phase 3 — Error contract unification:
8. Normalize Jido.Util error returns to %Error{} (H7)
9. Standardize :not_found convention project-wide (H8)
10. Add error normalization in AgentServer.process_signal (H9)
Phase 4 — Deduplication:
11. Remove Agent.Store hierarchy, migrate to Storage (C2, H12)
12. Extract Thread.from_entries/2, Thread.prepare_entries/3, Entry.generate_id/0 (H10)
13. Extract handle_signal_async/2 and ensure_traced/1 helpers (H11)
Phase 5 — Configuration & operational hygiene:
14. Fix telemetry compile_env + default log level + validation (H14, M15)
15. Centralize timeout defaults (H13)
16. Make supervisor restart intensity / max_children configurable (H15)
17. Add try/catch to all AgentServer public API functions (M2)
This issue was synthesized from 7 category-specific audit reports (OTP supervision, GenServer blocking, race conditions, state management & error tuples, code duplication, resource leaks, configuration hygiene) plus an independent cross-cutting review.