Skip to content

[Audit] Jido lib/ Comprehensive Code Quality & Safety Review — v2.0.0-rc.4 #128

@nshkrdotcom

Description

@nshkrdotcom

[Audit] Jido lib/ Comprehensive Code Quality & Safety Review — v2.0.0-rc.4

Scope: 99 files across lib/
Audit date: 2026-02-06
Categories: OTP supervision, GenServer blocking, race conditions, error handling, code duplication, resource leaks, configuration hygiene


Executive Summary

This issue consolidates the findings from a full audit of the Jido lib/ directory. Across seven audit categories, we identified 2 critical, 22 high, 34+ medium, and 35+ low severity findings. The findings cluster around a few central themes:

AgentServer is the primary hot spot. At ~1800 lines, agent_server.ex appears in 20+ findings across every audit category — missing trap_exit, synchronous signal processing, race conditions in resolve_server, signal-handling boilerplate duplication, inconsistent error normalization, timer/monitor cleanup gaps, and scattered timeout magic numbers.

The storage layer has a parallel-hierarchy problem. Two complete storage abstractions (Agent.Store vs Storage) coexist, and thread reconstruction / entry preparation logic is independently reimplemented 3–4 times across storage adapters. Consolidation eliminates ~230+ lines of structural duplication.

Error handling lacks a unified contract. Three error shapes coexist (%Jido.Error{} structs, bare atoms, bare strings), the :not_found sentinel appears in two incompatible forms, and Thread.Store uses 3-element tuples. Callers cannot reliably pattern-match on errors from different modules.

Timeouts and defaults are scattered as magic numbers. At least 10 different timeout defaults across 5 files, 4 copies of a 5_000ms shutdown timeout, and max_queue_size defined in 2 places with no centralized defaults module.


Critical Findings — Fix Immediately

C1: String.to_atom/1 fallback in journal_backed.ex — VM crash risk

File: lib/jido/thread/store/adapters/journal_backed.ex:192

defp to_atom(string) when is_binary(string) do
  String.to_existing_atom(string)
rescue
  ArgumentError -> String.to_atom(string)
end

If journal data contains arbitrary user-generated strings, each unique string creates a permanent atom. The BEAM atom table has a hard limit (~1,048,576 atoms). An attacker or misbehaving data source can exhaust this and crash the entire VM — not just the process, the entire node.

Fix: Remove the fallback entirely. Either keep the value as a string, map through a fixed whitelist of allowed atoms, or return {:error, :unknown_atom}. Never create atoms from untrusted data.

C2: Parallel storage hierarchies — Agent.Store vs Storage

Files: 8 files across two hierarchies

The project maintains two completely separate storage abstraction hierarchies:

Hierarchy A (Legacy): agent/store.ex (behaviour), agent/store/ets.ex (68 lines), agent/store/file.ex (87 lines), agent/persistence.ex (214 lines facade).

Hierarchy B (Unified): storage.ex (behaviour), storage/ets.ex (314 lines), storage/file.ex (377 lines), persist.ex (356 lines facade).

The checkpoint operations in Hierarchy B are structurally identical to Hierarchy A — the ETS get implementations are nearly character-for-character duplicates. This creates ~150 lines of structural duplication plus ongoing maintenance confusion about which API to use.

Fix: Deprecate and remove the entire Agent.Store hierarchy and Agent.Persistence facade. Migrate all callers to Jido.Storage checkpoint operations. Jido.Persist is the evolved version.


High-Priority Findings

The following are organized by theme rather than by individual audit document, since many findings span multiple categories.

H1: AgentServer does not trap_exit despite starting linked children

File: lib/jido/agent_server.ex (init at ~646, start_plugin_child at ~1344, start_subscription_sensor at ~1413)

AgentServer starts plugin children via apply(m, f, a) on child specs and sensor runtimes via SensorRuntime.start_link/1. Both create linked processes. However, AgentServer never calls Process.flag(:trap_exit, true), and its exit handling is written for monitors (:DOWN messages), not {:EXIT, ...} messages.

If any linked child or sensor crashes, the AgentServer dies immediately without calling terminate/2, bypassing lifecycle persistence/hibernation, cron job cancellation, and completion waiter cleanup.

Fix (preferred): Start plugin children and sensors under a per-agent DynamicSupervisor rather than linking them directly to the GenServer. Alternative: Add Process.flag(:trap_exit, true) in init/1 and a handle_info({:EXIT, pid, reason}, state) clause that feeds into the existing child-down pipeline.

H2: handle_call({:signal, ...}) runs the full strategy pipeline synchronously

File: lib/jido/agent_server.ex:729-746

The call chain is handle_callprocess_signaldo_process_signaldispatch_actionagent_module.cmd/2strategy.cmd/3Jido.Exec.run/1 for each instruction sequentially. For a framework designed to work with LLM-based agents, a single handle_call blocks the GenServer for the duration of potentially slow external API calls (seconds to minutes). All other handle_call and handle_info messages queue behind it.

Additionally, before cmd/2, run_plugin_signal_hooks/2 iterates all plugins calling handle_signal/2, and after cmd/2, run_plugin_transform_hooks/4 iterates all plugins calling transform_result/3 — all user-provided code with no timeout enforcement.

The default 5-second timeout on call/3 (line 276) is insufficient for LLM agents, and the caller gets a timeout error while the GenServer continues processing, leading to state divergence.

Fix: The handle_call({:signal, ...}) should only enqueue the signal and return immediately. Actual execution should happen asynchronously (the drain loop already exists for directives). Alternatively, offload agent_module.cmd/2 to a Task and return {:noreply, state}, replying via GenServer.reply/2 when complete.

H3: Task.async/1 in Discovery.init_async/0 — linked, unsupervised, never awaited

File: lib/jido/discovery.ex:88-94, called from lib/jido/application.ex:16

Task.async/1 creates a linked process and expects Task.await/2 to be called. Here the return value is discarded — the task is started, then Supervisor.start_link is called, and the task result is never awaited. This violates the Task.async API contract, the unhandled reply message sits in the caller's mailbox, and if build_catalog() raises, it crashes the application start process.

Fix: Replace with Task.Supervisor.start_child/2 under a supervised Task.Supervisor started in Jido.Application:

# In Jido.Application children:
{Task.Supervisor, name: Jido.SystemTaskSupervisor}

In Discovery:

def init_async do
Task.Supervisor.start_child(Jido.SystemTaskSupervisor, fn ->
catalog = build_catalog()
:persistent_term.put(@catalog_key, catalog)
end)
end

H4: ETS tables created without :heir by arbitrary owner processes

Files: lib/jido/storage/ets.ex:224, lib/jido/agent/store/ets.ex:60

Tables are created by whichever process first calls the storage function — not by a supervised process. There is no :heir option, so if the owning process crashes, the entire ETS table and all its data are destroyed. No ets.delete/1 cleanup exists anywhere.

Fix: Create ETS tables from a dedicated supervised process and set the :heir option so another process can take ownership on crash.

H5: ETS append_thread — read-then-write without locking

File: lib/jido/storage/ets.ex:128-177

current_rev = get_current_rev(threads_table, thread_id)  # read
# ... no lock ...
:ets.insert(threads_table, ets_entries)                   # write based on stale read

Two concurrent processes can read the same current_rev, both compute entries starting at the same sequence number, both insert, and the result is duplicate sequence numbers and data corruption. The expected_rev parameter provides optimistic concurrency control, but it is optional — when not provided, the check is skipped entirely. Even when provided, the check is not atomic with the write. Contrast with Storage.File which correctly uses :global.trans.

Fix: Use :ets.update_counter/4 to atomically reserve sequence ranges, add a per-thread-id lock (like File's :global.trans), route all ETS thread writes through a single GenServer per thread, or at minimum make expected_rev mandatory.

H6: Sensor runtime dispatch blocks the GenServer

File: lib/jido/sensor/runtime.ex:291-308

deliver_signal/2 calls Dispatch.dispatch(signal, agent_ref) synchronously within the sensor GenServer. If the dispatch target is an HTTP or webhook adapter, this blocks the sensor for the full network call duration. Subsequent :tick messages queue up.

Fix: Offload dispatch to a supervised Task, similar to how it's already done in the Emit directive executor (directive_executors.ex:31).

H7: Jido.Util returns {:error, String.t()} — API contract violation

File: lib/jido/util.ex:62-240

All validation functions return bare string errors like {:error, "All actions must implement the Jido.Action behavior"}. Any caller pattern-matching on {:error, %Error{message: msg}} will miss these entirely.

Fix: Wrap all Jido.Util validation returns in Jido.Error.validation_error(...). This is the single highest-impact error normalization change.

H8: :not_found returned in two incompatible forms

Bare :not_found is returned by Storage.ETS, Storage.File, Persist, and Agent.Store callbacks. Meanwhile {:error, :not_found} is returned by AgentServer, Await, Util, and InstanceManager. Any with chain expecting only {:ok, _} and {:error, _} silently falls through on bare :not_found.

Fix: Define a project-wide convention — either @type find_result :: {:ok, t()} | :not_found | {:error, term()} with explicit documentation, or normalize everything to {:error, :not_found}.

H9: AgentServer.process_signal doesn't normalize errors before returning

File: lib/jido/agent_server.ex:997, 1019, 1097, 1099-1100

process_signal can return errors of multiple shapes: raw plugin errors, a bare :no_matching_route atom, or routing reasons. Callers of AgentServer.call/3 cannot reliably pattern-match on what comes back.

Fix: Normalize all errors to Jido.Error structs before returning from process_signal.

H10: Thread reconstruction logic triplicated; entry preparation quadruplicated

Thread reconstruction (reconstruct_thread) is independently implemented in storage/ets.ex, thread/store/adapters/journal_backed.ex, and storage/file.ex — all building a %Thread{} from entries with entry_count, timestamps, metadata, and stats (~35 duplicated lines).

Entry preparation (assigning seq, timestamp, id) is implemented in four places: thread.ex (canonical), storage/ets.ex (copy), storage/file.ex (simplified), and journal_backed.ex (reimplementation). The fetch_entry_attr helper is character-for-character identical between Thread and Storage.ETS (~80 duplicated lines).

Entry ID generation uses three different approaches across four files — Jido.Util.generate_id(), Base.url_encode64(:crypto.strong_rand_bytes(12)), and random_string(12).

Fix: Add Thread.from_entries/2, Thread.prepare_entries/3, and Thread.Entry.generate_id/0 as public functions. All storage adapters delegate to them. Eliminates ~130 lines.

H11: Signal processing boilerplate copy-pasted three times in AgentServer

File: lib/jido/agent_server.ex

The identical "trace + try/catch + process_signal" pattern appears verbatim in handle_cast({:signal, ...}), handle_info({:scheduled_signal, ...}), and handle_info({:signal, ...}):

{traced_signal, _ctx} = TraceContext.ensure_from_signal(signal)
try do
  case process_signal(traced_signal, state) do
    {:ok, new_state, _resolved_action} -> {:noreply, new_state}
    {:error, _reason, new_state} -> {:noreply, new_state}
  end
after
  TraceContext.clear()
end

Similarly, the Trace.put + new_root wrapping pattern is triplicated at lines 1678, 1712, and 1736.

Fix: Extract defp handle_signal_async(signal, state) and defp ensure_traced(signal) helpers. Eliminates ~42 lines.

H12: Persistence facade duplication

Two independent persistence facades exist: Agent.Persistence (214 lines, uses Agent.Store, dump/load callbacks) and Persist (356 lines, uses Storage, checkpoint/restore callbacks plus thread journal support). Both implement the same check-callback → call-adapter → handle-results pattern.

Fix: Jido.Persist is the evolved version. Remove Jido.Agent.Persistence and migrate all callers. Resolves as part of C2.

H13: Timeout defaults scattered as magic numbers

At least 10 different timeout values across 5 files:

Value Locations
5_000ms (call) agent_server.ex:276, worker_pool.ex:88,117,175
5_000ms (shutdown) directive_executors.ex:193, instance_manager.ex:226, agent_server.ex:251, sensor/runtime.ex:90
10_000ms (await) agent_server.ex:351, await.ex:78, jido.ex:468,486,495
30_000ms (child await) jido.ex:477

Recommended Fix Order

The following sequence respects dependency chains and maximizes safety impact per unit of effort.

Phase 1 — Safety-critical (do first, independent changes):

  1. Remove String.to_atom/1 fallback in journal_backed.ex (C1)
  2. Add Process.flag(:trap_exit, true) to AgentServer init/1 + handle {:EXIT, ...} (H1)
  3. Replace Task.async/1 in Discovery.init_async/0 with supervised task (H3)
  4. Create ETS tables from a supervised process with :heir (H4)

Phase 2 — Data integrity & blocking: 5. Add locking or atomic ops to ETS append_thread (H5) 6. Offload sensor dispatch to a Task (H6) 7. Make signal processing async in AgentServer (H2) — largest architectural change

Phase 3 — Error contract unification: 8. Normalize Jido.Util error returns to %Error{} (H7) 9. Standardize :not_found convention project-wide (H8) 10. Add error normalization in AgentServer.process_signal (H9)

Phase 4 — Deduplication: 11. Remove Agent.Store hierarchy, migrate to Storage (C2, H12) 12. Extract Thread.from_entries/2, Thread.prepare_entries/3, Entry.generate_id/0 (H10) 13. Extract handle_signal_async/2 and ensure_traced/1 helpers (H11)

Phase 5 — Configuration & operational hygiene: 14. Fix telemetry compile_env + default log level + validation (H14, M15) 15. Centralize timeout defaults (H13) 16. Make supervisor restart intensity / max_children configurable (H15) 17. Add try/catch to all AgentServer public API functions (M2)


This issue was synthesized from 7 category-specific audit reports (OTP supervision, GenServer blocking, race conditions, state management & error tuples, code duplication, resource leaks, configuration hygiene) plus an independent cross-cutting review.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions