[Audit] Jido lib/ Comprehensive Code Quality & Safety Review — v2.0.0-rc.4

<html>
<body>
<html><head></head><body><h1>[Audit] Jido <code>lib/</code> Comprehensive Code Quality &amp; Safety Review — v2.0.0-rc.4</h1>
Scope: 99 files across <code>lib/</code> 
Audit date: 2026-02-06 
Categories: OTP supervision, GenServer blocking, race conditions, error handling, code duplication, resource leaks, configuration hygiene
<hr>
<h2>Executive Summary</h2>
This issue consolidates the findings from a full audit of the Jido <code>lib/</code> directory. Across seven audit categories, we identified 2 critical, 22 high, 34+ medium, and 35+ low severity findings. The findings cluster around a few central themes:
<code>AgentServer</code> is the primary hot spot. At ~1800 lines, <code>agent_server.ex</code> appears in 20+ findings across every audit category — missing <code>trap_exit</code>, synchronous signal processing, race conditions in <code>resolve_server</code>, signal-handling boilerplate duplication, inconsistent error normalization, timer/monitor cleanup gaps, and scattered timeout magic numbers.
The storage layer has a parallel-hierarchy problem. Two complete storage abstractions (<code>Agent.Store</code> vs <code>Storage</code>) coexist, and thread reconstruction / entry preparation logic is independently reimplemented 3–4 times across storage adapters. Consolidation eliminates ~230+ lines of structural duplication.
Error handling lacks a unified contract. Three error shapes coexist (<code>%Jido.Error{}</code> structs, bare atoms, bare strings), the <code>:not_found</code> sentinel appears in two incompatible forms, and <code>Thread.Store</code> uses 3-element tuples. Callers cannot reliably pattern-match on errors from different modules.
Timeouts and defaults are scattered as magic numbers. At least 10 different timeout defaults across 5 files, 4 copies of a <code>5_000ms</code> shutdown timeout, and <code>max_queue_size</code> defined in 2 places with no centralized defaults module.
<hr>
<h2>Critical Findings — Fix Immediately</h2>
<h3>C1: <code>String.to_atom/1</code> fallback in <code>journal_backed.ex</code> — VM crash risk</h3>
File: <code>lib/jido/thread/store/adapters/journal_backed.ex:192</code>
<pre><code class="language-elixir">defp to_atom(string) when is_binary(string) do
 String.to_existing_atom(string)
rescue
 ArgumentError -&gt; String.to_atom(string)
end
</code></pre>
If journal data contains arbitrary user-generated strings, each unique string creates a permanent atom. The BEAM atom table has a hard limit (~1,048,576 atoms). An attacker or misbehaving data source can exhaust this and crash the entire VM — not just the process, the entire node.
Fix: Remove the fallback entirely. Either keep the value as a string, map through a fixed whitelist of allowed atoms, or return <code>{:error, :unknown_atom}</code>. Never create atoms from untrusted data.
<h3>C2: Parallel storage hierarchies — <code>Agent.Store</code> vs <code>Storage</code></h3>
Files: 8 files across two hierarchies
The project maintains two completely separate storage abstraction hierarchies:
Hierarchy A (Legacy): <code>agent/store.ex</code> (behaviour), <code>agent/store/ets.ex</code> (68 lines), <code>agent/store/file.ex</code> (87 lines), <code>agent/persistence.ex</code> (214 lines facade).
Hierarchy B (Unified): <code>storage.ex</code> (behaviour), <code>storage/ets.ex</code> (314 lines), <code>storage/file.ex</code> (377 lines), <code>persist.ex</code> (356 lines facade).
The checkpoint operations in Hierarchy B are structurally identical to Hierarchy A — the ETS <code>get</code> implementations are nearly character-for-character duplicates. This creates ~150 lines of structural duplication plus ongoing maintenance confusion about which API to use.
Fix: Deprecate and remove the entire <code>Agent.Store</code> hierarchy and <code>Agent.Persistence</code> facade. Migrate all callers to <code>Jido.Storage</code> checkpoint operations. <code>Jido.Persist</code> is the evolved version.
<hr>
<h2>High-Priority Findings</h2>
The following are organized by theme rather than by individual audit document, since many findings span multiple categories.
<h3>H1: AgentServer does not <code>trap_exit</code> despite starting linked children</h3>
File: <code>lib/jido/agent_server.ex</code> (init at ~646, start_plugin_child at ~1344, start_subscription_sensor at ~1413)
AgentServer starts plugin children via <code>apply(m, f, a)</code> on child specs and sensor runtimes via <code>SensorRuntime.start_link/1</code>. Both create linked processes. However, AgentServer never calls <code>Process.flag(:trap_exit, true)</code>, and its exit handling is written for monitors (<code>:DOWN</code> messages), not <code>{:EXIT, ...}</code> messages.
If any linked child or sensor crashes, the AgentServer dies immediately without calling <code>terminate/2</code>, bypassing lifecycle persistence/hibernation, cron job cancellation, and completion waiter cleanup.
Fix (preferred): Start plugin children and sensors under a per-agent <code>DynamicSupervisor</code> rather than linking them directly to the GenServer. Alternative: Add <code>Process.flag(:trap_exit, true)</code> in <code>init/1</code> and a <code>handle_info({:EXIT, pid, reason}, state)</code> clause that feeds into the existing child-down pipeline.
<h3>H2: <code>handle_call({:signal, ...})</code> runs the full strategy pipeline synchronously</h3>
File: <code>lib/jido/agent_server.ex:729-746</code>
The call chain is <code>handle_call</code> → <code>process_signal</code> → <code>do_process_signal</code> → <code>dispatch_action</code> → <code>agent_module.cmd/2</code> → <code>strategy.cmd/3</code> → <code>Jido.Exec.run/1</code> for each instruction sequentially. For a framework designed to work with LLM-based agents, a single <code>handle_call</code> blocks the GenServer for the duration of potentially slow external API calls (seconds to minutes). All other <code>handle_call</code> and <code>handle_info</code> messages queue behind it.
Additionally, before <code>cmd/2</code>, <code>run_plugin_signal_hooks/2</code> iterates all plugins calling <code>handle_signal/2</code>, and after <code>cmd/2</code>, <code>run_plugin_transform_hooks/4</code> iterates all plugins calling <code>transform_result/3</code> — all user-provided code with no timeout enforcement.
The default 5-second timeout on <code>call/3</code> (line 276) is insufficient for LLM agents, and the caller gets a timeout error while the GenServer continues processing, leading to state divergence.
Fix: The <code>handle_call({:signal, ...})</code> should only enqueue the signal and return immediately. Actual execution should happen asynchronously (the drain loop already exists for directives). Alternatively, offload <code>agent_module.cmd/2</code> to a Task and return <code>{:noreply, state}</code>, replying via <code>GenServer.reply/2</code> when complete.
<h3>H3: <code>Task.async/1</code> in <code>Discovery.init_async/0</code> — linked, unsupervised, never awaited</h3>
File: <code>lib/jido/discovery.ex:88-94</code>, called from <code>lib/jido/application.ex:16</code>
<code>Task.async/1</code> creates a linked process and expects <code>Task.await/2</code> to be called. Here the return value is discarded — the task is started, then <code>Supervisor.start_link</code> is called, and the task result is never awaited. This violates the <code>Task.async</code> API contract, the unhandled reply message sits in the caller's mailbox, and if <code>build_catalog()</code> raises, it crashes the application start process.
Fix: Replace with <code>Task.Supervisor.start_child/2</code> under a supervised <code>Task.Supervisor</code> started in <code>Jido.Application</code>:
<pre><code class="language-elixir"># In Jido.Application children:
{Task.Supervisor, name: Jido.SystemTaskSupervisor}

# In Discovery:
def init_async do
 Task.Supervisor.start_child(Jido.SystemTaskSupervisor, fn -&gt;
 catalog = build_catalog()
 :persistent_term.put(@catalog_key, catalog)
 end)
end
</code></pre>
<h3>H4: ETS tables created without <code>:heir</code> by arbitrary owner processes</h3>
Files: <code>lib/jido/storage/ets.ex:224</code>, <code>lib/jido/agent/store/ets.ex:60</code>
Tables are created by whichever process first calls the storage function — not by a supervised process. There is no <code>:heir</code> option, so if the owning process crashes, the entire ETS table and all its data are destroyed. No <code>ets.delete/1</code> cleanup exists anywhere.
Fix: Create ETS tables from a dedicated supervised process and set the <code>:heir</code> option so another process can take ownership on crash.
<h3>H5: ETS <code>append_thread</code> — read-then-write without locking</h3>
File: <code>lib/jido/storage/ets.ex:128-177</code>
<pre><code class="language-elixir">current_rev = get_current_rev(threads_table, thread_id) # read
# ... no lock ...
:ets.insert(threads_table, ets_entries) # write based on stale read
</code></pre>
Two concurrent processes can read the same <code>current_rev</code>, both compute entries starting at the same sequence number, both insert, and the result is duplicate sequence numbers and data corruption. The <code>expected_rev</code> parameter provides optimistic concurrency control, but it is optional — when not provided, the check is skipped entirely. Even when provided, the check is not atomic with the write. Contrast with <code>Storage.File</code> which correctly uses <code>:global.trans</code>.
Fix: Use <code>:ets.update_counter/4</code> to atomically reserve sequence ranges, add a per-thread-id lock (like File's <code>:global.trans</code>), route all ETS thread writes through a single GenServer per thread, or at minimum make <code>expected_rev</code> mandatory.
<h3>H6: Sensor runtime dispatch blocks the GenServer</h3>
File: <code>lib/jido/sensor/runtime.ex:291-308</code>
<code>deliver_signal/2</code> calls <code>Dispatch.dispatch(signal, agent_ref)</code> synchronously within the sensor GenServer. If the dispatch target is an HTTP or webhook adapter, this blocks the sensor for the full network call duration. Subsequent <code>:tick</code> messages queue up.
Fix: Offload dispatch to a supervised Task, similar to how it's already done in the <code>Emit</code> directive executor (<code>directive_executors.ex:31</code>).
<h3>H7: <code>Jido.Util</code> returns <code>{:error, String.t()}</code> — API contract violation</h3>
File: <code>lib/jido/util.ex:62-240</code>
All validation functions return bare string errors like <code>{:error, "All actions must implement the Jido.Action behavior"}</code>. Any caller pattern-matching on <code>{:error, %Error{message: msg}}</code> will miss these entirely.
Fix: Wrap all <code>Jido.Util</code> validation returns in <code>Jido.Error.validation_error(...)</code>. This is the single highest-impact error normalization change.
<h3>H8: <code>:not_found</code> returned in two incompatible forms</h3>
Bare <code>:not_found</code> is returned by <code>Storage.ETS</code>, <code>Storage.File</code>, <code>Persist</code>, and <code>Agent.Store</code> callbacks. Meanwhile <code>{:error, :not_found}</code> is returned by <code>AgentServer</code>, <code>Await</code>, <code>Util</code>, and <code>InstanceManager</code>. Any <code>with</code> chain expecting only <code>{:ok, _}</code> and <code>{:error, _}</code> silently falls through on bare <code>:not_found</code>.
Fix: Define a project-wide convention — either <code>@type find_result :: {:ok, t()} | :not_found | {:error, term()}</code> with explicit documentation, or normalize everything to <code>{:error, :not_found}</code>.
<h3>H9: <code>AgentServer.process_signal</code> doesn't normalize errors before returning</h3>
File: <code>lib/jido/agent_server.ex:997, 1019, 1097, 1099-1100</code>
<code>process_signal</code> can return errors of multiple shapes: raw plugin errors, a bare <code>:no_matching_route</code> atom, or routing reasons. Callers of <code>AgentServer.call/3</code> cannot reliably pattern-match on what comes back.
Fix: Normalize all errors to <code>Jido.Error</code> structs before returning from <code>process_signal</code>.
<h3>H10: Thread reconstruction logic triplicated; entry preparation quadruplicated</h3>
Thread reconstruction (<code>reconstruct_thread</code>) is independently implemented in <code>storage/ets.ex</code>, <code>thread/store/adapters/journal_backed.ex</code>, and <code>storage/file.ex</code> — all building a <code>%Thread{}</code> from entries with <code>entry_count</code>, timestamps, metadata, and stats (~35 duplicated lines).
Entry preparation (assigning seq, timestamp, id) is implemented in four places: <code>thread.ex</code> (canonical), <code>storage/ets.ex</code> (copy), <code>storage/file.ex</code> (simplified), and <code>journal_backed.ex</code> (reimplementation). The <code>fetch_entry_attr</code> helper is character-for-character identical between <code>Thread</code> and <code>Storage.ETS</code> (~80 duplicated lines).
Entry ID generation uses three different approaches across four files — <code>Jido.Util.generate_id()</code>, <code>Base.url_encode64(:crypto.strong_rand_bytes(12))</code>, and <code>random_string(12)</code>.
Fix: Add <code>Thread.from_entries/2</code>, <code>Thread.prepare_entries/3</code>, and <code>Thread.Entry.generate_id/0</code> as public functions. All storage adapters delegate to them. Eliminates ~130 lines.
<h3>H11: Signal processing boilerplate copy-pasted three times in AgentServer</h3>
File: <code>lib/jido/agent_server.ex</code>
The identical "trace + try/catch + process_signal" pattern appears verbatim in <code>handle_cast({:signal, ...})</code>, <code>handle_info({:scheduled_signal, ...})</code>, and <code>handle_info({:signal, ...})</code>:
<pre><code class="language-elixir">{traced_signal, _ctx} = TraceContext.ensure_from_signal(signal)
try do
 case process_signal(traced_signal, state) do
 {:ok, new_state, _resolved_action} -&gt; {:noreply, new_state}
 {:error, _reason, new_state} -&gt; {:noreply, new_state}
 end
after
 TraceContext.clear()
end
</code></pre>
Similarly, the <code>Trace.put + new_root</code> wrapping pattern is triplicated at lines 1678, 1712, and 1736.
Fix: Extract <code>defp handle_signal_async(signal, state)</code> and <code>defp ensure_traced(signal)</code> helpers. Eliminates ~42 lines.
<h3>H12: Persistence facade duplication</h3>
Two independent persistence facades exist: <code>Agent.Persistence</code> (214 lines, uses <code>Agent.Store</code>, <code>dump/load</code> callbacks) and <code>Persist</code> (356 lines, uses <code>Storage</code>, <code>checkpoint/restore</code> callbacks plus thread journal support). Both implement the same check-callback → call-adapter → handle-results pattern.
Fix: <code>Jido.Persist</code> is the evolved version. Remove <code>Jido.Agent.Persistence</code> and migrate all callers. Resolves as part of C2.
<h3>H13: Timeout defaults scattered as magic numbers</h3>
At least 10 different timeout values across 5 files:

Value | Locations
-- | --
5_000ms (call) | agent_server.ex:276, worker_pool.ex:88,117,175
5_000ms (shutdown) | directive_executors.ex:193, instance_manager.ex:226, agent_server.ex:251, sensor/runtime.ex:90
10_000ms (await) | agent_server.ex:351, await.ex:78, jido.ex:468,486,495
30_000ms (child await) | jido.ex:477


<hr>
<h2>Recommended Fix Order</h2>
The following sequence respects dependency chains and maximizes safety impact per unit of effort.
Phase 1 — Safety-critical (do first, independent changes):
<ol>
<li>Remove <code>String.to_atom/1</code> fallback in <code>journal_backed.ex</code> (C1)</li>
<li>Add <code>Process.flag(:trap_exit, true)</code> to AgentServer <code>init/1</code> + handle <code>{:EXIT, ...}</code> (H1)</li>
<li>Replace <code>Task.async/1</code> in <code>Discovery.init_async/0</code> with supervised task (H3)</li>
<li>Create ETS tables from a supervised process with <code>:heir</code> (H4)</li>
</ol>
Phase 2 — Data integrity &amp; blocking:
5. Add locking or atomic ops to ETS <code>append_thread</code> (H5)
6. Offload sensor dispatch to a Task (H6)
7. Make signal processing async in AgentServer (H2) — largest architectural change
Phase 3 — Error contract unification:
8. Normalize <code>Jido.Util</code> error returns to <code>%Error{}</code> (H7)
9. Standardize <code>:not_found</code> convention project-wide (H8)
10. Add error normalization in <code>AgentServer.process_signal</code> (H9)
Phase 4 — Deduplication:
11. Remove <code>Agent.Store</code> hierarchy, migrate to <code>Storage</code> (C2, H12)
12. Extract <code>Thread.from_entries/2</code>, <code>Thread.prepare_entries/3</code>, <code>Entry.generate_id/0</code> (H10)
13. Extract <code>handle_signal_async/2</code> and <code>ensure_traced/1</code> helpers (H11)
Phase 5 — Configuration &amp; operational hygiene:
14. Fix telemetry <code>compile_env</code> + default log level + validation (H14, M15)
15. Centralize timeout defaults (H13)
16. Make supervisor restart intensity / <code>max_children</code> configurable (H15)
17. Add <code>try/catch</code> to all AgentServer public API functions (M2)
<hr>
This issue was synthesized from 7 category-specific audit reports (OTP supervision, GenServer blocking, race conditions, state management &amp; error tuples, code duplication, resource leaks, configuration hygiene) plus an independent cross-cutting review.</body></html>
</body>
</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Audit] Jido lib/ Comprehensive Code Quality & Safety Review — v2.0.0-rc.4 #128

[Audit] Jido `lib/` Comprehensive Code Quality & Safety Review — v2.0.0-rc.4

Executive Summary

Critical Findings — Fix Immediately

C1: `String.to_atom/1` fallback in `journal_backed.ex` — VM crash risk

C2: Parallel storage hierarchies — `Agent.Store` vs `Storage`

High-Priority Findings

H1: AgentServer does not `trap_exit` despite starting linked children

H2: `handle_call({:signal, ...})` runs the full strategy pipeline synchronously

H3: `Task.async/1` in `Discovery.init_async/0` — linked, unsupervised, never awaited

In Discovery:

H4: ETS tables created without `:heir` by arbitrary owner processes

H5: ETS `append_thread` — read-then-write without locking

H6: Sensor runtime dispatch blocks the GenServer

H7: `Jido.Util` returns `{:error, String.t()}` — API contract violation

H8: `:not_found` returned in two incompatible forms

H9: `AgentServer.process_signal` doesn't normalize errors before returning

H10: Thread reconstruction logic triplicated; entry preparation quadruplicated

H11: Signal processing boilerplate copy-pasted three times in AgentServer

H12: Persistence facade duplication

H13: Timeout defaults scattered as magic numbers

Recommended Fix Order

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Value	Locations
5_000ms (call)	agent_server.ex:276, worker_pool.ex:88,117,175
5_000ms (shutdown)	directive_executors.ex:193, instance_manager.ex:226, agent_server.ex:251, sensor/runtime.ex:90
10_000ms (await)	agent_server.ex:351, await.ex:78, jido.ex:468,486,495
30_000ms (child await)	jido.ex:477

[Audit] Jido lib/ Comprehensive Code Quality & Safety Review — v2.0.0-rc.4 #128

Description

[Audit] Jido lib/ Comprehensive Code Quality & Safety Review — v2.0.0-rc.4

Executive Summary

Critical Findings — Fix Immediately

C1: String.to_atom/1 fallback in journal_backed.ex — VM crash risk

C2: Parallel storage hierarchies — Agent.Store vs Storage

High-Priority Findings

H1: AgentServer does not trap_exit despite starting linked children

H2: handle_call({:signal, ...}) runs the full strategy pipeline synchronously

H3: Task.async/1 in Discovery.init_async/0 — linked, unsupervised, never awaited

In Discovery:

H4: ETS tables created without :heir by arbitrary owner processes

H5: ETS append_thread — read-then-write without locking

H6: Sensor runtime dispatch blocks the GenServer

H7: Jido.Util returns {:error, String.t()} — API contract violation

H8: :not_found returned in two incompatible forms

H9: AgentServer.process_signal doesn't normalize errors before returning

H10: Thread reconstruction logic triplicated; entry preparation quadruplicated

H11: Signal processing boilerplate copy-pasted three times in AgentServer

H12: Persistence facade duplication

H13: Timeout defaults scattered as magic numbers

Recommended Fix Order

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Audit] Jido `lib/` Comprehensive Code Quality & Safety Review — v2.0.0-rc.4

C1: `String.to_atom/1` fallback in `journal_backed.ex` — VM crash risk

C2: Parallel storage hierarchies — `Agent.Store` vs `Storage`

H1: AgentServer does not `trap_exit` despite starting linked children

H2: `handle_call({:signal, ...})` runs the full strategy pipeline synchronously

H3: `Task.async/1` in `Discovery.init_async/0` — linked, unsupervised, never awaited

H4: ETS tables created without `:heir` by arbitrary owner processes

H5: ETS `append_thread` — read-then-write without locking

H7: `Jido.Util` returns `{:error, String.t()}` — API contract violation

H8: `:not_found` returned in two incompatible forms

H9: `AgentServer.process_signal` doesn't normalize errors before returning